Prometheus Metrics
Overview
The Infra Operator exposes comprehensive Prometheus metrics for complete observability of your infrastructure management. These metrics provide insights into reconciliation performance, AWS API usage, drift detection, resource status, and more.
Metrics Categories
1. Reconciliation Metrics
Track the performance and reliability of Kubernetes reconciliation loops.
infra_operator_reconcile_total
Type: Counter
Labels: resource_type, result
Description: Total number of reconciliations per resource type and result (success, error, requeue)
# Reconciliation rate per resource type
rate(infra_operator_reconcile_total[5m])
# Success rate
sum(rate(infra_operator_reconcile_total{result="success"}[5m])) by (resource_type)
infra_operator_reconcile_duration_seconds
Type: Histogram
Labels: resource_type
Description: Duration of reconciliation operations in seconds
# p95 reconciliation duration
histogram_quantile(0.95, rate(infra_operator_reconcile_duration_seconds_bucket[5m]))
# Average duration per resource type
rate(infra_operator_reconcile_duration_seconds_sum[5m]) / rate(infra_operator_reconcile_duration_seconds_count[5m])
infra_operator_reconcile_errors_total
Type: Counter
Labels: resource_type, error_type
Description: Total number of reconciliation errors by type
Error Types:
get_failed- Failed to get resource from Kubernetessync_failed- Failed to sync with AWSstatus_update_failed- Failed to update statusprovider_failed- Failed to get AWS providerfinalizer_failed- Failed during cleanup
# Error rate per resource type
rate(infra_operator_reconcile_errors_total[5m])
# Error rate percentage
sum(rate(infra_operator_reconcile_errors_total[5m])) / sum(rate(infra_operator_reconcile_total[5m]))
2. Resource Metrics
Track the current state of managed resources.
infra_operator_resources_total
Type: Gauge
Labels: resource_type, status
Description: Current number of managed resources by type and status
Status Values:
ready- Resource is ready and syncednotready- Resource is not readypending- Resource is being createddeleting- Resource is being deletederror- Resource is in error state
# Total resources by type
sum(infra_operator_resources_total) by (resource_type)
# Resources not ready
infra_operator_resources_total{status="notready"}
# Percentage ready
sum(infra_operator_resources_total{status="ready"}) / sum(infra_operator_resources_total)
infra_operator_resource_creation_timestamp_seconds
Type: Gauge
Labels: resource_type, resource_name
Description: Unix timestamp when the resource was created
# Age of resources in hours
(time() - infra_operator_resource_creation_timestamp_seconds) / 3600
3. AWS API Metrics
Monitor AWS API performance, errors, and throttling.
infra_operator_aws_api_calls_total
Type: Counter
Labels: service, operation, result
Description: Total number of AWS API calls
Services: EC2, S3, RDS, ELBv2, IAM, SecretsManager, KMS, Lambda, APIGateway, CloudFront, Route53, ACM, DynamoDB, SQS, SNS, ECR, ECS, EKS, ElastiCache
# API call rate by service
sum(rate(infra_operator_aws_api_calls_total[5m])) by (service)
# Top 10 most called operations
topk(10, sum(rate(infra_operator_aws_api_calls_total[5m])) by (service, operation))
infra_operator_aws_api_call_duration_seconds
Type: Histogram
Labels: service, operation
Description: Duration of AWS API calls in seconds
# p95 latency per service
histogram_quantile(0.95, sum(rate(infra_operator_aws_api_call_duration_seconds_bucket[5m])) by (le, service))
# Slowest operations
topk(10, histogram_quantile(0.95, sum(rate(infra_operator_aws_api_call_duration_seconds_bucket[5m])) by (le, service, operation)))
infra_operator_aws_api_errors_total
Type: Counter
Labels: service, operation, error_code
Description: Total number of AWS API errors
Common Error Codes:
InvalidParameterValueResourceNotFoundExceptionUnauthorizedOperationDependencyViolationResourceAlreadyExists
# Error rate per service
sum(rate(infra_operator_aws_api_errors_total[5m])) by (service)
# Most common error codes
topk(10, sum(rate(infra_operator_aws_api_errors_total[5m])) by (error_code))
infra_operator_aws_api_throttles_total
Type: Counter
Labels: service, operation
Description: Total number of AWS API throttling events (rate limit exceeded)
# Throttling rate per service
sum(rate(infra_operator_aws_api_throttles_total[5m])) by (service)
# Operations being throttled
infra_operator_aws_api_throttles_total > 0
4. Drift Detection Metrics
Monitor configuration drift and automatic healing.
infra_operator_drift_detected_total
Type: Counter
Labels: resource_type, severity
Description: Total number of configuration drifts detected
Severity Levels:
critical- High severity driftwarning- Medium severity driftinfo- Low severity drift
# Drift detection rate
sum(rate(infra_operator_drift_detected_total[5m])) by (resource_type, severity)
# Critical drifts in last hour
increase(infra_operator_drift_detected_total{severity="critical"}[1h])
infra_operator_drift_healed_total
Type: Counter
Labels: resource_type
Description: Total number of drifts automatically healed
# Healing rate
rate(infra_operator_drift_healed_total[5m])
# Healing success percentage
sum(rate(infra_operator_drift_healed_total[5m])) / sum(rate(infra_operator_drift_detected_total[5m]))
infra_operator_drift_healing_failures_total
Type: Counter
Labels: resource_type, error_type
Description: Total number of failed drift healing attempts
# Healing failure rate
rate(infra_operator_drift_healing_failures_total[5m])
infra_operator_drift_detection_duration_seconds
Type: Histogram
Labels: resource_type
Description: Duration of drift detection operations
# p95 drift detection duration
histogram_quantile(0.95, rate(infra_operator_drift_detection_duration_seconds_bucket[5m]))
5. Finalizer Metrics
Track cleanup and deletion performance.
infra_operator_finalizer_duration_seconds
Type: Histogram
Labels: resource_type
Description: Duration of finalizer execution (resource cleanup)
# p95 finalizer duration
histogram_quantile(0.95, rate(infra_operator_finalizer_duration_seconds_bucket[5m]))
# Slowest finalizers
topk(5, histogram_quantile(0.95, sum(rate(infra_operator_finalizer_duration_seconds_bucket[5m])) by (le, resource_type)))
infra_operator_finalizer_errors_total
Type: Counter
Labels: resource_type, error_type
Description: Total number of finalizer execution errors
# Finalizer error rate
rate(infra_operator_finalizer_errors_total[5m])
6. Provider Metrics
Monitor AWS Provider status and credentials.
infra_operator_provider_ready
Type: Gauge
Labels: provider_name, region
Description: Indicates if an AWS Provider is ready (1) or not (0)
# Providers not ready
infra_operator_provider_ready == 0
# Ready percentage
avg(infra_operator_provider_ready)
infra_operator_provider_credential_rotations_total
Type: Counter
Labels: provider_name
Description: Total number of AWS credential rotations
# Credential rotation rate
rate(infra_operator_provider_credential_rotations_total[24h])
7. Workqueue Metrics
Monitor controller work queue depth and throughput.
infra_operator_workqueue_depth
Type: Gauge
Labels: resource_type
Description: Current depth of the work queue for each resource type
# Queue depth per resource type
infra_operator_workqueue_depth
# Deep queues (potential backlog)
infra_operator_workqueue_depth > 100
infra_operator_workqueue_adds_total
Type: Counter
Labels: resource_type
Description: Total number of items added to work queue
# Queue add rate
rate(infra_operator_workqueue_adds_total[5m])
8. Cache Metrics
Monitor caching effectiveness (future feature).
infra_operator_cache_hits_total
Type: Counter
Labels: resource_type
Description: Total number of successful cache lookups
infra_operator_cache_misses_total
Type: Counter
Labels: resource_type
Description: Total number of failed cache lookups
# Cache hit rate
sum(rate(infra_operator_cache_hits_total[5m])) / (sum(rate(infra_operator_cache_hits_total[5m])) + sum(rate(infra_operator_cache_misses_total[5m])))
Common Query Examples
Health & Performance
# Overall health - no errors in last 5 minutes
sum(rate(infra_operator_reconcile_errors_total[5m])) == 0
# Reconciliation throughput
sum(rate(infra_operator_reconcile_total[5m]))
# Average reconciliation duration
rate(infra_operator_reconcile_duration_seconds_sum[5m]) / rate(infra_operator_reconcile_duration_seconds_count[5m])
# Error budget (99.9% SLO)
1 - (sum(rate(infra_operator_reconcile_errors_total[30d])) / sum(rate(infra_operator_reconcile_total[30d])))
Resource Management
# Total managed resources
sum(infra_operator_resources_total)
# Resources in error state
sum(infra_operator_resources_total{status="error"})
# Oldest resources
topk(10, time() - infra_operator_resource_creation_timestamp_seconds)
AWS API Monitoring
# AWS API error rate percentage
sum(rate(infra_operator_aws_api_errors_total[5m])) / sum(rate(infra_operator_aws_api_calls_total[5m]))
# Services with highest error rates
topk(5, sum(rate(infra_operator_aws_api_errors_total[5m])) by (service))
# Throttled operations
sum(rate(infra_operator_aws_api_throttles_total[5m])) by (service, operation) > 0
Drift Monitoring
# Critical drifts detected
sum(increase(infra_operator_drift_detected_total{severity="critical"}[1h]))
# Drift healing success rate
sum(rate(infra_operator_drift_healed_total[5m])) / sum(rate(infra_operator_drift_detected_total[5m]))
# Resources with most drift
topk(5, sum(increase(infra_operator_drift_detected_total[24h])) by (resource_type))
Installation
Deploy ServiceMonitor
Command:
kubectl apply -f config/prometheus/servicemonitor.yaml
This creates:
- ServiceMonitor: Tells Prometheus to scrape the operator
- PodMonitor: Alternative direct pod scraping
- PrometheusRule: Pre-configured alerts
Import Grafana Dashboard
- Open Grafana UI
- Go to Dashboards → Import
- Upload
config/grafana/dashboard.json - Select your Prometheus datasource
- Click Import
Alerting Rules
The operator includes pre-configured Prometheus alerts in config/prometheus/servicemonitor.yaml:
Critical Alerts
- InfraOperatorNoReconciliations: No reconciliations in 10 minutes (operator may be down)
- InfraOperatorProviderNotReady: AWS Provider not ready for >5 minutes
- InfraOperatorDriftDetected: Critical drift detected
Warning Alerts
- InfraOperatorHighErrorRate: Reconciliation error rate >10%
- InfraOperatorSlowReconciliation: p95 reconciliation duration >30s
- InfraOperatorAWSAPIErrors: AWS API error rate >1 error/sec
- InfraOperatorAWSAPIThrottled: AWS API throttling detected
- InfraOperatorDriftHealingFailed: Drift healing failures
- InfraOperatorResourcesNotReady: >5 resources not ready for >15 minutes
- InfraOperatorSlowFinalizer: p95 finalizer duration >60s
Customize Alerts
Edit config/prometheus/servicemonitor.yaml to adjust:
- Thresholds
- Duration (for)
- Severity levels
- Alert labels and annotations
Integration with Existing Monitoring
Prometheus Operator
If you have Prometheus Operator installed:
kubectl apply -f config/prometheus/servicemonitor.yaml
The ServiceMonitor will be automatically discovered.
Manual Prometheus Configuration
Add to your prometheus.yml:
scrape_configs:
- job_name: 'infra-operator'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_control_plane]
regex: controller-manager
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: metrics
action: keep
Datadog Integration
Use Datadog's Prometheus integration:
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-checks
data:
prometheus.yaml: |
instances:
- prometheus_url: http://infra-operator-controller-manager-metrics-service.infra-operator-system:8443/metrics
namespace: infra_operator
metrics:
- infra_operator_*
New Relic Integration
Use New Relic's Prometheus exporter:
helm repo add newrelic https://helm-charts.newrelic.com
helm install newrelic-prometheus newrelic/nri-prometheus \
--set cluster=infra-operator-cluster \
--set config.scrape_enabled_label=prometheus.io/scrape
Dashboard Panels
The included Grafana dashboard provides:
- Reconciliation Rate - Real-time reconciliation throughput
- Error Rate - Percentage of failed reconciliations
- Reconciliation Duration - p95 and p50 latencies
- Resources by Status - Pie chart of resource states
- AWS API Call Rate - API usage by service
- AWS API Duration - p95 API latencies
- AWS API Errors - Error breakdown by service/operation
- AWS Throttling - Rate limiting events
- Drift Detected - Drift events by severity
- Drift Healing - Success and failure rates
- Provider Status - AWS Provider health
- Workqueue Depth - Queue backlogs
Metric Retention
Recommended Retention Periods
- Short-term (15 days): All metrics at 30s resolution
- Medium-term (90 days): Downsampled to 5m resolution
- Long-term (1 year): Downsampled to 1h resolution (for trends)
Prometheus Configuration
Example:
global:
scrape_interval: 30s
evaluation_interval: 30s
storage:
tsdb:
retention.time: 15d
Thanos/Cortex for Long-term
Use Thanos or Cortex for:
- Long-term storage (>15 days)
- Global view across clusters
- Downsampling
- High availability
Troubleshooting
Metrics Not Appearing
-
Check ServiceMonitor is created:
kubectl get servicemonitor -n infra-operator-system -
Verify Prometheus is scraping:
up{job="infra-operator"} -
Check operator logs:
kubectl logs -n infra-operator-system deployment/infra-operator-controller-manager
High Cardinality
If you have too many resources causing high cardinality:
-
Limit resource metrics:
// Only track essential statuses
ResourcesTotal.WithLabelValues(resourceType, "ready").Inc()
ResourcesTotal.WithLabelValues(resourceType, "notready").Inc() -
Use recording rules:
- record: infra_operator:reconcile_total:rate5m
expr: sum(rate(infra_operator_reconcile_total[5m])) by (resource_type)
Missing Historical Data
If you need historical data:
- Enable long-term storage (Thanos/Cortex)
- Increase retention period
- Use recording rules for pre-aggregated data
Best Practices
- Enable ServiceMonitor for Prometheus Operator — Automatic metric discovery and scraping configuration
- Set appropriate alert thresholds — Configure alerts for error rates >5%, p95 latency >30s, drift detection
- Use Grafana dashboards — Import pre-built dashboards for operator monitoring and AWS API metrics
- Monitor AWS API throttling — Set alerts on infra_operator_aws_api_throttled_total to detect rate limiting
- Track reconciliation duration — Monitor p95 reconciliation times to identify performance bottlenecks
Reference
Metric Naming Convention
All metrics follow the format:
infra_operator_<component>_<metric>_<unit>
Examples:
infra_operator_reconcile_total(counter, no unit)infra_operator_reconcile_duration_seconds(histogram, seconds)infra_operator_resources_total(gauge, count)
Labels
Standard labels used:
resource_type- Type of AWS resource (VPC, Subnet, etc.)service- AWS service name (EC2, S3, RDS, etc.)operation- AWS API operation nameresult- Operation result (success, error, requeue)status- Resource status (ready, notready, pending, deleting)severity- Drift severity (critical, warning, info)error_type- Type of error encounterederror_code- AWS API error codeprovider_name- Name of AWS Providerregion- AWS region
Support
For questions or issues with metrics:
- Check the documentation
- Review example queries
- Join our Slack community
- Open an issue