Drift Detection and Auto-Healing
Overview
Drift Detection is a feature that automatically detects when the actual state of AWS resources differs from the desired state defined in your Kubernetes Custom Resources (CRs), even when the CRs have not been changed.
What is Drift?
Drift occurs when infrastructure is modified outside the operator, such as:
- Manual changes via AWS Console
- Modifications via AWS CLI or other tools
- Changes made by other automation systems
- Accidental modifications by team members
- AWS service updates or migrations
Why Drift Detection Matters
Without drift detection:
- Manual changes can break your infrastructure
- Compliance violations go unnoticed
- Configuration inconsistencies accumulate
- Debugging becomes harder over time
- GitOps workflows lose their single source of truth
With drift detection:
- Automatic detection of all changes
- Configurable auto-healing or alerts
- Detailed drift reports in resource status
- Kubernetes events for monitoring
- Maintains infrastructure consistency
How It Works
Detection Process
- Periodic Checks: Controller checks for drift at configured intervals (default: 5 minutes)
- Comparison: Compares desired state (CR) with actual state (AWS API)
- Classification: Categorizes drifts by severity (low, medium, high)
- Filtering: Applies ignore patterns and severity thresholds
- Action: Auto-heals or alerts based on configuration
Configuration
Enable Drift Detection in AWSProvider
Configure drift detection at the provider level:
Example:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: production-aws
spec:
region: us-east-1
roleARN: arn:aws:iam::123456789012:role/infra-operator-role
# Drift Detection Configuration
driftDetection:
# Enable drift detection (default: true)
enabled: true
# Drift check interval (default: "5m")
# Accepts: "1m", "5m", "15m", "1h", etc.
checkInterval: "5m"
# Auto-heal detected drifts (default: true)
# true: automatically fixes by updating AWS
# false: only alerts via events and status (alert-only mode)
autoHeal: true
# Fields to ignore in drift detection
# Supports wildcards (*)
ignoreFields:
- "tags.aws:*" # Ignore AWS-managed tags
- "lastModified" # Ignore modification timestamps
- "status.*" # Ignore status fields
- "tags.LastUpdated" # Ignore specific tags
# Minimum severity level to report (default: "medium")
# Options: "low", "medium", "high"
# Only drifts at this level or above trigger reconciliation
severityThreshold: "medium"
Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable/disable drift detection |
checkInterval | string | "5m" | Frequency of drift checks |
autoHeal | boolean | true | Automatically fix vs alert only |
ignoreFields | []string | ["tags.aws:*", "lastModified", "status.*"] | Field patterns to ignore |
severityThreshold | string | "medium" | Minimum severity to act on |
Drift Severity Levels
The operator automatically assigns severity levels based on the field that drifted:
High Severity
Fields affecting security, networking, or critical functionality:
- Security groups
- IAM roles
- Encryption configurations
- Public access flags
- CIDR blocks
- Network configurations
Example:
# High severity drift detected
status:
driftDetected: true
driftDetails:
- field: "securityGroupIds[0]"
expected: "sg-prod-123"
actual: "sg-dev-456"
severity: "high"
Medium Severity
Fields affecting functionality but not security:
- Instance types
- Storage sizes
- Connection configurations
- Resource settings
Example:
# Medium severity drift detected
status:
driftDetected: true
driftDetails:
- field: "instanceType"
expected: "t3.large"
actual: "t3.medium"
severity: "medium"
Low Severity
Metadata and cosmetic fields:
- Tags (except security tags)
- Descriptions
- Names
- Non-critical metadata
Example:
# Low severity drift detected
status:
driftDetected: true
driftDetails:
- field: "tags.Description"
expected: "Production VPC"
actual: "Prod VPC"
severity: "low"
Checking Drift Status
View Drift in Resource Status
Command:
kubectl get vpc production-vpc -o yaml
Example:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: VPC
metadata:
name: production-vpc
spec:
cidrBlock: "10.0.0.0/16"
tags:
Environment: production
status:
ready: true
vpcID: vpc-0123456789abcdef
# Drift Detection Status
driftDetected: true
lastDriftCheck: "2024-11-23T10:30:00Z"
driftDetails:
- field: "tags.Environment"
expected: "production"
actual: "prod"
severity: "low"
- field: "enableDnsHostnames"
expected: "true"
actual: "false"
severity: "medium"
View Drift Events
Command:
kubectl get events --field-selector involvedObject.name=production-vpc
Output:
LAST SEEN TYPE REASON MESSAGE
2m Warning DriftDetected Detected 2 drift(s) for VPC vpc-123 (high: 0, medium: 1, low: 1)
2m Warning HighSeverityDrift High severity drift in securityGroupIds: desired=sg-prod, actual=sg-dev
1m Normal DriftHealed Auto-healed 2 drift(s) for VPC vpc-123
Monitor Drift with kubectl
Command:
# Check all resources with drift
kubectl get vpc,elasticip,s3bucket -A -o json | \
jq '.items[] | select(.status.driftDetected == true) | {name: .metadata.name, drifts: .status.driftDetails}'
# Count drifts by severity
kubectl get vpc -o json | \
jq '[.items[].status.driftDetails[]?.severity] | group_by(.) | map({severity: .[0], count: length})'
Auto-Healing
When autoHeal: true, the operator automatically fixes drifts:
How Auto-Healing Works
- Detect Drift: Operator detects difference between CR and AWS
- Evaluate: Checks if drift matches ignore patterns or severity threshold
- Fix: Updates AWS resource to match CR specification
- Verify: Re-checks to confirm drift is resolved
- Log: Creates Kubernetes event and updates status
Example: Auto-Healing Tags
Example:
# CR defines these tags
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: VPC
metadata:
name: prod-vpc
spec:
cidrBlock: "10.0.0.0/16"
tags:
Environment: production
Team: platform
Scenario: Someone manually changes tags in AWS Console:
Environment: prod(changed)Team: platform(no change)Owner: john(added)
Auto-Healing Process:
- Operator detects drift in tags
- Logs drift to status and events
- Automatically updates AWS tags to match CR:
- Restores
Environment: production - Removes
Owner: john(not in CR)
- Restores
- Updates status:
driftDetected: false
Healing Function Implementation
Each resource type has its own healing function:
Go Code:
// Example: VPC healing function
func (r *VPCReconciler) healVPCDrift(ctx context.Context, drifts []DriftItem, vpcCR *VPC) error {
// Convert CR to domain model
v := mapper.CRToDomainVPC(vpcCR)
// Sync AWS state to match CR
if err := r.vpcUseCase.SyncVPC(ctx, v); err != nil {
return fmt.Errorf("failed to heal drift: %w", err)
}
return nil
}
Alert-Only Mode
When autoHeal: false, the operator only reports drifts without fixing them:
Example:
driftDetection:
enabled: true
autoHeal: false # Alert-only mode
checkInterval: "5m"
Behavior:
- Detects all drifts
- Logs events
- Updates status with drift details
- DOES NOT modify AWS resources
- DOES NOT auto-heal
Use Cases:
- Production environments requiring manual approval
- Compliance auditing without auto-remediation
- Testing drift detection before enabling auto-healing
- Resources managed by multiple systems
Ignore Patterns
Field Path Patterns
Use ignore patterns to exclude fields from drift detection:
Example:
driftDetection:
ignoreFields:
# Exact match
- "lastModified"
# Prefix with wildcard
- "tags.aws:*" # Ignore all tags starting with "aws:"
- "status.*" # Ignore all status fields
# Specific tag
- "tags.LastSync" # Ignore this specific tag
Common Ignore Patterns
Example:
# AWS-managed fields
ignoreFields:
- "tags.aws:*"
- "lastModified"
- "createdTime"
- "status.*"
# External system tags
- "tags.terraform:*"
- "tags.cloudformation:*"
# Monitoring tags (externally managed)
- "tags.LastBackup"
- "tags.LastPatched"
- "tags.MonitoringEnabled"
Best Practices
- Start with Alert-Only mode — Begin with autoHeal disabled to understand what changes occur before enabling auto-remediation
- Use appropriate check intervals — Production: 5-10m, Development: 1-2m, balance detection speed vs API costs
- Configure severity thresholds — Set autoHealSeverityThreshold to control which drifts trigger automatic fixes
- Ignore external system tags — Use ignoreFields to exclude AWS-managed tags (aws:, cloudformation:) that change externally
- Monitor drift metrics — Configure alerting on drift_detected_total and drift_healing_failed_total metrics
Troubleshooting
Drift Not Detected
Problem: Drift exists but is not being detected
Solutions:
-
Check if drift detection is enabled:
kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.enabled}' -
Check if check interval has passed:
kubectl get vpc -o jsonpath='{.items[*].status.lastDriftCheck}' -
Check if field is in ignore list:
kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.ignoreFields}'
Auto-Healing Not Working
Problem: Drift detected but not fixing
Solutions:
-
Check if auto-healing is enabled:
kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.autoHeal}' -
Check severity threshold:
# If threshold is "high", medium/low severity drifts won't be auto-healed
kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.severityThreshold}' -
Check operator logs:
kubectl logs -n infra-operator-system deployment/infra-operator-controller-manager | grep drift
False Positive Drifts
Problem: Legitimate external changes flagged as drift
Solutions:
-
Add fields to ignore list:
driftDetection:
ignoreFields:
- "tags.ManagedByExternal"
- "specificField" -
Lower severity threshold:
driftDetection:
severityThreshold: "high" # Ignore low/medium severity
Examples
Example 1: High-Security Production Setup
Example:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: production-aws
spec:
region: us-east-1
roleARN: arn:aws:iam::123456789012:role/infra-operator
driftDetection:
enabled: true
checkInterval: "5m"
autoHeal: true
severityThreshold: "high" # Only auto-heal critical issues
ignoreFields:
- "tags.aws:*"
- "tags.backup:*"
- "lastModified"
Example 2: Development with Full Auto-Healing
Example:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: dev-aws
spec:
region: us-east-1
roleARN: arn:aws:iam::123456789012:role/infra-operator
driftDetection:
enabled: true
checkInterval: "1m" # Faster checks in dev
autoHeal: true
severityThreshold: "low" # Auto-heal everything
ignoreFields:
- "tags.aws:*"
Example 3: Audit Mode (No Auto-Healing)
Example:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: audit-aws
spec:
region: us-east-1
roleARN: arn:aws:iam::123456789012:role/infra-operator
driftDetection:
enabled: true
checkInterval: "15m"
autoHeal: false # Alert only, no changes
severityThreshold: "low" # Report all drifts
ignoreFields: [] # No ignores, detect everything
Related Features
- Status Management - How status is updated
- Deletion Policies - Resource cleanup
- AWS Provider - Credential management
API Reference
See the complete API specification in CRD documentation: