Drift Detection and Auto-Healing

Overview

Drift Detection is a feature that automatically detects when the actual state of AWS resources differs from the desired state defined in your Kubernetes Custom Resources (CRs), even when the CRs have not been changed.

What is Drift?

Drift occurs when infrastructure is modified outside the operator, such as:

Manual changes via AWS Console
Modifications via AWS CLI or other tools
Changes made by other automation systems
Accidental modifications by team members
AWS service updates or migrations

Why Drift Detection Matters

Without drift detection:

Manual changes can break your infrastructure
Compliance violations go unnoticed
Configuration inconsistencies accumulate
Debugging becomes harder over time
GitOps workflows lose their single source of truth

With drift detection:

Automatic detection of all changes
Configurable auto-healing or alerts
Detailed drift reports in resource status
Kubernetes events for monitoring
Maintains infrastructure consistency

How It Works

Drift Detection Flow

Detection Process

Periodic Checks: Controller checks for drift at configured intervals (default: 5 minutes)
Comparison: Compares desired state (CR) with actual state (AWS API)
Classification: Categorizes drifts by severity (low, medium, high)
Filtering: Applies ignore patterns and severity thresholds
Action: Auto-heals or alerts based on configuration

Configuration

Enable Drift Detection in AWSProvider

Configure drift detection at the provider level:

Example:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: production-aws
spec:
  region: us-east-1
  roleARN: arn:aws:iam::123456789012:role/infra-operator-role

  # Drift Detection Configuration
  driftDetection:
    # Enable drift detection (default: true)
    enabled: true

    # Drift check interval (default: "5m")
    # Accepts: "1m", "5m", "15m", "1h", etc.
    checkInterval: "5m"

    # Auto-heal detected drifts (default: true)
    # true: automatically fixes by updating AWS
    # false: only alerts via events and status (alert-only mode)
    autoHeal: true

    # Fields to ignore in drift detection
    # Supports wildcards (*)
    ignoreFields:
      - "tags.aws:*"           # Ignore AWS-managed tags
      - "lastModified"         # Ignore modification timestamps
      - "status.*"             # Ignore status fields
      - "tags.LastUpdated"     # Ignore specific tags

    # Minimum severity level to report (default: "medium")
    # Options: "low", "medium", "high"
    # Only drifts at this level or above trigger reconciliation
    severityThreshold: "medium"

Configuration Options

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable/disable drift detection
`checkInterval`	string	`"5m"`	Frequency of drift checks
`autoHeal`	boolean	`true`	Automatically fix vs alert only
`ignoreFields`	[]string	`["tags.aws:", "lastModified", "status."]`	Field patterns to ignore
`severityThreshold`	string	`"medium"`	Minimum severity to act on

Drift Severity Levels

The operator automatically assigns severity levels based on the field that drifted:

High Severity

Fields affecting security, networking, or critical functionality:

Security groups
IAM roles
Encryption configurations
Public access flags
CIDR blocks
Network configurations

Example:

# High severity drift detected
status:
  driftDetected: true
  driftDetails:
    - field: "securityGroupIds[0]"
      expected: "sg-prod-123"
      actual: "sg-dev-456"
      severity: "high"

Medium Severity

Fields affecting functionality but not security:

Instance types
Storage sizes
Connection configurations
Resource settings

Example:

# Medium severity drift detected
status:
  driftDetected: true
  driftDetails:
    - field: "instanceType"
      expected: "t3.large"
      actual: "t3.medium"
      severity: "medium"

Low Severity

Metadata and cosmetic fields:

Tags (except security tags)
Descriptions
Names
Non-critical metadata

Example:

# Low severity drift detected
status:
  driftDetected: true
  driftDetails:
    - field: "tags.Description"
      expected: "Production VPC"
      actual: "Prod VPC"
      severity: "low"

Checking Drift Status

View Drift in Resource Status

Command:

kubectl get vpc production-vpc -o yaml

Example:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: VPC
metadata:
  name: production-vpc
spec:
  cidrBlock: "10.0.0.0/16"
  tags:
    Environment: production
status:
  ready: true
  vpcID: vpc-0123456789abcdef

  # Drift Detection Status
  driftDetected: true
  lastDriftCheck: "2024-11-23T10:30:00Z"
  driftDetails:
    - field: "tags.Environment"
      expected: "production"
      actual: "prod"
      severity: "low"
    - field: "enableDnsHostnames"
      expected: "true"
      actual: "false"
      severity: "medium"

View Drift Events

Command:

kubectl get events --field-selector involvedObject.name=production-vpc

Output:

LAST SEEN   TYPE      REASON              MESSAGE
2m          Warning   DriftDetected       Detected 2 drift(s) for VPC vpc-123 (high: 0, medium: 1, low: 1)
2m          Warning   HighSeverityDrift   High severity drift in securityGroupIds: desired=sg-prod, actual=sg-dev
1m          Normal    DriftHealed         Auto-healed 2 drift(s) for VPC vpc-123

Monitor Drift with kubectl

Command:

# Check all resources with drift
kubectl get vpc,elasticip,s3bucket -A -o json | \
  jq '.items[] | select(.status.driftDetected == true) | {name: .metadata.name, drifts: .status.driftDetails}'

# Count drifts by severity
kubectl get vpc -o json | \
  jq '[.items[].status.driftDetails[]?.severity] | group_by(.) | map({severity: .[0], count: length})'

Auto-Healing

When autoHeal: true, the operator automatically fixes drifts:

How Auto-Healing Works

Detect Drift: Operator detects difference between CR and AWS
Evaluate: Checks if drift matches ignore patterns or severity threshold
Fix: Updates AWS resource to match CR specification
Verify: Re-checks to confirm drift is resolved
Log: Creates Kubernetes event and updates status

Example: Auto-Healing Tags

Example:

# CR defines these tags
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: VPC
metadata:
  name: prod-vpc
spec:
  cidrBlock: "10.0.0.0/16"
  tags:
    Environment: production
    Team: platform

Scenario: Someone manually changes tags in AWS Console:

Environment: prod (changed)
Team: platform (no change)
Owner: john (added)

Auto-Healing Process:

Operator detects drift in tags
Logs drift to status and events
Automatically updates AWS tags to match CR:
- Restores Environment: production
- Removes Owner: john (not in CR)
Updates status: driftDetected: false

Healing Function Implementation

Each resource type has its own healing function:

Go Code:

// Example: VPC healing function
func (r *VPCReconciler) healVPCDrift(ctx context.Context, drifts []DriftItem, vpcCR *VPC) error {
// Convert CR to domain model
v := mapper.CRToDomainVPC(vpcCR)

// Sync AWS state to match CR
if err := r.vpcUseCase.SyncVPC(ctx, v); err != nil {
        return fmt.Errorf("failed to heal drift: %w", err)
}

return nil
}

Alert-Only Mode

When autoHeal: false, the operator only reports drifts without fixing them:

Example:

driftDetection:
  enabled: true
  autoHeal: false  # Alert-only mode
  checkInterval: "5m"

Behavior:

Detects all drifts
Logs events
Updates status with drift details
DOES NOT modify AWS resources
DOES NOT auto-heal

Use Cases:

Production environments requiring manual approval
Compliance auditing without auto-remediation
Testing drift detection before enabling auto-healing
Resources managed by multiple systems

Ignore Patterns

Field Path Patterns

Use ignore patterns to exclude fields from drift detection:

Example:

driftDetection:
  ignoreFields:
    # Exact match
    - "lastModified"

    # Prefix with wildcard
    - "tags.aws:*"      # Ignore all tags starting with "aws:"
    - "status.*"        # Ignore all status fields

    # Specific tag
    - "tags.LastSync"   # Ignore this specific tag

Common Ignore Patterns

Example:

# AWS-managed fields
ignoreFields:
  - "tags.aws:*"
  - "lastModified"
  - "createdTime"
  - "status.*"

  # External system tags
  - "tags.terraform:*"
  - "tags.cloudformation:*"

  # Monitoring tags (externally managed)
  - "tags.LastBackup"
  - "tags.LastPatched"
  - "tags.MonitoringEnabled"

Best Practices

Start with Alert-Only mode — Begin with autoHeal disabled to understand what changes occur before enabling auto-remediation
Use appropriate check intervals — Production: 5-10m, Development: 1-2m, balance detection speed vs API costs
Configure severity thresholds — Set autoHealSeverityThreshold to control which drifts trigger automatic fixes
Ignore external system tags — Use ignoreFields to exclude AWS-managed tags (aws:, cloudformation:) that change externally
Monitor drift metrics — Configure alerting on drift_detected_total and drift_healing_failed_total metrics

Troubleshooting

Drift Not Detected

Problem: Drift exists but is not being detected

Solutions:

Check if drift detection is enabled:

kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.enabled}'

Check if check interval has passed:

kubectl get vpc -o jsonpath='{.items[*].status.lastDriftCheck}'

Check if field is in ignore list:

kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.ignoreFields}'

Auto-Healing Not Working

Problem: Drift detected but not fixing

Solutions:

Check if auto-healing is enabled:

kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.autoHeal}'

Check severity threshold:

# If threshold is "high", medium/low severity drifts won't be auto-healed
kubectl get awsprovider -o jsonpath='{.items[*].spec.driftDetection.severityThreshold}'

Check operator logs:

kubectl logs -n infra-operator-system deployment/infra-operator-controller-manager | grep drift

False Positive Drifts

Problem: Legitimate external changes flagged as drift

Solutions:

Add fields to ignore list:

driftDetection:
  ignoreFields:
    - "tags.ManagedByExternal"
    - "specificField"

Lower severity threshold:

driftDetection:
  severityThreshold: "high"  # Ignore low/medium severity

Examples

Example 1: High-Security Production Setup

Example:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: production-aws
spec:
  region: us-east-1
  roleARN: arn:aws:iam::123456789012:role/infra-operator

  driftDetection:
    enabled: true
    checkInterval: "5m"
    autoHeal: true
    severityThreshold: "high"  # Only auto-heal critical issues
    ignoreFields:
      - "tags.aws:*"
      - "tags.backup:*"
      - "lastModified"

Example 2: Development with Full Auto-Healing

Example:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: dev-aws
spec:
  region: us-east-1
  roleARN: arn:aws:iam::123456789012:role/infra-operator

  driftDetection:
    enabled: true
    checkInterval: "1m"        # Faster checks in dev
    autoHeal: true
    severityThreshold: "low"   # Auto-heal everything
    ignoreFields:
      - "tags.aws:*"

Example 3: Audit Mode (No Auto-Healing)

Example:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: audit-aws
spec:
  region: us-east-1
  roleARN: arn:aws:iam::123456789012:role/infra-operator

  driftDetection:
    enabled: true
    checkInterval: "15m"
    autoHeal: false           # Alert only, no changes
    severityThreshold: "low"  # Report all drifts
    ignoreFields: []          # No ignores, detect everything

Status Management - How status is updated
Deletion Policies - Resource cleanup
AWS Provider - Credential management

API Reference

See the complete API specification in CRD documentation:

Overview​

What is Drift?​

Why Drift Detection Matters​

How It Works​

Detection Process​

Configuration​

Enable Drift Detection in AWSProvider​

Configuration Options​

Drift Severity Levels​

High Severity​

Medium Severity​

Low Severity​

Checking Drift Status​

View Drift in Resource Status​

View Drift Events​

Monitor Drift with kubectl​

Auto-Healing​

How Auto-Healing Works​

Example: Auto-Healing Tags​

Healing Function Implementation​

Alert-Only Mode​

Ignore Patterns​

Field Path Patterns​

Common Ignore Patterns​

Best Practices​

Troubleshooting​

Drift Not Detected​

Auto-Healing Not Working​

False Positive Drifts​

Examples​

Example 1: High-Security Production Setup​

Example 2: Development with Full Auto-Healing​

Example 3: Audit Mode (No Auto-Healing)​

Related Features​

API Reference​

Overview

What is Drift?

Why Drift Detection Matters

How It Works

Detection Process

Configuration

Enable Drift Detection in AWSProvider

Configuration Options

Drift Severity Levels

High Severity

Medium Severity

Low Severity

Checking Drift Status

View Drift in Resource Status

View Drift Events

Monitor Drift with kubectl

Auto-Healing

How Auto-Healing Works

Example: Auto-Healing Tags

Healing Function Implementation

Alert-Only Mode

Ignore Patterns

Field Path Patterns

Common Ignore Patterns

Best Practices

Troubleshooting

Drift Not Detected

Auto-Healing Not Working

False Positive Drifts

Examples

Example 1: High-Security Production Setup

Example 2: Development with Full Auto-Healing

Example 3: Audit Mode (No Auto-Healing)

Related Features

API Reference