Troubleshooting

This guide covers common issues you may encounter when using Infra Operator and how to resolve them.

Diagnostic Commands

Check Operator Status

Command:

# Verify operator is running
kubectl get pods -n infra-operator

# Check operator logs
kubectl logs -n infra-operator deploy/infra-operator --tail=100

# Follow logs in real-time
kubectl logs -n infra-operator deploy/infra-operator -f

Check Resource Status

Command:

# List all resources
kubectl get vpc,subnet,sg,s3,ec2 -A

# Get detailed status
kubectl describe vpc my-vpc

# Check events
kubectl get events -n infra-operator --sort-by='.lastTimestamp'

Check AWSProvider

Command:

# Verify provider is ready
kubectl get awsprovider

# Check provider details
kubectl describe awsprovider aws-production

Common Issues

Operator Not Starting

Symptoms: Operator pod is in CrashLoopBackOff or Error state

Check logs:

kubectl logs -n infra-operator deploy/infra-operator --previous

Common causes:

Missing CRDs

Command:

# Verify CRDs are installed
kubectl get crds | grep aws-infra-operator.runner.codes

# Reinstall if missing
kubectl apply -f chart/crds/

Invalid RBAC

Command:

# Check ServiceAccount
kubectl get sa -n infra-operator

# Check ClusterRole
kubectl get clusterrole | grep infra-operator

Resource limits too low

Example:

# Increase in values.yaml
operator:
  resources:
    limits:
      memory: 512Mi
    requests:
      memory: 256Mi

AWSProvider Not Ready

Symptoms: AWSProvider ready: false

Check:

kubectl describe awsprovider aws-production

Common causes:

Invalid credentials

Command:

# Verify Secret exists
kubectl get secret aws-credentials -n infra-operator

# Test credentials
AWS_ACCESS_KEY_ID=$(kubectl get secret aws-credentials -n infra-operator \
  -o jsonpath='{.data.AWS_ACCESS_KEY_ID}' | base64 -d)
AWS_SECRET_ACCESS_KEY=$(kubectl get secret aws-credentials -n infra-operator \
  -o jsonpath='{.data.AWS_SECRET_ACCESS_KEY}' | base64 -d)
aws sts get-caller-identity

Wrong region

Example:

# Verify region in provider
spec:
  region: us-east-1  # Must match your AWS resources

IRSA not configured (EKS)

Command:

# Check ServiceAccount annotation
kubectl get sa infra-operator -n infra-operator -o yaml | grep eks.amazonaws.com

# Verify IAM role trust policy
aws iam get-role --role-name infra-operator-role

Resource Stuck in Pending

Symptoms: Resource stays in pending state indefinitely

Check:

kubectl describe vpc my-vpc
kubectl logs -n infra-operator deploy/infra-operator | grep "my-vpc"

Common causes:

AWSProvider not ready

Command:

kubectl get awsprovider
# Ensure provider referenced by resource is ready

AWS API errors

Command:

# Check operator logs for AWS errors
kubectl logs -n infra-operator deploy/infra-operator | grep -i "error\|failed"

Rate limiting

Command:

# Look for throttling errors
kubectl logs -n infra-operator deploy/infra-operator | grep -i "throttl"

Resource Won't Delete

Symptoms: Resource stuck in Terminating state

Check:

kubectl get vpc my-vpc -o yaml | grep -A 10 finalizers

Solutions:

Check for dependent resources

Command:

# VPC can't be deleted if it has subnets, IGWs, etc.
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-xxx"
aws ec2 describe-internet-gateways --filters "Name=attachment.vpc-id,Values=vpc-xxx"

Remove finalizer (last resort)

Command:

# WARNING: This may leave AWS resources orphaned
kubectl patch vpc my-vpc -p '{"metadata":{"finalizers":[]}}' --type=merge

Force delete with timeout

Command:
```
kubectl delete vpc my-vpc --timeout=30s
```

Drift Detected

Symptoms: Resource shows drift between Kubernetes spec and AWS

Check:

kubectl describe vpc my-vpc | grep -A 5 "Drift"

Solutions:

Update spec to match AWS

Command:

# Get current AWS state
aws ec2 describe-vpcs --vpc-ids vpc-xxx

# Update Kubernetes resource to match
kubectl edit vpc my-vpc

Force reconciliation

Command:

# Add annotation to trigger reconcile
kubectl annotate vpc my-vpc force-reconcile="$(date +%s)" --overwrite

Enable auto-remediation

Example:

spec:
  driftDetection:
    enabled: true
    autoRemediate: true

EC2 Instance Won't Start

Symptoms: EC2Instance stuck in pending or stopped

Check:

kubectl describe ec2instance my-instance
aws ec2 describe-instances --instance-ids i-xxx

Common causes:

Invalid AMI

Command:

# Check if AMI exists in region
aws ec2 describe-images --image-ids ami-xxx

Invalid instance type

Command:

# Check available instance types
aws ec2 describe-instance-types --instance-types t3.micro

Subnet/Security Group issues

Command:

# Verify subnet exists
aws ec2 describe-subnets --subnet-ids subnet-xxx

# Verify security group
aws ec2 describe-security-groups --group-ids sg-xxx

Insufficient capacity

Command:

# Try different AZ or instance type
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters Name=instance-type,Values=t3.micro

S3 Bucket Permission Denied

Symptoms: S3Bucket creation fails with access denied

Check:

kubectl logs -n infra-operator deploy/infra-operator | grep "s3\|bucket"

Solutions:

Check IAM permissions

JSON:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:DeleteBucket",
        "s3:GetBucketLocation",
        "s3:GetBucketTagging",
        "s3:PutBucketTagging",
        "s3:GetBucketVersioning",
        "s3:PutBucketVersioning",
        "s3:GetEncryptionConfiguration",
        "s3:PutEncryptionConfiguration"
      ],
      "Resource": "*"
    }
  ]
}

Bucket name already exists

Command:

# S3 bucket names are globally unique
aws s3api head-bucket --bucket my-bucket-name

LocalStack Connection Issues

Symptoms: Resources fail when using LocalStack

Check:

# Verify LocalStack is running
kubectl get pods | grep localstack

# Test connectivity
kubectl run test --rm -it --image=curlimages/curl -- \
  curl http://localstack.default.svc.cluster.local:4566/_localstack/health

Solutions:

Check endpoint URL

Example:

# In AWSProvider
spec:
  endpoint: http://localstack.default.svc.cluster.local:4566

Check LocalStack services

Command:

# List running services
curl http://localhost:4566/_localstack/health | jq

Performance Issues

Slow Reconciliation

Symptoms: Resources take long time to sync

Solutions:

Increase concurrency

Example:

# In operator deployment
args:
  - --max-concurrent-reconciles=10

Check rate limits

Command:

# Monitor AWS API calls
kubectl logs -n infra-operator deploy/infra-operator | grep -i "rate\|limit"

High Memory Usage

Symptoms: Operator using excessive memory

Solutions:

Increase memory limits

Example:

operator:
  resources:
    limits:
      memory: 1Gi

Reduce cache size

Example:
```
args:
  - --cache-size=100
```

Getting Help

Collect Debug Information

Command:

# Create debug bundle
mkdir debug-bundle
kubectl get pods -n infra-operator -o yaml > debug-bundle/pods.yaml
kubectl logs -n infra-operator deploy/infra-operator > debug-bundle/logs.txt
kubectl get crds | grep aws-infra-operator.runner.codes > debug-bundle/crds.txt
kubectl get awsprovider,vpc,subnet,sg -A -o yaml > debug-bundle/resources.yaml
kubectl get events -n infra-operator > debug-bundle/events.txt

Report Issues

When reporting issues, include:

Operator version
Kubernetes version
Cloud provider (AWS/LocalStack)
Resource YAML (redacted credentials)
Operator logs
Error messages

GitHub Issues: https://github.com/andrebassi/infra-operator/issues

Diagnostic Commands​

Check Operator Status​

Check Resource Status​

Check AWSProvider​

Common Issues​

Operator Not Starting​

AWSProvider Not Ready​

Resource Stuck in Pending​

Resource Won't Delete​

Drift Detected​

EC2 Instance Won't Start​

S3 Bucket Permission Denied​

LocalStack Connection Issues​

Performance Issues​

Slow Reconciliation​

High Memory Usage​

Getting Help​

Collect Debug Information​

Report Issues​

Diagnostic Commands

Check Operator Status

Check Resource Status

Check AWSProvider

Common Issues

Operator Not Starting

AWSProvider Not Ready

Resource Stuck in Pending

Resource Won't Delete

Drift Detected

EC2 Instance Won't Start

S3 Bucket Permission Denied

LocalStack Connection Issues

Performance Issues

Slow Reconciliation

High Memory Usage

Getting Help

Collect Debug Information

Report Issues