Infra Operator - Architecture and Design
Introduction
This document explains the architecture of infra-operator, a Kubernetes Operator that provisions AWS resources using the Custom Resources (CRs) pattern. The operator was developed following Kubernetes community best practices and uses AWS SDK Go v2 for direct interaction with AWS services.
Architecture Overview
Main Components
1. Custom Resource Definitions (CRDs)
CRDs define the structure of custom resources that the operator manages.
AWSProvider CRD
Purpose: Centralize AWS credentials and configurations.
Main fields:
region: AWS region (required)roleARN: IAM role ARN for IRSAaccessKeyIDRef/secretAccessKeyRef: Secret references for static credentialsdefaultTags: Tags applied to all resources
Status:
ready: Indicates if credentials are validaccountID: AWS account IDcallerIdentity: Authenticated identity ARNconditions: Status conditions
File: api/v1alpha1/awsprovider_types.go
S3Bucket CRD
Purpose: Manage S3 buckets with complete configuration.
Main fields:
providerRef: Reference to AWSProviderbucketName: Bucket name (globally unique)versioning: Versioning configurationencryption: Encryption configurationlifecycleRules: Lifecycle rulescorsRules: CORS configurationpublicAccessBlock: Public access blockingdeletionPolicy: Deletion policy (Delete, Retain, Orphan)
Status:
ready: Bucket created and configuredarn: Bucket ARNregion: Region where bucket existsbucketDomainName: Bucket domain name
File: api/v1alpha1/s3bucket_types.go
2. Controllers (Reconcilers)
Controllers implement reconciliation logic - they observe the desired state (spec) and current state, and work to converge current state to desired.
AWSProvider Controller
Responsibilities:
- Validate AWS credentials using STS GetCallerIdentity
- Build AWS configuration (aws.Config) based on spec
- Update status with account information
- Re-validate credentials periodically (every 5 minutes)
Reconciliation flow:
File: controllers/awsprovider_controller.go
S3Bucket Controller
Responsibilities:
- Create S3 bucket if it doesn't exist
- Configure versioning, encryption, lifecycle, CORS, tags
- Configure public access block
- Manage lifecycle (creation, update, deletion)
- Implement finalizers for controlled cleanup
Reconciliation flow:
File: controllers/s3bucket_controller.go
3. AWS Helper Package
Helper functions for AWS interaction.
Main functions:
GetAWSConfigFromProvider(): Gets aws.Config from AWSProvider CRbuildAWSConfig(): Builds AWS configuration with credentialsgetSecretValue(): Reads values from Kubernetes Secrets
File: pkg/aws/provider.go
Implemented Design Patterns
1. Reconciliation Loop
The fundamental pattern of Kubernetes Operators. The controller observes the desired state (CR spec) and works continuously to make current state converge to desired.
Characteristics:
- Idempotent: Multiple executions produce the same result
- Edge-triggered and Level-triggered: Reacts to changes and also periodically checks state
- Error handling: Returns error for automatic requeue
2. Finalizers
Mechanism to execute cleanup before deleting a CR.
Implementation in S3Bucket:
const s3BucketFinalizer = "aws-infra-operator.runner.codes/s3bucket-finalizer"
// When creating/updating:
if !controllerutil.ContainsFinalizer(bucket, s3BucketFinalizer) {
controllerutil.AddFinalizer(bucket, s3BucketFinalizer)
r.Update(ctx, bucket)
}
// When deleting:
if !bucket.ObjectMeta.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(bucket, s3BucketFinalizer) {
// Execute cleanup
r.deleteBucket(ctx, s3Client, bucket)
// Remove finalizer
controllerutil.RemoveFinalizer(bucket, s3BucketFinalizer)
r.Update(ctx, bucket)
}
}
3. Status Conditions
Following the Kubernetes pattern of conditions to report state.
Structure:
type Condition struct {
Type string // "Ready"
Status string // "True", "False", "Unknown"
LastTransitionTime metav1.Time
Reason string // "BucketReady", "CreationFailed"
Message string // Detailed description
}
4. Provider Pattern
Separation of credentials (AWSProvider) from resources (S3Bucket, RDS, etc).
Advantages:
- Reuse of credentials across multiple resources
- Centralized authentication validation
- Support for multiple providers (multi-account, multi-region)
- Easier credential rotation
5. Deletion Policies
Flexible policy to control what happens to AWS resources when CR is deleted.
Options:
- Delete: Removes AWS resource (default)
- Retain: Keeps AWS resource, removes only the CR
- Orphan: Detaches from operator but keeps everything
Data Flow
Creating an S3Bucket
IRSA Authentication
File Structure
infra-operator/
│
├── api/v1alpha1/ # API Definitions (CRDs)
│ ├── groupversion_info.go # API group registration
│ ├── awsprovider_types.go # AWSProvider CRD struct
│ ├── s3bucket_types.go # S3Bucket CRD struct
│ ├── rdsinstance_types.go # RDS CRD struct (pending controller)
│ ├── ec2instance_types.go # EC2 CRD struct (pending controller)
│ └── sqsqueue_types.go # SQS CRD struct (pending controller)
│
├── controllers/ # Reconciliation Logic
│ ├── awsprovider_controller.go # Validates AWS credentials
│ └── s3bucket_controller.go # Manages S3 buckets
│
├── pkg/ # Shared Libraries
│ └── aws/
│ └── provider.go # Helper functions for AWS config
│
├── config/ # Kubernetes Manifests
│ ├── crd/bases/ # CRD YAML manifests
│ ├── rbac/ # RBAC (ServiceAccount, Role, Binding)
│ ├── manager/ # Deployment and Namespace
│ └── samples/ # Example CRs
│
├── docs/ # Documentation
│ ├── ARCHITECTURE.md # This file
│ └── DEPLOYMENT_GUIDE.md # Deployment guide
│
├── main.go # Operator entry point
├── Dockerfile # Container image build
├── Makefile # Build automation
├── go.mod # Go module definition
├── CLAUDE.md # Complete technical documentation
└── README.md # User-facing documentation
Architectural Decisions
Why Go SDK v2 instead of ACK?
Decision: Use AWS SDK for Go v2 directly.
Alternative considered: AWS Controllers for Kubernetes (ACK)
Reasons:
-
Full Control:
- Custom reconciliation logic
- Business-specific validations
- Custom error handling
-
Deployment Simplicity:
- Single operator for all services
- No need to install multiple ACK controllers
- Lower resource overhead
-
Learning:
- Better understanding of operator patterns
- Deep knowledge of AWS APIs
- Flexibility to implement custom features
-
Dependencies:
- Fewer external dependencies
- Simpler SDK updates
- No dependency on ACK roadmap
Trade-offs:
- More code to maintain
- Need to implement features manually (pagination, retries)
- Responsibility to track AWS API changes
Why Provider Pattern?
Decision: Separate credentials (AWSProvider) from resources (S3, RDS, etc).
Reasons:
- Reuse: One AWSProvider can be used by multiple resources
- Security: Credentials isolated in a specific resource
- Multi-tenancy: Different namespaces can have different providers
- Multi-account: Supports access to multiple AWS accounts
- Rotation: Facilitates credential rotation
Alternative: Direct credentials in each resource (very repetitive and insecure)
Why Deletion Policies?
Decision: Allow users to choose what happens when deleting a CR.
Reasons:
- Safety: Avoid accidental deletion of important data
- Flexibility: Different use cases (dev vs prod)
- Compliance: Some regulations require data retention
- Migration: Facilitates migration to other tools
Implemented policies:
Delete: Removes AWS resource (default for ephemeral environments)Retain: Keeps AWS resource (default for production)Orphan: Detaches but keeps everything
Concurrency and Thread Safety
Controller Runtime
Controller-runtime manages concurrency automatically:
- Work Queue: Serializes reconciliation events
- Max Concurrent Reconciles: Configurable (default: 1)
- Retry Backoff: Exponential backoff for errors
Shared State
No shared state between reconciliations:
- Each reconciliation fetches current state from Kubernetes
- No shared cache between reconciles
- State persisted only in CRs
Rate Limiting
AWS SDK:
- Implements automatic retry with exponential backoff
- Respects AWS rate limits
- Uses context.Context for timeout
Kubernetes Client:
- Configurable rate limiting
- Default: 5 QPS, burst 10
Performance and Scalability
Implemented Optimizations
-
Smart Requeue:
- Ready resources: requeue every 5 minutes
- Resources with error: requeue every 1 minute
- Immediate changes: immediate requeue
-
Minimize API Calls:
- HeadBucket before GetBucket (cheaper)
- Apply configurations only if changed
- Batch updates when possible
-
Status Subresource:
- Status updates don't trigger new reconciliation
- Reduces unnecessary reconciliations
Current Limits
-
Single Region per Provider: One AWSProvider = one region
- Workaround: Create multiple AWSProviders
-
No Pagination: Doesn't list large quantities of resources
- Impact: Limited for operators managing resources
-
Sequential Processing: One resource at a time
- Impact: Limited scalability for many resources
Future Improvements
- Metrics: Prometheus metrics for observability
- Webhooks: Validation and defaults in admission
- Caching: AWS information cache to reduce API calls
- Parallel Reconciliation: MaxConcurrentReconciles > 1
- Health Checks: Detailed health checks for AWS resources
Security
Implemented Principles
-
Least Privilege:
- Minimum necessary RBAC
- Service-specific IAM policies
- Namespaced resources when possible
-
Secrets Management:
- Credentials in Kubernetes Secrets
- IRSA preferred (no long-lived credentials)
- Never log credentials
-
Container Security:
- Non-root user (UID 65532)
- Read-only root filesystem
- No privileged escalation
- Distroless base image
-
Network Security:
- HTTPS for AWS APIs
- Certificate validation
- No credential exposure in status
Threat Model
Considered threats:
- Credential Leakage: Mitigated with IRSA and Secrets
- Unauthorized Access: Mitigated with RBAC
- Privilege Escalation: Mitigated with security context
- MITM Attacks: Mitigated with HTTPS
Observability
Logging
Structured logging with controller-runtime:
- Automatic context (namespace, name, kind)
- Log levels: info, error, debug
- JSON format (optional)
Example:
logger := log.FromContext(ctx)
logger.Info("Creating S3 bucket", "bucket", bucketName)
logger.Error(err, "Failed to create bucket")
Status Reporting
All CRs expose:
status.ready: Boolean indicating if resource is readystatus.conditions: Array of detailed conditionsstatus.lastSyncTime: Timestamp of last reconciliation
Health Checks
Available endpoints:
/healthz: Liveness probe/readyz: Readiness probe/metrics: Prometheus metrics (future)
Testing Strategy
Test Levels
-
Unit Tests (future):
- Controller logic testing
- AWS SDK mock
- Kubernetes client mock
-
Integration Tests (future):
- Test with envtest (fake API server)
- Complete reconciliation testing
- No real AWS dependency
-
E2E Tests (future):
- Test on real cluster
- Test with real AWS
- Validation of created resources
Test Coverage Goals
- Controllers: > 80% coverage
- AWS helpers: > 90% coverage
- E2E: Happy paths and main error cases
Conclusion
The infra-operator was architected following patterns established by the Kubernetes community, focusing on:
- Simplicity: Easy to understand and maintain
- Extensibility: Easy to add new AWS resources
- Security: IRSA, RBAC, least privilege
- Reliability: Finalizers, deletion policies, idempotency
- Observability: Status conditions, structured logging
The architecture allows future growth with addition of new AWS services, webhook implementation, and performance improvements, while maintaining a solid and well-structured foundation.