ElastiCache - Managed In-Memory Cache
Create and manage fully managed Redis or Memcached clusters on AWS for high-performance and scalable applications.
Prerequisite: AWSProvider Configuration
Before creating any AWS resource, you need to configure an AWSProvider that manages credentials and authentication with AWS.
IRSA:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: production-aws
namespace: default
spec:
region: us-east-1
roleARN: arn:aws:iam::123456789012:role/infra-operator-role
defaultTags:
managed-by: infra-operator
environment: production
Static Credentials:
apiVersion: v1
kind: Secret
metadata:
name: aws-credentials
namespace: default
type: Opaque
stringData:
access-key-id: test
secret-access-key: test
---
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
name: localstack
namespace: default
spec:
region: us-east-1
accessKeyIDRef:
name: aws-credentials
key: access-key-id
secretAccessKeyRef:
name: aws-credentials
key: secret-access-key
defaultTags:
managed-by: infra-operator
environment: test
Verify Status:
kubectl get awsprovider
kubectl describe awsprovider production-aws
For production, always use IRSA (IAM Roles for Service Accounts) instead of static credentials.
Create IAM Role for IRSA
To use IRSA in production, you need to create an IAM Role with the necessary permissions:
Trust Policy (trust-policy.json):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:infra-operator-system:infra-operator-controller-manager",
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud": "sts.amazonaws.com"
}
}
}
]
}
IAM Policy - ElastiCache (elasticache-policy.json):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"elasticache:CreateCacheCluster",
"elasticache:DeleteCacheCluster",
"elasticache:DescribeCacheClusters",
"elasticache:ModifyCacheCluster",
"elasticache:CreateReplicationGroup",
"elasticache:DeleteReplicationGroup",
"elasticache:DescribeReplicationGroups",
"elasticache:ModifyReplicationGroup",
"elasticache:AddTagsToResource",
"elasticache:RemoveTagsFromResource",
"elasticache:ListTagsForResource",
"elasticache:CreateCacheSubnetGroup",
"elasticache:ModifyCacheSubnetGroup"
],
"Resource": "*"
}
]
}
Create Role with AWS CLI:
# 1. Get OIDC Provider from EKS cluster
export CLUSTER_NAME=my-cluster
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_PROVIDER=$(aws eks describe-cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--query "cluster.identity.oidc.issuer" \
--output text | sed -e "s/^https:\/\///")
# 2. Update trust-policy.json with correct values
cat > trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:infra-operator-system:infra-operator-controller-manager",
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
}
}
}
]
}
EOF
# 3. Create IAM Role
aws iam create-role \
--role-name infra-operator-elasticache-role \
--assume-role-policy-document file://trust-policy.json \
--description "Role for Infra Operator ElastiCache management"
# 4. Create and attach policy
aws iam put-role-policy \
--role-name infra-operator-elasticache-role \
--policy-name ElastiCacheManagement \
--policy-document file://elasticache-policy.json
# 5. Get Role ARN
aws iam get-role \
--role-name infra-operator-elasticache-role \
--query 'Role.Arn' \
--output text
Annotate Operator ServiceAccount:
# Add annotation to operator ServiceAccount
kubectl annotate serviceaccount infra-operator-controller-manager \
-n infra-operator-system \
eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/infra-operator-elasticache-role
Replace 123456789012 with your AWS Account ID and EXAMPLED539D4633E53DE1B71EXAMPLE with your OIDC provider ID.
Overview
Amazon ElastiCache is a fully managed in-memory caching service that provides performance and scalability for modern applications. It supports Redis and Memcached as cache engines with built-in high availability, security, and monitoring.
Features:
- In-Memory Cache: Sub-millisecond latency for reads/writes
- Two Supported Engines: Redis (more features) and Memcached (simpler)
- Cluster Mode: Redis Cluster for horizontal scalability
- Multi-AZ: Automatic high availability with failover
- Replication: Automatic failover with synchronized replicas
- Encryption: Encryption in transit (TLS) and at rest
- Authentication: Redis AUTH with secure tokens
- Automatic Backups: Snapshots for data recovery (Redis)
- Auto Scaling: Increase/decrease nodes based on demand
- VPC Support: Network isolation in private subnets
- Parameter Groups: Custom cache configuration
- Security Groups: Granular network access control
- ElastiCache Cluster Mode: Data distribution across multiple nodes
- Pub/Sub (Redis): Messaging between applications
- Transactions (Redis): ACID transactions with MULTI/EXEC
Status: ⚠️ Requires LocalStack Pro or Real AWS
Quick Start
Redis Single Node:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: e2e-redis-simple
namespace: default
spec:
providerRef:
name: localstack
# Cluster identifier (2-63 characters)
clusterID: e2e-redis-simple
# Engine and version
engine: redis
engineVersion: "7.0"
# Node size
nodeType: cache.t3.micro
# Number of nodes (single mode)
numCacheNodes: 1
# Tagging
tags:
environment: test
managed-by: infra-operator
purpose: e2e-testing
deletionPolicy: Delete
Memcached Cluster:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: e2e-memcached
namespace: default
spec:
providerRef:
name: localstack
# Identifier
clusterID: e2e-memcached-test
# Engine
engine: memcached
engineVersion: "1.6.17"
# Node size and quantity
nodeType: cache.t3.micro
numCacheNodes: 2
# Tagging
tags:
environment: test
managed-by: infra-operator
engine: memcached
purpose: e2e-testing
deletionPolicy: Delete
Redis Cluster Mode with Replication:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: e2e-redis-cluster
namespace: default
spec:
providerRef:
name: localstack
# Identifier
clusterID: e2e-redis-cluster
# Engine
engine: redis
engineVersion: "7.0"
# Node size
nodeType: cache.t3.small
# Replication group description
replicationGroupDescription: "E2E Test Redis Cluster"
# Cluster Mode
numNodeGroups: 2
replicasPerNodeGroup: 1
automaticFailoverEnabled: true
multiAZEnabled: false
# Tagging
tags:
environment: test
managed-by: infra-operator
cluster-mode: enabled
purpose: e2e-testing
deletionPolicy: Delete
Production with Full HA:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: app-cache-prod
namespace: default
spec:
providerRef:
name: production-aws
# Identifier
clusterID: app-redis-prod
# Engine
engine: redis
engineVersion: "7.0"
# Cluster Mode
replicationGroupDescription: "Production Redis Cluster"
automaticFailoverEnabled: true
multiAZEnabled: true
# Shard configuration
numNodeGroups: 2
replicasPerNodeGroup: 2
nodeType: cache.r6g.xlarge
# Network
subnetGroupName: private-cache-subnet
securityGroupIds:
- sg-0123456789abcdef0
# Backup
snapshotRetentionLimit: 7
snapshotWindow: "03:00-05:00"
# Security
transitEncryptionEnabled: true
atRestEncryptionEnabled: true
authTokenRef:
name: redis-auth
key: token
tags:
Environment: production
Critical: "true"
deletionPolicy: Retain
Apply:
kubectl apply -f elasticache.yaml
Verify Status:
kubectl get elasticachecluster
kubectl describe elasticachecluster e2e-redis-simple
kubectl get elasticachecluster e2e-redis-simple -o yaml
Configuration Reference
Required Fields
Reference to AWSProvider resource for authentication
AWSProvider resource name
Unique cluster identifier (2 to 63 characters)
Rules:
- Alphanumerics and hyphens only
- Cannot start or end with hyphen
- Must be unique per region
- Cannot be modified after creation
Example:
clusterID: app-redis-cache-prod
Cache engine
Options:
redis- Redis (recommended for production)memcached- Memcached (simple, no persistence)
Example:
engine: redis
Cache engine version
Examples:
- Redis:
7.0,6.2,5.0 - Memcached:
1.6.22,1.5.16
Example:
engineVersion: "7.0"
Note: Always specify version precisely
Node type (CPU, RAM, performance)
cache.t3 Family (Burstable - Dev/Test):
cache.t3.micro- 1 vCPU, 1 GB RAM (~$0.017/hour)cache.t3.small- 1 vCPU, 2 GB RAM (~$0.034/hour)cache.t3.medium- 1 vCPU, 4 GB RAM (~$0.068/hour)
cache.r6g Family (Memory Optimized):
cache.r6g.large- 2 vCPU, 16 GB RAM (~$0.280/hour)cache.r6g.xlarge- 4 vCPU, 32 GB RAM (~$0.560/hour)cache.r6g.2xlarge- 8 vCPU, 64 GB RAM (~$1.120/hour)
cache.m6g Family (General Purpose):
cache.m6g.large- 2 vCPU, 8 GB RAM (~$0.140/hour)cache.m6g.xlarge- 4 vCPU, 16 GB RAM (~$0.280/hour)
Example:
nodeType: cache.t3.micro
Number of nodes in cluster (single mode)
Range: 1 - 300 nodes
Note: Use numCacheNodes for single mode. For cluster mode, use numNodeGroups:
numCacheNodes: 1
Optional Fields - Cluster Mode and HA
Replication group description (for cluster mode)
Used with:
numNodeGroups: Number of shardsreplicasPerNodeGroup: Replicas per shardautomaticFailoverEnabled: Automatic failovermultiAZEnabled: Multi-AZ
Example:
replicationGroupDescription: "Production Redis Cluster"
Enable automatic failover (requires Multi-AZ and replication)
Benefits:
- Automatic failover in case of failure
- No manual intervention
- Application continues working
- Requires more replicas
Example:
automaticFailoverEnabled: true
Recommended: true in production
Distribute nodes across multiple availability zones
Benefits:
- Tolerance to AZ failures
- Automatic failover
- ~50% cost increase
Example:
multiAZEnabled: true
Recommended: true in production
Number of shards in cluster mode
Range: 1 - 500 shards
Note: Use with replicationGroupId for cluster mode:
numNodeGroups: 3
Number of replicas per shard
Range: 0 - 5 replicas per shard:
replicasPerNodeGroup: 2
Note: Recommended 1-2 replicas in production
Optional Fields - Network and Security
Cache subnet group name (private subnets)
Important: MUST exist previously in AWS:
subnetGroupName: private-cache-subnet
Create subnet group via AWS CLI:
aws elasticache create-cache-subnet-group \
--cache-subnet-group-name private-cache-subnet \
--cache-subnet-group-description "Private subnets for ElastiCache" \
--subnet-ids subnet-xxx subnet-yyy
AWS Security Group IDs for access control
Example:
securityGroupIds:
- sg-0123456789abcdef0
- sg-0987654321fedcba0
Enable encryption in transit (TLS)
Details:
- Encrypts data in motion on the network
- Requires authToken if enabled
- Performance: minimal overhead
- Recommended: always enabled
Example:
transitEncryptionEnabled: true
Enable encryption at rest
Details:
- Encrypts stored data
- Valid only for Redis (not Memcached)
- Performance: negligible impact
- Regulatory compliance
Example:
atRestEncryptionEnabled: true
Reference to Secret containing Redis authentication token (password)
Structure:
name: Secret namenamespace: Secret namespace (optional)key: Key within Secret
Example:
authTokenRef:
name: redis-auth
key: token
Corresponding Secret:
apiVersion: v1
kind: Secret
metadata:
name: redis-auth
stringData:
token: MySecureRedisToken123!
Note: Always use in production
Optional Fields - Backup and Recovery
Days to retain snapshots (Redis only)
Range: 0 - 35 days
Recommendations:
- Development: 0-1 days
- Staging: 1-7 days
- Production: 7-35 days
Example:
snapshotRetentionLimit: 7
Note: 0 = no automatic snapshots
Daily snapshot window (UTC, format HH:MM-HH:MM)
Default: Automatically selected:
snapshotWindow: "03:00-05:00" # 3AM-5AM UTC
Tip: Choose low-usage time
Automatically update minor versions
Example:
enableAutoMinorVersionUpgrade: true
Optional Fields - Other
Key-value pairs for organization and billing
Example:
tags:
Environment: production
Application: web-app
Team: backend
CostCenter: engineering
ManagedBy: infra-operator
What happens to the cluster when the CR is deleted
Options:
Delete: Cluster is deleted from AWSRetain: Cluster remains in AWS but not managedOrphan: Remove only management
Example:
deletionPolicy: Retain # For production
Status Fields
After the cluster is created, the following status fields are populated:
true when cluster is available and ready to use
Current cluster state
Possible values:
creating- Creatingavailable- Available and readymodifying- Configuration being changedsnapshotting- Snapshot in progressdeleting- Being deletedfailed- Creation error
Full ElastiCache cluster ARN
arn:aws:elasticache:us-east-1:123456789012:cluster:app-redis-cache
Cluster configuration endpoint (cluster mode)
Cache hostname for connection
Connection port
Cluster primary endpoint (replication group)
Primary node hostname
Connection port (default: 6379 Redis, 11211 Memcached)
Read endpoint (read replicas)
Hostname for reads
Connection port
List of individual endpoints for each node
Individual node hostname
Node port
Cluster node type (e.g., cache.t3.micro)
Running engine version
List of member clusters (for replication groups)
Timestamp when cluster was created
Timestamp of last sync with AWS
Cluster status conditions
Condition type (e.g., Ready, Available)
Condition status (True, False, Unknown)
Reason for current status
Descriptive message
Examples
Redis Single Node for Development
Simple cluster for development and testing:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: dev-redis
namespace: default
spec:
providerRef:
name: dev-aws
clusterID: dev-redis-cache
engine: redis
engineVersion: "7.0"
# Small instance for dev
nodeType: cache.t3.micro
numCacheNodes: 1
# No Multi-AZ to save costs
multiAZEnabled: false
# No backup
snapshotRetentionLimit: 0
subnetGroupName: dev-cache-subnet
# No encryption for better performance in dev
transitEncryptionEnabled: false
tags:
Environment: development
Application: my-app
deletionPolicy: Delete
Redis Cluster Mode with Full HA
Cluster for production with high availability:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: production-redis
namespace: default
spec:
providerRef:
name: production-aws
# Replication group for cluster mode
clusterID: app-redis-prod
replicationGroupDescription: "Production Redis Cluster"
# Engine
engine: redis
engineVersion: "7.0"
# Cluster configuration
nodeType: cache.r6g.xlarge
numNodeGroups: 3 # 3 shards
replicasPerNodeGroup: 2 # 2 replicas per shard
# High Availability
multiAZEnabled: true
automaticFailoverEnabled: true
# Network
subnetGroupName: private-cache-subnet
securityGroupIds:
- sg-0123456789abcdef0
# Backup
snapshotRetentionLimit: 7
snapshotWindow: "03:00-05:00"
# Security
transitEncryptionEnabled: true
atRestEncryptionEnabled: true
authTokenRef:
name: redis-auth
key: token
tags:
Environment: production
Critical: "true"
Team: backend
CostCenter: infrastructure
deletionPolicy: Retain
Redis with Full Encryption and Auth
Critical cluster with maximum security:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: secure-redis
namespace: default
spec:
providerRef:
name: production-aws
clusterID: secure-redis-prod
replicationGroupDescription: "Secure Redis Cluster"
engine: redis
engineVersion: "7.0"
# Performance nodes
nodeType: cache.r6g.2xlarge
numNodeGroups: 4
replicasPerNodeGroup: 2
# HA
multiAZEnabled: true
automaticFailoverEnabled: true
# Private network
subnetGroupName: critical-cache-subnet
securityGroupIds:
- sg-critical-redis-001
# Maximum security
transitEncryptionEnabled: true
atRestEncryptionEnabled: true
authTokenRef:
name: redis-auth-token
key: password
# Full backups
snapshotRetentionLimit: 35
snapshotWindow: "02:00-03:00"
preferredMaintenanceWindow: "sun:03:00-sun:04:00"
tags:
Environment: production
CriticalData: "true"
BackupRequired: "true"
Compliance: "required"
deletionPolicy: Retain
Memcached for Session Storage
Memcached cluster for storing sessions:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: session-cache
namespace: default
spec:
providerRef:
name: production-aws
clusterID: session-cache-prod
engine: memcached
engineVersion: "1.6.22"
nodeType: cache.t3.small
numCacheNodes: 3
subnetGroupName: web-cache-subnet
securityGroupIds:
- sg-memcached-001
# Memcached does not support Multi-AZ
multiAZEnabled: false
# No backup (data does not persist)
snapshotRetentionLimit: 0
tags:
Environment: production
Type: session-cache
Application: web-app
deletionPolicy: Delete
Redis for Rate Limiting and Leaderboards
Cluster optimized for fast operations:
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
name: realtime-redis
namespace: default
spec:
providerRef:
name: production-aws
clusterID: realtime-cache
replicationGroupDescription: "Realtime Redis Cluster"
engine: redis
engineVersion: "7.0"
# Optimized for low latency
nodeType: cache.r6g.large
numNodeGroups: 2
replicasPerNodeGroup: 1
# Fast failover
automaticFailoverEnabled: true
multiAZEnabled: true
subnetGroupName: realtime-cache-subnet
securityGroupIds:
- sg-realtime-redis-001
# Security
transitEncryptionEnabled: true
authTokenRef:
name: realtime-redis-auth
key: token
# Normal backups
snapshotRetentionLimit: 3
snapshotWindow: "02:00-03:00"
tags:
Environment: production
Type: realtime-cache
UseCase: rate-limiting-leaderboards
deletionPolicy: Retain
Verification
Verify Status via kubectl
Command:
# List all ElastiCache clusters
kubectl get elasticachecluster
# Get detailed information
kubectl get elasticachecluster app-cache -o yaml
# Watch creation in real-time
kubectl get elasticachecluster app-cache -w
# View events and status
kubectl describe elasticachecluster app-cache
Verify on AWS
AWS CLI:
# List clusters
aws elasticache describe-cache-clusters \
--query 'CacheClusters[].{Id:CacheClusterId,Status:CacheClusterStatus,Engine:Engine,NodeType:CacheNodeType}' \
--output table
# Get full details
aws elasticache describe-cache-clusters \
--cache-cluster-id app-redis-cache \
--output json | jq '.CacheClusters[0]'
# View connection endpoint
aws elasticache describe-cache-clusters \
--cache-cluster-id app-redis-cache \
--query 'CacheClusters[0].CacheNodes[0].Endpoint'
# Test connection (Redis)
redis-cli -h app-redis-cache.xxxxx.cache.amazonaws.com \
-p 6379 \
-a MyToken123! \
PING
# View snapshots
aws elasticache describe-snapshots \
--cache-cluster-id app-redis-cache
LocalStack:
# For testing with LocalStack
export AWS_ENDPOINT_URL=http://localhost:4566
aws elasticache describe-cache-clusters
aws elasticache describe-cache-clusters \
--cache-cluster-id app-redis-cache
redis-cli (Redis):
# Connect to cache
redis-cli -h <endpoint> -p 6379 -a <password>
# Test connection
PING
# Response: PONG
# View server information
INFO server
# View memory usage
INFO memory
# View all keys
KEYS *
# Disconnect
EXIT
memcached-tool (Memcached):
# Connect and get stats
echo "stats" | nc <endpoint> 11211
# View largest memory consumer
memcached-tool <endpoint>:11211 display
# Flush all (clear data)
echo "flush_all" | nc <endpoint> 11211
Expected Output
Example:
status:
ready: true
clusterStatus: available
cacheClusterARN: arn:aws:elasticache:us-east-1:123456789012:cluster:app-redis-cache
primaryEndpoint:
address: app-redis-cache.xxxxx.cache.amazonaws.com
port: 6379
cacheNodeType: cache.t3.micro
engineVersion: "7.0"
clusterCreateTime: "2025-11-22T20:30:15Z"
lastSyncTime: "2025-11-22T20:45:22Z"
conditions:
- type: Ready
status: "True"
reason: ClusterAvailable
message: "ElastiCache cluster is available"
lastTransitionTime: "2025-11-22T20:35:10Z"
Troubleshooting
Cluster stuck in creating for more than 30 minutes
Symptoms: cacheClusterStatus: creating indefinitely
Common causes:
- Subnet group does not exist or is invalid
- Security group not found
- Cluster quota reached
- Connectivity issue with AWS
Solutions:
# Check detailed status
kubectl describe elasticachecluster app-cache
# View operator logs
kubectl logs -n infra-operator-system \
deploy/infra-operator-controller-manager \
--tail=100 | grep -i elasticache
# Verify subnet group exists
aws elasticache describe-cache-subnet-groups \
--cache-subnet-group-name cache-subnet-group
# Verify AWSProvider is ready
kubectl get awsprovider
kubectl describe awsprovider production-aws
# Force sync
kubectl annotate elasticachecluster app-cache \
force-sync="$(date +%s)" --overwrite
Connection timeout when connecting to cache
Symptoms: Connection refused or Cannot connect to host
Common causes:
- Security group does not allow connection
- Cache is not accessible (private)
- Incorrect endpoint/port
- Network not routed correctly
Solutions:
# Get correct endpoint
aws elasticache describe-cache-clusters \
--cache-cluster-id app-redis-cache \
--query 'CacheClusters[0].CacheNodes[0].Endpoint'
# Verify security group allows inbound
aws ec2 describe-security-groups \
--group-ids sg-0123456789abcdef0 \
--query 'SecurityGroups[0].IpPermissions'
# MUST have rule like:
# IpProtocol: tcp, FromPort: 6379, ToPort: 6379
# CidrIp: 10.0.0.0/16 (your VPC)
# If connecting from outside VPC:
# 1. Cache MUST be publicly accessible (not recommended)
# 2. Or use bastion/jump host
# 3. Or use VPN
# Test with telnet/nc
nc -zv app-redis-cache.xxxxx.cache.amazonaws.com 6379
# Test with redis-cli
redis-cli -h app-redis-cache.xxxxx.cache.amazonaws.com \
-p 6379 \
-a password \
PING
Out of Memory (cache full)
Symptoms: Error OOM command not allowed when writing data
Causes:
- Data grew beyond expected
- TTL not configured (data never expires)
- Insufficient node size
Solutions:
# Connect to Redis and view usage
redis-cli -h <endpoint> -a password
INFO memory
# Increase node size (resize)
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"cacheNodeType":"cache.t3.small"}}'
# Increase number of nodes (scale up)
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"numCacheNodes":2}}'
# Clear data manually (Redis)
redis-cli -h <endpoint> -a password
FLUSHDB # Clear only current database
FLUSHALL # Clear all databases
# Implement TTL in application
SET key value EX 3600 # Expires in 1 hour
Slow Performance/High Latency
Symptoms: Slow read/write even in cache
Causes:
- CPU or memory saturated
- Congested network
- Aggressive eviction policy
- Insufficient node size
Solutions:
# View slow queries (Redis)
redis-cli -h <endpoint> -a password
SLOWLOG GET 10
# Increase node size
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"cacheNodeType":"cache.r6g.xlarge"}}'
# Increase replicas to distribute load
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"replicasPerNodeGroup":2}}'
# Enable cluster mode for distribution
# Requires recreating cluster
Slow failover or not working
Symptoms: After failure, cache stays offline for long time
Cause: Automatic failover may be disabled or insufficient replicas
Solutions:
# Verify automatic failover
aws elasticache describe-replication-groups \
--replication-group-id app-redis-rg \
--query 'ReplicationGroups[0].AutomaticFailoverEnabled'
# Enable if not
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"automaticFailoverEnabled":true}}'
# Verify Multi-AZ
aws elasticache describe-replication-groups \
--replication-group-id app-redis-rg \
--query 'ReplicationGroups[0].MultiAZ'
# Enable Multi-AZ
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"multiAZEnabled":true}}'
# Test failover manually
# WARNING: Causes downtime!
aws elasticache test-failover \
--replication-group-id app-redis-rg
# View failover progress
aws elasticache describe-replication-groups \
--replication-group-id app-redis-rg
High ElastiCache costs
Symptoms: AWS account with unexpected cache expenses
Common causes:
- Nodes too large
- Multi-AZ in dev/test
- Snapshots retained too long
- Cross-region replication
Solutions:
# Calculate current cost
# Use AWS Pricing Calculator
# cache.t3.micro: ~$0.017/hour * 730 = ~$12/month
# cache.r6g.xlarge: ~$0.560/hour * 730 = ~$409/month
# Reduce class (if possible)
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"cacheNodeType":"cache.t3.micro"}}'
# Reduce replicas in dev
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"replicasPerNodeGroup":0}}'
# Reduce snapshot retention
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"snapshotRetentionLimit":3}}'
# For dev, delete after use
kubectl delete elasticachecluster dev-cache
Snapshot fails or takes too long
Symptoms: Status stays in snapshotting for hours
Causes:
- Cache too large
- IO saturated during snapshot
- Many transactions during snapshot
Solutions:
# View snapshot status
aws elasticache describe-snapshots \
--cache-cluster-id app-redis-cache
# View completed snapshots
aws elasticache describe-snapshots \
--query 'Snapshots[].{SnapshotName:SnapshotName,Status:SnapshotStatus}'
# Increase snapshot window
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"snapshotWindow":"02:00-04:00"}}'
# Create manual snapshot (Redis)
aws elasticache create-snapshot \
--cache-cluster-id app-redis-cache \
--snapshot-name manual-backup-$(date +%Y%m%d-%H%M%S)
# Delete old snapshots
aws elasticache delete-snapshot \
--snapshot-name snapshot-name
Error deleting cluster (finalizer stuck)
Symptoms: kubectl delete elasticachecluster pending indefinitely
Cause: Finalizer cannot delete cluster
Solutions:
# View details
kubectl describe elasticachecluster app-cache
# View finalizers
kubectl get elasticachecluster app-cache -o yaml | grep finalizers
# Option 1: Change deletionPolicy before deleting
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"deletionPolicy":"Retain"}}'
# Then delete
kubectl delete elasticachecluster app-cache
# Option 2: Force delete CR
kubectl patch elasticachecluster app-cache \
-p '{"metadata":{"finalizers":[]}}' \
--type=merge
# Then delete manually on AWS if needed
aws elasticache delete-cache-cluster \
--cache-cluster-id app-redis-cache
Invalid or expired authentication token
Symptoms: Error WRONGPASS invalid username-password pair when connecting
Cause: Incorrect, expired, or unconfigured auth token
Solutions:
# Verify auth token is configured
kubectl get elasticachecluster app-cache -o yaml | grep authToken
# Test connecting without auth
redis-cli -h <endpoint> -p 6379 PING
# If fails, check on AWS
aws elasticache describe-cache-clusters \
--cache-cluster-id app-redis-cache
# Connect with correct token
redis-cli -h <endpoint> -p 6379 -a MyToken123! PING
# Update token (requires restart)
kubectl patch elasticachecluster app-cache \
--type merge \
-p '{"spec":{"authToken":"NewToken456!"}}'
# Verify transitEncryption is enabled
kubectl get elasticachecluster app-cache -o yaml | grep transitEncryption
Best Practices
- Enable Multi-AZ for production — Automatic failover in case of failure, ~50% higher cost but essential for critical data
- Enable encryption everywhere — Transit encryption (TLS) always enabled, at-rest encryption for sensitive data, auth tokens in production for regulatory compliance
- Use strong auth tokens — Complex tokens (32+ characters), store in Kubernetes Secret, rotate regularly, never hardcode, use different tokens per environment
- Use Cluster Mode for scale — Enable for >100 GB of data, automatic sharding, unlimited horizontal scalability, better performance than single mode
- Configure replicas properly — Minimum 1 replica in production, 2 replicas for high concurrency, read-heavy workloads benefit from more replicas
- Right-size node types — Start with t3.micro, monitor memory usage, use r6g for memory-intensive or m6g for general purpose workloads
- Implement TTL and eviction — Always use TTL on keys, set eviction policy (allkeys-lru), avoid data that never expires, monitor evictions
- Configure backups — Automatic snapshots (7-35 days), manual backups before upgrades, test restore regularly, copy snapshots to another region
- Use connection pooling — Don't create new connections per request, reuse existing connections, configure appropriate timeout
- Tag all resources — Environment (dev/staging/prod), application, cost center, owner/team, critical flag
- Secure network access — Cache in private subnet, restrictive security group, only necessary IPs/SGs, never 0.0.0.0/0
Usage Patterns
1. Session Storage (Memcached)
Store user sessions:
engine: memcached
nodeType: cache.t3.small
numCacheNodes: 3
Benefits:
- Fast and simple
- Share sessions between instances
- No persistence (OK for sessions)
2. Database Query Cache (Redis)
Cache query results:
engine: redis
nodeType: cache.r6g.large
replicationGroupDescription: "Query Cache Redis"
snapshotRetentionLimit: 7
Benefits:
- Reduce database load
- Sub-millisecond latency
- Automatic backup
3. Rate Limiting (Redis)
Limit requests per user/IP:
engine: redis
nodeType: cache.r6g.large
numNodeGroups: 2
replicasPerNodeGroup: 1
automaticFailoverEnabled: true
Benefits:
- Very fast for increments
- Distributed across multiple nodes
- Highly reliable
4. Real-time Leaderboards (Redis)
Real-time ranking with sorted sets:
engine: redis
nodeType: cache.r6g.xlarge
numNodeGroups: 3
replicasPerNodeGroup: 2
Benefits:
- O(log N) operations
- Fast score updates
- Efficient queries
5. Pub/Sub Messaging (Redis)
Messaging system between services:
engine: redis
nodeType: cache.r6g.large
automaticFailoverEnabled: true
multiAZEnabled: true
Benefits:
- Low latency
- Broadcast to multiple subscribers
- Simple pattern matching
6. Analytics Data (Redis)
Real-time data aggregation:
engine: redis
nodeType: cache.r6g.2xlarge
numNodeGroups: 4
replicasPerNodeGroup: 2
snapshotRetentionLimit: 7
Benefits:
- HyperLogLog for approximate counting
- Bitmaps for fast operations
- Streams for temporal data