ElastiCache - Managed In-Memory Cache

Create and manage fully managed Redis or Memcached clusters on AWS for high-performance and scalable applications.

Prerequisite: AWSProvider Configuration

Before creating any AWS resource, you need to configure an AWSProvider that manages credentials and authentication with AWS.

IRSA:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: production-aws
  namespace: default
spec:
  region: us-east-1
  roleARN: arn:aws:iam::123456789012:role/infra-operator-role
  defaultTags:
    managed-by: infra-operator
    environment: production

Static Credentials:

apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
  namespace: default
type: Opaque
stringData:
  access-key-id: test
  secret-access-key: test
---
apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: AWSProvider
metadata:
  name: localstack
  namespace: default
spec:
  region: us-east-1
  accessKeyIDRef:
    name: aws-credentials
    key: access-key-id
  secretAccessKeyRef:
    name: aws-credentials
    key: secret-access-key
  defaultTags:
    managed-by: infra-operator
    environment: test

Verify Status:

kubectl get awsprovider
kubectl describe awsprovider production-aws

aviso

For production, always use IRSA (IAM Roles for Service Accounts) instead of static credentials.

Create IAM Role for IRSA

To use IRSA in production, you need to create an IAM Role with the necessary permissions:

Trust Policy (trust-policy.json):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:infra-operator-system:infra-operator-controller-manager",
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}

IAM Policy - ElastiCache (elasticache-policy.json):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticache:CreateCacheCluster",
        "elasticache:DeleteCacheCluster",
        "elasticache:DescribeCacheClusters",
        "elasticache:ModifyCacheCluster",
        "elasticache:CreateReplicationGroup",
        "elasticache:DeleteReplicationGroup",
        "elasticache:DescribeReplicationGroups",
        "elasticache:ModifyReplicationGroup",
        "elasticache:AddTagsToResource",
        "elasticache:RemoveTagsFromResource",
        "elasticache:ListTagsForResource",
        "elasticache:CreateCacheSubnetGroup",
        "elasticache:ModifyCacheSubnetGroup"
      ],
      "Resource": "*"
    }
  ]
}

Create Role with AWS CLI:

# 1. Get OIDC Provider from EKS cluster
export CLUSTER_NAME=my-cluster
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

OIDC_PROVIDER=$(aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $AWS_REGION \
  --query "cluster.identity.oidc.issuer" \
  --output text | sed -e "s/^https:\/\///")

# 2. Update trust-policy.json with correct values
cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:infra-operator-system:infra-operator-controller-manager",
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# 3. Create IAM Role
aws iam create-role \
  --role-name infra-operator-elasticache-role \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role for Infra Operator ElastiCache management"

# 4. Create and attach policy
aws iam put-role-policy \
  --role-name infra-operator-elasticache-role \
  --policy-name ElastiCacheManagement \
  --policy-document file://elasticache-policy.json

# 5. Get Role ARN
aws iam get-role \
  --role-name infra-operator-elasticache-role \
  --query 'Role.Arn' \
  --output text

Annotate Operator ServiceAccount:

# Add annotation to operator ServiceAccount
kubectl annotate serviceaccount infra-operator-controller-manager \
  -n infra-operator-system \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/infra-operator-elasticache-role

observação

Replace 123456789012 with your AWS Account ID and EXAMPLED539D4633E53DE1B71EXAMPLE with your OIDC provider ID.

Overview

Amazon ElastiCache is a fully managed in-memory caching service that provides performance and scalability for modern applications. It supports Redis and Memcached as cache engines with built-in high availability, security, and monitoring.

Features:

In-Memory Cache: Sub-millisecond latency for reads/writes
Two Supported Engines: Redis (more features) and Memcached (simpler)
Cluster Mode: Redis Cluster for horizontal scalability
Multi-AZ: Automatic high availability with failover
Replication: Automatic failover with synchronized replicas
Encryption: Encryption in transit (TLS) and at rest
Authentication: Redis AUTH with secure tokens
Automatic Backups: Snapshots for data recovery (Redis)
Auto Scaling: Increase/decrease nodes based on demand
VPC Support: Network isolation in private subnets
Parameter Groups: Custom cache configuration
Security Groups: Granular network access control
ElastiCache Cluster Mode: Data distribution across multiple nodes
Pub/Sub (Redis): Messaging between applications
Transactions (Redis): ACID transactions with MULTI/EXEC

Status: ⚠️ Requires LocalStack Pro or Real AWS

Quick Start

Redis Single Node:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: e2e-redis-simple
  namespace: default
spec:
  providerRef:
    name: localstack

  # Cluster identifier (2-63 characters)
  clusterID: e2e-redis-simple

  # Engine and version
  engine: redis
  engineVersion: "7.0"

  # Node size
  nodeType: cache.t3.micro

  # Number of nodes (single mode)
  numCacheNodes: 1

  # Tagging
  tags:
    environment: test
    managed-by: infra-operator
    purpose: e2e-testing

  deletionPolicy: Delete

Memcached Cluster:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: e2e-memcached
  namespace: default
spec:
  providerRef:
    name: localstack

  # Identifier
  clusterID: e2e-memcached-test

  # Engine
  engine: memcached
  engineVersion: "1.6.17"

  # Node size and quantity
  nodeType: cache.t3.micro
  numCacheNodes: 2

  # Tagging
  tags:
    environment: test
    managed-by: infra-operator
    engine: memcached
    purpose: e2e-testing

  deletionPolicy: Delete

Redis Cluster Mode with Replication:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: e2e-redis-cluster
  namespace: default
spec:
  providerRef:
    name: localstack

  # Identifier
  clusterID: e2e-redis-cluster

  # Engine
  engine: redis
  engineVersion: "7.0"

  # Node size
  nodeType: cache.t3.small

  # Replication group description
  replicationGroupDescription: "E2E Test Redis Cluster"

  # Cluster Mode
  numNodeGroups: 2
  replicasPerNodeGroup: 1
  automaticFailoverEnabled: true
  multiAZEnabled: false

  # Tagging
  tags:
    environment: test
    managed-by: infra-operator
    cluster-mode: enabled
    purpose: e2e-testing

  deletionPolicy: Delete

Production with Full HA:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: app-cache-prod
  namespace: default
spec:
  providerRef:
    name: production-aws

  # Identifier
  clusterID: app-redis-prod

  # Engine
  engine: redis
  engineVersion: "7.0"

  # Cluster Mode
  replicationGroupDescription: "Production Redis Cluster"
  automaticFailoverEnabled: true
  multiAZEnabled: true

  # Shard configuration
  numNodeGroups: 2
  replicasPerNodeGroup: 2
  nodeType: cache.r6g.xlarge

  # Network
  subnetGroupName: private-cache-subnet
  securityGroupIds:
  - sg-0123456789abcdef0

  # Backup
  snapshotRetentionLimit: 7
  snapshotWindow: "03:00-05:00"

  # Security
  transitEncryptionEnabled: true
  atRestEncryptionEnabled: true
  authTokenRef:
    name: redis-auth
    key: token

  tags:
    Environment: production
    Critical: "true"

  deletionPolicy: Retain

Apply:

kubectl apply -f elasticache.yaml

Verify Status:

kubectl get elasticachecluster
kubectl describe elasticachecluster e2e-redis-simple
kubectl get elasticachecluster e2e-redis-simple -o yaml

Configuration Reference

Required Fields

Reference to AWSProvider resource for authentication

AWSProvider resource name

Unique cluster identifier (2 to 63 characters)

Rules:

Alphanumerics and hyphens only
Cannot start or end with hyphen
Must be unique per region
Cannot be modified after creation

Example:

clusterID: app-redis-cache-prod

Cache engine

Options:

redis - Redis (recommended for production)
memcached - Memcached (simple, no persistence)

Example:

engine: redis

Cache engine version

Examples:

Redis: 7.0, 6.2, 5.0
Memcached: 1.6.22, 1.5.16

Example:

engineVersion: "7.0"

Note: Always specify version precisely

Node type (CPU, RAM, performance)

cache.t3 Family (Burstable - Dev/Test):

cache.t3.micro - 1 vCPU, 1 GB RAM (~$0.017/hour)
cache.t3.small - 1 vCPU, 2 GB RAM (~$0.034/hour)
cache.t3.medium - 1 vCPU, 4 GB RAM (~$0.068/hour)

cache.r6g Family (Memory Optimized):

cache.r6g.large - 2 vCPU, 16 GB RAM (~$0.280/hour)
cache.r6g.xlarge - 4 vCPU, 32 GB RAM (~$0.560/hour)
cache.r6g.2xlarge - 8 vCPU, 64 GB RAM (~$1.120/hour)

cache.m6g Family (General Purpose):

cache.m6g.large - 2 vCPU, 8 GB RAM (~$0.140/hour)
cache.m6g.xlarge - 4 vCPU, 16 GB RAM (~$0.280/hour)

Example:

nodeType: cache.t3.micro

Number of nodes in cluster (single mode)

Range: 1 - 300 nodes

Note: Use numCacheNodes for single mode. For cluster mode, use numNodeGroups:

numCacheNodes: 1

Optional Fields - Cluster Mode and HA

Replication group description (for cluster mode)

Used with:

numNodeGroups: Number of shards
replicasPerNodeGroup: Replicas per shard
automaticFailoverEnabled: Automatic failover
multiAZEnabled: Multi-AZ

Example:

replicationGroupDescription: "Production Redis Cluster"

Enable automatic failover (requires Multi-AZ and replication)

Benefits:

Automatic failover in case of failure
No manual intervention
Application continues working
Requires more replicas

Example:

automaticFailoverEnabled: true

Recommended: true in production

Distribute nodes across multiple availability zones

Benefits:

Tolerance to AZ failures
Automatic failover
~50% cost increase

Example:

multiAZEnabled: true

Recommended: true in production

Number of shards in cluster mode

Range: 1 - 500 shards

Note: Use with replicationGroupId for cluster mode:

numNodeGroups: 3

Number of replicas per shard

Range: 0 - 5 replicas per shard:

replicasPerNodeGroup: 2

Note: Recommended 1-2 replicas in production

Optional Fields - Network and Security

Cache subnet group name (private subnets)

Important: MUST exist previously in AWS:

subnetGroupName: private-cache-subnet

Create subnet group via AWS CLI:

aws elasticache create-cache-subnet-group \
--cache-subnet-group-name private-cache-subnet \
--cache-subnet-group-description "Private subnets for ElastiCache" \
--subnet-ids subnet-xxx subnet-yyy

AWS Security Group IDs for access control

Example:

securityGroupIds:
- sg-0123456789abcdef0
- sg-0987654321fedcba0

Enable encryption in transit (TLS)

Details:

Encrypts data in motion on the network
Requires authToken if enabled
Performance: minimal overhead
Recommended: always enabled

Example:

transitEncryptionEnabled: true

Enable encryption at rest

Details:

Encrypts stored data
Valid only for Redis (not Memcached)
Performance: negligible impact
Regulatory compliance

Example:

atRestEncryptionEnabled: true

Reference to Secret containing Redis authentication token (password)

Structure:

name: Secret name
namespace: Secret namespace (optional)
key: Key within Secret

Example:

authTokenRef:
  name: redis-auth
  key: token

Corresponding Secret:

apiVersion: v1
kind: Secret
metadata:
  name: redis-auth
stringData:
  token: MySecureRedisToken123!

Note: Always use in production

Optional Fields - Backup and Recovery

Days to retain snapshots (Redis only)

Range: 0 - 35 days

Recommendations:

Development: 0-1 days
Staging: 1-7 days
Production: 7-35 days

Example:

snapshotRetentionLimit: 7

Note: 0 = no automatic snapshots

Daily snapshot window (UTC, format HH:MM-HH:MM)

Default: Automatically selected:

snapshotWindow: "03:00-05:00"  # 3AM-5AM UTC

Tip: Choose low-usage time

Automatically update minor versions

Example:

enableAutoMinorVersionUpgrade: true

Optional Fields - Other

Key-value pairs for organization and billing

Example:

tags:
  Environment: production
  Application: web-app
  Team: backend
  CostCenter: engineering
  ManagedBy: infra-operator

What happens to the cluster when the CR is deleted

Options:

Delete: Cluster is deleted from AWS
Retain: Cluster remains in AWS but not managed
Orphan: Remove only management

Example:

deletionPolicy: Retain  # For production

Status Fields

After the cluster is created, the following status fields are populated:

true when cluster is available and ready to use

Current cluster state

Possible values:

creating - Creating
available - Available and ready
modifying - Configuration being changed
snapshotting - Snapshot in progress
deleting - Being deleted
failed - Creation error

Full ElastiCache cluster ARN

arn:aws:elasticache:us-east-1:123456789012:cluster:app-redis-cache

Cluster configuration endpoint (cluster mode)

Cache hostname for connection

Connection port

Cluster primary endpoint (replication group)

Primary node hostname

Connection port (default: 6379 Redis, 11211 Memcached)

Read endpoint (read replicas)

Hostname for reads

Connection port

List of individual endpoints for each node

Individual node hostname

Node port

Cluster node type (e.g., cache.t3.micro)

Running engine version

List of member clusters (for replication groups)

Timestamp when cluster was created

Timestamp of last sync with AWS

Cluster status conditions

Condition type (e.g., Ready, Available)

Condition status (True, False, Unknown)

Reason for current status

Descriptive message

Examples

Redis Single Node for Development

Simple cluster for development and testing:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: dev-redis
  namespace: default
spec:
  providerRef:
    name: dev-aws

  clusterID: dev-redis-cache
  engine: redis
  engineVersion: "7.0"

  # Small instance for dev
  nodeType: cache.t3.micro
  numCacheNodes: 1

  # No Multi-AZ to save costs
  multiAZEnabled: false

  # No backup
  snapshotRetentionLimit: 0

  subnetGroupName: dev-cache-subnet

  # No encryption for better performance in dev
  transitEncryptionEnabled: false

  tags:
    Environment: development
    Application: my-app

  deletionPolicy: Delete

Redis Cluster Mode with Full HA

Cluster for production with high availability:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: production-redis
  namespace: default
spec:
  providerRef:
    name: production-aws

  # Replication group for cluster mode
  clusterID: app-redis-prod
  replicationGroupDescription: "Production Redis Cluster"

  # Engine
  engine: redis
  engineVersion: "7.0"

  # Cluster configuration
  nodeType: cache.r6g.xlarge
  numNodeGroups: 3          # 3 shards
  replicasPerNodeGroup: 2   # 2 replicas per shard

  # High Availability
  multiAZEnabled: true
  automaticFailoverEnabled: true

  # Network
  subnetGroupName: private-cache-subnet
  securityGroupIds:
  - sg-0123456789abcdef0

  # Backup
  snapshotRetentionLimit: 7
  snapshotWindow: "03:00-05:00"

  # Security
  transitEncryptionEnabled: true
  atRestEncryptionEnabled: true
  authTokenRef:
    name: redis-auth
    key: token

  tags:
    Environment: production
    Critical: "true"
    Team: backend
    CostCenter: infrastructure

  deletionPolicy: Retain

Redis with Full Encryption and Auth

Critical cluster with maximum security:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: secure-redis
  namespace: default
spec:
  providerRef:
    name: production-aws

  clusterID: secure-redis-prod
  replicationGroupDescription: "Secure Redis Cluster"

  engine: redis
  engineVersion: "7.0"

  # Performance nodes
  nodeType: cache.r6g.2xlarge
  numNodeGroups: 4
  replicasPerNodeGroup: 2

  # HA
  multiAZEnabled: true
  automaticFailoverEnabled: true

  # Private network
  subnetGroupName: critical-cache-subnet
  securityGroupIds:
  - sg-critical-redis-001

  # Maximum security
  transitEncryptionEnabled: true
  atRestEncryptionEnabled: true
  authTokenRef:
    name: redis-auth-token
    key: password

  # Full backups
  snapshotRetentionLimit: 35
  snapshotWindow: "02:00-03:00"
  preferredMaintenanceWindow: "sun:03:00-sun:04:00"

  tags:
    Environment: production
    CriticalData: "true"
    BackupRequired: "true"
    Compliance: "required"

  deletionPolicy: Retain

Memcached for Session Storage

Memcached cluster for storing sessions:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: session-cache
  namespace: default
spec:
  providerRef:
    name: production-aws

  clusterID: session-cache-prod

  engine: memcached
  engineVersion: "1.6.22"

  nodeType: cache.t3.small
  numCacheNodes: 3

  subnetGroupName: web-cache-subnet
  securityGroupIds:
  - sg-memcached-001

  # Memcached does not support Multi-AZ
  multiAZEnabled: false

  # No backup (data does not persist)
  snapshotRetentionLimit: 0

  tags:
    Environment: production
    Type: session-cache
    Application: web-app

  deletionPolicy: Delete

Redis for Rate Limiting and Leaderboards

Cluster optimized for fast operations:

apiVersion: aws-infra-operator.runner.codes/v1alpha1
kind: ElastiCacheCluster
metadata:
  name: realtime-redis
  namespace: default
spec:
  providerRef:
    name: production-aws

  clusterID: realtime-cache
  replicationGroupDescription: "Realtime Redis Cluster"

  engine: redis
  engineVersion: "7.0"

  # Optimized for low latency
  nodeType: cache.r6g.large
  numNodeGroups: 2
  replicasPerNodeGroup: 1

  # Fast failover
  automaticFailoverEnabled: true
  multiAZEnabled: true

  subnetGroupName: realtime-cache-subnet
  securityGroupIds:
  - sg-realtime-redis-001

  # Security
  transitEncryptionEnabled: true
  authTokenRef:
    name: realtime-redis-auth
    key: token

  # Normal backups
  snapshotRetentionLimit: 3
  snapshotWindow: "02:00-03:00"

  tags:
    Environment: production
    Type: realtime-cache
    UseCase: rate-limiting-leaderboards

  deletionPolicy: Retain

Verification

Verify Status via kubectl

Command:

# List all ElastiCache clusters
kubectl get elasticachecluster

# Get detailed information
kubectl get elasticachecluster app-cache -o yaml

# Watch creation in real-time
kubectl get elasticachecluster app-cache -w

# View events and status
kubectl describe elasticachecluster app-cache

Verify on AWS

AWS CLI:

# List clusters
aws elasticache describe-cache-clusters \
      --query 'CacheClusters[].{Id:CacheClusterId,Status:CacheClusterStatus,Engine:Engine,NodeType:CacheNodeType}' \
      --output table

# Get full details
aws elasticache describe-cache-clusters \
      --cache-cluster-id app-redis-cache \
      --output json | jq '.CacheClusters[0]'

# View connection endpoint
aws elasticache describe-cache-clusters \
      --cache-cluster-id app-redis-cache \
      --query 'CacheClusters[0].CacheNodes[0].Endpoint'

# Test connection (Redis)
redis-cli -h app-redis-cache.xxxxx.cache.amazonaws.com \
              -p 6379 \
              -a MyToken123! \
              PING

# View snapshots
aws elasticache describe-snapshots \
      --cache-cluster-id app-redis-cache

LocalStack:

# For testing with LocalStack
export AWS_ENDPOINT_URL=http://localhost:4566

aws elasticache describe-cache-clusters

aws elasticache describe-cache-clusters \
      --cache-cluster-id app-redis-cache

redis-cli (Redis):

# Connect to cache
redis-cli -h <endpoint> -p 6379 -a <password>

# Test connection
PING
# Response: PONG

# View server information
INFO server

# View memory usage
INFO memory

# View all keys
KEYS *

# Disconnect
EXIT

memcached-tool (Memcached):

# Connect and get stats
echo "stats" | nc <endpoint> 11211

# View largest memory consumer
memcached-tool <endpoint>:11211 display

# Flush all (clear data)
echo "flush_all" | nc <endpoint> 11211

Expected Output

Example:

status:
  ready: true
  clusterStatus: available
  cacheClusterARN: arn:aws:elasticache:us-east-1:123456789012:cluster:app-redis-cache
  primaryEndpoint:
    address: app-redis-cache.xxxxx.cache.amazonaws.com
    port: 6379
  cacheNodeType: cache.t3.micro
  engineVersion: "7.0"
  clusterCreateTime: "2025-11-22T20:30:15Z"
  lastSyncTime: "2025-11-22T20:45:22Z"
  conditions:
  - type: Ready
    status: "True"
    reason: ClusterAvailable
    message: "ElastiCache cluster is available"
    lastTransitionTime: "2025-11-22T20:35:10Z"

Troubleshooting

Cluster stuck in creating for more than 30 minutes

Symptoms: cacheClusterStatus: creating indefinitely

Common causes:

Subnet group does not exist or is invalid
Security group not found
Cluster quota reached
Connectivity issue with AWS

Solutions:

# Check detailed status
kubectl describe elasticachecluster app-cache

# View operator logs
kubectl logs -n infra-operator-system \
      deploy/infra-operator-controller-manager \
      --tail=100 | grep -i elasticache

# Verify subnet group exists
aws elasticache describe-cache-subnet-groups \
      --cache-subnet-group-name cache-subnet-group

# Verify AWSProvider is ready
kubectl get awsprovider
kubectl describe awsprovider production-aws

# Force sync
kubectl annotate elasticachecluster app-cache \
      force-sync="$(date +%s)" --overwrite

Connection timeout when connecting to cache

Symptoms: Connection refused or Cannot connect to host

Common causes:

Security group does not allow connection
Cache is not accessible (private)
Incorrect endpoint/port
Network not routed correctly

Solutions:

# Get correct endpoint
aws elasticache describe-cache-clusters \
      --cache-cluster-id app-redis-cache \
      --query 'CacheClusters[0].CacheNodes[0].Endpoint'

# Verify security group allows inbound
aws ec2 describe-security-groups \
      --group-ids sg-0123456789abcdef0 \
      --query 'SecurityGroups[0].IpPermissions'

# MUST have rule like:
# IpProtocol: tcp, FromPort: 6379, ToPort: 6379
# CidrIp: 10.0.0.0/16 (your VPC)

# If connecting from outside VPC:
# 1. Cache MUST be publicly accessible (not recommended)
# 2. Or use bastion/jump host
# 3. Or use VPN

# Test with telnet/nc
nc -zv app-redis-cache.xxxxx.cache.amazonaws.com 6379

# Test with redis-cli
redis-cli -h app-redis-cache.xxxxx.cache.amazonaws.com \
             -p 6379 \
             -a password \
             PING

Out of Memory (cache full)

Symptoms: Error OOM command not allowed when writing data

Causes:

Data grew beyond expected
TTL not configured (data never expires)
Insufficient node size

Solutions:

# Connect to Redis and view usage
redis-cli -h <endpoint> -a password
INFO memory

# Increase node size (resize)
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"cacheNodeType":"cache.t3.small"}}'

# Increase number of nodes (scale up)
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"numCacheNodes":2}}'

# Clear data manually (Redis)
redis-cli -h <endpoint> -a password
FLUSHDB  # Clear only current database
FLUSHALL # Clear all databases

# Implement TTL in application
SET key value EX 3600  # Expires in 1 hour

Slow Performance/High Latency

Symptoms: Slow read/write even in cache

Causes:

CPU or memory saturated
Congested network
Aggressive eviction policy
Insufficient node size

Solutions:

# View slow queries (Redis)
redis-cli -h <endpoint> -a password
SLOWLOG GET 10

# Increase node size
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"cacheNodeType":"cache.r6g.xlarge"}}'

# Increase replicas to distribute load
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"replicasPerNodeGroup":2}}'

# Enable cluster mode for distribution
# Requires recreating cluster

Slow failover or not working

Symptoms: After failure, cache stays offline for long time

Cause: Automatic failover may be disabled or insufficient replicas

Solutions:

# Verify automatic failover
aws elasticache describe-replication-groups \
      --replication-group-id app-redis-rg \
      --query 'ReplicationGroups[0].AutomaticFailoverEnabled'

# Enable if not
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"automaticFailoverEnabled":true}}'

# Verify Multi-AZ
aws elasticache describe-replication-groups \
      --replication-group-id app-redis-rg \
      --query 'ReplicationGroups[0].MultiAZ'

# Enable Multi-AZ
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"multiAZEnabled":true}}'

# Test failover manually
# WARNING: Causes downtime!
aws elasticache test-failover \
      --replication-group-id app-redis-rg

# View failover progress
aws elasticache describe-replication-groups \
      --replication-group-id app-redis-rg

High ElastiCache costs

Symptoms: AWS account with unexpected cache expenses

Common causes:

Nodes too large
Multi-AZ in dev/test
Snapshots retained too long
Cross-region replication

Solutions:

# Calculate current cost
# Use AWS Pricing Calculator
# cache.t3.micro: ~$0.017/hour * 730 = ~$12/month
# cache.r6g.xlarge: ~$0.560/hour * 730 = ~$409/month

# Reduce class (if possible)
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"cacheNodeType":"cache.t3.micro"}}'

# Reduce replicas in dev
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"replicasPerNodeGroup":0}}'

# Reduce snapshot retention
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"snapshotRetentionLimit":3}}'

# For dev, delete after use
kubectl delete elasticachecluster dev-cache

Snapshot fails or takes too long

Symptoms: Status stays in snapshotting for hours

Causes:

Cache too large
IO saturated during snapshot
Many transactions during snapshot

Solutions:

# View snapshot status
aws elasticache describe-snapshots \
      --cache-cluster-id app-redis-cache

# View completed snapshots
aws elasticache describe-snapshots \
      --query 'Snapshots[].{SnapshotName:SnapshotName,Status:SnapshotStatus}'

# Increase snapshot window
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"snapshotWindow":"02:00-04:00"}}'

# Create manual snapshot (Redis)
aws elasticache create-snapshot \
      --cache-cluster-id app-redis-cache \
      --snapshot-name manual-backup-$(date +%Y%m%d-%H%M%S)

# Delete old snapshots
aws elasticache delete-snapshot \
      --snapshot-name snapshot-name

Error deleting cluster (finalizer stuck)

Symptoms: kubectl delete elasticachecluster pending indefinitely

Cause: Finalizer cannot delete cluster

Solutions:

# View details
kubectl describe elasticachecluster app-cache

# View finalizers
kubectl get elasticachecluster app-cache -o yaml | grep finalizers

# Option 1: Change deletionPolicy before deleting
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"deletionPolicy":"Retain"}}'

# Then delete
kubectl delete elasticachecluster app-cache

# Option 2: Force delete CR
kubectl patch elasticachecluster app-cache \
      -p '{"metadata":{"finalizers":[]}}' \
      --type=merge

# Then delete manually on AWS if needed
aws elasticache delete-cache-cluster \
      --cache-cluster-id app-redis-cache

Invalid or expired authentication token

Symptoms: Error WRONGPASS invalid username-password pair when connecting

Cause: Incorrect, expired, or unconfigured auth token

Solutions:

# Verify auth token is configured
kubectl get elasticachecluster app-cache -o yaml | grep authToken

# Test connecting without auth
redis-cli -h <endpoint> -p 6379 PING

# If fails, check on AWS
aws elasticache describe-cache-clusters \
      --cache-cluster-id app-redis-cache

# Connect with correct token
redis-cli -h <endpoint> -p 6379 -a MyToken123! PING

# Update token (requires restart)
kubectl patch elasticachecluster app-cache \
      --type merge \
      -p '{"spec":{"authToken":"NewToken456!"}}'

# Verify transitEncryption is enabled
kubectl get elasticachecluster app-cache -o yaml | grep transitEncryption

Best Practices

Enable Multi-AZ for production — Automatic failover in case of failure, ~50% higher cost but essential for critical data
Enable encryption everywhere — Transit encryption (TLS) always enabled, at-rest encryption for sensitive data, auth tokens in production for regulatory compliance
Use strong auth tokens — Complex tokens (32+ characters), store in Kubernetes Secret, rotate regularly, never hardcode, use different tokens per environment
Use Cluster Mode for scale — Enable for >100 GB of data, automatic sharding, unlimited horizontal scalability, better performance than single mode
Configure replicas properly — Minimum 1 replica in production, 2 replicas for high concurrency, read-heavy workloads benefit from more replicas
Right-size node types — Start with t3.micro, monitor memory usage, use r6g for memory-intensive or m6g for general purpose workloads
Implement TTL and eviction — Always use TTL on keys, set eviction policy (allkeys-lru), avoid data that never expires, monitor evictions
Configure backups — Automatic snapshots (7-35 days), manual backups before upgrades, test restore regularly, copy snapshots to another region
Use connection pooling — Don't create new connections per request, reuse existing connections, configure appropriate timeout
Tag all resources — Environment (dev/staging/prod), application, cost center, owner/team, critical flag
Secure network access — Cache in private subnet, restrictive security group, only necessary IPs/SGs, never 0.0.0.0/0

Usage Patterns

1. Session Storage (Memcached)

Store user sessions:

engine: memcached
nodeType: cache.t3.small
numCacheNodes: 3

Benefits:

Fast and simple
Share sessions between instances
No persistence (OK for sessions)

2. Database Query Cache (Redis)

Cache query results:

engine: redis
nodeType: cache.r6g.large
replicationGroupDescription: "Query Cache Redis"
snapshotRetentionLimit: 7

Benefits:

Reduce database load
Sub-millisecond latency
Automatic backup

3. Rate Limiting (Redis)

Limit requests per user/IP:

engine: redis
nodeType: cache.r6g.large
numNodeGroups: 2
replicasPerNodeGroup: 1
automaticFailoverEnabled: true

Benefits:

Very fast for increments
Distributed across multiple nodes
Highly reliable

4. Real-time Leaderboards (Redis)

Real-time ranking with sorted sets:

engine: redis
nodeType: cache.r6g.xlarge
numNodeGroups: 3
replicasPerNodeGroup: 2

Benefits:

O(log N) operations
Fast score updates
Efficient queries

5. Pub/Sub Messaging (Redis)

Messaging system between services:

engine: redis
nodeType: cache.r6g.large
automaticFailoverEnabled: true
multiAZEnabled: true

Benefits:

Low latency
Broadcast to multiple subscribers
Simple pattern matching

6. Analytics Data (Redis)

Real-time data aggregation:

engine: redis
nodeType: cache.r6g.2xlarge
numNodeGroups: 4
replicasPerNodeGroup: 2
snapshotRetentionLimit: 7

Benefits:

HyperLogLog for approximate counting
Bitmaps for fast operations
Streams for temporal data

VPC and Subnets

Prerequisite: AWSProvider Configuration​

Create IAM Role for IRSA​

Overview​

Quick Start​

Configuration Reference​

Required Fields​

Optional Fields - Cluster Mode and HA​

Optional Fields - Network and Security​

Optional Fields - Backup and Recovery​

Optional Fields - Other​

Status Fields​

Examples​

Redis Single Node for Development​

Redis Cluster Mode with Full HA​

Redis with Full Encryption and Auth​

Memcached for Session Storage​

Redis for Rate Limiting and Leaderboards​

Verification​

Verify Status via kubectl​

Verify on AWS​

Expected Output​

Troubleshooting​

Cluster stuck in creating for more than 30 minutes​

Connection timeout when connecting to cache​

Out of Memory (cache full)​

Slow Performance/High Latency​

Slow failover or not working​

High ElastiCache costs​

Snapshot fails or takes too long​

Error deleting cluster (finalizer stuck)​

Invalid or expired authentication token​

Best Practices​

Usage Patterns​

1. Session Storage (Memcached)​

2. Database Query Cache (Redis)​

3. Rate Limiting (Redis)​

4. Real-time Leaderboards (Redis)​

5. Pub/Sub Messaging (Redis)​

6. Analytics Data (Redis)​

Related Resources​

Prerequisite: AWSProvider Configuration

Create IAM Role for IRSA

Overview

Quick Start

Configuration Reference

Required Fields

Optional Fields - Cluster Mode and HA

Optional Fields - Network and Security

Optional Fields - Backup and Recovery

Optional Fields - Other

Status Fields

Examples

Redis Single Node for Development

Redis Cluster Mode with Full HA

Redis with Full Encryption and Auth

Memcached for Session Storage

Redis for Rate Limiting and Leaderboards

Verification

Verify Status via kubectl

Verify on AWS

Expected Output

Troubleshooting

Cluster stuck in creating for more than 30 minutes

Connection timeout when connecting to cache

Out of Memory (cache full)

Slow Performance/High Latency

Slow failover or not working

High ElastiCache costs

Snapshot fails or takes too long

Error deleting cluster (finalizer stuck)

Invalid or expired authentication token

Best Practices

Usage Patterns

1. Session Storage (Memcached)

2. Database Query Cache (Redis)

3. Rate Limiting (Redis)

4. Real-time Leaderboards (Redis)

5. Pub/Sub Messaging (Redis)

6. Analytics Data (Redis)

Related Resources