AWS RDS: Setup, Multi-AZ, Read Replicas, Parameter Groups and Cost Optimization

Amazon RDS (Relational Database Service) manages the operational overhead of running a relational database — provisioning, patching, backups, and failover — so you can focus on your application. But knowing how to configure it correctly for production workloads, security, and cost is where the real skill lies.

What RDS Manages for You

Operating system and database engine patches
Automated backups and point-in-time recovery
Storage auto-scaling
Multi-AZ failover
Monitoring and alerting via CloudWatch
Encryption at rest and in transit

What you still manage: schema design, query optimization, indexes, and connection management.

Creating an RDS Instance with Terraform

hcl
-- modules/rds/main.tf

resource "aws_db_subnet_group" "main" {
  name       = "${var.name}-subnet-group"
  subnet_ids = var.private_subnet_ids

  tags = { Name = "${var.name}-subnet-group" }
}

resource "aws_security_group" "rds" {
  name   = "${var.name}-rds-sg"
  vpc_id = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.app_security_group_ids  -- only app servers can connect
  }

  tags = { Name = "${var.name}-rds-sg" }
}

resource "aws_db_instance" "main" {
  identifier        = var.name
  engine            = "postgres"
  engine_version    = "16.1"
  instance_class    = var.instance_class
  allocated_storage = var.storage_gb
  max_allocated_storage = var.storage_gb * 3  -- auto-scaling ceiling

  db_name  = var.db_name
  username = var.db_username
  password = var.db_password  -- use AWS Secrets Manager in production

  -- Networking
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  publicly_accessible    = false  -- never expose to internet

  -- High availability
  multi_az = var.environment == "prod"

  -- Backups
  backup_retention_period = var.environment == "prod" ? 7 : 1
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"

  -- Performance
  parameter_group_name = aws_db_parameter_group.main.name
  storage_type         = "gp3"
  storage_encrypted    = true
  kms_key_id           = var.kms_key_arn

  -- Monitoring
  monitoring_interval = 60  -- Enhanced Monitoring every 60s
  monitoring_role_arn = var.monitoring_role_arn
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  performance_insights_enabled    = true
  performance_insights_retention_period = 7  -- days

  -- Deletion protection
  deletion_protection     = var.environment == "prod"
  skip_final_snapshot     = var.environment != "prod"
  final_snapshot_identifier = "${var.name}-final-snapshot"

  tags = { Environment = var.environment }
}

Multi-AZ for High Availability

Multi-AZ maintains a synchronous standby replica in a different Availability Zone. Failover is automatic and typically completes in 60-120 seconds.

code
AZ-1 (eu-west-1a)          AZ-2 (eu-west-1b)
┌─────────────────┐         ┌─────────────────┐
│  Primary RDS    │────────►│  Standby RDS    │
│  (read/write)   │  sync   │  (no traffic)   │
└─────────────────┘         └─────────────────┘
         │
     DNS name
  (stays the same
   after failover)

What triggers automatic failover:

AZ outage
Primary instance failure
OS patching on the primary
Instance class change

Your application connects via the RDS endpoint DNS name — it automatically points to the new primary after failover. No application code change needed.

bash
-- RDS endpoint format
your-db.abc123xyz.eu-west-1.rds.amazonaws.com

-- After failover: same DNS name, now points to former standby

Read Replicas for Read Scaling

Read replicas serve read-only traffic, reducing load on the primary:

hcl
resource "aws_db_instance" "read_replica" {
  identifier          = "${var.name}-replica-1"
  replicate_source_db = aws_db_instance.main.identifier
  instance_class      = "db.t3.medium"
  publicly_accessible = false
  skip_final_snapshot = true

  -- Read replicas use async replication
  -- Small replication lag is normal (usually <1 second)
}

python
-- Application: use replica for reads, primary for writes
import os

WRITE_DB_URL = os.environ["DATABASE_URL"]         -- primary
READ_DB_URL  = os.environ["DATABASE_READ_URL"]    -- replica

-- SQLAlchemy example
from sqlalchemy import create_engine

write_engine = create_engine(WRITE_DB_URL, pool_size=5, max_overflow=10)
read_engine  = create_engine(READ_DB_URL,  pool_size=10, max_overflow=20)

-- Use read_engine for SELECT-heavy operations
-- Use write_engine for INSERT, UPDATE, DELETE

Read replicas vs Multi-AZ:

Multi-AZ is for availability (failover) — standby gets no traffic
Read replicas are for performance (scale reads) — they receive query traffic
Read replicas can be promoted to standalone instances for disaster recovery

Parameter Groups

Parameter groups configure database engine settings:

hcl
resource "aws_db_parameter_group" "main" {
  family = "postgres16"
  name   = "${var.name}-params"

  parameter {
    name  = "shared_buffers"
    value = "{DBInstanceClassMemory/4}"  -- 25% of RAM for buffer cache
  }

  parameter {
    name  = "work_mem"
    value = "16384"  -- 16MB per sort/hash operation
  }

  parameter {
    name  = "maintenance_work_mem"
    value = "524288"  -- 512MB for VACUUM, CREATE INDEX
  }

  parameter {
    name  = "effective_cache_size"
    value = "{DBInstanceClassMemory*3/4}"  -- hint to planner: 75% of RAM is cacheable
  }

  parameter {
    name  = "log_min_duration_statement"
    value = "1000"  -- log queries taking more than 1 second
  }

  parameter {
    name  = "log_lock_waits"
    value = "1"  -- log lock wait events
  }

  parameter {
    name  = "max_connections"
    value = "200"
  }

  parameter {
    name  = "rds.force_ssl"
    value = "1"  -- require SSL connections
  }
}

Automated Backups and Point-in-Time Recovery

RDS takes daily automated snapshots and streams transaction logs continuously:

bash
-- Restore to a specific point in time (AWS CLI)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mydb \
  --target-db-instance-identifier mydb-restored \
  --restore-time 2025-03-10T08:30:00Z

-- Create a manual snapshot (before a risky migration)
aws rds create-db-snapshot \
  --db-instance-identifier mydb \
  --db-snapshot-identifier mydb-pre-migration-20250311

Best practice: take a manual snapshot before any destructive migration or major change.

Secrets Manager Integration

Never hardcode database passwords. Use AWS Secrets Manager with automatic rotation:

hcl
resource "aws_secretsmanager_secret" "db_password" {
  name                    = "${var.name}/db-password"
  recovery_window_in_days = 7
}

resource "aws_secretsmanager_secret_rotation" "db_password" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = var.rotation_lambda_arn

  rotation_rules {
    automatically_after_days = 30
  }
}

python
-- Application fetches password from Secrets Manager at startup
import boto3
import json
from functools import lru_cache

@lru_cache(maxsize=1)
def get_db_password() -> str:
    client = boto3.client("secretsmanager", region_name="eu-west-1")
    response = client.get_secret_value(SecretId="myapp/db-password")
    secret = json.loads(response["SecretString"])
    return secret["password"]

Connection Pooling with RDS Proxy

RDS has a maximum connection limit. With many Lambda functions or app instances, you can exhaust connections quickly. RDS Proxy pools and multiplexes connections:

hcl
resource "aws_db_proxy" "main" {
  name                   = "${var.name}-proxy"
  engine_family          = "POSTGRESQL"
  idle_client_timeout    = 1800
  require_tls            = true
  role_arn               = aws_iam_role.rds_proxy.arn
  vpc_security_group_ids = [aws_security_group.rds_proxy.id]
  vpc_subnet_ids         = var.private_subnet_ids

  auth {
    auth_scheme = "SECRETS"
    iam_auth    = "REQUIRED"
    secret_arn  = aws_secretsmanager_secret.db_password.arn
  }
}

RDS Proxy is especially valuable for serverless workloads (Lambda) where connection thrashing is common.

Monitoring and Alerting

hcl
-- CloudWatch alarm for high CPU
resource "aws_cloudwatch_metric_alarm" "rds_cpu_high" {
  alarm_name          = "${var.name}-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"
  period              = 300  -- 5 minutes
  statistic           = "Average"
  threshold           = 80
  alarm_actions       = [var.sns_topic_arn]

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.id
  }
}

-- Also monitor:
-- FreeStorageSpace (alert at <10GB)
-- DatabaseConnections (alert near max_connections)
-- ReadLatency / WriteLatency (alert above 20ms)
-- ReplicaLag (alert above 5 seconds for read replicas)

Cost Optimization

Right-size instances: Use Performance Insights to check if your instance is over-provisioned. A db.t3.medium handles most small web apps.
Reserved Instances: 1-year reserved saves ~40%; 3-year saves ~60% vs on-demand.
Aurora Serverless v2: For variable workloads — scales to zero, pay per ACU-hour.
Stop dev/test databases: RDS instances can be stopped for up to 7 days. Automate stopping non-production instances outside business hours.
gp3 over gp2: gp3 gives 3,000 IOPS baseline (vs 100 for gp2) at lower cost.
Delete old snapshots: Automated snapshots expire automatically; manual snapshots accumulate and incur storage costs.

Common Interview Questions

Q: What is the difference between Multi-AZ and a read replica?

Multi-AZ maintains a synchronous standby in another AZ for automatic failover — it improves availability but does not serve traffic. A read replica uses asynchronous replication to a separate instance that serves read queries — it improves read performance but does not automatically fail over. Some teams use both: Multi-AZ for availability and read replicas for scaling.

Q: What is RDS Proxy and why would you use it with Lambda?

RDS Proxy is a fully managed database proxy that pools and shares database connections. Lambda functions are ephemeral — each invocation may open a new connection, and with thousands of concurrent invocations you can hit RDS connection limits quickly. RDS Proxy multiplexes many Lambda connections into a smaller pool of actual database connections, solving this problem without application changes.

Q: How does point-in-time recovery work in RDS?

RDS continuously streams transaction logs (WAL for PostgreSQL) to S3. Combined with daily automated snapshots, this lets you restore to any second within your retention window. RDS restores the latest snapshot before your target time, then replays the transaction logs up to the exact point you specified.

Practice on Froquiz

Cloud database knowledge is tested in backend and DevOps interviews. Explore our AWS and infrastructure quizzes on Froquiz to test your cloud knowledge.

Summary

RDS manages patching, backups, failover, and monitoring — you manage schema and queries
Multi-AZ = high availability (automatic failover in 60-120s); read replicas = read scalability
Never expose RDS publicly — place in private subnets, restrict security groups to app servers
Parameter groups tune engine settings — log slow queries, set appropriate memory parameters
Always take a manual snapshot before destructive migrations
Use Secrets Manager for passwords with automatic rotation — never hardcode credentials
RDS Proxy multiplexes connections — essential for Lambda and other ephemeral compute
Cost: right-size instances, use Reserved Instances for steady workloads, gp3 over gp2