AWS RDS: Setup, Multi-AZ, Read Replicas, Parameter Groups and Cost Optimization
Amazon RDS (Relational Database Service) manages the operational overhead of running a relational database β provisioning, patching, backups, and failover β so you can focus on your application. But knowing how to configure it correctly for production workloads, security, and cost is where the real skill lies.
What RDS Manages for You
- Operating system and database engine patches
- Automated backups and point-in-time recovery
- Storage auto-scaling
- Multi-AZ failover
- Monitoring and alerting via CloudWatch
- Encryption at rest and in transit
What you still manage: schema design, query optimization, indexes, and connection management.
Creating an RDS Instance with Terraform
hcl-- modules/rds/main.tf resource "aws_db_subnet_group" "main" { name = "${var.name}-subnet-group" subnet_ids = var.private_subnet_ids tags = { Name = "${var.name}-subnet-group" } } resource "aws_security_group" "rds" { name = "${var.name}-rds-sg" vpc_id = var.vpc_id ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = var.app_security_group_ids -- only app servers can connect } tags = { Name = "${var.name}-rds-sg" } } resource "aws_db_instance" "main" { identifier = var.name engine = "postgres" engine_version = "16.1" instance_class = var.instance_class allocated_storage = var.storage_gb max_allocated_storage = var.storage_gb * 3 -- auto-scaling ceiling db_name = var.db_name username = var.db_username password = var.db_password -- use AWS Secrets Manager in production -- Networking db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.rds.id] publicly_accessible = false -- never expose to internet -- High availability multi_az = var.environment == "prod" -- Backups backup_retention_period = var.environment == "prod" ? 7 : 1 backup_window = "03:00-04:00" maintenance_window = "Mon:04:00-Mon:05:00" -- Performance parameter_group_name = aws_db_parameter_group.main.name storage_type = "gp3" storage_encrypted = true kms_key_id = var.kms_key_arn -- Monitoring monitoring_interval = 60 -- Enhanced Monitoring every 60s monitoring_role_arn = var.monitoring_role_arn enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"] performance_insights_enabled = true performance_insights_retention_period = 7 -- days -- Deletion protection deletion_protection = var.environment == "prod" skip_final_snapshot = var.environment != "prod" final_snapshot_identifier = "${var.name}-final-snapshot" tags = { Environment = var.environment } }
Multi-AZ for High Availability
Multi-AZ maintains a synchronous standby replica in a different Availability Zone. Failover is automatic and typically completes in 60-120 seconds.
codeAZ-1 (eu-west-1a) AZ-2 (eu-west-1b) βββββββββββββββββββ βββββββββββββββββββ β Primary RDS ββββββββββΊβ Standby RDS β β (read/write) β sync β (no traffic) β βββββββββββββββββββ βββββββββββββββββββ β DNS name (stays the same after failover)
What triggers automatic failover:
- AZ outage
- Primary instance failure
- OS patching on the primary
- Instance class change
Your application connects via the RDS endpoint DNS name β it automatically points to the new primary after failover. No application code change needed.
bash-- RDS endpoint format your-db.abc123xyz.eu-west-1.rds.amazonaws.com -- After failover: same DNS name, now points to former standby
Read Replicas for Read Scaling
Read replicas serve read-only traffic, reducing load on the primary:
hclresource "aws_db_instance" "read_replica" { identifier = "${var.name}-replica-1" replicate_source_db = aws_db_instance.main.identifier instance_class = "db.t3.medium" publicly_accessible = false skip_final_snapshot = true -- Read replicas use async replication -- Small replication lag is normal (usually <1 second) }
python-- Application: use replica for reads, primary for writes import os WRITE_DB_URL = os.environ["DATABASE_URL"] -- primary READ_DB_URL = os.environ["DATABASE_READ_URL"] -- replica -- SQLAlchemy example from sqlalchemy import create_engine write_engine = create_engine(WRITE_DB_URL, pool_size=5, max_overflow=10) read_engine = create_engine(READ_DB_URL, pool_size=10, max_overflow=20) -- Use read_engine for SELECT-heavy operations -- Use write_engine for INSERT, UPDATE, DELETE
Read replicas vs Multi-AZ:
- Multi-AZ is for availability (failover) β standby gets no traffic
- Read replicas are for performance (scale reads) β they receive query traffic
- Read replicas can be promoted to standalone instances for disaster recovery
Parameter Groups
Parameter groups configure database engine settings:
hclresource "aws_db_parameter_group" "main" { family = "postgres16" name = "${var.name}-params" parameter { name = "shared_buffers" value = "{DBInstanceClassMemory/4}" -- 25% of RAM for buffer cache } parameter { name = "work_mem" value = "16384" -- 16MB per sort/hash operation } parameter { name = "maintenance_work_mem" value = "524288" -- 512MB for VACUUM, CREATE INDEX } parameter { name = "effective_cache_size" value = "{DBInstanceClassMemory*3/4}" -- hint to planner: 75% of RAM is cacheable } parameter { name = "log_min_duration_statement" value = "1000" -- log queries taking more than 1 second } parameter { name = "log_lock_waits" value = "1" -- log lock wait events } parameter { name = "max_connections" value = "200" } parameter { name = "rds.force_ssl" value = "1" -- require SSL connections } }
Automated Backups and Point-in-Time Recovery
RDS takes daily automated snapshots and streams transaction logs continuously:
bash-- Restore to a specific point in time (AWS CLI) aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier mydb \ --target-db-instance-identifier mydb-restored \ --restore-time 2025-03-10T08:30:00Z -- Create a manual snapshot (before a risky migration) aws rds create-db-snapshot \ --db-instance-identifier mydb \ --db-snapshot-identifier mydb-pre-migration-20250311
Best practice: take a manual snapshot before any destructive migration or major change.
Secrets Manager Integration
Never hardcode database passwords. Use AWS Secrets Manager with automatic rotation:
hclresource "aws_secretsmanager_secret" "db_password" { name = "${var.name}/db-password" recovery_window_in_days = 7 } resource "aws_secretsmanager_secret_rotation" "db_password" { secret_id = aws_secretsmanager_secret.db_password.id rotation_lambda_arn = var.rotation_lambda_arn rotation_rules { automatically_after_days = 30 } }
python-- Application fetches password from Secrets Manager at startup import boto3 import json from functools import lru_cache @lru_cache(maxsize=1) def get_db_password() -> str: client = boto3.client("secretsmanager", region_name="eu-west-1") response = client.get_secret_value(SecretId="myapp/db-password") secret = json.loads(response["SecretString"]) return secret["password"]
Connection Pooling with RDS Proxy
RDS has a maximum connection limit. With many Lambda functions or app instances, you can exhaust connections quickly. RDS Proxy pools and multiplexes connections:
hclresource "aws_db_proxy" "main" { name = "${var.name}-proxy" engine_family = "POSTGRESQL" idle_client_timeout = 1800 require_tls = true role_arn = aws_iam_role.rds_proxy.arn vpc_security_group_ids = [aws_security_group.rds_proxy.id] vpc_subnet_ids = var.private_subnet_ids auth { auth_scheme = "SECRETS" iam_auth = "REQUIRED" secret_arn = aws_secretsmanager_secret.db_password.arn } }
RDS Proxy is especially valuable for serverless workloads (Lambda) where connection thrashing is common.
Monitoring and Alerting
hcl-- CloudWatch alarm for high CPU resource "aws_cloudwatch_metric_alarm" "rds_cpu_high" { alarm_name = "${var.name}-cpu-high" comparison_operator = "GreaterThanThreshold" evaluation_periods = 2 metric_name = "CPUUtilization" namespace = "AWS/RDS" period = 300 -- 5 minutes statistic = "Average" threshold = 80 alarm_actions = [var.sns_topic_arn] dimensions = { DBInstanceIdentifier = aws_db_instance.main.id } } -- Also monitor: -- FreeStorageSpace (alert at <10GB) -- DatabaseConnections (alert near max_connections) -- ReadLatency / WriteLatency (alert above 20ms) -- ReplicaLag (alert above 5 seconds for read replicas)
Cost Optimization
- Right-size instances: Use Performance Insights to check if your instance is over-provisioned. A
db.t3.mediumhandles most small web apps. - Reserved Instances: 1-year reserved saves ~40%; 3-year saves ~60% vs on-demand.
- Aurora Serverless v2: For variable workloads β scales to zero, pay per ACU-hour.
- Stop dev/test databases: RDS instances can be stopped for up to 7 days. Automate stopping non-production instances outside business hours.
- gp3 over gp2: gp3 gives 3,000 IOPS baseline (vs 100 for gp2) at lower cost.
- Delete old snapshots: Automated snapshots expire automatically; manual snapshots accumulate and incur storage costs.
Common Interview Questions
Q: What is the difference between Multi-AZ and a read replica?
Multi-AZ maintains a synchronous standby in another AZ for automatic failover β it improves availability but does not serve traffic. A read replica uses asynchronous replication to a separate instance that serves read queries β it improves read performance but does not automatically fail over. Some teams use both: Multi-AZ for availability and read replicas for scaling.
Q: What is RDS Proxy and why would you use it with Lambda?
RDS Proxy is a fully managed database proxy that pools and shares database connections. Lambda functions are ephemeral β each invocation may open a new connection, and with thousands of concurrent invocations you can hit RDS connection limits quickly. RDS Proxy multiplexes many Lambda connections into a smaller pool of actual database connections, solving this problem without application changes.
Q: How does point-in-time recovery work in RDS?
RDS continuously streams transaction logs (WAL for PostgreSQL) to S3. Combined with daily automated snapshots, this lets you restore to any second within your retention window. RDS restores the latest snapshot before your target time, then replays the transaction logs up to the exact point you specified.
Practice on Froquiz
Cloud database knowledge is tested in backend and DevOps interviews. Explore our AWS and infrastructure quizzes on Froquiz to test your cloud knowledge.
Summary
- RDS manages patching, backups, failover, and monitoring β you manage schema and queries
- Multi-AZ = high availability (automatic failover in 60-120s); read replicas = read scalability
- Never expose RDS publicly β place in private subnets, restrict security groups to app servers
- Parameter groups tune engine settings β log slow queries, set appropriate memory parameters
- Always take a manual snapshot before destructive migrations
- Use Secrets Manager for passwords with automatic rotation β never hardcode credentials
- RDS Proxy multiplexes connections β essential for Lambda and other ephemeral compute
- Cost: right-size instances, use Reserved Instances for steady workloads, gp3 over gp2