Scaling Rails with PostgreSQL Read Replicas: Part 3 - Production Excellence
In Part 1 and Part 2, we covered setup and advanced patterns. Now let’s focus on what happens when you actually deploy read replicas to production.
This part covers the operational excellence required for production systems: how to deploy without downtime, monitor effectively, handle failures gracefully, and optimize performance based on real-world usage. These are the hard-won lessons from running read replicas at scale.
What’s Covered
- Zero-Downtime Deployment Strategy
- Comprehensive Monitoring
- Performance Optimization
- Disaster Recovery
- Production Checklist
Zero-Downtime Deployment Strategy
Adding read replicas to an existing production application requires careful planning. Here’s a battle-tested approach:
Step 1: Start with 0% Traffic
Use the gradual rollout strategy from Part 2, starting with 0% traffic to replicas:
1
2
3
4
5
6
7
8
9
10
11
# app/services/replica_rollout.rb
class ReplicaRollout
ROLLOUT_PERCENTAGES = {
analytics_queries: 0, # Start at 0%
search_queries: 0, # Start at 0%
user_profiles: 0, # Start at 0%
default: 0 # Everything uses primary
}.freeze
# ... rest of implementation from Part 2
end
During this phase:
- Deploy replica infrastructure - Ensure replicas are receiving data
- Monitor replication lag - Verify replicas stay in sync
- Test with read-only users - Have internal team members manually test
- Collect baseline metrics - CPU, memory, query performance on replicas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# app/jobs/replica_health_validator_job.rb
class ReplicaHealthValidatorJob < ApplicationJob
def perform
# Check all replicas are healthy before increasing traffic
replicas = [:primary_replica, :primary_replica_2]
health_checks = replicas.map do |replica|
{
name: replica,
lag: check_replication_lag(replica),
connections: check_connection_count(replica),
query_success: test_query(replica)
}
end
if health_checks.all? { |check| check[:lag] < 5.seconds && check[:query_success] }
Rails.logger.info "All replicas healthy, ready for traffic"
StatsD.event("replicas.ready_for_traffic", "All health checks passed")
else
AlertService.notify("Replicas not ready", health_checks)
end
end
private
def check_replication_lag(replica)
ApplicationRecord.connected_to(role: :reading, shard: replica) do
result = ApplicationRecord.connection.execute(
"SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag"
).first
result['lag'].to_f.seconds
end
rescue => e
Float::INFINITY
end
def test_query(replica)
ApplicationRecord.connected_to(role: :reading, shard: replica) do
ApplicationRecord.connection.execute("SELECT 1")
true
end
rescue
false
end
end
Step 2: Gradual Traffic Migration
Once shadow mode shows replicas are healthy, gradually migrate traffic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# app/services/replica_traffic_manager.rb
class ReplicaTrafficManager
MIGRATION_SCHEDULE = [
{ percentage: 0, duration: 1.day }, # Shadow mode
{ percentage: 5, duration: 1.hour }, # 5% of reads
{ percentage: 10, duration: 2.hours }, # Monitor closely
{ percentage: 25, duration: 4.hours },
{ percentage: 50, duration: 1.day },
{ percentage: 75, duration: 1.day },
{ percentage: 100, duration: nil } # Full migration
].freeze
def self.current_percentage
migration_start = Rails.cache.read('replica_migration_start') || Time.current
elapsed = Time.current - migration_start
MIGRATION_SCHEDULE.each_with_index do |phase, index|
next_phase = MIGRATION_SCHEDULE[index + 1]
if phase[:duration].nil? || elapsed < phase[:duration]
return phase[:percentage]
end
elapsed -= phase[:duration]
end
100 # Full migration complete
end
def self.should_use_replica?(feature: :default)
percentage = current_percentage
# Critical features migrate last
percentage = [percentage - 20, 0].max if [:payments, :auth].include?(feature)
rand(100) < percentage
end
def self.with_traffic_management(feature: :default, &block)
if should_use_replica?(feature: feature) && ReplicaHealthCheck.healthy?
ApplicationRecord.connected_to(role: :reading, &block)
else
yield
end
rescue => e
# Always fallback to primary on errors
ErrorTracker.track(e, context: { feature: feature })
yield
end
end
Step 3: Automated Rollback
Build automatic rollback when issues arise:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# app/services/replica_circuit_breaker.rb
class ReplicaCircuitBreaker
FAILURE_THRESHOLD = 5
TIMEOUT_THRESHOLD = 10.seconds
RECOVERY_TIMEOUT = 1.minute
class << self
def call(&block)
return yield if circuit_open?
execute_with_breaker(&block)
end
private
def execute_with_breaker(&block)
start_time = Time.current
result = ApplicationRecord.connected_to(role: :reading, &block)
# Reset failures on success
reset_failure_count
result
rescue => e
record_failure(e, Time.current - start_time)
# Fallback to primary
yield
end
def circuit_open?
return false unless last_failure_time
# Check if we're still in recovery period
Time.current - last_failure_time < RECOVERY_TIMEOUT
end
def record_failure(error, duration)
increment_failure_count
if failure_count >= FAILURE_THRESHOLD || duration > TIMEOUT_THRESHOLD
open_circuit
AlertService.critical(
"Replica circuit breaker opened",
error: error.message,
failure_count: failure_count,
duration: duration
)
end
end
def open_circuit
Rails.cache.write('replica:circuit:open', true, expires_in: RECOVERY_TIMEOUT)
Rails.cache.write('replica:circuit:opened_at', Time.current)
end
def failure_count
Rails.cache.read('replica:circuit:failures').to_i
end
def increment_failure_count
Rails.cache.increment('replica:circuit:failures', 1, expires_in: 5.minutes)
end
def reset_failure_count
Rails.cache.delete('replica:circuit:failures')
end
def last_failure_time
Rails.cache.read('replica:circuit:opened_at')
end
end
end
Comprehensive Monitoring
Production read replicas need monitoring at multiple levels. Most teams already use APM tools like New Relic, DataDog, or AppSignal—leverage these instead of building custom monitoring.
APM Integration
Modern APM tools automatically track database metrics, but you need to ensure they distinguish between primary and replica queries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# config/initializers/datadog.rb (if using DataDog)
Datadog.configure do |c|
c.tracing.instrument :active_record, service_name: 'postgres' do |config|
# Tag queries by database role
config.on_query do |span, event|
connection = event.payload[:connection]
role = connection.pool.db_config.configuration_hash[:replica] ? 'replica' : 'primary'
span.set_tag('db.role', role)
span.set_tag('db.connection_name', connection.pool.db_config.name)
end
end
end
# For New Relic
# config/newrelic.yml
# Enable database query analysis to see replica vs primary distribution
Cloud Provider Monitoring
If using managed databases, leverage their built-in monitoring:
AWS RDS:
1
2
3
4
5
6
7
8
9
# CloudWatch already tracks these for RDS read replicas:
# - ReplicaLag
# - ReadIOPS / WriteIOPS
# - DatabaseConnections
# - CPUUtilization per replica
# Just add CloudWatch alarms:
# Alarm: ReplicaLag > 5000 milliseconds
# Alarm: DatabaseConnections > 80% of max_connections
Google Cloud SQL / Azure Database: Similar built-in metrics available through their monitoring services.
Custom Metrics for Business Logic
While APM tools handle infrastructure metrics, you still need application-specific monitoring:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
around_action :track_replica_usage
private
def track_replica_usage
replica_used = false
# Hook into ActiveRecord to detect replica usage
subscriber = ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
connection = event.payload[:connection]
if connection&.pool&.db_config&.configuration_hash&.dig(:replica)
replica_used = true
end
end
yield
# Send to your APM
if defined?(Datadog)
Datadog::Tracing.active_trace&.set_tag('replica.used', replica_used)
elsif defined?(NewRelic)
NewRelic::Agent.add_custom_attributes(replica_used: replica_used)
end
ensure
ActiveSupport::Notifications.unsubscribe(subscriber)
end
end
Key Metrics to Monitor
Configure your APM dashboard to track:
- Infrastructure (from CloudWatch/APM):
- Replication lag per replica
- Connection count by database role
- Query response time P50/P95/P99 by role
- Error rate by database
- Application (from APM custom metrics):
- % of requests using replicas
- Cache hit rate (higher replica usage should increase this)
- Replica fallback rate (when replicas fail)
- Business Impact:
- Page load time before/after replica adoption
- API response times by endpoint
- Background job duration changes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# app/services/replica_monitor.rb
class ReplicaMonitor
def self.check_all
{
replication_lag: check_replication_lag,
connection_stats: check_connections,
query_performance: check_query_performance,
disk_usage: check_disk_usage,
long_running_queries: check_long_queries
}
end
private
def self.check_replication_lag
replicas = ApplicationRecord.configurations.configs_for(role: :reading)
replicas.map do |config|
ApplicationRecord.connected_to(role: :reading, shard: config.name.to_sym) do
lag_query = <<-SQL
SELECT
pg_is_in_recovery() as is_replica,
pg_last_wal_receive_lsn() as receive_lsn,
pg_last_wal_replay_lsn() as replay_lsn,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds,
pg_last_xact_replay_timestamp() as last_replay_time
SQL
result = ApplicationRecord.connection.exec_query(lag_query).first
{
name: config.name,
lag_seconds: result['lag_seconds']&.to_f || 0,
lag_bytes: calculate_lag_bytes(result['receive_lsn'], result['replay_lsn']),
last_replay: result['last_replay_time'],
healthy: (result['lag_seconds']&.to_f || 0) < 5.0
}
end
end
end
def self.check_connections
ApplicationRecord.connected_to(role: :reading) do
stats = ApplicationRecord.connection.exec_query(<<-SQL).first
SELECT
COUNT(*) FILTER (WHERE state = 'active') as active_connections,
COUNT(*) FILTER (WHERE state = 'idle') as idle_connections,
COUNT(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction,
COUNT(*) as total_connections,
MAX(EXTRACT(EPOCH FROM (now() - state_change))) as longest_connection_seconds
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
SQL
{
active: stats['active_connections'],
idle: stats['idle_connections'],
idle_in_transaction: stats['idle_in_transaction'],
total: stats['total_connections'],
longest_duration: stats['longest_connection_seconds']&.to_f,
pool_exhaustion_risk: stats['total_connections'] > 80
}
end
end
def self.check_query_performance
ApplicationRecord.connected_to(role: :reading) do
slow_queries = ApplicationRecord.connection.exec_query(<<-SQL)
SELECT
query,
calls,
total_time,
mean_time,
max_time,
stddev_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY mean_time DESC
LIMIT 10
SQL
slow_queries.map(&:to_h)
end
end
def self.calculate_lag_bytes(receive_lsn, replay_lsn)
return 0 unless receive_lsn && replay_lsn
# PostgreSQL LSN format: XXXXXXXX/YYYYYYYY
receive_bytes = lsn_to_bytes(receive_lsn)
replay_bytes = lsn_to_bytes(replay_lsn)
receive_bytes - replay_bytes
end
def self.lsn_to_bytes(lsn)
high, low = lsn.split('/').map { |x| x.to_i(16) }
(high << 32) + low
end
end
Application-Level Monitoring
Track how your application uses replicas:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# config/initializers/replica_instrumentation.rb
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
# Determine which connection was used
connection_name = event.payload[:connection]&.pool&.db_config&.name || 'unknown'
role = event.payload[:connection]&.pool&.db_config&.configuration_hash&.dig(:replica) ? 'replica' : 'primary'
# Track metrics
tags = {
role: role,
connection: connection_name,
operation: extract_operation(event.payload[:sql])
}
StatsD.histogram('db.query.duration', event.duration, tags: tags)
StatsD.increment('db.query.count', tags: tags)
# Alert on slow replica queries
if role == 'replica' && event.duration > 1000
Rails.logger.warn(
"Slow replica query",
duration: event.duration,
sql: event.payload[:sql],
connection: connection_name
)
end
end
def extract_operation(sql)
case sql
when /^SELECT/i then 'select'
when /^INSERT/i then 'insert'
when /^UPDATE/i then 'update'
when /^DELETE/i then 'delete'
else 'other'
end
end
Performance Optimization
1. Query Optimization for Replicas
Some queries perform differently on replicas:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# app/models/concerns/replica_optimized.rb
module ReplicaOptimized
extend ActiveSupport::Concern
class_methods do
def replica_optimized_scope(name, &block)
# Define two versions of the scope
scope "#{name}_primary", block
scope "#{name}_replica", block
# Smart scope that picks the right version
define_singleton_method(name) do |*args|
if ApplicationRecord.current_role == :reading
send("#{name}_replica", *args)
else
send("#{name}_primary", *args)
end
end
end
end
end
class Product < ApplicationRecord
include ReplicaOptimized
# Regular scope for primary
replica_optimized_scope :search do |term|
where("name ILIKE ?", "%#{term}%")
end
# Override for replica with better performance
def self.search_replica(term)
# Use full-text search on replica (assuming GIN index)
where("search_vector @@ plainto_tsquery('english', ?)", term)
end
# Materialized view only on replica
def self.top_selling_replica
# This materialized view is refreshed hourly on replicas
connection.exec_query(
"SELECT * FROM top_selling_products_mv LIMIT 100"
)
end
end
2. Connection Pool Tuning
Optimize pools based on actual usage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# app/services/connection_pool_optimizer.rb
class ConnectionPoolOptimizer
def self.tune_pools
configs = ApplicationRecord.configurations.configs_for(env_name: Rails.env)
configs.each do |config|
pool = ApplicationRecord.connection_handler.retrieve_connection_pool(config.name)
next unless pool
stats = pool_statistics(pool)
if stats[:wait_count] > 10
# Pool is too small
recommendation = (pool.size * 1.5).to_i
Rails.logger.info(
"Pool #{config.name} should be increased to #{recommendation}",
current_size: pool.size,
wait_count: stats[:wait_count]
)
elsif stats[:usage_ratio] < 0.3
# Pool is too large
recommendation = [(pool.size * 0.7).to_i, 5].max
Rails.logger.info(
"Pool #{config.name} could be reduced to #{recommendation}",
current_size: pool.size,
usage_ratio: stats[:usage_ratio]
)
end
end
end
def self.pool_statistics(pool)
{
size: pool.size,
connections: pool.connections.size,
busy: pool.connections.count(&:in_use?),
dead: pool.connections.count(&:disconnected?),
wait_count: pool.stat[:wait_count],
usage_ratio: pool.connections.count(&:in_use?).to_f / pool.size
}
end
end
# app/jobs/connection_pool_monitor_job.rb
class ConnectionPoolMonitorJob < ApplicationJob
queue_as :monitoring
def perform
ConnectionPoolOptimizer.tune_pools
# Schedule next check
self.class.set(wait: 5.minutes).perform_later
end
end
# Start monitoring (in an initializer or deploy task)
ConnectionPoolMonitorJob.perform_later if Rails.env.production?
3. Multi-Region Optimization
For global applications, use region-aware routing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# app/services/region_aware_router.rb
class RegionAwareRouter
REGION_REPLICAS = {
# Map your regions to database configurations
# 'region_key' => :database_config_name
# Example:
# 'us-east' => :primary_replica_us_east,
# 'eu' => :primary_replica_eu,
}.freeze
def self.nearest_replica(request)
region = detect_region(request)
REGION_REPLICAS[region] || :primary_replica # Fallback to default replica
end
def self.with_nearest_replica(request, &block)
replica = nearest_replica(request)
ApplicationRecord.connected_to(role: :reading, shard: replica) do
yield
end
rescue => e
# Critical: Always fallback to a working replica
Rails.logger.warn("Region replica failed: #{replica}, error: #{e.message}")
StatsD.increment('replica.region_fallback', tags: ["region:#{replica}"])
# Fallback strategy - could be primary or another region
ApplicationRecord.connected_to(role: :reading, &block)
end
private
def self.detect_region(request)
# Implement based on your infrastructure:
# Option 1: CDN/Proxy headers
# request.headers['YOUR-CDN-REGION-HEADER']
# Option 2: Load balancer headers
# request.headers['X-AWS-REGION'] or custom headers
# Option 3: GeoIP lookup (implement your preferred service)
# GeoIP.lookup(request.remote_ip).region_code
# Option 4: User preference/account setting
# current_user&.preferred_region
# Implement your detection logic here
# Return a key that matches REGION_REPLICAS
end
end
# Usage pattern - apply selectively
class ApplicationController < ActionController::Base
# Only use for read-heavy, latency-sensitive endpoints
def with_regional_replica(&block)
if should_use_regional_routing?
RegionAwareRouter.with_nearest_replica(request, &block)
else
yield # Use default routing
end
end
private
def should_use_regional_routing?
# Implement your logic:
# - Feature flag controlled
# - Only for certain controllers/actions
# - Only for premium users
# - etc.
end
end
# Example usage in specific controllers
class ApiController < ApplicationController
def index
with_regional_replica do
@data = YourModel.complex_read_query
end
end
end
Disaster Recovery
Automatic Failover
Handle replica failures gracefully:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# app/services/replica_failover_manager.rb
class ReplicaFailoverManager
class << self
def healthy_replicas
@healthy_replicas ||= {}
end
def mark_unhealthy(replica_name, error)
healthy_replicas[replica_name] = {
healthy: false,
error: error.message,
failed_at: Time.current
}
AlertService.notify(
"Replica marked unhealthy: #{replica_name}",
error: error.message
)
end
def check_replica_health(replica_name)
return false if healthy_replicas[replica_name]&.dig(:healthy) == false
ApplicationRecord.connected_to(role: :reading, shard: replica_name) do
ApplicationRecord.connection.exec_query("SELECT 1")
true
end
rescue => e
mark_unhealthy(replica_name, e)
false
end
def with_failover(&block)
available_replicas = [:primary_replica, :primary_replica_2, :primary_replica_3]
available_replicas.each do |replica|
next unless check_replica_health(replica)
begin
return ApplicationRecord.connected_to(role: :reading, shard: replica, &block)
rescue => e
Rails.logger.error("Replica #{replica} failed during query: #{e.message}")
mark_unhealthy(replica, e)
end
end
# All replicas failed, use primary
Rails.logger.warn("All replicas unhealthy, falling back to primary")
yield
end
end
# Background health checker job
class HealthCheckerJob < ApplicationJob
queue_as :monitoring
def perform
ReplicaFailoverManager.healthy_replicas.each do |replica_name, status|
next if status[:healthy]
# Try to recover unhealthy replicas
if Time.current - status[:failed_at] > 5.minutes
if ReplicaFailoverManager.check_replica_health(replica_name)
ReplicaFailoverManager.healthy_replicas[replica_name] = { healthy: true }
AlertService.notify("Replica recovered: #{replica_name}")
end
end
end
# Schedule next check
self.class.set(wait: 30.seconds).perform_later
end
end
end
Production Checklist
Before going live with read replicas:
- Monitoring
- Replication lag alerts (< 5 seconds)
- Connection pool monitoring
- Query performance tracking
- Error rate monitoring
- Testing
- Load testing with realistic read/write ratios
- Failover testing
- Replication lag simulation
- Connection pool exhaustion testing
- Operations
- Runbook for replica issues
- Automated health checks
- Circuit breakers configured
- Gradual rollout plan
- Performance
- Replica-specific indexes created
- Connection pools tuned
- Query routing optimized
- Region-aware routing (if needed)
Further Reading
For production deployment best practices:
- PostgreSQL Documentation
- Monitoring Database Activity - Understanding
pg_stat
views - High Availability, Load Balancing, and Replication - Production deployment patterns
- Connection Pooling - Tuning connection parameters
- Monitoring Database Activity - Understanding
- AWS/Cloud Resources
- Amazon RDS Read Replicas - If using RDS
- Working with PostgreSQL Read Replicas - RDS PostgreSQL specific guide
Key Takeaways
Successfully running read replicas in production requires:
- Gradual adoption - Start with shadow mode, increase traffic slowly
- Comprehensive monitoring - Track everything from replication lag to query patterns
- Automatic recovery - Build systems that heal themselves
- Performance focus - Optimize specifically for replica characteristics
Read replicas are powerful but complex. Start simple, measure everything, and build sophistication as your needs grow.