Scaling Rails with PostgreSQL Read Replicas: Part 3 - Production Excellence

25 Jun 2025 · 13 min read

In Part 1 and Part 2, we covered setup and advanced patterns. Now let’s focus on what happens when you actually deploy read replicas to production.

This part covers the operational excellence required for production systems: how to deploy without downtime, monitor effectively, handle failures gracefully, and optimize performance based on real-world usage. These are the hard-won lessons from running read replicas at scale.

Zero-Downtime Deployment Strategy

Adding read replicas to an existing production application requires careful planning. Here’s a battle-tested approach:

Step 1: Start with 0% Traffic

Use the gradual rollout strategy from Part 2, starting with 0% traffic to replicas:

# app/services/replica_rollout.rb
class ReplicaRollout
  ROLLOUT_PERCENTAGES = {
    analytics_queries: 0,    # Start at 0%
    search_queries: 0,       # Start at 0%
    user_profiles: 0,        # Start at 0%
    default: 0               # Everything uses primary
  }.freeze
  
  # ... rest of implementation from Part 2
end

During this phase:

Deploy replica infrastructure - Ensure replicas are receiving data
Monitor replication lag - Verify replicas stay in sync
Test with read-only users - Have internal team members manually test
Collect baseline metrics - CPU, memory, query performance on replicas

# app/jobs/replica_health_validator_job.rb
class ReplicaHealthValidatorJob < ApplicationJob
  def perform
    # Check all replicas are healthy before increasing traffic
    replicas = [:primary_replica, :primary_replica_2]
    
    health_checks = replicas.map do |replica|
      {
        name: replica,
        lag: check_replication_lag(replica),
        connections: check_connection_count(replica),
        query_success: test_query(replica)
      }
    end
    
    if health_checks.all? { |check| check[:lag] < 5.seconds && check[:query_success] }
      Rails.logger.info "All replicas healthy, ready for traffic"
      StatsD.event("replicas.ready_for_traffic", "All health checks passed")
    else
      AlertService.notify("Replicas not ready", health_checks)
    end
  end
  
  private
  
  def check_replication_lag(replica)
    ApplicationRecord.connected_to(role: :reading, shard: replica) do
      result = ApplicationRecord.connection.execute(
        "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag"
      ).first
      result['lag'].to_f.seconds
    end
  rescue => e
    Float::INFINITY
  end
  
  def test_query(replica)
    ApplicationRecord.connected_to(role: :reading, shard: replica) do
      ApplicationRecord.connection.execute("SELECT 1")
      true
    end
  rescue
    false
  end
end

Step 2: Gradual Traffic Migration

Once shadow mode shows replicas are healthy, gradually migrate traffic:

# app/services/replica_traffic_manager.rb
class ReplicaTrafficManager
  MIGRATION_SCHEDULE = [
    { percentage: 0, duration: 1.day },    # Shadow mode
    { percentage: 5, duration: 1.hour },   # 5% of reads
    { percentage: 10, duration: 2.hours }, # Monitor closely
    { percentage: 25, duration: 4.hours },
    { percentage: 50, duration: 1.day },
    { percentage: 75, duration: 1.day },
    { percentage: 100, duration: nil }     # Full migration
  ].freeze
  
  def self.current_percentage
    migration_start = Rails.cache.read('replica_migration_start') || Time.current
    elapsed = Time.current - migration_start
    
    MIGRATION_SCHEDULE.each_with_index do |phase, index|
      next_phase = MIGRATION_SCHEDULE[index + 1]
      
      if phase[:duration].nil? || elapsed < phase[:duration]
        return phase[:percentage]
      end
      
      elapsed -= phase[:duration]
    end
    
    100 # Full migration complete
  end
  
  def self.should_use_replica?(feature: :default)
    percentage = current_percentage
    
    # Critical features migrate last
    percentage = [percentage - 20, 0].max if [:payments, :auth].include?(feature)
    
    rand(100) < percentage
  end
  
  def self.with_traffic_management(feature: :default, &block)
    if should_use_replica?(feature: feature) && ReplicaHealthCheck.healthy?
      ApplicationRecord.connected_to(role: :reading, &block)
    else
      yield
    end
  rescue => e
    # Always fallback to primary on errors
    ErrorTracker.track(e, context: { feature: feature })
    yield
  end
end

Step 3: Automated Rollback

Build automatic rollback when issues arise:

# app/services/replica_circuit_breaker.rb
class ReplicaCircuitBreaker
  FAILURE_THRESHOLD = 5
  TIMEOUT_THRESHOLD = 10.seconds
  RECOVERY_TIMEOUT = 1.minute
  
  class << self
    def call(&block)
      return yield if circuit_open?
      
      execute_with_breaker(&block)
    end
    
    private
    
    def execute_with_breaker(&block)
      start_time = Time.current
      
      result = ApplicationRecord.connected_to(role: :reading, &block)
      
      # Reset failures on success
      reset_failure_count
      result
      
    rescue => e
      record_failure(e, Time.current - start_time)
      
      # Fallback to primary
      yield
    end
    
    def circuit_open?
      return false unless last_failure_time
      
      # Check if we're still in recovery period
      Time.current - last_failure_time < RECOVERY_TIMEOUT
    end
    
    def record_failure(error, duration)
      increment_failure_count
      
      if failure_count >= FAILURE_THRESHOLD || duration > TIMEOUT_THRESHOLD
        open_circuit
        AlertService.critical(
          "Replica circuit breaker opened",
          error: error.message,
          failure_count: failure_count,
          duration: duration
        )
      end
    end
    
    def open_circuit
      Rails.cache.write('replica:circuit:open', true, expires_in: RECOVERY_TIMEOUT)
      Rails.cache.write('replica:circuit:opened_at', Time.current)
    end
    
    def failure_count
      Rails.cache.read('replica:circuit:failures').to_i
    end
    
    def increment_failure_count
      Rails.cache.increment('replica:circuit:failures', 1, expires_in: 5.minutes)
    end
    
    def reset_failure_count
      Rails.cache.delete('replica:circuit:failures')
    end
    
    def last_failure_time
      Rails.cache.read('replica:circuit:opened_at')
    end
  end
end

Comprehensive Monitoring

Production read replicas need monitoring at multiple levels. Most teams already use APM tools like New Relic, DataDog, or AppSignal—leverage these instead of building custom monitoring.

APM Integration

Modern APM tools automatically track database metrics, but you need to ensure they distinguish between primary and replica queries. This complements the measurement-driven optimization approach we recommend for Rails performance:

# config/initializers/datadog.rb (if using DataDog)
Datadog.configure do |c|
  c.tracing.instrument :active_record, service_name: 'postgres' do |config|
    # Tag queries by database role
    config.on_query do |span, event|
      connection = event.payload[:connection]
      role = connection.pool.db_config.configuration_hash[:replica] ? 'replica' : 'primary'
      
      span.set_tag('db.role', role)
      span.set_tag('db.connection_name', connection.pool.db_config.name)
    end
  end
end

# For New Relic
# config/newrelic.yml
# Enable database query analysis to see replica vs primary distribution

Cloud Provider Monitoring

If using managed databases, leverage their built-in monitoring:

AWS RDS:

# CloudWatch already tracks these for RDS read replicas:
# - ReplicaLag
# - ReadIOPS / WriteIOPS
# - DatabaseConnections
# - CPUUtilization per replica

# Just add CloudWatch alarms:
# Alarm: ReplicaLag > 5000 milliseconds
# Alarm: DatabaseConnections > 80% of max_connections

Google Cloud SQL / Azure Database: Similar built-in metrics available through their monitoring services.

Custom Metrics for Business Logic

While APM tools handle infrastructure metrics, you still need application-specific monitoring:

# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
  around_action :track_replica_usage
  
  private
  
  def track_replica_usage
    replica_used = false
    
    # Hook into ActiveRecord to detect replica usage
    subscriber = ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
      event = ActiveSupport::Notifications::Event.new(*args)
      connection = event.payload[:connection]
      
      if connection&.pool&.db_config&.configuration_hash&.dig(:replica)
        replica_used = true
      end
    end
    
    yield
    
    # Send to your APM
    if defined?(Datadog)
      Datadog::Tracing.active_trace&.set_tag('replica.used', replica_used)
    elsif defined?(NewRelic)
      NewRelic::Agent.add_custom_attributes(replica_used: replica_used)
    end
  ensure
    ActiveSupport::Notifications.unsubscribe(subscriber)
  end
end

Key Metrics to Monitor

Configure your APM dashboard to track:

Infrastructure (from CloudWatch/APM):
- Replication lag per replica
- Connection count by database role
- Query response time P50/P95/P99 by role
- Error rate by database
Application (from APM custom metrics):
- % of requests using replicas
- Cache hit rate (higher replica usage should increase this)
- Replica fallback rate (when replicas fail)
Business Impact:
- Page load time before/after replica adoption
- API response times by endpoint
- Background job duration changes

# app/services/replica_monitor.rb
class ReplicaMonitor
  def self.check_all
    {
      replication_lag: check_replication_lag,
      connection_stats: check_connections,
      query_performance: check_query_performance,
      disk_usage: check_disk_usage,
      long_running_queries: check_long_queries
    }
  end
  
  private
  
  def self.check_replication_lag
    replicas = ApplicationRecord.configurations.configs_for(role: :reading)
    
    replicas.map do |config|
      ApplicationRecord.connected_to(role: :reading, shard: config.name.to_sym) do
        lag_query = <<-SQL
          SELECT 
            pg_is_in_recovery() as is_replica,
            pg_last_wal_receive_lsn() as receive_lsn,
            pg_last_wal_replay_lsn() as replay_lsn,
            EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds,
            pg_last_xact_replay_timestamp() as last_replay_time
        SQL
        
        result = ApplicationRecord.connection.exec_query(lag_query).first
        
        {
          name: config.name,
          lag_seconds: result['lag_seconds']&.to_f || 0,
          lag_bytes: calculate_lag_bytes(result['receive_lsn'], result['replay_lsn']),
          last_replay: result['last_replay_time'],
          healthy: (result['lag_seconds']&.to_f || 0) < 5.0
        }
      end
    end
  end
  
  def self.check_connections
    ApplicationRecord.connected_to(role: :reading) do
      stats = ApplicationRecord.connection.exec_query(<<-SQL).first
        SELECT 
          COUNT(*) FILTER (WHERE state = 'active') as active_connections,
          COUNT(*) FILTER (WHERE state = 'idle') as idle_connections,
          COUNT(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction,
          COUNT(*) as total_connections,
          MAX(EXTRACT(EPOCH FROM (now() - state_change))) as longest_connection_seconds
        FROM pg_stat_activity
        WHERE pid <> pg_backend_pid()
      SQL
      
      {
        active: stats['active_connections'],
        idle: stats['idle_connections'],
        idle_in_transaction: stats['idle_in_transaction'],
        total: stats['total_connections'],
        longest_duration: stats['longest_connection_seconds']&.to_f,
        pool_exhaustion_risk: stats['total_connections'] > 80
      }
    end
  end
  
  def self.check_query_performance
    ApplicationRecord.connected_to(role: :reading) do
      slow_queries = ApplicationRecord.connection.exec_query(<<-SQL)
        SELECT 
          query,
          calls,
          total_time,
          mean_time,
          max_time,
          stddev_time
        FROM pg_stat_statements
        WHERE query NOT LIKE '%pg_stat_statements%'
        ORDER BY mean_time DESC
        LIMIT 10
      SQL
      
      slow_queries.map(&:to_h)
    end
  end
  
  def self.calculate_lag_bytes(receive_lsn, replay_lsn)
    return 0 unless receive_lsn && replay_lsn
    
    # PostgreSQL LSN format: XXXXXXXX/YYYYYYYY
    receive_bytes = lsn_to_bytes(receive_lsn)
    replay_bytes = lsn_to_bytes(replay_lsn)
    
    receive_bytes - replay_bytes
  end
  
  def self.lsn_to_bytes(lsn)
    high, low = lsn.split('/').map { |x| x.to_i(16) }
    (high << 32) + low
  end
end

Application-Level Monitoring

Track how your application uses replicas:

# config/initializers/replica_instrumentation.rb
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  
  # Determine which connection was used
  connection_name = event.payload[:connection]&.pool&.db_config&.name || 'unknown'
  role = event.payload[:connection]&.pool&.db_config&.configuration_hash&.dig(:replica) ? 'replica' : 'primary'
  
  # Track metrics
  tags = {
    role: role,
    connection: connection_name,
    operation: extract_operation(event.payload[:sql])
  }
  
  StatsD.histogram('db.query.duration', event.duration, tags: tags)
  StatsD.increment('db.query.count', tags: tags)
  
  # Alert on slow replica queries
  if role == 'replica' && event.duration > 1000
    Rails.logger.warn(
      "Slow replica query",
      duration: event.duration,
      sql: event.payload[:sql],
      connection: connection_name
    )
  end
end

def extract_operation(sql)
  case sql
  when /^SELECT/i then 'select'
  when /^INSERT/i then 'insert'
  when /^UPDATE/i then 'update'
  when /^DELETE/i then 'delete'
  else 'other'
  end
end

Performance Optimization

1. Query Optimization for Replicas

Some queries perform differently on replicas. If you’re also using HTTP caching strategies, you can combine both approaches for maximum performance:

# app/models/concerns/replica_optimized.rb
module ReplicaOptimized
  extend ActiveSupport::Concern
  
  class_methods do
    def replica_optimized_scope(name, &block)
      # Define two versions of the scope
      scope "#{name}_primary", block
      scope "#{name}_replica", block
      
      # Smart scope that picks the right version
      define_singleton_method(name) do |*args|
        if ApplicationRecord.current_role == :reading
          send("#{name}_replica", *args)
        else
          send("#{name}_primary", *args)
        end
      end
    end
  end
end

class Product < ApplicationRecord
  include ReplicaOptimized
  
  # Regular scope for primary
  replica_optimized_scope :search do |term|
    where("name ILIKE ?", "%#{term}%")
  end
  
  # Override for replica with better performance
  def self.search_replica(term)
    # Use full-text search on replica (assuming GIN index)
    where("search_vector @@ plainto_tsquery('english', ?)", term)
  end
  
  # Materialized view only on replica
  def self.top_selling_replica
    # This materialized view is refreshed hourly on replicas
    connection.exec_query(
      "SELECT * FROM top_selling_products_mv LIMIT 100"
    )
  end
end

2. Connection Pool Tuning

Optimize pools based on actual usage:

# app/services/connection_pool_optimizer.rb
class ConnectionPoolOptimizer
  def self.tune_pools
    configs = ApplicationRecord.configurations.configs_for(env_name: Rails.env)
    
    configs.each do |config|
      pool = ApplicationRecord.connection_handler.retrieve_connection_pool(config.name)
      next unless pool
      
      stats = pool_statistics(pool)
      
      if stats[:wait_count] > 10
        # Pool is too small
        recommendation = (pool.size * 1.5).to_i
        Rails.logger.info(
          "Pool #{config.name} should be increased to #{recommendation}",
          current_size: pool.size,
          wait_count: stats[:wait_count]
        )
      elsif stats[:usage_ratio] < 0.3
        # Pool is too large
        recommendation = [(pool.size * 0.7).to_i, 5].max
        Rails.logger.info(
          "Pool #{config.name} could be reduced to #{recommendation}",
          current_size: pool.size,
          usage_ratio: stats[:usage_ratio]
        )
      end
    end
  end
  
  def self.pool_statistics(pool)
    {
      size: pool.size,
      connections: pool.connections.size,
      busy: pool.connections.count(&:in_use?),
      dead: pool.connections.count(&:disconnected?),
      wait_count: pool.stat[:wait_count],
      usage_ratio: pool.connections.count(&:in_use?).to_f / pool.size
    }
  end
end

# app/jobs/connection_pool_monitor_job.rb
class ConnectionPoolMonitorJob < ApplicationJob
  queue_as :monitoring
  
  def perform
    ConnectionPoolOptimizer.tune_pools
    
    # Schedule next check
    self.class.set(wait: 5.minutes).perform_later
  end
end

# Start monitoring (in an initializer or deploy task)
ConnectionPoolMonitorJob.perform_later if Rails.env.production?

3. Multi-Region Optimization

For global applications, use region-aware routing:

# app/services/region_aware_router.rb
class RegionAwareRouter
  REGION_REPLICAS = {
    # Map your regions to database configurations
    # 'region_key' => :database_config_name
    # Example:
    # 'us-east' => :primary_replica_us_east,
    # 'eu' => :primary_replica_eu,
  }.freeze
  
  def self.nearest_replica(request)
    region = detect_region(request)
    REGION_REPLICAS[region] || :primary_replica  # Fallback to default replica
  end
  
  def self.with_nearest_replica(request, &block)
    replica = nearest_replica(request)
    
    ApplicationRecord.connected_to(role: :reading, shard: replica) do
      yield
    end
  rescue => e
    # Critical: Always fallback to a working replica
    Rails.logger.warn("Region replica failed: #{replica}, error: #{e.message}")
    StatsD.increment('replica.region_fallback', tags: ["region:#{replica}"])
    
    # Fallback strategy - could be primary or another region
    ApplicationRecord.connected_to(role: :reading, &block)
  end
  
  private
  
  def self.detect_region(request)
    # Implement based on your infrastructure:
    
    # Option 1: CDN/Proxy headers
    # request.headers['YOUR-CDN-REGION-HEADER']
    
    # Option 2: Load balancer headers
    # request.headers['X-AWS-REGION'] or custom headers
    
    # Option 3: GeoIP lookup (implement your preferred service)
    # GeoIP.lookup(request.remote_ip).region_code
    
    # Option 4: User preference/account setting
    # current_user&.preferred_region
    
    # Implement your detection logic here
    # Return a key that matches REGION_REPLICAS
  end
end

# Usage pattern - apply selectively
class ApplicationController < ActionController::Base
  # Only use for read-heavy, latency-sensitive endpoints
  def with_regional_replica(&block)
    if should_use_regional_routing?
      RegionAwareRouter.with_nearest_replica(request, &block)
    else
      yield  # Use default routing
    end
  end
  
  private
  
  def should_use_regional_routing?
    # Implement your logic:
    # - Feature flag controlled
    # - Only for certain controllers/actions
    # - Only for premium users
    # - etc.
  end
end

# Example usage in specific controllers
class ApiController < ApplicationController
  def index
    with_regional_replica do
      @data = YourModel.complex_read_query
    end
  end
end

Disaster Recovery

Automatic Failover

Handle replica failures gracefully:

# app/services/replica_failover_manager.rb
class ReplicaFailoverManager
  class << self
    def healthy_replicas
      @healthy_replicas ||= {}
    end
    
    def mark_unhealthy(replica_name, error)
      healthy_replicas[replica_name] = {
        healthy: false,
        error: error.message,
        failed_at: Time.current
      }
      
      AlertService.notify(
        "Replica marked unhealthy: #{replica_name}",
        error: error.message
      )
    end
    
    def check_replica_health(replica_name)
      return false if healthy_replicas[replica_name]&.dig(:healthy) == false
      
      ApplicationRecord.connected_to(role: :reading, shard: replica_name) do
        ApplicationRecord.connection.exec_query("SELECT 1")
        true
      end
    rescue => e
      mark_unhealthy(replica_name, e)
      false
    end
    
    def with_failover(&block)
      available_replicas = [:primary_replica, :primary_replica_2, :primary_replica_3]
      
      available_replicas.each do |replica|
        next unless check_replica_health(replica)
        
        begin
          return ApplicationRecord.connected_to(role: :reading, shard: replica, &block)
        rescue => e
          Rails.logger.error("Replica #{replica} failed during query: #{e.message}")
          mark_unhealthy(replica, e)
        end
      end
      
      # All replicas failed, use primary
      Rails.logger.warn("All replicas unhealthy, falling back to primary")
      yield
    end
  end
  
  # Background health checker job
  class HealthCheckerJob < ApplicationJob
    queue_as :monitoring
    
    def perform
      ReplicaFailoverManager.healthy_replicas.each do |replica_name, status|
        next if status[:healthy]
        
        # Try to recover unhealthy replicas
        if Time.current - status[:failed_at] > 5.minutes
          if ReplicaFailoverManager.check_replica_health(replica_name)
            ReplicaFailoverManager.healthy_replicas[replica_name] = { healthy: true }
            AlertService.notify("Replica recovered: #{replica_name}")
          end
        end
      end
      
      # Schedule next check
      self.class.set(wait: 30.seconds).perform_later
    end
  end
end

Production Checklist

Before going live with read replicas:

Monitoring
- Replication lag alerts (< 5 seconds)
- Connection pool monitoring
- Query performance tracking
- Error rate monitoring
Testing
- Load testing with realistic read/write ratios
- Failover testing
- Replication lag simulation
- Connection pool exhaustion testing
Operations
- Runbook for replica issues
- Automated health checks
- Circuit breakers configured
- Gradual rollout plan
Performance
- Replica-specific indexes created
- Connection pools tuned
- Query routing optimized
- Region-aware routing (if needed)

Key Takeaways

Successfully running read replicas in production requires:

Gradual adoption - Start with shadow mode, increase traffic slowly
Comprehensive monitoring - Track everything from replication lag to query patterns
Automatic recovery - Build systems that heal themselves
Performance focus - Optimize specifically for replica characteristics

Read replicas are powerful but complex. Start simple, measure everything, and build sophistication as your needs grow.

Prateek Choudhary

Technology Leader