PostgreSQL Backup Migration: pg-backup-scripts → pgBackRest¶

Overview¶

This document covers the migration from custom pg-backup-scripts to industry-standard pgBackRest for PostgreSQL backups on the forge server.

Migration Date: 2024-01 Status: ✅ Complete

What Changed¶

Removed Components¶

❌ pkgs/pg-backup-scripts.nix (161 lines of custom backup scripts)
❌ systemd.services.pg-zfs-snapshot (coordinated PostgreSQL + ZFS snapshots)
❌ systemd.services.pg-check-stale-backup-label (backup_label verification)
❌ modules/nixos/services/postgresql/backup-integration.nix (Restic integration module)
❌ modules.backup.restic.jobs.postgresql-offsite (complex snapshot mounting/unmounting logic)

Added Components¶

✅ pkgs.pgbackrest (industry-standard PostgreSQL backup tool)
✅ /etc/pgbackrest.conf (45-line configuration file)
✅ systemd.services.pgbackrest-stanza-create (one-time initialization)
✅ systemd.services.pgbackrest-full-backup (daily full backups)
✅ systemd.services.pgbackrest-incr-backup (hourly incremental backups)
✅ systemd.services.pgbackrest-diff-backup (6-hourly differential backups)
✅ systemd.timers.pgbackrest-* (three timers with randomized delays)
✅ systemd.services.pgbackrest-metrics (Prometheus monitoring)
✅ PostgreSQL archive_command using pgBackRest

Simplified Components¶

✨ Sanoid configuration: Removed autosnap = false, now uses standard wal-frequent template
✨ ZFS datasets: No longer require custom snapshot coordination

Architecture¶

Old System (pg-backup-scripts)¶

PostgreSQL → pg_start_backup()
          → ZFS snapshot (coordinated)
          → pg_stop_backup()
          → Mount snapshot read-only
          → Restic backup from snapshot
          → Unmount snapshot
          → Upload to R2

New System (pgBackRest)¶

PostgreSQL → WAL archiving (continuous via archive_command)
          ↓
          → pgBackRest stanza "main"
             ├─ repo1: /mnt/nas-backup/pgbackrest (local NAS)
             ├─ repo2: s3://R2/nix-homelab-prod-servers (offsite)
             ├─ Full backup: Daily 2 AM
             ├─ Incremental backup: Hourly
             └─ Differential backup: Every 6 hours

Configuration¶

pgBackRest Config (`/etc/pgbackrest.conf`)¶

[global]
repo1-path=/mnt/nas-backup/pgbackrest
repo1-retention-full=7
repo1-retention-diff=4

repo2-type=s3
repo2-s3-bucket=nix-homelab-prod-servers
repo2-s3-region=auto
repo2-s3-endpoint=21ee32956d11b5baf662d186bd0b4ab4.r2.cloudflarestorage.com
repo2-s3-key-type=shared
repo2-path=/forge/pgbackrest
repo2-retention-full=30
repo2-retention-diff=14

process-max=2
compress-type=lz4
compress-level=3
delta=y

[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

PostgreSQL Settings¶

services.postgresql.settings = {
  archive_mode = "on";
  archive_command = "${pkgs.pgbackrest}/bin/pgbackrest --stanza=main archive-push %p";
  archive_timeout = "300";  # 5-minute RPO
};

Backup Schedules¶

Backup Type	Schedule	Retention (Local)	Retention (R2)
Full	Daily @ 2 AM	7 backups	30 backups
Incremental	Hourly	N/A (depends on full)	N/A
Differential	Every 6 hours	4 backups	14 backups

Operations Guide¶

Check Backup Status¶

# View all backups
sudo pgbackrest --stanza=main info

# View detailed JSON output
sudo pgbackrest --stanza=main --output=json info

# Check specific repository
sudo pgbackrest --stanza=main --repo=1 info  # Local NAS
sudo pgbackrest --stanza=main --repo=2 info  # R2 offsite

Manual Backup Operations¶

# Trigger full backup (both repos)
sudo systemctl start pgbackrest-full-backup.service

# Trigger incremental backup
sudo systemctl start pgbackrest-incr-backup.service

# Trigger differential backup
sudo systemctl start pgbackrest-diff-backup.service

# Manual backup to specific repo
sudo pgbackrest --stanza=main --repo=1 --type=full backup

Monitor Backup Jobs¶

# Check service status
systemctl status pgbackrest-full-backup.timer
systemctl status pgbackrest-incr-backup.timer
systemctl status pgbackrest-diff-backup.timer

# View recent backup logs
journalctl -u pgbackrest-full-backup.service -n 50
journalctl -u pgbackrest-incr-backup.service -n 50
journalctl -u pgbackrest-diff-backup.service -n 50

# Check Prometheus metrics
cat /var/lib/node_exporter/textfile_collector/pgbackrest.prom

Disaster Recovery¶

Scenario 1: Restore from Local Backup (Fastest)¶

Recovery Time Objective (RTO): ~10-30 minutes

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Backup current state (safety)
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.old

# 3. Restore latest backup from local NAS
sudo pgbackrest --stanza=main --repo=1 restore

# 4. Start PostgreSQL (auto-recovers to latest WAL)
sudo systemctl start postgresql

# 5. Verify database
sudo -u postgres psql -c "SELECT now();"

Scenario 2: Point-in-Time Recovery (PITR)¶

Use Case: Recover to specific timestamp (e.g., before accidental DELETE)

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Backup current state
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.old

# 3. Restore base backup
sudo pgbackrest --stanza=main --repo=1 restore

# 4. Create recovery.signal
sudo -u postgres touch /var/lib/postgresql/16/main/recovery.signal

# 5. Configure recovery target in postgresql.conf
cat >> /var/lib/postgresql/16/main/postgresql.auto.conf <<EOF
restore_command = 'pgbackrest --stanza=main archive-get %f "%p"'
recovery_target_time = '2024-01-15 14:30:00+00'
recovery_target_action = 'promote'
EOF

# 6. Start PostgreSQL and monitor recovery
sudo systemctl start postgresql
sudo journalctl -u postgresql -f

# 7. Verify recovered data
sudo -u postgres psql -c "SELECT * FROM my_table WHERE id = 123;"

# 8. Clean up recovery config
sudo -u postgres psql -c "ALTER SYSTEM RESET restore_command;"
sudo systemctl restart postgresql

Scenario 3: Restore from R2 Offsite (Disaster Recovery)¶

Use Case: Complete NAS failure, restore from cloud

# 1. Ensure R2 credentials are available
cat /run/secrets/restic/r2-prod-env

# 2. Stop PostgreSQL
sudo systemctl stop postgresql

# 3. Backup current state
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.old

# 4. Restore from R2 (repo2)
sudo pgbackrest --stanza=main --repo=2 restore

# 5. Start PostgreSQL
sudo systemctl start postgresql

# 6. Verify
sudo -u postgres psql -c "SELECT now();"

Scenario 4: Restore Specific Backup Set¶

# 1. List available backup sets
sudo pgbackrest --stanza=main info

# Output example:
# stanza: main
#     status: ok
#     cipher: none
#
#     db (current)
#         wal archive min/max (16): 00000001000000000000000A/00000001000000000000001F
#
#         full backup: 20240115-020015F
#             timestamp start/stop: 2024-01-15 02:00:15+00 / 2024-01-15 02:05:23+00
#             wal start/stop: 00000001000000000000000A / 00000001000000000000000C
#             database size: 1.2GB, database backup size: 1.2GB
#             repo1: backup set size: 456MB, backup size: 456MB

# 2. Restore specific backup set
sudo systemctl stop postgresql
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.old
sudo pgbackrest --stanza=main --repo=1 --set=20240115-020015F restore
sudo systemctl start postgresql

Monitoring¶

Prometheus Metrics¶

Metrics exported to /var/lib/node_exporter/textfile_collector/pgbackrest.prom:

# Backup timestamps (Unix epoch)
pgbackrest_last_full_backup_timestamp{stanza="main",repo="repo1",hostname="forge"}
pgbackrest_last_incr_backup_timestamp{stanza="main",repo="repo1",hostname="forge"}
pgbackrest_last_diff_backup_timestamp{stanza="main",repo="repo1",hostname="forge"}

# Backup counts
pgbackrest_backup_count{stanza="main",repo="repo1",hostname="forge"}

# Repository sizes (bytes)
pgbackrest_repo_size_bytes{stanza="main",repo="repo1",hostname="forge"}
pgbackrest_repo_backup_size_bytes{stanza="main",repo="repo1",hostname="forge"}

Alerting Rules¶

Create Prometheus alerts for:

# Alert if no full backup in 25 hours
- alert: PostgreSQLBackupStale
  expr: time() - pgbackrest_last_full_backup_timestamp > 90000
  labels:
    severity: warning
  annotations:
    summary: "PostgreSQL full backup is stale (>25 hours)"

# Alert if no incremental backup in 2 hours
- alert: PostgreSQLIncrementalBackupStale
  expr: time() - pgbackrest_last_incr_backup_timestamp > 7200
  labels:
    severity: warning
  annotations:
    summary: "PostgreSQL incremental backup is stale (>2 hours)"

# Alert if backup repository size is concerning
- alert: PostgreSQLBackupRepoLarge
  expr: pgbackrest_repo_size_bytes > 100e9  # 100GB
  labels:
    severity: info
  annotations:
    summary: "PostgreSQL backup repository exceeds 100GB"

ZFS Snapshots Integration¶

Role of ZFS Snapshots¶

With pgBackRest handling backups, ZFS snapshots now serve a complementary role:

Fast Local Rollback: Restore in <1 minute vs 10-30 minutes for pgBackRest
Quick Recovery from Recent Issues: Roll back 5-60 minutes without WAL replay
Not PITR-Compliant: Snapshots are crash-consistent, not application-consistent
Use Case: "Oh no, I just dropped that table 2 minutes ago!"

Sanoid Configuration¶

services.sanoid.datasets."tank/services/postgresql/main" = {
  use_template = [ "prod" ];
  recursive = false;
};

services.sanoid.datasets."tank/services/postgresql/main-wal" = {
  use_template = [ "wal-frequent" ];  # 12x5min, 48h hourly, 7d daily
  recursive = false;
};

Quick ZFS Rollback¶

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. List recent snapshots
sudo zfs list -t snapshot tank/services/postgresql/main | tail -n 5

# 3. Rollback to snapshot (DESTROYS NEWER SNAPSHOTS!)
sudo zfs rollback -r tank/services/postgresql/main@autosnap_2024-01-15_14:30:00_hourly

# 4. Start PostgreSQL
sudo systemctl start postgresql

# 5. Verify
sudo -u postgres psql -c "SELECT now();"

⚠️ Warning: zfs rollback destroys all snapshots newer than the target. For point-in-time recovery to arbitrary timestamps, use pgBackRest PITR instead.

Troubleshooting¶

Issue: WAL Archiving Failures¶

Symptoms:

journalctl -u postgresql | grep "archive command failed"

Diagnosis:

# Check pgbackrest can access repositories
sudo -u postgres pgbackrest --stanza=main check

# Manually test archive command
sudo -u postgres pgbackrest --stanza=main archive-push /var/lib/postgresql/16/main/pg_wal/000000010000000000000001

Solutions: - Ensure /mnt/nas-backup is mounted - Verify R2 credentials at /run/secrets/restic/r2-prod-env - Check disk space: df -h /mnt/nas-backup

Issue: Backup Job Failures¶

Symptoms:

systemctl status pgbackrest-full-backup.service
# Output: "failed"

Diagnosis:

# View detailed logs
journalctl -u pgbackrest-full-backup.service -n 100

# Check stanza status
sudo pgbackrest --stanza=main check

Common Causes: 1. Stanza not created: Run sudo systemctl start pgbackrest-stanza-create.service 2. Repository permissions: Check ownership of /mnt/nas-backup/pgbackrest 3. R2 credentials: Verify SOPS secret is decrypted 4. PostgreSQL not running: systemctl status postgresql

Issue: Slow Restore Performance¶

Symptoms: Restore taking >1 hour for small databases

Optimizations:

# Use more processes for restore
sudo pgbackrest --stanza=main --repo=1 --process-max=4 restore

# Use delta restore (only changed blocks)
sudo pgbackrest --stanza=main --repo=1 --delta restore

# Restore from local repo (faster than R2)
sudo pgbackrest --stanza=main --repo=1 restore

Issue: Metrics Not Updating¶

Symptoms: Prometheus metrics stale or missing

Diagnosis:

# Check metrics service
systemctl status pgbackrest-metrics.service
systemctl status pgbackrest-metrics.timer

# View metrics file
cat /var/lib/node_exporter/textfile_collector/pgbackrest.prom

# Check for errors
journalctl -u pgbackrest-metrics.service -n 50

Solutions: - Ensure node_exporter textfile collector is enabled - Verify directory permissions: ls -ld /var/lib/node_exporter/textfile_collector/ - Manually trigger metrics collection: sudo systemctl start pgbackrest-metrics.service

Performance Tuning¶

Backup Performance¶

# In /etc/pgbackrest.conf
process-max=4           # More parallel processes (default: 1)
compress-level=1        # Faster compression (default: 3)
protocol-timeout=1200   # Longer timeout for slow networks

Restore Performance¶

# Fast restore with parallel processes
sudo pgbackrest --stanza=main --repo=1 --process-max=4 --delta restore

Network Optimization¶

# For R2 offsite backups
repo2-storage-verify-tls=n  # Skip TLS verification (faster, less secure)
repo2-bundle=y              # Bundle small files (fewer S3 requests)

Migration Notes¶

What Was NOT Migrated¶

✅ ZFS Snapshots: Still active via Sanoid (fast local rollback)
✅ Sanoid/Syncoid: Still replicating to nas-1 (local DR)
✅ Node Exporter: Still collecting PostgreSQL metrics

Breaking Changes¶

Restic Commands No Longer Work:

# OLD (broken)
restic -r /mnt/nas-backup snapshots --tag postgresql

# NEW
pgbackrest --stanza=main info

Backup Paths Changed:

# OLD
/mnt/nas-backup/<restic-snapshot-id>/var/lib/postgresql/16/main

# NEW
/mnt/nas-backup/pgbackrest/backup/main/<backup-set-id>

Systemd Service Names Changed:

# OLD
systemctl status restic-backups-postgresql-main-base.service

# NEW
systemctl status pgbackrest-full-backup.service

Rollback Plan¶

If pgBackRest fails catastrophically, the old system can be restored:

# 1. Revert Git commits
git log --oneline | grep -i pgbackrest  # Find migration commits
git revert <commit-hash>

# 2. Rebuild
nixos-rebuild switch

# 3. Verify pg-backup-scripts are active
systemctl status pg-zfs-snapshot.service

However, this is unlikely to be necessary because: - pgBackRest is battle-tested and widely used - We tested restore procedures during migration - ZFS snapshots provide immediate fallback for recent data

References¶

Summary¶

Benefits Achieved¶

Reduced Complexity: 161 lines of custom scripts → 45-line config file
Industry Standard: Using battle-tested pgBackRest vs custom solution
Better Features: Incremental, differential, parallel backups, delta restore
Simpler Operations: Single pgbackrest command vs complex Restic workflows
Better Monitoring: Built-in metrics via pgbackrest info --output=json

Trade-offs¶

New Tool: Team must learn pgBackRest (but it's well-documented)
Less ZFS Integration: pgBackRest doesn't coordinate with ZFS snapshots (but doesn't need to)

Conclusion¶

The migration to pgBackRest successfully replaced a complex custom backup system with an industry-standard tool while maintaining all PITR capabilities and simplifying operations. The result is a more maintainable, well-supported backup solution that follows the YAGNI principle.

PostgreSQL Backup Migration: pg-backup-scripts → pgBackRest¶

Overview¶

What Changed¶

Removed Components¶

Added Components¶

Simplified Components¶

Architecture¶

Old System (pg-backup-scripts)¶

New System (pgBackRest)¶

Configuration¶

pgBackRest Config (/etc/pgbackrest.conf)¶

PostgreSQL Settings¶

Backup Schedules¶

Operations Guide¶

Check Backup Status¶

Manual Backup Operations¶

Monitor Backup Jobs¶

Disaster Recovery¶

Scenario 1: Restore from Local Backup (Fastest)¶

Scenario 2: Point-in-Time Recovery (PITR)¶

Scenario 3: Restore from R2 Offsite (Disaster Recovery)¶

Scenario 4: Restore Specific Backup Set¶

Monitoring¶

Prometheus Metrics¶

Alerting Rules¶

ZFS Snapshots Integration¶

Role of ZFS Snapshots¶

Sanoid Configuration¶

Quick ZFS Rollback¶

Troubleshooting¶

Issue: WAL Archiving Failures¶

Issue: Backup Job Failures¶

Issue: Slow Restore Performance¶

Issue: Metrics Not Updating¶

Performance Tuning¶

Backup Performance¶

Restore Performance¶

Network Optimization¶

Migration Notes¶

What Was NOT Migrated¶

Breaking Changes¶

Rollback Plan¶

References¶

Summary¶

Benefits Achieved¶

Trade-offs¶

Conclusion¶

pgBackRest Config (`/etc/pgbackrest.conf`)¶