PostgreSQL Disaster Recovery Flow - Root Cause Analysis & Fix¶
Issue Summary¶
The disaster recovery test failed because the .preseed-completed marker file persisted after PGDATA deletion, preventing the preseed service from running and causing PostgreSQL to create a fresh database instead of restoring from backup.
Root Cause: ZFS Dataset Layering¶
The Problem¶
Two conflicting dataset configurations created overlapping ZFS datasets at the same mountpoint:
- hosts/forge/default.nix (line 320):
tank/services/postgres(old name) - PostgreSQL module storage-integration.nix:
tank/services/postgresql(new name)
Both datasets mounted at /var/lib/postgresql/16/, creating a layered filesystem:
$ zfs list | grep postgres
tank/services/postgres 96K 797G 96K /var/lib/postgresql/16
tank/services/postgresql 23.8M 797G 23.8M /var/lib/postgresql/16
What Happened During DR Test¶
- 14:03 - Preseed service ran, found initialized PGDATA from nixos-bootstrap
- 14:03 - Created
.preseed-completedmarker in PGDATA - 14:17 - Attempted to delete PGDATA with
rm -rf /var/lib/postgresql/16/* - 14:17 - Command only deleted files on top dataset (
postgresql) - 14:25 - Restarted services to test DR
- 14:25:34 - Preseed condition check: marker EXISTS on bottom dataset (
postgres) - 14:25:34 - Preseed service SKIPPED (condition not met)
- 14:25:35 - PostgreSQL's initdb ran → fresh database created ❌
The marker file survived because it was on the hidden bottom dataset that wasn't affected by the deletion command.
Solutions Applied¶
Fix 1: Remove Duplicate Dataset Configuration¶
File: hosts/forge/default.nix
Removed the postgres dataset declaration from forge-specific config. The PostgreSQL module's storage-integration.nix is now the single source of truth.
Before:
services = {
postgres = {
recordsize = "8K";
mountpoint = "/var/lib/postgresql/16";
properties = { ... };
};
}
After:
services = {
# PostgreSQL dataset is now managed by the PostgreSQL module's storage-integration.nix
# to avoid duplicate dataset creation and configuration conflicts.
# See: modules/nixos/services/postgresql/storage-integration.nix
}
Fix 2: Merge Advanced ZFS Properties¶
File: modules/nixos/services/postgresql/storage-integration.nix
Added the advanced ZFS properties from forge's config to the module:
"postgresql" = {
properties = {
"com.sun:auto-snapshot" = "false";
logbias = "throughput";
primarycache = "metadata";
redundant_metadata = "most";
sync = "standard";
# ... other properties
};
};
Fix 3: Move Marker File Outside PGDATA¶
File: modules/nixos/postgresql-preseed.nix
Changed marker location from inside PGDATA to parent directory:
Before: /var/lib/postgresql/16/.preseed-completed (inside PGDATA)
After: /var/lib/postgresql/.preseed-completed-16 (parent directory)
This prevents the marker from being caught in dataset layering issues.
Changes:
- Added markerFile variable: /var/lib/postgresql/.preseed-completed-${version}
- Updated all marker creation calls to use new path
- Updated systemd condition to check new path
Deployment Plan¶
Step 1: Clean Up Existing Datasets (On Forge)¶
# SSH to forge
ssh ryan@forge.holthome.net
# Stop PostgreSQL
sudo systemctl stop postgresql
# Check current datasets
zfs list | grep postgres
# Unmount both datasets
sudo zfs umount tank/services/postgresql
sudo zfs umount tank/services/postgres
# Destroy the old duplicate dataset (96K, nearly empty)
sudo zfs destroy tank/services/postgres
# Remount the correct dataset
sudo zfs mount tank/services/postgresql
sudo zfs set mountpoint=/var/lib/postgresql/16 tank/services/postgresql
# Remove old marker if it exists
sudo rm -f /var/lib/postgresql/16/.preseed-completed
sudo rm -f /var/lib/postgresql/.preseed-completed-16
Step 2: Deploy Configuration¶
# On local machine
cd ~/src/nix-config
# Review changes
git diff
# Commit the fixes
git add hosts/forge/default.nix \
modules/nixos/services/postgresql/storage-integration.nix \
modules/nixos/postgresql-preseed.nix
git commit -m "fix(forge): resolve PostgreSQL dataset layering and preseed marker issues
- Remove duplicate postgres dataset config from forge/default.nix
- Consolidate dataset management in storage-integration.nix
- Move preseed marker outside PGDATA to parent directory
- Prevents marker from being hidden by ZFS dataset layering
Fixes disaster recovery flow where marker persisted across
PGDATA deletion, preventing automatic restore."
# Build and deploy
nix flake check
nixos-rebuild switch --flake .#forge --target-host ryan@forge.holthome.net
Step 3: Test Disaster Recovery Flow¶
Once you have backups again (after Step 4), test the DR flow:
# SSH to forge
ssh ryan@forge.holthome.net
# Stop PostgreSQL
sudo systemctl stop postgresql
# Verify backup exists
sudo -u postgres pgbackrest --stanza=main info
# Delete PGDATA completely
sudo rm -rf /var/lib/postgresql/16/*
# Delete marker (if exists)
sudo rm -f /var/lib/postgresql/.preseed-completed-16
# Start services - preseed should automatically restore
sudo systemctl start postgresql-preseed
sudo systemctl status postgresql-preseed
# Check if restore succeeded
sudo -u postgres psql -l
sudo -u postgres psql -d dispatcharr -c "\dt"
# Verify it's the restored database (not fresh)
sudo -u postgres psql -d dispatcharr -c "SELECT COUNT(*) FROM channels;" # Should have data
Step 4: Take New Backups¶
After deployment and verification:
# Start a full backup
sudo systemctl start pgbackrest-full-backup.service
sudo systemctl status pgbackrest-full-backup.service
# Verify backup completed
sudo -u postgres pgbackrest --stanza=main info
# Check archive status
sudo -u postgres pgbackrest --stanza=main --repo=1 repo-ls archive/main/16-1
Verification Checklist¶
After deployment:
- Only one
tank/services/postgresqldataset exists - Dataset mounted at
/var/lib/postgresql/16/ - No
tank/services/postgresdataset exists - Marker file location is
/var/lib/postgresql/.preseed-completed-16 - Preseed service has correct
ConditionPathExistspath - New full backup created successfully
- DR test: Delete PGDATA + marker → restore works → PostgreSQL starts with restored data
Manual Marker Management (When Needed)¶
The preseed service automatically creates the marker when: 1. It finds an already-initialized PGDATA (skips restore) 2. It successfully completes a restore
To force a restore (disaster recovery scenario):
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Delete PGDATA
sudo rm -rf /var/lib/postgresql/16/*
# 3. Delete marker (CRITICAL!)
sudo rm -f /var/lib/postgresql/.preseed-completed-16
# 4. Start services (preseed will run automatically)
sudo systemctl start postgresql
# Preseed service will:
# - Check marker doesn't exist ✓
# - Find empty PGDATA ✓
# - Restore from latest backup ✓
# - Create marker after success ✓
# - PostgreSQL starts with restored data ✓
Monitoring and Logging¶
Check preseed status:
# Service status
systemctl status postgresql-preseed
# Full logs
sudo journalctl -u postgresql-preseed --no-pager
# Check if marker exists
ls -la /var/lib/postgresql/.preseed-completed-16
# Check PostgreSQL data timestamp
sudo ls -lt /var/lib/postgresql/16/ | head
Technical Notes¶
Why Move Marker Outside PGDATA?¶
When the marker is inside PGDATA (/var/lib/postgresql/16/.preseed-completed):
- It's subject to ZFS dataset mount layering
- Can persist on hidden bottom dataset layers
- Survives rm -rf /var/lib/postgresql/16/* deletion
- Breaks DR flow by preventing preseed execution
When the marker is in parent directory (/var/lib/postgresql/.preseed-completed-16):
- Outside any PGDATA-specific dataset
- Always on the same filesystem layer
- Reliable deletion with rm -f
- DR flow works correctly
Dataset Naming Convention¶
Going forward, use the PostgreSQL module's storage-integration for dataset management:
- Dataset name: postgresql (not postgres)
- Management: Automatic via storage-integration.nix
- Properties: Defined in module, inherited by all hosts
- Host overrides: Use module options, not direct dataset config
Future Improvements¶
Consider these enhancements:
- Pre-flight checks: Add service that verifies only one PostgreSQL dataset exists
- Marker validation: Script to check marker file location matches systemd condition
- DR documentation: Update
docs/postgresql-auto-restore-homelab.mdwith marker details - Automated testing: Include DR scenario in NixOS tests
References¶
- Preseed module:
modules/nixos/postgresql-preseed.nix - Storage integration:
modules/nixos/services/postgresql/storage-integration.nix - Forge config:
hosts/forge/default.nix - ZFS dataset module:
modules/nixos/storage/datasets.nix