ADR-007: Multi-Tier Disaster Recovery (Preseed Pattern)¶

Status: Accepted Date: 2025-12-09 Context: Automated service restoration strategy

Context¶

Services with persistent data need automated recovery when their ZFS dataset is missing or corrupted. Without automation, restoring a service requires manual intervention, increasing downtime and operational burden.

Several restoration sources exist with different trade-offs:

ZFS Syncoid replication: Fast, block-level, preserves snapshots
Local ZFS snapshots: Instant, no network dependency
Restic file backup: Offsite, geographic redundancy

The question is how to orchestrate these sources for automatic recovery.

Decision¶

Implement a multi-tier preseed system that attempts restoration in priority order, with Restic excluded from automated restore by default.

Restore Priority Order¶

Syncoid (Primary): Block-level replication from nas-1
Fastest for large datasets
Preserves ZFS properties and snapshot lineage
Maintains incremental replication for future syncs
Local Snapshots: ZFS snapshots on same host
Instant rollback
No network dependency
Limited to retention window
Restic (Manual DR Only): File-based backup
⚠️ NOT recommended for automated preseed
Breaks ZFS lineage (future sends must be full)
Use only for true disaster recovery with manual intervention

Default Configuration¶

restoreMethods = [ "syncoid" "local" ];  # Recommended

Architecture¶

Service Start
    │
    ▼
┌─────────────────────┐
│ Dataset exists?     │──Yes──▶ Start service normally
└─────────┬───────────┘
          │ No
          ▼
┌─────────────────────┐
│ Try Syncoid restore │──Success──▶ Start service
└─────────┬───────────┘
          │ Fail
          ▼
┌─────────────────────┐
│ Try local snapshot  │──Success──▶ Start service
└─────────┬───────────┘
          │ Fail
          ▼
    Preseed fails
    (Alert operator)

Consequences¶

Positive¶

Automatic recovery: Services restore themselves without intervention
Preserved ZFS lineage: Syncoid/local restores maintain incremental capability
Clear failure signal: Preseed failure indicates infrastructure issue (nas-1 down)
Consistent pattern: Same implementation for native and containerized services

Negative¶

No automatic offsite restore: Restic excluded from automation
Network dependency: Syncoid requires nas-1 reachability
Manual DR for worst case: True disaster requires operator intervention

Why Exclude Restic from Automation¶

Including Restic in automated preseed has these problems:

Breaks ZFS lineage: After Restic restore, future Syncoid sends must be full (not incremental)
Hides infrastructure issues: nas-1 down = silent failover instead of alert
Creates cleanup work: Manual re-establishment of replication required
False sense of recovery: Service runs, but backup infrastructure is degraded

When to Include Restic¶

Only for services where immediate availability is more important than ZFS lineage:

# Use sparingly - only for critical services
restoreMethods = [ "syncoid" "local" "restic" ];

Implementation¶

Service Module Integration¶

# In service module
preseed = {
  enable = lib.mkEnableOption "automatic restore before service start";
  repositoryUrl = lib.mkOption { type = lib.types.str; };
  passwordFile = lib.mkOption { type = lib.types.path; };
  restoreMethods = lib.mkOption {
    type = lib.types.listOf (lib.types.enum [ "syncoid" "local" "restic" ]);
    default = [ "syncoid" "local" ];
  };
};

Host Configuration¶

# Using forgeDefaults helper
modules.services.sonarr = {
  enable = true;
  preseed = forgeDefaults.mkPreseed [ "syncoid" "local" ];
};

Preseed Service Unit¶

systemd.services."preseed-${serviceName}" = {
  wantedBy = [ ];  # Started as dependency only
  before = [ "${serviceName}.service" ];
  requiredBy = [ "${serviceName}.service" ];

  serviceConfig = {
    Type = "oneshot";
    RemainAfterExit = true;
  };

  script = ''
    if zfs list ${dataset} &>/dev/null; then
      echo "Dataset exists, skipping preseed"
      exit 0
    fi

    # Try restore methods in order...
  '';
};

Disaster Recovery Preseed Pattern - Full documentation
Preseed Restore Methods Guide - Method selection
ADR-002: Host-Level Defaults Library - mkPreseed helper