Skip to content

ADR-007: Multi-Tier Disaster Recovery (Preseed Pattern)

Status: Accepted Date: 2025-12-09 Context: Automated service restoration strategy

Context

Services with persistent data need automated recovery when their ZFS dataset is missing or corrupted. Without automation, restoring a service requires manual intervention, increasing downtime and operational burden.

Several restoration sources exist with different trade-offs:

  • ZFS Syncoid replication: Fast, block-level, preserves snapshots
  • Local ZFS snapshots: Instant, no network dependency
  • Restic file backup: Offsite, geographic redundancy

The question is how to orchestrate these sources for automatic recovery.

Decision

Implement a multi-tier preseed system that attempts restoration in priority order, with Restic excluded from automated restore by default.

Restore Priority Order

  1. Syncoid (Primary): Block-level replication from nas-1
  2. Fastest for large datasets
  3. Preserves ZFS properties and snapshot lineage
  4. Maintains incremental replication for future syncs

  5. Local Snapshots: ZFS snapshots on same host

  6. Instant rollback
  7. No network dependency
  8. Limited to retention window

  9. Restic (Manual DR Only): File-based backup

  10. ⚠️ NOT recommended for automated preseed
  11. Breaks ZFS lineage (future sends must be full)
  12. Use only for true disaster recovery with manual intervention

Default Configuration

restoreMethods = [ "syncoid" "local" ];  # Recommended

Architecture

Service Start
┌─────────────────────┐
│ Dataset exists?     │──Yes──▶ Start service normally
└─────────┬───────────┘
          │ No
┌─────────────────────┐
│ Try Syncoid restore │──Success──▶ Start service
└─────────┬───────────┘
          │ Fail
┌─────────────────────┐
│ Try local snapshot  │──Success──▶ Start service
└─────────┬───────────┘
          │ Fail
    Preseed fails
    (Alert operator)

Consequences

Positive

  • Automatic recovery: Services restore themselves without intervention
  • Preserved ZFS lineage: Syncoid/local restores maintain incremental capability
  • Clear failure signal: Preseed failure indicates infrastructure issue (nas-1 down)
  • Consistent pattern: Same implementation for native and containerized services

Negative

  • No automatic offsite restore: Restic excluded from automation
  • Network dependency: Syncoid requires nas-1 reachability
  • Manual DR for worst case: True disaster requires operator intervention

Why Exclude Restic from Automation

Including Restic in automated preseed has these problems:

  1. Breaks ZFS lineage: After Restic restore, future Syncoid sends must be full (not incremental)
  2. Hides infrastructure issues: nas-1 down = silent failover instead of alert
  3. Creates cleanup work: Manual re-establishment of replication required
  4. False sense of recovery: Service runs, but backup infrastructure is degraded

When to Include Restic

Only for services where immediate availability is more important than ZFS lineage:

# Use sparingly - only for critical services
restoreMethods = [ "syncoid" "local" "restic" ];

Implementation

Service Module Integration

# In service module
preseed = {
  enable = lib.mkEnableOption "automatic restore before service start";
  repositoryUrl = lib.mkOption { type = lib.types.str; };
  passwordFile = lib.mkOption { type = lib.types.path; };
  restoreMethods = lib.mkOption {
    type = lib.types.listOf (lib.types.enum [ "syncoid" "local" "restic" ]);
    default = [ "syncoid" "local" ];
  };
};

Host Configuration

# Using forgeDefaults helper
modules.services.sonarr = {
  enable = true;
  preseed = forgeDefaults.mkPreseed [ "syncoid" "local" ];
};

Preseed Service Unit

systemd.services."preseed-${serviceName}" = {
  wantedBy = [ ];  # Started as dependency only
  before = [ "${serviceName}.service" ];
  requiredBy = [ "${serviceName}.service" ];

  serviceConfig = {
    Type = "oneshot";
    RemainAfterExit = true;
  };

  script = ''
    if zfs list ${dataset} &>/dev/null; then
      echo "Dataset exists, skipping preseed"
      exit 0
    fi

    # Try restore methods in order...
  '';
};