NixOS Backup System Onboarding Guide¶

Last Updated: 2025-10-08

Overview¶

This guide provides comprehensive instructions for onboarding new hosts and services to the centralized NixOS backup system. The backup system is built on Restic with ZFS snapshot integration, comprehensive monitoring, automated testing, and enterprise-grade features including error analysis and documentation generation.

Table of Contents¶

System Architecture
Prerequisites
Quick Start
Host Onboarding
Step 9: Configure ZFS Replication (Optional)
Service Onboarding
Configuration Reference
Advanced Features
Monitoring & Alerting
Troubleshooting
Best Practices

System Architecture¶

The backup system consists of several integrated components:

Core Components¶

Restic Backup Engine: Modern, encrypted, deduplicated file-based backup solution
ZFS Snapshot Integration: Consistent point-in-time backups via ZFS snapshots
Sanoid/Syncoid (Optional): Automated ZFS snapshot management and block-level replication
Service Module: Pre-configured backup profiles for common services (UniFi, Omada, 1Password Connect, Attic, System configs)
Monitoring System: Multi-tier monitoring with Prometheus metrics, error analysis, and notifications
Automated Testing: Repository verification and restore testing
Documentation Generator: Self-documenting system with runbooks

Architecture Diagram¶

┌─────────────────────────────────────────────────────────────┐
│                        NixOS Host                           │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │  ZFS Pool    │  │   Service    │  │   System     │     │
│  │  Snapshots   │  │   Data       │  │   Configs    │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
│         │                  │                  │              │
│         └──────────────────┴──────────────────┘              │
│                            │                                 │
│                    ┌───────▼────────┐                       │
│                    │ Restic Backups │                       │
│                    │   (Encrypted)  │                       │
│                    └───────┬────────┘                       │
│                            │                                 │
│         ┌──────────────────┼──────────────────┐            │
│         │                  │                  │             │
│    ┌────▼────┐      ┌─────▼─────┐      ┌────▼────┐       │
│    │ Primary │      │ Secondary │      │  Cloud  │       │
│    │  Repo   │      │   Repo    │      │  Repo   │       │
│    └─────────┘      └───────────┘      └─────────┘       │
│                                                             │
│  ┌──────────────────────────────────────────────────┐     │
│  │           Monitoring & Testing Layer              │     │
│  ├──────────────────────────────────────────────────┤     │
│  │ • Prometheus Metrics  • Error Analysis           │     │
│  │ • Repository Checks   • Restore Testing          │     │
│  │ • ntfy Notifications  • Healthchecks.io          │     │
│  │ • Auto Documentation  • Audit Logging            │     │
│  └──────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘

Prerequisites¶

Before onboarding a host, ensure the following are available:

Required¶

NixOS System: Host must be running NixOS
Backup Repository: At least one configured Restic repository (local, NAS, or cloud)
Repository Password: Secure password file for repository encryption
Network Access: Connectivity to backup destinations

Optional but Recommended¶

ZFS Filesystem: For consistent snapshots during backup
NFS Mount Management: For standardized NAS-based backup storage (see NFS Mount Management Guide)
Node Exporter: For Prometheus metrics export
SOPS/Age: For secure secret management (repository passwords, credentials)
Notification Service: ntfy.sh or Healthchecks.io for alerts

SOPS Secret Management¶

Repository passwords and service credentials should be managed via SOPS:

# In secrets.sops.yaml
restic-primary-password: ENC[AES256_GCM,data:...,iv:...,tag:...,type:str]
restic-b2-env: ENC[AES256_GCM,data:...,iv:...,tag:...,type:str]
unifi-mongo-credentials: ENC[AES256_GCM,data:...,iv:...,tag:...,type:str]

# In your host configuration
sops.secrets.restic-primary-password = {
  sopsFile = ./secrets.sops.yaml;
  owner = "restic-backup";
  group = "restic-backup";
  mode = "0400";
};

Quick Start¶

Basic Host Backup Configuration¶

The simplest configuration to get started:

# In your host's configuration.nix
{
  modules.backup = {
    enable = true;

    # Configure at least one repository
    restic = {
      enable = true;

      repositories.primary = {
        url = "/mnt/nas/backups/${config.networking.hostName}";
        passwordFile = config.sops.secrets.restic-password.path;
        primary = true;
      };

      # Define backup jobs
      jobs.system = {
        enable = true;
        paths = [
          "/etc/nixos"
          "/home"
        ];
        repository = "primary";
        tags = [ "system" "essential" ];
      };
    };

    # Enable basic monitoring
    monitoring = {
      enable = true;
      ntfy = {
        enable = true;
        topic = "https://ntfy.sh/my-backups";
      };
    };
  };
}

This minimal configuration provides: - Daily backups at 02:00 - 14 daily, 8 weekly, 6 monthly, 2 yearly retention - ntfy notifications on failure - Structured JSON logging

Host Onboarding¶

Step 1: Import the Backup Module¶

The backup module is located at /modules/nixos/backup.nix and is automatically imported via the default module imports.

Verify it's imported:

# In /modules/nixos/default.nix
{
  imports = [
    # ... other modules
    ./backup.nix
    ./services/backup-services.nix
  ];
}

Step 2: Configure Repositories¶

Define one or more backup repositories. Repositories can be local, NAS-based, or cloud-based.

Local/NAS Repository¶

For NAS-based repositories, use the NFS mount management module for standardized configuration:

# First, configure the NFS mount (see nfs-mount-management.md)
modules.filesystems.nfs = {
  enable = true;

  servers.nas = {
    address = "nas.holthome.net";
    version = "4.2";
  };

  shares.backups = {
    server = "nas";
    remotePath = "/export/backups/${config.networking.hostName}";
    localPath = "/mnt/nas/backups";
    autoMount = false;  # Manual mount, only needed during backups
    soft = true;        # Don't hang if NAS is unavailable
    options = [ "noexec" "nosuid" "nodev" ];
  };
};

# Then configure the backup repository
modules.backup.restic.repositories = {
  primary = {
    url = "/mnt/nas/backups";
    passwordFile = config.sops.secrets.restic-primary-password.path;
    primary = true;
  };
};

Note: See the NFS Mount Management Guide for comprehensive NFS configuration options and best practices.

Backblaze B2 Repository¶

modules.backup.restic.repositories = {
  b2-cloud = {
    url = "b2:bucket-name:/${config.networking.hostName}";
    passwordFile = config.sops.secrets.restic-b2-password.path;
    environmentFile = config.sops.secrets.restic-b2-env.path;  # Contains B2 credentials
    primary = false;
  };
};

Environment file format for B2:

B2_ACCOUNT_ID=your_account_id
B2_ACCOUNT_KEY=your_account_key

SFTP Repository¶

modules.backup.restic.repositories = {
  remote-sftp = {
    url = "sftp:backup@backup-server.example.com:/backups/${config.networking.hostName}";
    passwordFile = config.sops.secrets.restic-sftp-password.path;
    primary = false;
  };
};

Step 3: Configure ZFS Integration (Optional)¶

If your host uses ZFS, enable snapshot integration for consistent backups:

modules.backup.zfs = {
  enable = true;
  pool = "rpool";  # Your ZFS pool name
  datasets = [
    ""              # Root dataset
    "home"          # Additional datasets
    "var/lib"
  ];
  retention = {
    daily = 7;
    weekly = 4;
    monthly = 3;
  };
};

How ZFS Integration Works: 1. Before backup, a snapshot is created for each dataset 2. The snapshot is mounted at /mnt/backup-snapshot 3. Restic backs up from the snapshot (consistent state) 4. After backup, snapshots are cleaned up based on retention

Step 4: Define Backup Jobs¶

Create backup jobs for different data types:

modules.backup.restic.jobs = {
  # System configuration backup
  system-config = {
    enable = true;
    paths = [
      "/etc/nixos"
      "/etc/systemd"
    ];
    repository = "primary";
    tags = [ "system" "configuration" ];
    excludePatterns = [
      "*.tmp"
      "*/.git"
    ];
  };

  # Home directories backup
  home = {
    enable = true;
    paths = [
      "/home"
    ];
    repository = "primary";
    tags = [ "user-data" "home" ];
    excludePatterns = [
      "*/.cache"
      "*/node_modules"
      "*/target"
      "*/.git"
    ];
    resources = {
      memory = "512m";
      cpus = "1.0";
    };
  };

  # Application data backup
  app-data = {
    enable = true;
    paths = [
      "/var/lib/postgresql"
      "/var/lib/mysql"
    ];
    repository = "primary";
    tags = [ "databases" "critical" ];
    preBackupScript = ''
      # Dump databases before backup
      echo "Creating database dumps..."
      # Add your database dump commands here
    '';
    postBackupScript = ''
      # Cleanup dumps
      echo "Cleaning up database dumps..."
    '';
  };
};

Step 5: Configure Monitoring¶

Enable comprehensive monitoring and alerting:

modules.backup.monitoring = {
  enable = true;

  # ntfy.sh notifications
  ntfy = {
    enable = true;
    topic = "https://ntfy.sh/my-homelab-backups";
  };

  # Healthchecks.io monitoring
  healthchecks = {
    enable = true;
    uuidFile = config.sops.secrets.healthchecks-uuid.path;
  };

  # Immediate failure notifications
  onFailure = {
    enable = true;
    notificationScript = ''
      # Custom failure handling
      echo "Backup failed for $JOB_NAME on $HOSTNAME"
      # Add custom notification logic
    '';
  };

  # Prometheus metrics export
  prometheus = {
    enable = true;
    metricsDir = "/var/lib/node_exporter/textfile_collector";
  };

  # Error analysis and categorization
  errorAnalysis = {
    enable = true;
    # Uses default error categories or customize
  };
};

Step 6: Enable Advanced Features¶

Repository Verification¶

modules.backup.verification = {
  enable = true;
  schedule = "weekly";           # daily/weekly/monthly
  checkData = false;             # Set true for full data integrity check
  checkDataSubset = "5%";        # Percentage to check when checkData=false
};

Automated Restore Testing¶

modules.backup.restoreTesting = {
  enable = true;
  schedule = "monthly";
  sampleFiles = 10;              # Number of random files to test
  testDir = "/tmp/backup-restore-test";
  retainTestData = false;        # Cleanup after test
};

Configuration Validation¶

modules.backup.validation = {
  enable = true;
  preFlightChecks = {
    enable = true;
    minFreeSpace = "10G";
    networkTimeout = 30;
  };
  repositoryHealth = {
    enable = true;
    maxAge = "48h";              # Alert if no backup in 48h
    minBackups = 3;              # Minimum snapshots to maintain
  };
};

Performance Tuning¶

modules.backup.performance = {
  cacheDir = "/var/cache/restic";
  cacheSizeLimit = "1G";
  ioScheduling = {
    enable = true;
    ioClass = "idle";            # Don't impact other I/O
    priority = 7;                # Lowest priority
  };
};

Security Hardening¶

modules.backup.security = {
  enable = true;
  restrictNetwork = true;        # Only allow access to backup repos
  readOnlyRootfs = true;         # Read-only root filesystem
  auditLogging = true;           # Detailed audit logs
};

Documentation Generation¶

modules.backup.documentation = {
  enable = true;
  outputDir = "/var/lib/backup/docs";
  includeMetrics = true;
};

This generates comprehensive documentation including: - System overview and configuration - Operational procedures and runbooks - Troubleshooting guides - Metrics reference - Emergency procedures

Step 7: Customize Global Settings¶

modules.backup.restic.globalSettings = {
  compression = "auto";          # auto/off/max
  readConcurrency = 2;           # Concurrent read operations
  retention = {
    daily = 14;
    weekly = 8;
    monthly = 6;
    yearly = 2;
  };
};

modules.backup.schedule = "02:00";  # Backup time (24-hour format)

Step 8: Apply Configuration¶

# Build and switch to new configuration
sudo nixos-rebuild switch

# Verify backup services are enabled
systemctl list-timers "restic-*"

# Check service status
systemctl status restic-backups-*

# View logs
journalctl -u restic-backups-* -f

Step 9: Configure ZFS Replication (Optional but Recommended)¶

For hosts with ZFS, add automated snapshot management and replication using Sanoid and Syncoid. This provides block-level replication complementing the file-based Restic backups.

Prerequisites¶

ZFS-based filesystem on source host
ZFS dataset on destination (e.g., nas-1)
SSH key for zfs-replication user
ZFS permissions granted on both source and destination

Configure Sanoid (Snapshot Management)¶

Add to your host configuration (e.g., hosts/forge/zfs-replication.nix):

{ config, ... }:

{
  config = {
    # Create dedicated user for ZFS replication
    users.users.zfs-replication = {
      isSystemUser = true;
      group = "zfs-replication";
      home = "/var/lib/zfs-replication";
      createHome = true;
      shell = "/run/current-system/sw/bin/nologin";
      description = "ZFS replication service user";
    };

    users.groups.zfs-replication = {};

    # Manage SSH private key via SOPS
    sops.secrets."zfs-replication/ssh-key" = {
      owner = "zfs-replication";
      group = "zfs-replication";
      mode = "0600";
      path = "/var/lib/zfs-replication/.ssh/id_ed25519";
    };

    # Create .ssh directory
    systemd.tmpfiles.rules = [
      "d /var/lib/zfs-replication/.ssh 0700 zfs-replication zfs-replication -"
    ];

    # Configure Sanoid for snapshot management
    services.sanoid = {
      enable = true;

      templates = {
        production = {
          hourly = 24;      # 1 day of hourly snapshots
          daily = 7;        # 1 week of daily snapshots
          weekly = 4;       # 1 month of weekly snapshots
          monthly = 3;      # 3 months of monthly snapshots
          yearly = 0;       # No yearly snapshots
          autosnap = true;
          autoprune = true;
        };
      };

      datasets = {
        "rpool/safe/home" = {
          useTemplate = [ "production" ];
          recursive = false;
        };

        "rpool/safe/persist" = {
          useTemplate = [ "production" ];
          recursive = false;
        };
      };
    };

    # Configure Syncoid for replication
    services.syncoid = {
      enable = true;
      interval = "hourly";
      sshKey = "/var/lib/zfs-replication/.ssh/id_ed25519";

      commands = {
        "rpool/safe/home" = {
          target = "zfs-replication@nas-1.holthome.net:backup/forge/zfs-recv/home";
          recursive = false;
          sendOptions = "w";  # Raw encrypted send
          recvOptions = "u";  # Receive without mounting
        };

        "rpool/safe/persist" = {
          target = "zfs-replication@nas-1.holthome.net:backup/forge/zfs-recv/persist";
          recursive = false;
          sendOptions = "w";
          recvOptions = "u";
        };
      };
    };
  };
}

Post-Deployment: Verify ZFS Permissions¶

ZFS permissions are now applied automatically via a systemd service (zfs-delegate-permissions.service). After deploying, verify they were applied correctly:

# On source host (e.g., forge)
ssh forge.holthome.net

# Verify the systemd service ran successfully
systemctl status zfs-delegate-permissions.service

# Verify permissions were granted
sudo zfs allow rpool/safe/home
sudo zfs allow rpool/safe/persist

# Expected output should show:
# - sanoid: send,snapshot,hold,destroy
# - zfs-replication: send,snapshot,hold

Note: The configuration includes a systemd.services.zfs-delegate-permissions service that automatically applies ZFS permissions at boot, making the system fully declarative and reproducible.

On destination host (nas-1), the zfs-replication user needs receive permissions:

# On destination host (nas-1)
ssh nas-1.holthome.net

# Grant receive permissions
sudo zfs allow zfs-replication receive,create,mount,hold backup/forge/zfs-recv

# Verify permissions
sudo zfs allow backup/forge/zfs-recv

Verify ZFS Replication Setup¶

# On source host - check services
systemctl status sanoid.timer
systemctl status syncoid.timer

# Trigger initial snapshot
sudo systemctl start sanoid.service

# Check snapshots were created
zfs list -t snapshot | grep autosnap

# Trigger initial replication (will take time for first full send)
sudo systemctl start syncoid.service

# Monitor replication progress
sudo journalctl -u syncoid.service -f

# On destination - verify snapshots arrived
ssh nas-1.holthome.net 'zfs list -t all backup/forge/zfs-recv'

ZFS Replication Monitoring¶

# Check timer schedules
systemctl list-timers sanoid syncoid

# View recent snapshot activity
sudo journalctl -u sanoid.service -n 50

# View recent replication activity
sudo journalctl -u syncoid.service -n 50

# Check for errors
systemctl --state=failed | grep -E "sanoid|syncoid"

# Compare snapshot counts (source vs destination)
echo "Source snapshots:"
zfs list -t snapshot rpool/safe/home | wc -l
echo "Destination snapshots:"
ssh nas-1.holthome.net 'zfs list -t snapshot backup/forge/zfs-recv/home | wc -l'

Note: For comprehensive ZFS replication documentation, see the ZFS Replication Setup Guide.

Benefits of Sanoid/Syncoid¶

Complements Restic by providing: - Near-instant snapshots (copy-on-write) - Efficient block-level replication (only changed blocks) - Fast bare-metal recovery - Preserves ZFS properties and attributes - Hourly snapshots with automatic pruning

Use cases: - Quick rollback to recent snapshot - Fast recovery of entire datasets - Replication to off-site ZFS storage - Disaster recovery with block-level consistency

Service Onboarding¶

The backup system includes pre-configured profiles for common homelab services. These profiles handle service-specific requirements like database dumps, application quiescence, and proper exclusion patterns.

Available Service Profiles¶

UniFi Controller: MongoDB dumps and configuration backup
Omada Controller: Database export and configuration
1Password Connect: Vault and credentials backup
Attic Binary Cache: Large dataset backup with optional ZFS send
System Configuration: NixOS generation and flake tracking

Enabling Service Backups¶

Import the service backup module (already imported by default):

# In /modules/nixos/default.nix
imports = [
  ./backup.nix
  ./services/backup-services.nix
];

UniFi Controller Backup¶

modules.services.backup-services = {
  enable = true;

  unifi = {
    enable = true;
    dataPath = "/var/lib/unifi";  # Default path
    mongoCredentialsFile = config.sops.secrets.unifi-mongo-creds.path;
  };
};

What it backs up: - MongoDB database with oplog (point-in-time consistency) - Configuration files - Keystore - Excludes: logs, work, temp directories

SOPS secret format (mongoCredentialsFile):

MONGO_USER=admin
MONGO_PASSWORD=your_secure_password

Omada Controller Backup¶

modules.services.backup-services = {
  enable = true;

  omada = {
    enable = true;
    dataPath = "/var/lib/omada";
    containerName = "omada";      # If running in container
  };
};

What it backs up: - MongoDB collections (sites, devices) - Controller data and configuration - Excludes: logs, work, temp directories

1Password Connect Backup¶

modules.services.backup-services = {
  enable = true;

  onepassword-connect = {
    enable = true;
    dataPath = "/var/lib/onepassword-connect/data";
    credentialsFile = config.sops.secrets.op-connect-creds.path;
  };
};

What it backs up: - Vault data - Credentials and sync state - Excludes: temporary files, cache

Attic Binary Cache Backup¶

modules.services.backup-services = {
  enable = true;

  attic = {
    enable = true;
    dataPath = "/var/lib/attic";

    # Option 1: Standard Restic backup (smaller caches)
    useZfsSend = false;

    # Option 2: ZFS send/receive (recommended for large caches)
    useZfsSend = true;
    nasDestination = "backup@nas.holthome.net";
  };
};

What it backs up: - Binary cache data - Cache metadata - Option for efficient ZFS replication

ZFS Send Method: - More efficient for large datasets - Incremental sends to NAS - Preserves ZFS features (compression, dedup)

System Configuration Backup¶

modules.services.backup-services = {
  enable = true;

  system = {
    enable = true;
    paths = [
      "/etc/nixos"
      "/home/ryan/.config"
      "/var/log"
    ];
    excludePatterns = [
      "*.tmp"
      "*.cache"
      "*/.git"
      "*/node_modules"
    ];
  };
};

What it backs up: - NixOS configuration - System generations list - Flake lock files - User configurations - System logs (with exclusions)

Creating Custom Service Profiles¶

To add a new service profile, edit /modules/nixos/services/backup-services.nix:

# Add to options section
myservice = {
  enable = mkEnableOption "MyService backup";

  dataPath = mkOption {
    type = types.str;
    default = "/var/lib/myservice";
    description = "Path to MyService data directory";
  };

  # Additional service-specific options
};

# Add to config section
(mkIf cfg.myservice.enable {
  myservice = {
    enable = true;
    paths = [ cfg.myservice.dataPath ];
    repository = "primary";
    tags = [ "myservice" "application" ];

    preBackupScript = ''
      # Service-specific preparation
      echo "Preparing MyService for backup..."
      # Example: Stop service, dump database, etc.
    '';

    postBackupScript = ''
      # Service-specific cleanup
      echo "Cleaning up MyService backup..."
      # Example: Restart service, remove temp files
    '';

    excludePatterns = [
      "*/logs/*"
      "*/temp/*"
    ];

    resources = {
      memory = "512m";
      cpus = "1.0";
    };
  };
})

Configuration Reference¶

Module Options¶

`modules.backup.enable`¶

Type: boolean
Default: false
Description: Enable the comprehensive backup system

`modules.backup.zfs`¶

enable: Enable ZFS snapshot integration
pool: ZFS pool name (default: "rpool")
datasets: List of datasets to snapshot
retention: Snapshot retention policy

`modules.backup.restic`¶

enable: Enable Restic backup
globalSettings: Global Restic configuration
compression: "auto", "off", or "max"
readConcurrency: Number of concurrent read operations
retention: Backup retention policy
repositories: Repository definitions
url: Repository URL
passwordFile: Path to password file
environmentFile: Path to environment file (optional)
primary: Is this the primary repository?
jobs: Backup job definitions
enable: Enable this job
paths: List of paths to backup
repository: Repository name to use
tags: Backup tags
excludePatterns: Patterns to exclude
preBackupScript: Script to run before backup
postBackupScript: Script to run after backup
resources: Resource limits

`modules.backup.monitoring`¶

enable: Enable monitoring and notifications
healthchecks: Healthchecks.io integration
ntfy: ntfy.sh notifications
onFailure: Immediate failure notifications
prometheus: Prometheus metrics export
errorAnalysis: Intelligent error categorization
logDir: Directory for structured logs

`modules.backup.verification`¶

enable: Enable automated repository verification
schedule: Verification schedule (daily/weekly/monthly)
checkData: Full data integrity check
checkDataSubset: Percentage of data to verify

`modules.backup.restoreTesting`¶

enable: Enable automated restore testing
schedule: Testing schedule
sampleFiles: Number of files to test
testDir: Directory for test restores
retainTestData: Keep test data after validation

`modules.backup.validation`¶

enable: Enable pre-flight validation
preFlightChecks: Pre-backup checks
enable: Enable checks
minFreeSpace: Minimum free space required
networkTimeout: Network connectivity timeout
repositoryHealth: Repository health monitoring
enable: Enable health monitoring
maxAge: Maximum backup age before alert
minBackups: Minimum backup count

`modules.backup.performance`¶

cacheDir: Restic cache directory
cacheSizeLimit: Maximum cache size
ioScheduling: I/O scheduling optimization

`modules.backup.security`¶

enable: Enable security hardening
restrictNetwork: Restrict network access
readOnlyRootfs: Read-only root filesystem
auditLogging: Detailed audit logging

`modules.backup.documentation`¶

enable: Enable documentation generation
outputDir: Documentation output directory
includeMetrics: Include metrics in docs

`modules.backup.schedule`¶

Type: string
Default: "02:00"
Description: Backup time in 24-hour format

Advanced Features¶

Multi-Repository Strategy¶

Configure multiple repositories for redundancy:

modules.backup.restic.repositories = {
  # Primary: Fast local/NAS storage
  primary = {
    url = "/mnt/nas/backups/${config.networking.hostName}";
    passwordFile = config.sops.secrets.restic-primary-password.path;
    primary = true;
  };

  # Secondary: Off-site replication
  secondary = {
    url = "sftp:backup@remote.example.com:/backups/${config.networking.hostName}";
    passwordFile = config.sops.secrets.restic-secondary-password.path;
    primary = false;
  };

  # Tertiary: Cloud backup
  b2-cloud = {
    url = "b2:my-bucket:/${config.networking.hostName}";
    passwordFile = config.sops.secrets.restic-b2-password.path;
    environmentFile = config.sops.secrets.restic-b2-env.path;
    primary = false;
  };
};

# Configure jobs to use different repositories
modules.backup.restic.jobs = {
  # Critical data goes to all repositories
  critical-data = {
    enable = true;
    paths = [ "/var/lib/critical" ];
    repository = "primary";  # Will also be replicated to secondary/tertiary
    tags = [ "critical" ];
  };

  # Less critical data only to primary
  cache-data = {
    enable = true;
    paths = [ "/var/cache/apps" ];
    repository = "primary";
    tags = [ "cache" ];
  };
};

Custom Error Categories¶

Customize error analysis rules:

modules.backup.monitoring.errorAnalysis = {
  enable = true;
  categoryRules = [
    {
      pattern = "(timeout|connection reset)";
      category = "network";
      severity = "high";
      actionable = true;
      retryable = true;
    }
    {
      pattern = "(disk full|out of space)";
      category = "storage";
      severity = "critical";
      actionable = true;
      retryable = false;
    }
    # Add custom rules specific to your environment
  ];
};

Scheduled Maintenance Windows¶

Adjust backup timing to avoid peak usage:

modules.backup.schedule = "03:30";  # Run at 3:30 AM

# Individual job overrides
modules.backup.restic.jobs.large-dataset = {
  # ... other config ...
  # Note: Individual job scheduling requires modifying the timer
};

Resource Management¶

Fine-tune resources for different backup jobs:

modules.backup.restic.jobs = {
  # Light backup job
  configs = {
    enable = true;
    paths = [ "/etc" ];
    repository = "primary";
    resources = {
      memory = "128m";
      memoryReservation = "64m";
      cpus = "0.25";
    };
  };

  # Heavy backup job
  databases = {
    enable = true;
    paths = [ "/var/lib/databases" ];
    repository = "primary";
    resources = {
      memory = "2g";
      memoryReservation = "1g";
      cpus = "2.0";
    };
  };
};

Monitoring & Alerting¶

Systemd Service Monitoring¶

# List all backup timers
systemctl list-timers "restic-*"

# Check specific backup job status
systemctl status restic-backups-system

# View real-time logs
journalctl -u restic-backups-* -f

# Check last run result
systemctl show -p ActiveEnterTimestamp restic-backups-system
systemctl show -p ActiveState restic-backups-system

Structured Logs¶

All backup events are logged in JSON format:

# View backup job logs
tail -f /var/log/backup/backup-jobs.jsonl | jq

# View error analysis
tail -f /var/log/backup/error-analysis.jsonl | jq

# View restore test results
tail -f /var/log/backup/backup-restore-tests.jsonl | jq

# Query specific events
jq 'select(.event == "backup_failure")' /var/log/backup/*.jsonl

Prometheus Metrics¶

Available metrics (when prometheus.enable = true):

# Backup job duration
restic_backup_duration_seconds{job="system"}

# Last successful backup timestamp
restic_backup_last_success_timestamp{job="system"}

# Backup status (1=success, 0=failure)
restic_backup_status{job="system"}

# Error counts by category
backup_errors_by_category_total{category="network"}

# Error counts by severity
backup_errors_by_severity_total{severity="critical"}

# Repository verification status
restic_verification_status{repository="primary"}

# Restore test results
restic_restore_test_status{repository="primary"}

Alert Rules¶

Example Prometheus alert rules:

groups:
  - name: backup.rules
    rules:
      - alert: BackupJobFailed
        expr: restic_backup_status == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup job {{ $labels.job }} failed on {{ $labels.hostname }}"

      - alert: BackupJobNotRunning
        expr: time() - restic_backup_last_success_timestamp > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Backup job {{ $labels.job }} hasn't run in 24+ hours"

      - alert: HighBackupErrorRate
        expr: increase(backup_errors_by_severity_total{severity="critical"}[1h]) > 5
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "High backup error rate on {{ $labels.hostname }}"

      - alert: RestoreTestFailed
        expr: restic_restore_test_status == 0
        for: 0m
        labels:
          severity: high
        annotations:
          summary: "Restore test failed for repository {{ $labels.repository }}"

Notification Channels¶

ntfy.sh¶

Simple push notifications:

modules.backup.monitoring.ntfy = {
  enable = true;
  topic = "https://ntfy.sh/my-backups";
};

Receives notifications: - On backup failure - On verification failure - On restore test failure

Healthchecks.io¶

Dead man's switch monitoring:

modules.backup.monitoring.healthchecks = {
  enable = true;
  uuidFile = config.sops.secrets.healthchecks-uuid.path;
};

Pings on: - Backup success - Backup failure - Each check-in

Troubleshooting¶

Common Issues¶

Backup Job Fails with "Permission Denied"¶

Symptoms: Backup fails with permission errors

Solutions:

# Check backup user/group
id restic-backup

# Verify file permissions
ls -la /path/to/backup

# Check SELinux/AppArmor if enabled
getenforce  # or apparmor_status

# Fix: Ensure backup user has read access
sudo chown -R restic-backup:restic-backup /path/to/backup

"Repository Not Found" Error¶

Symptoms: Backup fails with repository initialization error

Solutions:

# Check repository URL is accessible
restic -r <repo-url> snapshots

# Verify password file exists
cat /path/to/password/file

# Initialize repository manually
restic -r <repo-url> init

# Check environment file for cloud repos
cat /path/to/env/file

ZFS Snapshot Mount Fails¶

Symptoms: Backup fails with ZFS-related errors

Solutions:

# Check ZFS pool status
zpool status

# List existing snapshots
zfs list -t snapshot

# Clean up stale snapshots
zfs list -t snapshot | grep backup- | awk '{print $1}' | xargs -n1 zfs destroy

# Verify mount point
ls -la /mnt/backup-snapshot

High Memory Usage¶

Symptoms: Backup process consuming excessive memory

Solutions:

# Reduce resources for the job
modules.backup.restic.jobs.problematic-job = {
  resources = {
    memory = "512m";  # Reduce from higher value
    memoryReservation = "256m";
    cpus = "1.0";
  };
};

# Reduce read concurrency
modules.backup.restic.globalSettings.readConcurrency = 1;

# Clear Restic cache
# rm -rf /var/cache/restic/*

Slow Backup Performance¶

Symptoms: Backups taking too long to complete

Solutions:

# Enable compression
modules.backup.restic.globalSettings.compression = "auto";

# Increase read concurrency
modules.backup.restic.globalSettings.readConcurrency = 4;

# Adjust I/O scheduling
modules.backup.performance.ioScheduling = {
  enable = true;
  ioClass = "best-effort";  # Instead of "idle"
  priority = 4;  # Higher priority
};

# Increase cache size
modules.backup.performance.cacheSizeLimit = "2G";

Repository Corruption¶

Symptoms: Repository check fails with errors

Solutions:

# Run repository check
restic -r <repo-url> check

# Attempt repair with rebuild-index
restic -r <repo-url> rebuild-index

# Full data check (slow)
restic -r <repo-url> check --read-data

# If irreparable, restore from secondary repository

Debug Mode¶

Enable verbose logging for troubleshooting:

# Run backup manually with verbose output
sudo -u restic-backup restic -r <repo-url> backup /path --verbose

# Check systemd service logs with all details
journalctl -u restic-backups-system -b --no-pager

# Enable debug logging in Restic
export RESTIC_DEBUG=1

Manual Recovery¶

Restore Single File¶

# List snapshots
restic -r <repo-url> snapshots

# Find file
restic -r <repo-url> find /path/to/file

# Restore from specific snapshot
restic -r <repo-url> restore <snapshot-id> \
  --target /tmp/restore \
  --include /path/to/file

Full System Restore¶

# 1. Boot from NixOS installation media

# 2. Configure network
systemctl start NetworkManager
nmtui

# 3. Mount target filesystem
mount /dev/sdX /mnt

# 4. Install Restic
nix-shell -p restic

# 5. Restore system
restic -r <repo-url> restore latest --target /mnt

# 6. Install bootloader
nixos-install --root /mnt

# 7. Reboot
reboot

Best Practices¶

NFS Mount Configuration¶

Backup Storage (Occasional Access)¶

For backup destinations that need occasional access, use systemd automount with idle timeout:

fileSystems."/mnt/nas-backup" = {
  device = "nas-1.holthome.net:/mnt/backup/forge/restic";
  fsType = "nfs";
  options = [
    "nfsvers=4.2"
    "rw"
    "noatime"
    "x-systemd.automount"           # Mount on first access
    "x-systemd.idle-timeout=600"    # Unmount after 10 min idle
    "x-systemd.mount-timeout=30s"   # Fail fast if NAS down
  ];
};

Benefits:

Won't block boot if NAS is down
Automatically unmounts after idle period (security + resource efficiency)
Auto-mounts on first access (transparent to services)
Available system-wide for any service that needs it

Media Library (Continuous Access)¶

For shared media libraries accessed by multiple services (Plex, Sonarr, Radarr, SABnzbd, etc.), use a single shared mount point:

fileSystems."/mnt/media" = {
  device = "nas-1.holthome.net:/mnt/media";
  fsType = "nfs";
  options = [
    "nfsvers=4.2"
    "rw"
    "noatime"
    "x-systemd.automount"  # Optional but recommended for resilience
    "noauto"               # Required with automount
  ];
};

Best Practices for Shared Media:

Single mount point: All services reference the same path (e.g., /mnt/media)
Shared group: Create a media group and add all service users to it
Consistent permissions: Ensure NFS export and local permissions align (UID/GID mapping)
Do not create separate mounts per service: This adds complexity and can cause file sync issues

Example user/group configuration:

users.groups.media = { gid = 1500; };  # Match GID across NAS and clients

users.users = {
  plex.extraGroups = [ "media" ];
  sonarr.extraGroups = [ "media" ];
  radarr.extraGroups = [ "media" ];
  sabnzbd.extraGroups = [ "media" ];
};

Security¶

Encrypt Repository Passwords: Always use SOPS or similar for password management
Rotate Credentials: Periodically rotate repository passwords
Least Privilege: Run backups with minimal required permissions
Audit Logs: Enable audit logging for compliance
Network Isolation: Restrict backup process network access

Reliability¶

Test Restores: Enable automated restore testing
Multiple Repositories: Use at least two repositories (local + off-site)
Monitor Actively: Configure alerts for failures
Verify Regularly: Enable weekly repository verification
Document Procedures: Keep runbooks updated

Performance¶

Schedule Wisely: Run backups during low-activity periods
Resource Limits: Prevent backup impact on production services
Compression: Enable auto compression for better efficiency
Prune Regularly: Configure retention policies appropriately
Use ZFS Snapshots: Ensure consistent backups

Data Management¶

Exclude Patterns: Don't backup cache, logs, temp files
Tag Appropriately: Use tags for easy snapshot identification
Retention Policy: Balance storage costs with recovery needs
Pre/Post Scripts: Handle application-specific requirements
Incremental Backups: Leverage Restic's deduplication

Operational¶

Monitor Trends: Track backup size and duration over time
Capacity Planning: Monitor repository storage growth
Error Analysis: Review error categories weekly
Update Documentation: Keep generated docs current
Test Disaster Recovery: Practice full system restores

Example Configurations¶

Complete Production Host¶

{ config, lib, pkgs, ... }:
{
  # Enable backup system
  modules.backup = {
    enable = true;

    # ZFS integration
    zfs = {
      enable = true;
      pool = "rpool";
      datasets = [ "" "home" "var/lib" ];
      retention = {
        daily = 7;
        weekly = 4;
        monthly = 3;
      };
    };

    # Restic configuration
    restic = {
      enable = true;

      globalSettings = {
        compression = "auto";
        readConcurrency = 2;
        retention = {
          daily = 14;
          weekly = 8;
          monthly = 6;
          yearly = 2;
        };
      };

      # Multiple repositories
      repositories = {
        primary = {
          url = "/mnt/nas/backups/${config.networking.hostName}";
          passwordFile = config.sops.secrets.restic-primary-password.path;
          primary = true;
        };
        b2-cloud = {
          url = "b2:homelab-backups:/${config.networking.hostName}";
          passwordFile = config.sops.secrets.restic-b2-password.path;
          environmentFile = config.sops.secrets.restic-b2-env.path;
          primary = false;
        };
      };

      # Backup jobs
      jobs = {
        system = {
          enable = true;
          paths = [ "/etc/nixos" "/etc/systemd" ];
          repository = "primary";
          tags = [ "system" "configuration" ];
        };
        home = {
          enable = true;
          paths = [ "/home" ];
          repository = "primary";
          tags = [ "user-data" ];
          excludePatterns = [
            "*/.cache"
            "*/Downloads"
            "*/node_modules"
          ];
        };
      };
    };

    # Comprehensive monitoring
    monitoring = {
      enable = true;

      ntfy = {
        enable = true;
        topic = "https://ntfy.sh/homelab-backups";
      };

      healthchecks = {
        enable = true;
        uuidFile = config.sops.secrets.healthchecks-uuid.path;
      };

      onFailure = {
        enable = true;
      };

      prometheus = {
        enable = true;
      };

      errorAnalysis = {
        enable = true;
      };
    };

    # Automated verification
    verification = {
      enable = true;
      schedule = "weekly";
      checkDataSubset = "5%";
    };

    # Restore testing
    restoreTesting = {
      enable = true;
      schedule = "monthly";
      sampleFiles = 10;
    };

    # Validation
    validation = {
      enable = true;
      preFlightChecks = {
        enable = true;
        minFreeSpace = "10G";
      };
      repositoryHealth = {
        enable = true;
        maxAge = "48h";
        minBackups = 3;
      };
    };

    # Performance
    performance = {
      ioScheduling = {
        enable = true;
        ioClass = "idle";
        priority = 7;
      };
    };

    # Security
    security = {
      enable = true;
      restrictNetwork = true;
      auditLogging = true;
    };

    # Documentation
    documentation = {
      enable = true;
      includeMetrics = true;
    };

    schedule = "02:30";
  };

  # Enable service backups
  modules.services.backup-services = {
    enable = true;

    # Add service-specific backups as needed
    system.enable = true;
  };
}

Minimal Laptop Configuration¶

{ config, lib, pkgs, ... }:
{
  modules.backup = {
    enable = true;

    restic = {
      enable = true;

      repositories.laptop-backup = {
        url = "b2:my-laptop-backups:/";
        passwordFile = config.sops.secrets.restic-password.path;
        environmentFile = config.sops.secrets.restic-b2-env.path;
        primary = true;
      };

      jobs.laptop-data = {
        enable = true;
        paths = [
          "/home/user/Documents"
          "/home/user/Pictures"
        ];
        repository = "laptop-backup";
        excludePatterns = [
          "*/.cache"
          "*/Downloads"
        ];
        resources = {
          memory = "256m";
          cpus = "1.0";
        };
      };
    };

    monitoring = {
      enable = true;
      ntfy = {
        enable = true;
        topic = "https://ntfy.sh/my-laptop";
      };
    };

    schedule = "22:00";  # Evening backup
  };
}

Advanced Troubleshooting¶

Systemd Sandboxing with SSH Keys and Symlinks¶

When using systemd services that need to access SSH keys managed via SOPS (or other secret management that uses symlinks), you may encounter issues with systemd sandboxing.

Symptom: "Identity file not accessible"¶

Warning: Identity file /var/lib/user/.ssh/id_ed25519 not accessible: No such file or directory

Even though:

The file exists
Manual sudo -u user test -r /var/lib/user/.ssh/id_ed25519 succeeds
File permissions are correct (600)

Root Cause¶

When PrivateMounts=true is set (common in hardened systemd services), the service runs in a private mount namespace. Symlinks that cross mount namespace boundaries won't resolve correctly unless both the symlink source AND target are explicitly mapped into the service's namespace.

Example problematic setup:

/var/lib/user/.ssh/id_ed25519 -> /run/secrets/service/ssh-key

With only ReadOnlyPaths=/var/lib/user/.ssh, the symlink exists but the target /run/secrets/service/ssh-key is not visible in the private mount namespace.

Solution: Use BindReadOnlyPaths¶

You must use BindReadOnlyPaths (not ReadOnlyPaths) for both ends of the symlink:

systemd.services.my-service.serviceConfig = {
  BindReadOnlyPaths = lib.mkForce [
    "/nix/store"
    "/etc"
    "/bin/sh"
    # Both symlink source and target must be explicitly bound
    "/var/lib/user/.ssh"            # Symlink location
    "/run/secrets/service"          # Symlink target directory
  ];
};

Why BindReadOnlyPaths instead of ReadOnlyPaths?

ReadOnlyPaths only makes paths read-only within the existing namespace
BindReadOnlyPaths explicitly binds external paths into the private mount namespace
With PrivateMounts=true, symlink resolution requires both ends to be bound

Common Pitfall: BindReadOnlyPaths=/run¶

Many NixOS modules set BindReadOnlyPaths=/run which makes the entire /run tree read-only. This prevents access to specific subdirectories even if you add ReadOnlyPaths=/run/secrets/....

Solution: Override the entire BindReadOnlyPaths list:

systemd.services.my-service.serviceConfig = {
  # Override to remove /run from the list
  BindReadOnlyPaths = lib.mkForce [
    "/nix/store"
    "/etc"
    "/bin/sh"
    # Now add specific paths under /run as needed
    "/run/secrets/service"
  ];
};

ZFS Replication SSH Issues¶

Symptom: SSH Connection Hangs¶

When testing SSH as the replication user, the connection hangs with a blank screen. Syncoid reports:

CRITICAL ERROR: ssh connection echo test failed with exit code 255
cannot receive: failed to read from stream

Common Causes¶

1. User Shell Set to nologin¶

# Check current shell
getent passwd zfs-replication
# Output: zfs-replication:x:998:998::/var/lib/zfs-replication:/usr/sbin/nologin

Solution: The user needs a working shell:

# On Ubuntu/Debian
sudo usermod -s /usr/bin/bash zfs-replication

# In NixOS configuration
users.users.zfs-replication = {
  shell = "/run/current-system/sw/bin/bash";  # Not nologin!
};

Why: Tools like syncoid need to execute multiple commands (zfs list, zfs receive, etc.). With nologin, SSH connections hang waiting for input.

2. Forced Command in authorized_keys¶

# Check for forced command
sudo cat /var/lib/zfs-replication/.ssh/authorized_keys
# Output: command="zfs recv -F pool/dataset" ssh-ed25519 AAAAC3...

Problem: The forced command restricts SSH to only execute that single command. Syncoid needs to:

Run echo tests to verify connectivity
List datasets with zfs list
Check snapshot existence
Execute zfs receive with dynamic options

Solution: Remove the forced command, keep other restrictions:

# Before (BAD)
command="zfs recv -F pool/dataset",no-agent-forwarding,no-X11-forwarding ssh-ed25519 AAAAC3...

# After (GOOD)
no-agent-forwarding,no-X11-forwarding ssh-ed25519 AAAAC3...

In NixOS:

users.users.zfs-replication = {
  openssh.authorizedKeys.keys = [
    # DO NOT add a forced command
    "no-agent-forwarding,no-X11-forwarding ssh-ed25519 AAAAC3... user@host"
  ];
};

Security Note: Security is maintained through:

SSH key-based authentication only
ZFS delegated permissions (only specific operations allowed)
SSH restrictions (no-agent-forwarding, no-X11-forwarding, no-pty)
User has no sudo access

Testing SSH Connectivity¶

# From the source host, as the replication user
sudo -u zfs-replication ssh zfs-replication@destination 'echo OK && hostname'

# Should output:
# OK
# destination-hostname

# If it hangs, check:
# 1. Shell (not nologin)
# 2. authorized_keys (no forced command)

Conclusion¶

The NixOS backup system provides enterprise-grade backup capabilities with:

Encrypted, deduplicated backups via Restic
ZFS snapshot integration for consistency
Pre-configured service profiles
Comprehensive monitoring and alerting
Automated verification and testing
Self-documenting system

For additional help:

Review generated documentation in /var/lib/backup/docs/
Check structured logs in /var/log/backup/
Consult Prometheus metrics for system health
Review backup module source: /modules/nixos/backup.nix
See advanced troubleshooting above for systemd sandboxing and SSH issues

NixOS Backup System Onboarding Guide¶

Overview¶

Table of Contents¶

System Architecture¶

Core Components¶

Architecture Diagram¶

Prerequisites¶

Required¶

Optional but Recommended¶

SOPS Secret Management¶

Quick Start¶

Basic Host Backup Configuration¶

Host Onboarding¶

Step 1: Import the Backup Module¶

Step 2: Configure Repositories¶

Local/NAS Repository¶

Backblaze B2 Repository¶

SFTP Repository¶

Step 3: Configure ZFS Integration (Optional)¶

Step 4: Define Backup Jobs¶

Step 5: Configure Monitoring¶

Step 6: Enable Advanced Features¶

Repository Verification¶

Automated Restore Testing¶

Configuration Validation¶

Performance Tuning¶

Security Hardening¶

Documentation Generation¶

Step 7: Customize Global Settings¶

Step 8: Apply Configuration¶

Step 9: Configure ZFS Replication (Optional but Recommended)¶

Prerequisites¶

Configure Sanoid (Snapshot Management)¶

Post-Deployment: Verify ZFS Permissions¶

Verify ZFS Replication Setup¶

ZFS Replication Monitoring¶

Benefits of Sanoid/Syncoid¶

Service Onboarding¶

Available Service Profiles¶

Enabling Service Backups¶

UniFi Controller Backup¶

Omada Controller Backup¶

1Password Connect Backup¶

Attic Binary Cache Backup¶

System Configuration Backup¶

Creating Custom Service Profiles¶

Configuration Reference¶

Module Options¶

modules.backup.enable¶

modules.backup.zfs¶

modules.backup.restic¶

modules.backup.monitoring¶

modules.backup.verification¶

modules.backup.restoreTesting¶

modules.backup.validation¶

modules.backup.performance¶

modules.backup.security¶

modules.backup.documentation¶

modules.backup.schedule¶

Advanced Features¶

Multi-Repository Strategy¶

Custom Error Categories¶

Scheduled Maintenance Windows¶

Resource Management¶

Monitoring & Alerting¶

Systemd Service Monitoring¶

Structured Logs¶

Prometheus Metrics¶

Alert Rules¶

Notification Channels¶

ntfy.sh¶

Healthchecks.io¶

Troubleshooting¶

Common Issues¶

Backup Job Fails with "Permission Denied"¶

"Repository Not Found" Error¶

ZFS Snapshot Mount Fails¶

High Memory Usage¶

Slow Backup Performance¶

Repository Corruption¶

`modules.backup.enable`¶

`modules.backup.zfs`¶

`modules.backup.restic`¶

`modules.backup.monitoring`¶

`modules.backup.verification`¶

`modules.backup.restoreTesting`¶

`modules.backup.validation`¶

`modules.backup.performance`¶

`modules.backup.security`¶

`modules.backup.documentation`¶

`modules.backup.schedule`¶