Backup Metrics Reference¶

Overview¶

The backup module exports metrics in Prometheus format to the node-exporter textfile collector directory. These metrics are automatically exposed at the node-exporter metrics endpoint.

Metrics Endpoint¶

URL: http://forge.holthome.net:9100/metrics

All backup-related metrics will be included in the node-exporter response along with standard system metrics.

Available Metrics¶

Backup Job Metrics¶

Generated per backup job after each backup run.

File: /var/lib/node_exporter/textfile_collector/restic_backup_<job_name>.prom

# Duration of backup operation
restic_backup_duration_seconds{job="system",repository="/mnt/nas-backup",hostname="forge"}

# Last successful backup timestamp (Unix timestamp)
restic_backup_last_success_timestamp{job="system",repository="/mnt/nas-backup",hostname="forge"}

# Backup job status (1=success, 0=failure)
restic_backup_status{job="system",repository="/mnt/nas-backup",hostname="forge"}

Repository Verification Metrics¶

Generated after repository integrity checks.

File: /var/lib/node_exporter/textfile_collector/restic_verification_<repository_name>.prom

# Duration of verification operation
restic_verification_duration_seconds{repository="nas-primary",hostname="forge"}

# Last verification run timestamp
restic_verification_last_run_timestamp{repository="nas-primary",hostname="forge"}

# Verification status (1=success, 0=failure)
restic_verification_status{repository="nas-primary",hostname="forge"}

Restore Test Metrics¶

Generated after automated restore testing.

File: /var/lib/node_exporter/textfile_collector/restic_restore_test_<repository_name>.prom

# Duration of restore test
restic_restore_test_duration_seconds{repository="nas-primary",hostname="forge"}

# Last restore test timestamp
restic_restore_test_last_run_timestamp{repository="nas-primary",hostname="forge"}

# Restore test status (1=success, 0=failure)
restic_restore_test_status{repository="nas-primary",hostname="forge"}

Backup Failure Metrics¶

Generated when a backup fails.

File: /var/lib/node_exporter/textfile_collector/restic_backup_<job_name>_failure.prom

# Backup failure timestamp
restic_backup_failure_timestamp{job="system",repository="/mnt/nas-backup",hostname="forge"}

# Number of consecutive failures
restic_backup_failure_count{job="system",repository="/mnt/nas-backup",hostname="forge"}

Error Analysis Metrics¶

Generated by the backup error analysis service.

File: /var/lib/node_exporter/textfile_collector/backup_error_analysis.prom

# Last error analysis run timestamp
backup_error_analysis_last_run_timestamp{hostname="forge"}

# Number of errors found in analysis
backup_error_analysis_errors_found{hostname="forge"}

Documentation Generation Metrics¶

Generated after backup documentation is created.

File: /var/lib/node_exporter/textfile_collector/backup_documentation.prom

# Last documentation generation timestamp
backup_documentation_last_generated_timestamp{hostname="forge"}

Metric Labels¶

All metrics include these common labels:

hostname: The hostname of the system (e.g., "forge")
job: The backup job name (for job-specific metrics)
repository: The repository name or URL (for repository-specific metrics)

Useful Queries¶

Time Since Last Backup¶

# Seconds since last successful backup
time() - restic_backup_last_success_timestamp{hostname="forge"}

# Alert if no backup in 24 hours
(time() - restic_backup_last_success_timestamp{hostname="forge"}) > 86400

Backup Failure Detection¶

# Show current backup status (0 = failed)
restic_backup_status{hostname="forge"}

# Count consecutive failures
restic_backup_failure_count{hostname="forge"}

Backup Duration Trends¶

# Current backup duration
restic_backup_duration_seconds{hostname="forge"}

# Rate of change in backup duration (increasing backup size?)
rate(restic_backup_duration_seconds{hostname="forge"}[1h])

Repository Health¶

# Verification status
restic_verification_status{hostname="forge"}

# Time since last verification
time() - restic_verification_last_run_timestamp{hostname="forge"}

Example Prometheus Alert Rules¶

groups:
  - name: backup_alerts
    rules:
      - alert: BackupFailed
        expr: restic_backup_status{hostname="forge"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup failed on {{ $labels.hostname }}"
          description: "Backup job {{ $labels.job }} failed on {{ $labels.hostname }}"

      - alert: BackupStale
        expr: (time() - restic_backup_last_success_timestamp{hostname="forge"}) > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "No successful backup in 24 hours on {{ $labels.hostname }}"

      - alert: BackupDurationIncreasing
        expr: |
          (
            restic_backup_duration_seconds{hostname="forge"}
            /
            restic_backup_duration_seconds{hostname="forge"} offset 1w
          ) > 1.5
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Backup duration increased by 50% on {{ $labels.hostname }}"

      - alert: VerificationFailed
        expr: restic_verification_status{hostname="forge"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Repository verification failed on {{ $labels.hostname }}"

Testing Metrics¶

After deploying, verify metrics are available:

# Check if metrics files exist
ls -la /var/lib/node_exporter/textfile_collector/

# View a metrics file
cat /var/lib/node_exporter/textfile_collector/backup_documentation.prom

# Query node-exporter endpoint
curl http://localhost:9100/metrics | grep -E 'restic_|backup_'

# Query from Prometheus (if configured)
# Navigate to: http://prometheus:9090/graph
# Query: restic_backup_last_success_timestamp{hostname="forge"}

Metric File Ownership¶

All metric files are owned by the restic-backup user and created in /var/lib/node_exporter/textfile_collector/, which is readable by the node-exporter user.

The common monitoring module automatically creates this directory with proper permissions:

systemd.tmpfiles.rules = [
  "d /var/lib/node_exporter/textfile_collector 0755 node-exporter node-exporter -"
];