Backup Metrics Reference¶
Overview¶
The backup module exports metrics in Prometheus format to the node-exporter textfile collector directory. These metrics are automatically exposed at the node-exporter metrics endpoint.
Metrics Endpoint¶
URL: http://forge.holthome.net:9100/metrics
All backup-related metrics will be included in the node-exporter response along with standard system metrics.
Available Metrics¶
Backup Job Metrics¶
Generated per backup job after each backup run.
File: /var/lib/node_exporter/textfile_collector/restic_backup_<job_name>.prom
# Duration of backup operation
restic_backup_duration_seconds{job="system",repository="/mnt/nas-backup",hostname="forge"}
# Last successful backup timestamp (Unix timestamp)
restic_backup_last_success_timestamp{job="system",repository="/mnt/nas-backup",hostname="forge"}
# Backup job status (1=success, 0=failure)
restic_backup_status{job="system",repository="/mnt/nas-backup",hostname="forge"}
Repository Verification Metrics¶
Generated after repository integrity checks.
File: /var/lib/node_exporter/textfile_collector/restic_verification_<repository_name>.prom
# Duration of verification operation
restic_verification_duration_seconds{repository="nas-primary",hostname="forge"}
# Last verification run timestamp
restic_verification_last_run_timestamp{repository="nas-primary",hostname="forge"}
# Verification status (1=success, 0=failure)
restic_verification_status{repository="nas-primary",hostname="forge"}
Restore Test Metrics¶
Generated after automated restore testing.
File: /var/lib/node_exporter/textfile_collector/restic_restore_test_<repository_name>.prom
# Duration of restore test
restic_restore_test_duration_seconds{repository="nas-primary",hostname="forge"}
# Last restore test timestamp
restic_restore_test_last_run_timestamp{repository="nas-primary",hostname="forge"}
# Restore test status (1=success, 0=failure)
restic_restore_test_status{repository="nas-primary",hostname="forge"}
Backup Failure Metrics¶
Generated when a backup fails.
File: /var/lib/node_exporter/textfile_collector/restic_backup_<job_name>_failure.prom
# Backup failure timestamp
restic_backup_failure_timestamp{job="system",repository="/mnt/nas-backup",hostname="forge"}
# Number of consecutive failures
restic_backup_failure_count{job="system",repository="/mnt/nas-backup",hostname="forge"}
Error Analysis Metrics¶
Generated by the backup error analysis service.
File: /var/lib/node_exporter/textfile_collector/backup_error_analysis.prom
# Last error analysis run timestamp
backup_error_analysis_last_run_timestamp{hostname="forge"}
# Number of errors found in analysis
backup_error_analysis_errors_found{hostname="forge"}
Documentation Generation Metrics¶
Generated after backup documentation is created.
File: /var/lib/node_exporter/textfile_collector/backup_documentation.prom
# Last documentation generation timestamp
backup_documentation_last_generated_timestamp{hostname="forge"}
Metric Labels¶
All metrics include these common labels:
hostname: The hostname of the system (e.g., "forge")job: The backup job name (for job-specific metrics)repository: The repository name or URL (for repository-specific metrics)
Useful Queries¶
Time Since Last Backup¶
# Seconds since last successful backup
time() - restic_backup_last_success_timestamp{hostname="forge"}
# Alert if no backup in 24 hours
(time() - restic_backup_last_success_timestamp{hostname="forge"}) > 86400
Backup Failure Detection¶
# Show current backup status (0 = failed)
restic_backup_status{hostname="forge"}
# Count consecutive failures
restic_backup_failure_count{hostname="forge"}
Backup Duration Trends¶
# Current backup duration
restic_backup_duration_seconds{hostname="forge"}
# Rate of change in backup duration (increasing backup size?)
rate(restic_backup_duration_seconds{hostname="forge"}[1h])
Repository Health¶
# Verification status
restic_verification_status{hostname="forge"}
# Time since last verification
time() - restic_verification_last_run_timestamp{hostname="forge"}
Example Prometheus Alert Rules¶
groups:
- name: backup_alerts
rules:
- alert: BackupFailed
expr: restic_backup_status{hostname="forge"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed on {{ $labels.hostname }}"
description: "Backup job {{ $labels.job }} failed on {{ $labels.hostname }}"
- alert: BackupStale
expr: (time() - restic_backup_last_success_timestamp{hostname="forge"}) > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "No successful backup in 24 hours on {{ $labels.hostname }}"
- alert: BackupDurationIncreasing
expr: |
(
restic_backup_duration_seconds{hostname="forge"}
/
restic_backup_duration_seconds{hostname="forge"} offset 1w
) > 1.5
for: 1h
labels:
severity: info
annotations:
summary: "Backup duration increased by 50% on {{ $labels.hostname }}"
- alert: VerificationFailed
expr: restic_verification_status{hostname="forge"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Repository verification failed on {{ $labels.hostname }}"
Testing Metrics¶
After deploying, verify metrics are available:
# Check if metrics files exist
ls -la /var/lib/node_exporter/textfile_collector/
# View a metrics file
cat /var/lib/node_exporter/textfile_collector/backup_documentation.prom
# Query node-exporter endpoint
curl http://localhost:9100/metrics | grep -E 'restic_|backup_'
# Query from Prometheus (if configured)
# Navigate to: http://prometheus:9090/graph
# Query: restic_backup_last_success_timestamp{hostname="forge"}
Metric File Ownership¶
All metric files are owned by the restic-backup user and created in /var/lib/node_exporter/textfile_collector/, which is readable by the node-exporter user.
The common monitoring module automatically creates this directory with proper permissions: