Monitoring Strategy: Black-Box vs White-Box¶
This document establishes the strategic division of monitoring responsibilities between Gatus (black-box) and Prometheus/Alertmanager (white-box) for homelab services.
Last Updated: December 4, 2025 Architecture: Gatus replaces Uptime Kuma as the black-box monitoring solution
Core Principle¶
"Use Gatus for user-facing availability (is it up?) and Prometheus for system internals (how well is it running?)."
These two monitoring perspectives are complementary, not redundant: - Gatus alert tells you what is broken (user impact) - Prometheus alert tells you why it's breaking (system health)
Black-Box Monitoring (Gatus)¶
Purpose¶
External validation from the user's perspective. Answers: "Is this service available and behaving as a user would expect from the outside?"
Characteristics¶
- Knows nothing about internal state - interacts as a client would
- Binary checks: Service is UP ✅ or DOWN ❌
- Immediate alerts: User-facing failures require immediate action
- Public status page: Family/users can see service availability
- Declarative configuration: Services contribute their own endpoints
What to Monitor in Gatus¶
Add to Gatus if the service meets ANY of these criteria: - ✅ Users directly interact with it (web UI, API, network service) - ✅ You want it displayed on a public status page - ✅ The check is simple and external (HTTP 200, port open, ping) - ✅ Failure immediately impacts user experience
Check Types¶
| Check Type | Use Case | Example |
|---|---|---|
| HTTP(S) | Web services, APIs | Status 200 + keyword "Login" |
| TCP Port | Database connections | Port 5432 accepting connections |
| DNS | DNS resolution | Query google.com through AdGuard |
| Ping | Host reachability | NAS, other homelab nodes |
| TLS Certificate | Certificate expiry | Certificate valid, >30 days remaining |
Alert Routing¶
- Target: Phone notifications (critical/immediate)
- When: User-visible service failures
- Configure in: NixOS configuration via
modules.services.gatus.contributions
White-Box Monitoring (Prometheus)¶
Purpose¶
Internal health and performance measurement. Answers: "What is the internal state, load, and performance of this service and its components?"
Characteristics¶
- Requires metrics exposure - service must instrument itself
- Quantitative measurements: Gauges, counters, histograms, trends
- Predictive alerts: Warn before failures occur (disk filling, memory leak)
- Historical analysis: Grafana dashboards, capacity planning
What to Monitor in Prometheus¶
Add to Prometheus if the service meets ANY of these criteria: - ✅ Exposes metrics (native exporter or instrumented) - ✅ Resource usage matters (CPU, memory, disk, network) - ✅ Needs predictive alerting (trending toward failure) - ✅ Requires historical trending and dashboards
Metric Sources¶
| Source | Purpose | Examples |
|---|---|---|
| node_exporter | System-level metrics | CPU, memory, disk, network, systemd units |
| postgres_exporter | Database internals | Connection pools, query latency, replication lag |
| Application exporters | App-specific metrics | Auth failures, request rates, queue depths |
| Textfile collectors | Custom metrics | ZFS health, GPU usage, container stats |
Alert Types¶
| Alert Category | Severity | Example |
|---|---|---|
| Resource Critical | Critical | Disk <10%, Memory >90% sustained |
| Degradation | High | Query latency >500ms p95 |
| Predictive | Warning | Disk will fill in 4 hours (trend) |
| Internal Failure | High | Backup job failed, systemd restart loop |
Alert Routing¶
- Target: Alertmanager → Slack/Email/Phone (severity-based)
- When: System trending toward failure or internal problems
- Configure in: NixOS configuration (
modules.alerting.rules)
When to Use BOTH¶
Use both monitoring systems when: 1. Service is critical AND complex - PostgreSQL, authentication services 2. Different perspectives provide different value - External availability ≠ internal health 3. Alerts serve different purposes - User impact vs operational health
Example: PostgreSQL¶
Gatus Check:
- Type: TCP Port
- Target: localhost:5432
- Alert: "Database is completely unreachable"
- Purpose: Fast validation that users/apps can connect
Prometheus Monitoring: - Exporter: postgres_exporter - Metrics: Connection pool usage, query latency, replication lag, backup status - Alerts: High connection count, slow queries, backup failures - Purpose: Catch slow degradation before total failure
Why both? TCP check is immediate user-perspective validation. Prometheus catches internal problems (slow queries, connection exhaustion) before they cause complete outages.
Service-Specific Guidance¶
Authentication (PocketID, Keycloak)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | HTTPS → login page returns 200 + "Login" keyword | Users can access login |
| Prometheus | Auth success/failure rate, request latency | Detect attacks or misconfig |
DNS (AdGuard Home, Pihole)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | DNS query for google.com succeeds | DNS resolution working |
| Prometheus | systemd unit state, query rate, block rate | Service health, performance |
Reverse Proxy (Caddy, Traefik)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | HTTP(S) checks on proxied services (not Caddy itself) | Detect broken routes |
| Prometheus | systemd unit state, restart count | Caddy service health |
Note: Don't monitor "Caddy itself" in Gatus. Monitor the services behind Caddy. If Caddy is down, all HTTP checks fail - that's your signal.
Media (Plex, Jellyfin)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | HTTPS → web UI returns 200 | Status page for family 📊 |
| Prometheus | Memory usage (transcode leaks), CPU (encoding) | Prevent OOM, resource exhaustion |
Databases (PostgreSQL, MySQL)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | TCP port check (optional - systemd check may suffice) | Basic reachability |
| Prometheus | Connection pools, query performance, backup status | Internal health, capacity |
Monitoring Itself (Gatus, Prometheus)¶
| System | Check | Purpose |
|---|---|---|
| Gatus | ❌ Don't monitor itself (circular dependency) | N/A |
| Prometheus | systemd health check service state | Meta-monitoring |
Critical Pattern: Use systemd health check timers to probe Gatus, then monitor the timer state in Prometheus. This avoids the "monitoring the monitor" complexity trap.
Implementation Checklist¶
For New Services¶
When adding a service to your homelab, follow this decision tree:
┌─────────────────────────────────────────┐
│ New Service: "example-service" │
└─────────────────┬───────────────────────┘
│
↓
┌───────────────────────────┐
│ Do users interact with │
│ this service? │
└───────┬───────────────────┘
│
Yes │ No
↓ ↓
┌─────────┐ └──────────────┐
│ Add to │ │
│ Gatus │ │
│ │ │
└─────┬───┘ │
│ │
↓ ↓
┌─────────────────────────────────────┐
│ Does it expose metrics or use │
│ significant resources? │
└───────┬─────────────────────────────┘
│
Yes │ No
↓ ↓
┌───────┐ └─────────────────┐
│ Enable│ │ Only systemd │
│ Prom │ │ unit monitoring │
│ scrape│ │ is sufficient │
└───────┘ └─────────────────┘
Prometheus Configuration (Keep Simple)¶
Current exporters to maintain:
- ✅ node_exporter - System metrics (already configured)
- ✅ postgres_exporter - Database metrics (already configured)
- ✅ Systemd unit state monitoring (already configured)
- ✅ Textfile collectors: ZFS, GPU, containers (already configured)
When to add application exporters: - Only for critical services where internal metrics provide significant value - Examples: caddy-security (auth failures), Caddy (request latency), critical APIs - Default to NO - infrastructure monitoring is usually sufficient
Gatus Configuration (Declarative)¶
For each user-facing service, add a Gatus contribution in the service's NixOS configuration:
# In service module or host service file
modules.services.gatus.contributions.myservice = {
name = "My Service";
group = "Applications";
url = "https://myservice.holthome.net/health";
interval = "60s";
conditions = [
"[STATUS] == 200"
"[RESPONSE_TIME] < 1000"
];
};
Check types: 1. Web services: HTTP(S) check with keyword/status validation 2. DNS services: DNS query check 3. Databases: TCP port check (if not sufficiently covered by Prometheus) 4. Infrastructure hosts: Ping/ICMP check for critical nodes (NAS, etc.)
Alert Philosophy¶
Prometheus Alerts¶
Goal: Predict and prevent failures before user impact
Characteristics: - Threshold-based (CPU >90% for 5m) - Trend-based (disk will fill in 4h) - Internal failures (backup failed, restart loop) - Route through Alertmanager with severity-based routing
Gatus Alerts¶
Goal: Immediate notification of user-visible failures
Characteristics: - Binary (service up or down) - External perspective (as users see it) - Configure in NixOS via contribution pattern - Direct notifications (phone, critical channels)
Alert Fatigue Prevention¶
Anti-Pattern: Don't alert on "guesses" (CPU high, memory high) unless you've validated they predict failures.
Best Practice: Alert on symptoms (service down, backup failed) and validated thresholds (disk <10%).
Homelab Optimization: - Keep alert count low (high signal-to-noise) - Validate every alert adds value - Remove alerts that trigger without actionable issues
Visualization Strategy¶
Gatus¶
- Purpose: Public status page for users/family
- Audience: Non-technical users
- Content: Service availability (green/red), uptime percentages
Grafana Dashboards¶
- Purpose: Operational visibility and analysis
- Audience: Homelab operator (you)
- Content: Resource trends, performance metrics, capacity planning
Separation of Concerns: Users see "Is Plex up?", operators see "Why is Plex using 8GB RAM?"
Migration Path¶
Current State¶
- ✅ Prometheus + node_exporter deployed
- ✅ postgres_exporter deployed
- ✅ Custom textfile collectors (ZFS, GPU, containers)
- ✅ Alertmanager integrated
- ✅ Gatus deployed with contributory endpoint pattern
To Implement¶
- Add Gatus contributions for user-facing services (in NixOS config)
- Review Prometheus alert rules - ensure they follow "alert on symptoms" philosophy
- Document runbooks for each alert type (what to do when it fires)
- Test alert delivery - verify both Gatus and Prometheus alerts reach you
Future Enhancements (Optional)¶
- Add application exporters for critical services (if justified by value)
- Implement predictive alerting for capacity planning
- Create Grafana dashboards for specific service deep-dives
Anti-Patterns to Avoid¶
❌ Don't monitor Gatus inside Gatus (circular dependency) ✅ Do use Prometheus + systemd health check timer to monitor Gatus
❌ Don't add both Gatus and Prometheus checks for the same thing ✅ Do use Gatus for availability, Prometheus for internals (different perspectives)
❌ Don't add gatus-exporter to expose Gatus metrics to Prometheus
✅ Do use Gatus's native /metrics endpoint (already built-in)
❌ Don't monitor Caddy directly in Gatus ✅ Do monitor the services behind Caddy (if Caddy is down, all checks fail)
❌ Don't alert on resource usage without validation ✅ Do alert on symptoms (service down) and validated thresholds (disk <10%)
Decision Framework Summary¶
Quick Reference Table¶
| Question | Gatus | Prometheus | Both |
|---|---|---|---|
| User-facing web service? | ✅ HTTP(S) check | Optional: app metrics | If critical |
| Database service? | Optional: TCP check | ✅ Internal metrics | Usually |
| Infrastructure host? | ✅ Ping check | ✅ node_exporter | Yes |
| Monitoring service itself? | ❌ No | ✅ Health check state | No |
| Reverse proxy? | ❌ Monitor services behind it | ✅ Systemd state | Yes |
| Internal-only service? | ❌ No | ✅ If exposes metrics | No |
The Mental Model¶
┌─────────────────────────────────────────────────────────────┐
│ User Perspective │
│ (Gatus Checks) │
│ │
│ "Can users access the service right now?" │
│ → HTTP 200? DNS resolving? Port open? │
│ → Binary: UP ✅ or DOWN ❌ │
│ → Alert immediately on failure │
└──────────────────────────┬──────────────────────────────────┘
│
Service is running
│
↓
┌─────────────────────────────────────────────────────────────┐
│ System Perspective │
│ (Prometheus/Grafana Metrics) │
│ │
│ "How healthy is the service internally?" │
│ → CPU/Memory usage? Disk space? Query latency? │
│ → Quantitative: Trends, thresholds, predictions │
│ → Alert on degradation before failure │
└─────────────────────────────────────────────────────────────┘
References¶
- Gatus Module:
modules/nixos/services/gatus/default.nix - Prometheus Configuration:
hosts/forge/monitoring.nix - Alerting Module:
modules/nixos/alerting.nix - Alert Definitions: Co-located with services (e.g.,
hosts/forge/services/*.nix)
Revision History¶
- Dec 4, 2025: Updated to use Gatus as black-box monitoring solution
- Replaced Uptime Kuma references with Gatus
- Added declarative configuration patterns via NixOS contributions
- Updated decision framework and examples
- Nov 5, 2025: Initial document created based on Gemini Pro strategic analysis
- Established black-box vs white-box monitoring principles
- Defined clear decision framework for service monitoring
- Documented anti-patterns and best practices for homelab context