ADR-006: Black-Box vs White-Box Monitoring Strategy¶
Status: Accepted Date: 2025-12-09 Context: Monitoring architecture for homelab services
Context¶
Homelab monitoring needs to answer two distinct questions:
- "Is this service available?" - User perspective
- "How healthy is this service internally?" - Operator perspective
Early implementations used a single monitoring tool for both purposes, leading to either alert fatigue (too many internal alerts) or blind spots (missing user-facing issues).
Decision¶
Maintain two complementary monitoring systems with clear separation of concerns:
| System | Type | Purpose | Alerts On |
|---|---|---|---|
| Gatus | Black-box | User-facing availability | Service up/down, HTTP status, port reachability |
| Prometheus | White-box | Internal health & performance | Resource usage, query latency, backup failures |
Gatus (Black-Box)¶
- Monitors services as a user would see them
- Binary checks: UP ✅ or DOWN ❌
- Immediate phone notifications for user-visible failures
- Public status page for family/users
- Services contribute their own endpoints via contributory pattern
Prometheus (White-Box)¶
- Monitors internal service state
- Quantitative metrics: gauges, counters, histograms
- Predictive alerts (disk filling, memory leak)
- Historical analysis via Grafana
- Alert routing through Alertmanager
Consequences¶
Positive¶
- Clear responsibility: Each system has one job
- Complementary alerts: Gatus says what is broken, Prometheus says why
- Reduced alert fatigue: Internal issues don't page unless they affect users
- Public status page: Users can self-serve availability checks
- Predictive capability: Prometheus catches issues before they cause outages
Negative¶
- Two systems to maintain: More infrastructure complexity
- Alert coordination: Must avoid duplicate alerts for same issue
- Learning curve: Need to understand when to use which system
Mitigations¶
- Use Gatus's native
/metricsendpoint (no separate exporter) - Document clear decision framework for each service
- Avoid monitoring Gatus inside Gatus (use Prometheus health check)
Decision Framework¶
text
┌─────────────────────────────────────────┐
│ New Service: Do users interact with it? │
└───────────┬─────────────────────────────┘
│
Yes │ No
↓ ↓
┌───────┐ └──────────────┐
│ Add │ │
│ Gatus │ │
│ check │ │
└───┬───┘ │
│ │
↓ ↓
┌───────────────────────────────────────────┐
│ Does it expose metrics or use significant │
│ resources? │
└───────┬───────────────────────────────────┘
│
Yes │ No
↓ ↓
┌───────┐ └─────────────────┐
│Enable │ │ Only systemd │
│Prom │ │ unit monitoring │
│scrape │ └─────────────────┘
└───────┘text
Examples¶
PostgreSQL (Both Systems)¶
Gatus: TCP check on port 5432 - "Can clients connect?"
Prometheus: Connection pool usage, query latency, backup status - "Why is it slow?"
Plex (Gatus Primary)¶
Gatus: HTTPS check on web UI - Status page for family
Prometheus: Memory usage (transcoding leaks) - Prevent OOM
Caddy (Prometheus Only)¶
Don't monitor Caddy in Gatus - monitor services behind Caddy. If Caddy is down, all HTTP checks fail - that's the signal.
Anti-Patterns¶
❌ Don't monitor Gatus inside Gatus (circular dependency) ✅ Use Prometheus + systemd health check timer to monitor Gatus
❌ Don't add both checks for the same thing ✅ Use Gatus for availability, Prometheus for internals (different perspectives)
❌ Don't alert on resource usage without validation ✅ Alert on symptoms (service down) and validated thresholds (disk <10%)
Related¶
- Monitoring Strategy - Full documentation
- ADR-001: Contributory Infrastructure Pattern - Gatus endpoint contributions