Modular Design Patterns¶

Last Updated: 2025-12-31

This document establishes standardized design patterns for NixOS service modules based on the refined Caddy and PostgreSQL reference implementations. Following these patterns ensures consistency, maintainability, and type safety across the entire infrastructure configuration.

Design Philosophy¶

Core Principles¶

Declarative Configuration: Services declare what they need, not how to achieve it
Type Safety: Use types.submodule for complex configuration structures
Separation of Concerns: Abstract implementation details from service declarations
Automatic Integration: Services automatically register with infrastructure systems
Graceful Migration: Support legacy patterns during transitions

Reference Implementations¶

Web Services: modules/nixos/services/caddy/default.nix - Structured backend configuration, security options, automatic DNS record generation
Storage Services: modules/nixos/services/postgresql/ - Database provisioning, secure credential handling, systemd integration
Observability Services: modules/nixos/services/loki/default.nix, modules/nixos/services/promtail/default.nix - Complete observability stack with standardized patterns
Monitoring Services: modules/nixos/services/gatus/default.nix - Black-box monitoring with contributory endpoint system and native Prometheus metrics
Shared Types: lib/types.nix - Centralized type definitions for all standardized submodules

Monitoring Strategy: docs/monitoring-strategy.md - Black-box vs white-box monitoring principles, service-specific guidance

Cross-Service Contribution Interfaces¶

Shared services expose dedicated integration points so downstream modules can declaratively contribute resources without patching implementation details:

Grafana (modules/nixos/services/grafana/default.nix)
Use modules.services.grafana.integrations.<name> to bundle datasources, dashboards, and LoadCredential entries.
Each integration can provide a map of datasources plus dashboard providers, and the module handles YAML provisioning + credential wiring automatically.
Example: modules.services.grafana.integrations.teslamate.datasources.teslamate = { ... }; (see TeslaMate module for a real reference).
PostgreSQL → Grafana bridge
Any database may declare grafanaDatasources = [ { ... } ] under modules.services.postgresql.databases.<dbName>.
Entries capture datasource metadata (host, user, UID, Timescale toggles) and optional dashboard directories; the module emits the correct Grafana integration and attaches password files safely via systemd credentials.
EMQX MQTT broker
Global ACLs live in modules.services.emqx.aclRules, and downstream services can append via modules.services.emqx.integrations.<service>.acls.
MQTT users can also be added through the same integration attribute, keeping per-service credentials close to their owners.
The module now materializes authz.conf and sets EMQX_AUTHORIZATION__* automatically whenever rules are declared.

Thin Orchestrator Pattern (Multi-Service Stacks)¶

When multiple services work together as a cohesive stack (e.g., observability, monitoring), use a thin orchestrator pattern rather than a "god module" that re-exposes all options.

Anti-Pattern: God Module ❌¶

Don't create meta-modules that proxy every option from underlying services:

# BAD: God module that re-exposes 90% of underlying options
options.modules.services.observability = {
  loki = {
    port = mkOption { ... };
    retentionDays = mkOption { ... };
    storagePath = mkOption { ... };
    # ... 50 more options copied from loki module
  };
  grafana = {
    port = mkOption { ... };
    oidc = mkOption { ... };
    # ... 100 more options copied from grafana module
  };
};

Problems: - Dual maintenance burden (options defined twice) - Documentation divergence - Type synchronization issues - Makes underlying modules harder to use directly

Correct Pattern: Thin Orchestrator ✅¶

A thin orchestrator only provides: 1. Master enable toggle - Turn the whole stack on/off 2. Component toggles - Enable/disable individual components 3. Cross-cutting wiring - Connections that span services (e.g., Promtail → Loki URL) 4. Stack-level concerns - Auto-discovery, shared alerts

# GOOD: Thin orchestrator that wires services together
options.modules.services.observability = {
  enable = mkEnableOption "observability stack";

  # Component toggles - no option re-exposure
  loki.enable = mkOption { type = types.bool; default = cfg.enable; };
  promtail.enable = mkOption { type = types.bool; default = cfg.enable; };
  grafana.enable = mkOption { type = types.bool; default = cfg.enable; };
  prometheus.enable = mkOption { type = types.bool; default = false; };

  # Stack-level concern: auto-discovery of metrics endpoints
  autoDiscovery.enable = mkOption { type = types.bool; default = true; };

  # Stack-level concern: shared alerting rules
  alerts.enable = mkOption { type = types.bool; default = true; };
};

config = mkIf cfg.enable {
  # Enable individual modules - they configure themselves
  modules.services.loki.enable = cfg.loki.enable;
  modules.services.promtail.enable = cfg.promtail.enable;
  modules.services.grafana.enable = cfg.grafana.enable;

  # Cross-cutting wiring: connect Promtail to Loki
  modules.services.promtail.lokiUrl = mkIf (cfg.promtail.enable && cfg.loki.enable)
    "http://127.0.0.1:${toString config.modules.services.loki.port}";

  # Cross-cutting wiring: auto-configure Grafana datasources
  modules.services.grafana.autoConfigure = {
    loki = lib.mkDefault cfg.loki.enable;
    prometheus = lib.mkDefault cfg.prometheus.enable;
  };
};

When to Use Thin Orchestrators¶

✅ Use thin orchestrator when: - Multiple services form a logical stack (observability, media automation) - Services need wiring between each other - You want a simple "enable the whole stack" toggle - Cross-cutting concerns like auto-discovery need coordination

❌ Don't create orchestrators when: - Services are independent (each service module stands alone) - No cross-service wiring is needed - A simple host-level config suffices

Reference Implementation¶

Observability Stack: modules/nixos/services/observability/default.nix
~190 lines (vs 876 lines in previous god-module version)
Enables: Loki, Promtail, Grafana, optionally Prometheus
Provides: Auto-discovery of metrics endpoints, stack-level alerts
Wires: Promtail → Loki, Grafana datasources

Host-Level Customization¶

Users who need to customize individual services configure them directly:

# Host config: Enable stack with thin orchestrator
modules.services.observability.enable = true;

# Customize individual services directly (not through orchestrator)
modules.services.loki.retention = 30;
modules.services.grafana.oidc = { ... };
modules.services.promtail.extraScrapeConfigs = [ ... ];

This provides the best of both worlds: - Simple enable for common use cases - Full control when customization is needed - No option duplication between modules

Creating New Service Modules¶

Native vs Container Decision¶

CRITICAL PRINCIPLE: Always prefer native NixOS services over containerized implementations when available.

Decision Framework¶

When adding a new service, follow this priority order:

Check for native NixOS module (search.nixos.org/options)
✅ PREFERRED: Wrap native module with homelab patterns
Example: Gatus has services.gatus - use this instead of container
Benefits: Better NixOS integration, easier updates, no container overhead
If native module doesn't exist, check if upstream provides one
Sometimes services have NixOS modules in their own repos
Consider contributing the module to nixpkgs
Only use containers when:
No native NixOS module exists or is practical
Service explicitly requires containerization (security isolation)
Rapid prototyping before creating native module

Architecture Example: Gatus¶

The Gatus module demonstrates the preferred approach:

Native Module with Contributory Pattern: - Wraps native services.gatus NixOS module - Adds homelab-specific contribution system - Services register endpoints declaratively

Key Architecture:

# Wrapper around native module adds homelab patterns
config = mkIf cfg.enable {
  # Enable native NixOS service
  services.gatus = {
    enable = true;
    settings = {
      web.port = cfg.port;
      # Endpoints contributed by other services
    };
  };

  # Add homelab integrations
  # - ZFS storage management
  # - Backup integration
  # - Reverse proxy registration
  # - Native Prometheus metrics
  # - Preseed/DR capability
};

Contributory Pattern:

# Services register themselves with Gatus
modules.services.gatus.contributions.plex = {
  name = "Plex";
  group = "Media";
  url = "https://plex.holthome.net/web/index.html";
  interval = "60s";
  conditions = [ "[STATUS] == 200" ];
};

Benefits of Native Approach: - ✅ Simpler implementation (46% less code) - ✅ No Podman dependency - ✅ Better systemd integration - ✅ Automatic NixOS updates (nix flake update) - ✅ Native privilege management (no container user mapping) - ✅ Direct filesystem access (no volume mounts)

Service Module Creation Workflow¶

Step 1: Research & Discovery¶

Search for native NixOS module:

# Search nixpkgs options
nix search nixpkgs#<service-name>

# Check NixOS options
# https://search.nixos.org/options?query=services.<service>

Evaluate existing module (if found):
Does it provide sufficient configuration options?
Is it actively maintained?
Does it follow modern NixOS patterns?
Decision point:
Native module exists and is sufficient → Wrapper approach (preferred)
Native module incomplete/outdated → Contribute fixes or full implementation
No native module → Container or custom implementation

Step 2: Module Structure¶

Create your module in modules/nixos/services/<service-name>/:

{ config, lib, pkgs, ... }:

let
  cfg = config.modules.services.<service-name>;

  # Use mylib for shared types (injected via _module.args)
  sharedTypes = mylib.types;

  # Storage helpers for ZFS dataset management (requires pkgs)
  storageHelpers = mylib.storageHelpers pkgs;
in
{
  options.modules.services.<service-name> = {
    enable = lib.mkEnableOption "<service-name>";

    # Add standardized submodules (choose applicable ones)
    reverseProxy = lib.mkOption {
      type = lib.types.nullOr sharedTypes.reverseProxySubmodule;
      default = null;
      description = "Reverse proxy configuration";
    };

    metrics = lib.mkOption {
      type = lib.types.nullOr sharedTypes.metricsSubmodule;
      default = null;
      description = "Prometheus metrics collection";
    };

    logging = lib.mkOption {
      type = lib.types.nullOr sharedTypes.loggingSubmodule;
      default = null;
      description = "Log shipping configuration";
    };

    backup = lib.mkOption {
      type = lib.types.nullOr sharedTypes.backupSubmodule;
      default = null;
      description = "Backup configuration";
    };

    notifications = lib.mkOption {
      type = lib.types.nullOr sharedTypes.notificationSubmodule;
      default = null;
      description = "Notification channels";
    };

    # Add preseed for disaster recovery (if stateful)
    preseed = {
      enable = lib.mkEnableOption "automatic restore before service start";
      repositoryUrl = lib.mkOption {
        type = lib.types.str;
        description = "URL to Restic repository";
      };
      # ... (see preseed pattern below)
    };
  };

  config = lib.mkIf cfg.enable {
    # Service implementation here
  };
}

Host-Level Contribution Rule¶

When a host (for example hosts/forge/services/<name>.nix) enables a service module and contributes additional infrastructure resources—ZFS datasets, Sanoid definitions, alert rules, backup jobs, Cloudflare tunnels, etc.—every one of those contributions must be wrapped in a guard tied to the service's enable flag. The canonical pattern is:

let
  serviceEnabled = config.modules.services.<service>.enable or false;
in
{
  config = lib.mkMerge [
    { modules.services.<service> = { enable = true; ... }; }

    (lib.mkIf serviceEnabled {
      modules.storage.datasets.services.<service> = { ... };
      modules.backup.sanoid.datasets."tank/services/<service>" = { ... };
      modules.alerting.rules."<service>-service-down" = { ... };
      modules.services.caddy.virtualHosts.<service>.cloudflare = { ... };
    })
  ];
}

This ensures that disabling a service automatically disables all downstream infrastructure so we never create orphaned datasets, alerts, or backup jobs.

Host-Level Defaults Libraries¶

For hosts with many services following similar patterns, create a centralized defaults library to reduce duplication. This is particularly useful for:

Standard Sanoid/Syncoid replication configurations
Common alert patterns (service-down, systemd-down)
Backup repository configurations
Authentication/security policies

Reference Implementation: hosts/forge/lib/defaults.nix

# hosts/<hostname>/lib/defaults.nix
{ config, lib }:

let
  resticEnabled = (config.modules.backup.enable or false)
    && (config.modules.backup.restic.enable or false);
in
{
  # Standard backup configuration
  backup = {
    enable = true;
    repository = "nas-primary";
  };

  # ZFS replication helper
  mkSanoidDataset = serviceName: {
    useTemplate = [ "services" ];
    recursive = false;
    autosnap = true;
    autoprune = true;
    replication = {
      targetHost = "nas-1.example.com";
      targetDataset = "backup/${config.networking.hostName}/zfs-recv/${serviceName}";
      # ... replication options
    };
  };

  # Container service-down alert helper
  mkServiceDownAlert = serviceName: displayName: description: {
    type = "promql";
    alertname = "${displayName}ServiceDown";
    expr = ''container_service_active{name="${serviceName}"} == 0'';
    for = "2m";
    severity = "high";
    labels = { service = serviceName; category = "availability"; };
    annotations = {
      summary = "${displayName} service is down on {{ $labels.instance }}";
      description = "The ${displayName} ${description} service is not active.";
      command = "systemctl status podman-${serviceName}.service";
    };
  };

  # Systemd service-down alert helper (for native services)
  mkSystemdServiceDownAlert = serviceName: displayName: description: {
    type = "promql";
    alertname = "${displayName}ServiceDown";
    expr = ''node_systemd_unit_state{name="${serviceName}.service",state="active"} == 0'';
    for = "2m";
    severity = "high";
    labels = { service = serviceName; category = "availability"; };
    annotations = {
      summary = "${displayName} service is down on {{ $labels.instance }}";
      description = "The ${displayName} ${description} service is not active.";
      command = "systemctl status ${serviceName}.service";
    };
  };

  # Preseed/DR configuration (auto-gated by restic)
  mkPreseed = restoreMethods: lib.mkIf resticEnabled {
    enable = true;
    repositoryUrl = "/mnt/nas-backup";
    passwordFile = config.sops.secrets."restic/password".path;
    restoreMethods = restoreMethods;
  };
}

Usage in Service Files:

{ config, lib, ... }:
let
  forgeDefaults = import ../lib/defaults.nix { inherit config lib; };
  serviceEnabled = config.modules.services.myapp.enable or false;
in
{
  config = lib.mkMerge [
    {
      modules.services.myapp = {
        enable = true;
        backup = forgeDefaults.backup;
        preseed = forgeDefaults.mkPreseed [ "syncoid" "local" "restic" ];
      };
    }

    (lib.mkIf serviceEnabled {
      modules.backup.sanoid.datasets."tank/services/myapp" =
        forgeDefaults.mkSanoidDataset "myapp";

      modules.alerting.rules."myapp-service-down" =
        forgeDefaults.mkServiceDownAlert "myapp" "MyApp" "application";
    })
  ];
}

Key Benefits: - Reduces boilerplate from ~15-20 lines to 1-2 lines per concern - Ensures consistency across all services on a host - Single point of change for host-specific infrastructure patterns - Separates "what the module does" from "how this host deploys it"

When to Create a Defaults Library: - Host has 5+ services with similar patterns - Multiple services share the same backup/replication target - Alert patterns are standardized across services - Authentication policies are consistent

When NOT to Use Helpers: - Service has unique, complex alert expressions - Backup requires custom retention or exclusion patterns - Replication needs special handling (encrypted sends, different targets)

Step 3: Native Wrapper Pattern (PREFERRED)¶

When wrapping a native NixOS module:

config = lib.mkIf cfg.enable {
  # 1. Enable and configure native service
  services.<service-name> = {
    enable = true;
    # Pass through relevant configuration
    # Keep it minimal - let native module handle defaults
  };

  # 2. Override systemd service if needed (for ZFS, etc.)
  systemd.services."<service-name>" = {
    # Add dependencies
    after = [ "zfs-mount.service" ];
    requires = [ "zfs-mount.service" ];

    # Override user/permissions if needed
    serviceConfig = {
      User = lib.mkForce "<service-user>";
      Group = lib.mkForce "<service-group>";
      StateDirectory = lib.mkForce ""; # Disable if using ZFS
      ReadWritePaths = [ cfg.dataDir ]; # For sandboxing
    };
  };

  # 3. Add ZFS storage management
  systemd.services."ensure-<service-name>-storage" =
    storageHelpers.mkZfsStorageService {
      dataset = "tank/services/<service-name>";
      mountpoint = cfg.dataDir;
      owner = "<service-user>";
      properties = {
        recordsize = "16K"; # Optimize for workload
        compression = "zstd";
      };
    };

  # 4. Add homelab integrations (backup, monitoring, etc.)
  # See subsequent patterns below
};

Step 4: Add Homelab Integrations¶

Follow the standardized submodule patterns:

Reverse Proxy (if web service):

modules.services.caddy.virtualHosts."<service>" = lib.mkIf (cfg.reverseProxy != null) {
  enable = cfg.reverseProxy.enable;
  hostName = cfg.reverseProxy.hostName;
  backend = cfg.reverseProxy.backend;
};

Monitoring (see Monitoring Strategy doc):
Add Gatus endpoint contribution (user-facing check)
Configure Prometheus alerts (system health)
Add systemd health check if needed

Backup (if stateful):

modules.services.backup.jobs."<service-name>" = lib.mkIf (cfg.backup != null) {
  enable = cfg.backup.enable;
  repository = cfg.backup.repository;
  paths = [ cfg.dataDir ];
  useSnapshots = cfg.backup.useSnapshots;
  # ... (see backup pattern)
};

Preseed/DR (for critical services):

# Add pre-start restore logic
# See Disaster Recovery Preseed Pattern doc

Step 5: Testing & Validation¶

Build configuration:

nix build .#nixosConfigurations.<host>.config.system.build.toplevel

Deploy and verify:

# Check service status
systemctl status <service-name>.service

# Verify ZFS dataset
zfs list | grep <service-name>

# Test backup
systemctl start backup-<service-name>.service

# Check monitoring
curl http://localhost:<metrics-port>/metrics

Validate integrations:
Reverse proxy: curl https://<service>.domain.tld
Backup: Check Restic snapshots
Monitoring: Verify Prometheus scrape targets
Logs: Check Loki for service logs

Step 6: Documentation¶

Add inline comments explaining: - Why native vs container choice was made - Any workarounds or special considerations - Dependencies and assumptions - Reference to relevant design pattern docs

Common Patterns by Service Type¶

Web Application¶

✅ Reverse proxy (Caddy)
✅ Gatus health check contribution
✅ Backup (if stores data)
⚠️ Metrics (only if critical)

Database¶

✅ Backup with snapshots
✅ Metrics (postgres_exporter, etc.)
✅ Preseed/DR capability
✅ TCP health check (optional)

Infrastructure Service¶

✅ Systemd monitoring (Prometheus)
✅ Metrics (node_exporter or custom)
⚠️ Backup (if configuration is critical)

Monitoring Service¶

✅ Systemd health check
✅ Prometheus monitoring (meta-monitoring)
❌ NO recursive monitoring (avoid complexity)

Anti-Patterns to Avoid¶

❌ Don't create container version without checking for native module - Always search nixpkgs first - Containers should be last resort

❌ Don't duplicate functionality that exists in native modules - Use native module features when available - Only override what you need to change

❌ Don't skip standardized submodules - Every service should use applicable patterns - Consistency makes maintenance easier

❌ Don't create per-service backup scripts - Use unified backup system - Declare backup needs, don't implement them

❌ Don't implement custom metric exporters - Use existing exporters when available - Consider if metrics are actually needed (see Monitoring Strategy)

Migration Path for Existing Containers¶

If you have existing container-based services:

Check for native module (may not have existed when originally deployed)
Evaluate migration effort vs benefits
Save container version as .container-backup file
Implement native wrapper with same functionality
Test thoroughly before removing container
Document architecture change with reasoning

Shared Types Library¶

All standardized submodule types are centralized in lib/types/ (with lib/types.nix as compatibility wrapper) to ensure consistency and reusability across services.

Import Pattern¶

# Preferred: Use mylib (injected via _module.args)
sharedTypes = mylib.types;

# Alternative: Direct import (for non-module contexts)
sharedTypes = import ../../../lib/types.nix { inherit lib; };

Available Shared Types¶

sharedTypes.reverseProxySubmodule - Reverse proxy integration with TLS backend support
sharedTypes.metricsSubmodule - Prometheus metrics collection with advanced labeling
sharedTypes.loggingSubmodule - Log shipping with multiline parsing, regex support, and container driver config
sharedTypes.backupSubmodule - Backup integration with retention policies
sharedTypes.notificationSubmodule - Notification channels with escalation
sharedTypes.containerResourcesSubmodule - Container resource management
sharedTypes.datasetSubmodule - ZFS dataset configuration (recordsize, compression, properties)
sharedTypes.healthcheckSubmodule - Container healthcheck configuration (interval, timeout, retries)

Standardized Submodule Patterns¶

1. Reverse Proxy Integration¶

All web services should use the shared reverse proxy type for consistent Caddy integration.

Implementation Pattern¶

# Use shared type instead of inline definition
reverseProxy = mkOption {
  type = types.nullOr sharedTypes.reverseProxySubmodule;
  default = null;
  description = "Reverse proxy configuration for this service";
};

Auto-Registration Implementation¶

config = mkIf cfg.enable {
  # Automatic Caddy registration
  modules.services.caddy.virtualHosts."${serviceName}" = mkIf (cfg.reverseProxy != null) {
    enable = cfg.reverseProxy.enable;
    hostName = cfg.reverseProxy.hostName;
    backend = cfg.reverseProxy.backend;
    auth = cfg.reverseProxy.auth;
  };
};

2. Metrics Collection Pattern¶

Services that expose metrics should use the shared metrics type for automatic Prometheus integration.

Implementation Pattern¶

# Use shared type with service-specific defaults
metrics = mkOption {
  type = types.nullOr sharedTypes.metricsSubmodule;
  default = {
    enable = true;
    port = 9090;  # Service-specific port
    path = "/metrics";
    labels = {
      service_type = "database";
      exporter = "postgres";
      function = "storage";
    };
  };
  description = "Prometheus metrics collection configuration";
};

Auto-Registration Implementation¶

# No explicit registration required - the observability module automatically
# scans all enabled services under `config.modules.services.*` for metrics submodules

config = mkIf cfg.enable {
  # Services are automatically discovered when they define a metrics submodule
  # The observability module uses discoverMetricsTargets() to find all services with:
  # - (service.metrics or null) != null
  # - (service.metrics.enable or false) == true

  # Generated Prometheus scrape config will include:
  # - job_name: "service-${serviceName}"
  # - targets: ["${interface}:${port}"]
  # - metrics_path: "${path}"
  # - scrape_interval: "${scrapeInterval}"
  # - labels: service.metrics.labels + { service = serviceName; instance = hostName; }
};

3. Log Shipping Pattern¶

Services that produce logs should use the shared logging type for automatic Promtail/Loki integration.

Implementation Pattern¶

# Use shared type with enhanced parsing capabilities
logging = mkOption {
  type = types.nullOr sharedTypes.loggingSubmodule;
  default = {
    enable = true;
    journalUnit = "${serviceName}.service";
    labels = {
      service = serviceName;
      service_type = "application";
    };
    parseFormat = "json";  # or "logfmt", "regex", "multiline", "none"
  };
  description = "Log shipping configuration";
};

# Advanced parsing example
logging = mkOption {
  type = types.nullOr sharedTypes.loggingSubmodule;
  default = {
    enable = true;
    logFiles = [ "/var/log/app/error.log" ];
    parseFormat = "multiline";
    multilineConfig = {
      firstLineRegex = "^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}";
      maxWaitTime = "3s";
    };
  };
};

Auto-Registration Implementation¶

# Logging auto-registration follows the same pattern as metrics
# The observability module automatically discovers services with logging submodules

config = mkIf cfg.enable {
  # Promtail automatically discovers services with logging.enable = true
  # Generated configuration includes:
  # - job_name: "service-${serviceName}"
  # - journalUnit or logFiles based on configuration
  # - labels: service.logging.labels + { service = serviceName; }
  # - parseFormat for structured log processing
};

4. Unified Backup Integration Pattern ✅ UPDATED 2025-10-29¶

All stateful services should use the unified backup system with the shared backup type for consistent policy management and automatic discovery.

Implementation Pattern¶

# Use shared type with service-specific configuration
backup = mkOption {
  type = types.nullOr sharedTypes.backupSubmodule;
  default = {
    enable = true;
    repository = "nas-primary";
    frequency = "daily";
    tags = [ "service-type" "service-name" "data-category" ];
    useSnapshots = false;  # Opt-in for ZFS snapshot coordination
    excludePatterns = [
      "**/cache/**"
      "**/tmp/**"
      "**/*.log"
    ];
  };
  description = "Backup configuration for unified backup system";
};

ZFS Snapshot Integration (Opt-in)¶

For services that need consistent backups (databases, applications with locks, etc.), enable ZFS snapshot coordination:

# For services requiring snapshot consistency (databases, locked files)
backup = {
  enable = true;
  repository = "nas-primary";
  useSnapshots = true;        # Enable snapshot coordination
  zfsDataset = "tank/services/myservice";  # Required when useSnapshots=true
  frequency = "daily";
  tags = [ "database" "myservice" "critical" ];
  excludePatterns = [
    "**/*.log"        # Exclude logs from snapshots
    "**/cache/**"     # Exclude cache directories
  ];
};

Working Example (from dispatcharr service):

backup = {
  enable = true;
  repository = "nas-primary";
  useSnapshots = true;
  zfsDataset = "tank/services/dispatcharr";
  frequency = "daily";
  tags = [ "iptv" "dispatcharr" "application" ];
};

When to Use Snapshots: - ✅ Database services (PostgreSQL, SQLite) - consistency critical - ✅ Application state (Sonarr, Radarr, etc.) - avoid corrupt configs - ✅ Locked files - services that may have open file handles - ❌ Static content (Plex media) - snapshots add unnecessary overhead - ❌ Read-only data - content that doesn't change during backup

Auto-Discovery Implementation¶

# No manual registration required!
# The unified backup system automatically discovers services with backup submodules
# Services simply declare their backup needs, system handles the rest

Key Changes from Legacy Pattern: - ✅ Automatic Discovery: No manual job registration needed - ✅ Opt-in Snapshots: Services declare useSnapshots = true when needed - ✅ Unified Monitoring: All metrics flow through textfile collector - ✅ Enterprise Verification: Automated integrity checks and restore testing

Critical Configuration Updates Required:

Services that handle databases, application state, or have file locking must enable snapshot coordination:

# Update these service backup configurations:
# 1. Sonarr - SQLite database, configuration files
modules.services.sonarr.backup = {
  enable = true;
  repository = "nas-primary";
  useSnapshots = true;                    # REQUIRED
  zfsDataset = "tank/services/sonarr";    # REQUIRED
  excludePatterns = [ "**/*.log" "**/cache/**" ];
};

# 2. Loki - Database and indexes
modules.services.loki.backup = {
  enable = true;
  repository = "nas-primary";
  useSnapshots = true;                    # REQUIRED
  zfsDataset = "tank/services/loki";      # REQUIRED
  excludePatterns = [ "**/*.tmp" ];
};

Migration Guide: See /docs/unified-backup-design-patterns.md for complete implementation details.

5. Notification Integration Pattern¶

Services should use the shared notification type for consistent alerting and status reporting.

Implementation Pattern¶

# Use shared type with escalation support
notifications = mkOption {
  type = types.nullOr sharedTypes.notificationSubmodule;
  default = {
    enable = true;
    channels = {
      onFailure = [ "critical-alerts" "team-slack" ];
      onBackup = [ "backup-status" ];
      onHealthCheck = [ "monitoring-alerts" ];
    };
    customMessages = {
      failure = "${serviceName} service failed on ${config.networking.hostName}";
      backup = "${serviceName} backup completed on ${config.networking.hostName}";
    };
    escalation = {
      afterMinutes = 15;
      channels = [ "on-call-pager" ];
    };
  };
  description = "Notification configuration";
};

6. Container Resource Management Pattern¶

Containerized services should use systemd resource limits and the shared container resources type.

Implementation Pattern¶

# For systemd services with resource limits
resources = mkOption {
  type = types.attrsOf types.str;
  default = {
    MemoryMax = "512M";
    MemoryReservation = "256M";
    CPUQuota = "50%";
  };
  description = "Systemd resource limits";
};

# For Podman containers, use shared container resources type
container = mkOption {
  type = types.submodule {
    options = {
      resources = mkOption {
        type = sharedTypes.containerResourcesSubmodule;
        default = {
          memory = "512m";
          memoryReservation = "256m";
          cpus = "1.0";
          cpuQuota = "50%";
        };
      };
      # Additional container-specific options...
    };
  };
};

Implementation Guidelines¶

Module Structure Template¶

Every service module should follow this structure:

# Service module template using shared types
{ config, lib, pkgs, ... }:

let
  inherit (lib) mkOption mkEnableOption mkIf types;
  cfg = config.modules.services.<service>;
  serviceName = "<service>";
  # Use mylib for shared types (injected via _module.args)
  sharedTypes = mylib.types;
in
{
  options.modules.services.<service> = {
    enable = mkEnableOption "<service> service";

    # Core service configuration
    dataDir = mkOption {
      type = types.path;
      default = "/var/lib/${serviceName}";
      description = "Data directory";
    };

    port = mkOption {
      type = types.port;
      description = "Service port";
    };

    # Systemd resource limits
    resources = mkOption {
      type = types.attrsOf types.str;
      default = {
        MemoryMax = "256M";
        CPUQuota = "25%";
      };
      description = "Systemd resource limits";
    };

    # Standardized integration submodules using shared types
    reverseProxy = mkOption {
      type = types.nullOr sharedTypes.reverseProxySubmodule;
      default = null;
      description = "Reverse proxy configuration";
    };

    metrics = mkOption {
      type = types.nullOr sharedTypes.metricsSubmodule;
      default = null;
      description = "Prometheus metrics collection";
    };

    logging = mkOption {
      type = types.nullOr sharedTypes.loggingSubmodule;
      default = null;
      description = "Log shipping configuration";
    };

    backup = mkOption {
      type = types.nullOr sharedTypes.backupSubmodule;
      default = null;
      description = "Backup configuration";
    };

    notifications = mkOption {
      type = types.nullOr sharedTypes.notificationSubmodule;
      default = null;
      description = "Notification configuration";
    };

    # ZFS integration pattern
    zfs = {
      dataset = mkOption {
        type = types.nullOr types.str;
        default = null;
        example = "tank/services/${serviceName}";
        description = "ZFS dataset to mount at dataDir";
      };

      properties = mkOption {
        type = types.attrsOf types.str;
        default = {
          compression = "zstd";
          atime = "off";
          "com.sun:auto-snapshot" = "true";
        };
        description = "ZFS dataset properties";
      };
    };
  };

  config = mkIf cfg.enable {
    # ZFS dataset configuration
    modules.storage.datasets.services.${serviceName} = mkIf (cfg.zfs.dataset != null) {
      mountpoint = cfg.dataDir;
      properties = cfg.zfs.properties;
      owner = serviceName;
      group = serviceName;
      mode = "0750";
    };

    # Core service implementation
    systemd.services."${serviceName}" = {
      description = "<Service> service";
      wantedBy = [ "multi-user.target" ];
      after = [ "network.target" ] ++ lib.optionals (cfg.zfs.dataset != null) [ "zfs-mount.service" ];
      wants = lib.optionals (cfg.zfs.dataset != null) [ "zfs-mount.service" ];

      serviceConfig = {
        ExecStart = "${cfg.package}/bin/${serviceName}";
        Restart = "always";
        User = serviceName;
        Group = serviceName;

        # Resource limits
        MemoryMax = cfg.resources.MemoryMax;
        MemoryReservation = cfg.resources.MemoryReservation or null;
        CPUQuota = cfg.resources.CPUQuota;

        # Security hardening
        ProtectSystem = "strict";
        ProtectHome = true;
        PrivateTmp = true;
        NoNewPrivileges = true;
      };
    };

    # User/group creation
    users.users."${serviceName}" = {
      isSystemUser = true;
      group = serviceName;
      home = cfg.dataDir;
      createHome = true;
    };

    users.groups."${serviceName}" = {};

    # Auto-registration with infrastructure systems using structured backend configuration
    modules.services.caddy.virtualHosts.${serviceName} = mkIf (cfg.reverseProxy != null && cfg.reverseProxy.enable) {
      enable = true;
      hostName = cfg.reverseProxy.hostName;

      # Use structured backend configuration from shared types
      backend = cfg.reverseProxy.backend;

      # Authentication configuration from shared types
      auth = cfg.reverseProxy.auth;

      # Security configuration from shared types
      security = cfg.reverseProxy.security;

      # Additional configuration
      extraConfig = cfg.reverseProxy.extraConfig;
    };

    # Metrics auto-registration happens automatically via observability module
    # No explicit configuration needed - the observability module scans all
    # services under modules.services.* for metrics submodules

    # Firewall configuration (localhost only)
    networking.firewall = {
      interfaces.lo.allowedTCPPorts = [ cfg.port ]
        ++ lib.optional (cfg.metrics != null && cfg.metrics.enable) cfg.metrics.port;
    };

    # Directory ownership (if not using ZFS dataset)
    systemd.tmpfiles.rules = lib.mkIf (cfg.zfs.dataset == null) [
      "d ${cfg.dataDir} 0755 ${serviceName} ${serviceName} -"
    ];
  };
}

Validation and Assertions¶

Every module should include comprehensive validation:

config = mkIf cfg.enable {
  assertions = [
    {
      assertion = cfg.reverseProxy == null || cfg.reverseProxy.backend.port == cfg.port;
      message = "Reverse proxy backend port must match service port";
    }
    {
      assertion = cfg.metrics == null || cfg.metrics.port != cfg.port;
      message = "Metrics port must be different from service port";
    }
    # Additional validations...
  ];
};

7. ZFS Integration Pattern¶

Services with persistent storage should use the ZFS dataset pattern for optimized storage management.

Implementation Pattern¶

zfs = {
  dataset = mkOption {
    type = types.nullOr types.str;
    default = null;
    example = "tank/services/${serviceName}";
    description = "ZFS dataset to mount at dataDir";
  };

  properties = mkOption {
    type = types.attrsOf types.str;
    default = {
      compression = "zstd";
      atime = "off";
      "com.sun:auto-snapshot" = "true";
    };
    description = "ZFS dataset properties";
  };
};

# Auto-registration with storage module
modules.storage.datasets.services.${serviceName} = mkIf (cfg.zfs.dataset != null) {
  mountpoint = cfg.dataDir;
  properties = cfg.zfs.properties;
  owner = serviceName;
  group = serviceName;
  mode = "0750";
};

8. Directory and Permission Management Pattern¶

Services must use systemd's native StateDirectory mechanism for directory ownership and permissions. This provides a single source of truth and prevents conflicts between tmpfiles and systemd.

Critical Design Principles¶

✅ DO: Native SystemD Services - Use StateDirectory + StateDirectoryMode for directory management - Let systemd create and manage directory ownership - Set UMask to control file creation permissions - NO tmpfiles rules for native services

❌ DON'T: Common Mistakes - Don't use tmpfiles for native systemd services - Don't set home directory to data directory (causes permission reversion) - Don't mix tmpfiles and StateDirectory - Don't rely on ZFS dataset properties for permissions

Implementation Pattern for Native Services¶

# Service module configuration
config = mkIf cfg.enable {
  # Set user home to /var/empty to prevent activation script interference
  users.users.${serviceName} = {
    isSystemUser = true;
    group = serviceName;
    home = lib.mkForce "/var/empty";  # CRITICAL: Prevents 700 permission enforcement
  };

  # SystemD service configuration
  systemd.services.${serviceName} = {
    serviceConfig = {
      # StateDirectory tells systemd to create /var/lib/${serviceName}
      # with ownership set to User:Group
      StateDirectory = serviceName;

      # StateDirectoryMode sets directory permissions (750 = rwxr-x---)
      StateDirectoryMode = "0750";

      # UMask ensures files created by service are 640 (rw-r-----)
      UMask = "0027";

      # User/Group are set by the service (usually from upstream module)
      User = serviceName;
      Group = serviceName;
    };
  };

  # For ZFS datasets, only specify the dataset - NO owner/group/mode
  modules.storage.datasets.services.${serviceName} = mkIf (cfg.zfs.dataset != null) {
    mountpoint = cfg.dataDir;
    properties = cfg.zfs.properties;
    # DO NOT SET: owner, group, mode - these interfere with StateDirectory
  };
};

Implementation Pattern for OCI Containers¶

OCI containers don't support StateDirectory, so they must use tmpfiles:

# Service module configuration for OCI containers
config = mkIf cfg.enable {
  users.users.${serviceName} = {
    isSystemUser = true;
    group = serviceName;
    home = "/var/empty";
  };

  # For OCI containers, use ZFS dataset WITH explicit permissions
  modules.storage.datasets.services.${serviceName} = mkIf (cfg.zfs.dataset != null) {
    mountpoint = cfg.dataDir;
    properties = cfg.zfs.properties;
    owner = serviceName;
    group = serviceName;
    mode = "0750";
    # Note: OCI containers don't support StateDirectory, so we explicitly set
    # permissions via tmpfiles (handled by storage module)
  };
};

Storage Module Smart Detection¶

The storage module automatically detects service type and applies correct pattern:

# In storage/datasets.nix
systemd.tmpfiles.rules = lib.flatten (lib.mapAttrsToList (serviceName: serviceConfig:
  let
    # Check if explicit permissions are set (OCI containers)
    hasExplicitPermissions = (serviceConfig.mode or null) != null
                          && (serviceConfig.owner or null) != null
                          && (serviceConfig.group or null) != null;
  in
    if hasExplicitPermissions then [
      # OCI containers: Use explicit permissions via tmpfiles
      "d \"${mountpoint}\" ${serviceConfig.mode} ${serviceConfig.owner} ${serviceConfig.group} - -"
      "z \"${mountpoint}\" ${serviceConfig.mode} ${serviceConfig.owner} ${serviceConfig.group} - -"
    ] else [
      # Native services: No tmpfiles rules - rely on StateDirectory
      # (tmpfiles with "-" defaults to root:root which interferes)
    ]
) cfg.services);

Permission Architecture Summary¶

Service Type	Directory Creation	Permission Management	Home Directory
Native SystemD	`StateDirectory`	`StateDirectoryMode` + `UMask`	`/var/empty`
Native + ZFS	ZFS dataset	`owner`/`group`/`mode` in dataset config	`/var/empty`
OCI Container	tmpfiles	`mode`/`owner`/`group` in dataset config	`/var/empty`

IMPORTANT: When using ZFS datasets, StateDirectory only manages permissions for directories it creates. If a ZFS dataset is already mounted at the path, StateDirectory cannot change its permissions. You must explicitly set owner, group, and mode in the dataset configuration.

Examples from Working Implementations¶

Grafana (Native Service with ZFS):

users.users.grafana.home = lib.mkForce "/var/empty";

systemd.services.grafana.serviceConfig = {
  StateDirectory = "grafana";
  StateDirectoryMode = "0750";
  UMask = "0027";
};

# ZFS dataset - MUST include owner/group/mode since ZFS mountpoint pre-exists
# StateDirectory cannot change permissions on pre-existing directories
modules.storage.datasets.services.grafana = {
  mountpoint = "/var/lib/grafana";
  owner = "grafana";
  group = "grafana";
  mode = "0750";
};

Sonarr (OCI Container):

users.users.sonarr.home = "/var/empty";

# No StateDirectory (not supported by OCI)

# ZFS dataset - WITH owner/group/mode
modules.storage.datasets.services.sonarr = {
  mountpoint = "/var/lib/sonarr";
  owner = "sonarr";
  group = "sonarr";
  mode = "0750";
  # Note: OCI containers don't support StateDirectory
};

Backup User Group Membership¶

For backup integration, ensure the backup user can read service data:

# Add service groups to backup user
users.users.restic-backup.extraGroups = [
  "grafana"
  "loki"
  "plex"
  "promtail"
  # Add all services that need backup
];

With 750 permissions (rwxr-x---), the backup user (member of service group) can read directories and files created with UMask=0027.

Common Pitfalls and Solutions¶

Problem: Permissions revert to 700 after nixos-rebuild Cause: User home directory set to data directory Solution: Set home = lib.mkForce "/var/empty"

Problem: Directory owned by root:root instead of service user Cause: tmpfiles rule with "-" for user/group Solution: Remove tmpfiles rule, use StateDirectory instead

Problem: Backup fails with permission denied Cause: Files created with 600 permissions (user-only) Solution: Add exclude patterns for security-sensitive files

Problem: Service fails to start after migration Cause: StateDirectory not set, directory doesn't exist Solution: Add StateDirectory = serviceName to serviceConfig

Validation Checklist¶

After implementing directory management:

# 1. Check StateDirectory configuration
systemctl show <service>.service -p StateDirectory -p StateDirectoryMode

# 2. Verify directory ownership
ls -ld /var/lib/<service>  # Should be <service>:<service> drwxr-x---

# 3. Check user home directory
getent passwd <service>  # Should show /var/empty

# 4. Verify tmpfiles rules
systemd-tmpfiles --cat-config | grep <service>
# Native services: Should NOT have tmpfiles entries
# OCI containers: Should have "d" and "z" entries with explicit permissions

# 5. Test permission persistence
sudo systemctl restart <service>.service
ls -ld /var/lib/<service>  # Should still be drwxr-x---

Helper Functions¶

Reusable helper functions in lib/ for common patterns:

lib/types.nix - ✅ Implemented - Shared type definitions (split into lib/types/*.nix)
lib/monitoring-helpers.nix - ✅ Implemented - Metrics and alert configuration
lib/backup-helpers.nix - ✅ Implemented - Backup job generation
lib/caddy-helpers.nix - ✅ Implemented - Reverse proxy configuration
modules/nixos/storage/helpers-lib.nix - ✅ Implemented - Storage/preseed helpers (via mylib.storageHelpers pkgs)

Migration Strategy¶

Phase 1: Documentation and Standards ✅ COMPLETED¶

✅ Document patterns (this document)
✅ Update copilot-instructions.md with pattern requirements
✅ Create helper libraries (lib/types.nix, lib/types/*.nix)

Phase 2: Infrastructure Implementation ✅ COMPLETED¶

✅ Implement shared type definitions (lib/types.nix)
✅ Create standardized Caddy auto-registration
✅ Implement observability stack (Loki, Promtail)
🔄 Centralized notification system (planned)
🔄 Enhanced backup orchestration (planned)

Phase 3: Service Migration 🔄 IN PROGRESS¶

✅ Migrated core observability services (Grafana, Loki, Promtail)
✅ Migrated web services with reverse proxy patterns
🔄 Migrate remaining containerized services
🔄 Audit and update legacy service configurations
🔄 Deprecate inline type definitions in favor of shared types

Phase 4: Validation and Testing 📋 PLANNED¶

Implement module validation tests
Add integration tests for auto-registration
Document troubleshooting guides
Create migration validation checklist

Current Migration Status¶

✅ Completed Services: - Grafana - Full standardized pattern implementation - Loki - Complete observability stack with ZFS integration - Promtail - Advanced log shipping with multiline parsing - Caddy - Structured backend configuration with auto-registration

🔄 In Progress: - Remaining containerized services (UniFi, Omada, etc.) - Prometheus/Grafana integration for auto-discovery - Centralized backup orchestration

📋 Next Priority: - Migrate all remaining services to use shared types - Implement centralized observability auto-registration - Complete notification system integration

Best Practices¶

Type Safety¶

Always use types.submodule for complex configuration
Provide clear descriptions and examples
Use mkEnableOption for boolean features
Validate configuration with assertions

Security¶

Default to secure configurations
Use systemd security directives
Handle secrets via SOPS and environment variables
Never expose credentials in process lists or logs

Maintainability¶

Keep implementation details in helper functions
Use descriptive option names
Provide migration paths for breaking changes
Document all breaking changes in commit messages

Performance¶

Use lazy evaluation where possible
Avoid expensive computations in option definitions
Cache complex derivations
Consider resource usage of generated configurations

This document serves as the authoritative guide for all new service modules and the target for migrating existing ones.

Modular Design Patterns¶

Design Philosophy¶

Core Principles¶

Reference Implementations¶

Related Documentation¶

Cross-Service Contribution Interfaces¶

Thin Orchestrator Pattern (Multi-Service Stacks)¶

Anti-Pattern: God Module ❌¶

Correct Pattern: Thin Orchestrator ✅¶

When to Use Thin Orchestrators¶

Reference Implementation¶

Host-Level Customization¶

Creating New Service Modules¶

Native vs Container Decision¶

Decision Framework¶

Architecture Example: Gatus¶

Service Module Creation Workflow¶

Step 1: Research & Discovery¶

Step 2: Module Structure¶

Host-Level Contribution Rule¶

Host-Level Defaults Libraries¶

Step 3: Native Wrapper Pattern (PREFERRED)¶

Step 4: Add Homelab Integrations¶

Step 5: Testing & Validation¶

Step 6: Documentation¶

Common Patterns by Service Type¶

Web Application¶

Database¶

Infrastructure Service¶

Monitoring Service¶

Anti-Patterns to Avoid¶

Migration Path for Existing Containers¶

Shared Types Library¶

Import Pattern¶

Available Shared Types¶

Standardized Submodule Patterns¶

1. Reverse Proxy Integration¶

Implementation Pattern¶

Auto-Registration Implementation¶

2. Metrics Collection Pattern¶

Implementation Pattern¶

Auto-Registration Implementation¶

3. Log Shipping Pattern¶

Implementation Pattern¶

Auto-Registration Implementation¶

4. Unified Backup Integration Pattern ✅ UPDATED 2025-10-29¶

Implementation Pattern¶

ZFS Snapshot Integration (Opt-in)¶

Auto-Discovery Implementation¶

5. Notification Integration Pattern¶

Implementation Pattern¶

6. Container Resource Management Pattern¶

Implementation Pattern¶

Implementation Guidelines¶

Module Structure Template¶

Validation and Assertions¶

7. ZFS Integration Pattern¶

Implementation Pattern¶

8. Directory and Permission Management Pattern¶

Critical Design Principles¶

Implementation Pattern for Native Services¶

Implementation Pattern for OCI Containers¶

Storage Module Smart Detection¶

Permission Architecture Summary¶

Examples from Working Implementations¶

Backup User Group Membership¶

Common Pitfalls and Solutions¶

Validation Checklist¶

Helper Functions¶

Migration Strategy¶

Phase 1: Documentation and Standards ✅ COMPLETED¶

Phase 2: Infrastructure Implementation ✅ COMPLETED¶

Phase 3: Service Migration 🔄 IN PROGRESS¶

Phase 4: Validation and Testing 📋 PLANNED¶

Current Migration Status¶

Best Practices¶

Type Safety¶

Security¶

Maintainability¶

Performance¶