Monitoring¶

The bundled monitoring stack uses Prometheus + Grafana with Docker-label discovery so newly-spawned world workers (per map or per zone group) show up automatically.

Where the stack lives¶

monitoring/ in the repository — prometheus.yml, the Grafana dashboards, and the alert rules.
Docker Compose profiles bring the stack up next to the cluster (docker compose --profile monitoring up).
Helm chart values switch between embedded Prometheus and an external one if you already operate cluster-wide observability.

What is exported¶

Every NATS publish increments a Prometheus counter labelled by subject. Workers also export:

splintertree_world_active_players{map_id, instance_id, zone_group_id} — current population per shard.
splintertree_world_tick_seconds_bucket{map_id, instance_id, zone_group_id} — tick duration histogram.
splintertree_orchestrator_spawn_total{outcome} — spawn / drain counters.
splintertree_gateway_sessions_active — connected sessions per gateway replica.
splintertree_web_api_request_seconds_bucket{route, method, status} — HTTP latency histograms.

Alerts¶

Default alert rules cover:

Auth latency p99 > 500 ms.
Map worker tick p95 > 200 ms.
Orchestrator spawn-failure ratio > 5%.
Gateway dropped sessions / minute spiking 3× over baseline.

Tune them under monitoring/alerts/ to taste.

Logging¶

Workers log structured JSON to stdout; the Helm chart ships a sample Vector / Loki configuration if you want centralised logs. Per-session correlation IDs flow through every NATS message so traces can follow a player across crates.

Tracing¶

OpenTelemetry exporters wire up through environment variables (OTEL_EXPORTER_OTLP_*). The web API and the orchestrator emit spans by default; gameplay-hot paths in splintertree-world are sampled at 1% to keep overhead low.