Monitoring¶
The bundled monitoring stack uses Prometheus + Grafana with Docker-label discovery so newly-spawned world workers (per map or per zone group) show up automatically.
Where the stack lives¶
monitoring/in the repository —prometheus.yml, the Grafana dashboards, and the alert rules.- Docker Compose profiles bring the stack up next to the
cluster (
docker compose --profile monitoring up). - Helm chart values switch between embedded Prometheus and an external one if you already operate cluster-wide observability.
What is exported¶
Every NATS publish increments a Prometheus counter labelled by subject. Workers also export:
splintertree_world_active_players{map_id, instance_id, zone_group_id}— current population per shard.splintertree_world_tick_seconds_bucket{map_id, instance_id, zone_group_id}— tick duration histogram.splintertree_orchestrator_spawn_total{outcome}— spawn / drain counters.splintertree_gateway_sessions_active— connected sessions per gateway replica.splintertree_web_api_request_seconds_bucket{route, method, status}— HTTP latency histograms.
Alerts¶
Default alert rules cover:
- Auth latency p99 > 500 ms.
- Map worker tick p95 > 200 ms.
- Orchestrator spawn-failure ratio > 5%.
- Gateway dropped sessions / minute spiking 3× over baseline.
Tune them under monitoring/alerts/ to taste.
Logging¶
Workers log structured JSON to stdout; the Helm chart ships a sample Vector / Loki configuration if you want centralised logs. Per-session correlation IDs flow through every NATS message so traces can follow a player across crates.
Tracing¶
OpenTelemetry exporters wire up through environment variables
(OTEL_EXPORTER_OTLP_*). The web API and the orchestrator emit
spans by default; gameplay-hot paths in splintertree-world are
sampled at 1% to keep overhead low.