Skip to content

Monitoring

The bundled monitoring stack uses Prometheus + Grafana with Docker-label discovery so newly-spawned world workers (per map or per zone group) show up automatically.

Where the stack lives

  • monitoring/ in the repository — prometheus.yml, the Grafana dashboards, and the alert rules.
  • Docker Compose profiles bring the stack up next to the cluster (docker compose --profile monitoring up).
  • Helm chart values switch between embedded Prometheus and an external one if you already operate cluster-wide observability.

What is exported

Every NATS publish increments a Prometheus counter labelled by subject. Workers also export:

  • splintertree_world_active_players{map_id, instance_id, zone_group_id} — current population per shard.
  • splintertree_world_tick_seconds_bucket{map_id, instance_id, zone_group_id} — tick duration histogram.
  • splintertree_orchestrator_spawn_total{outcome} — spawn / drain counters.
  • splintertree_gateway_sessions_active — connected sessions per gateway replica.
  • splintertree_web_api_request_seconds_bucket{route, method, status} — HTTP latency histograms.

Alerts

Default alert rules cover:

  • Auth latency p99 > 500 ms.
  • Map worker tick p95 > 200 ms.
  • Orchestrator spawn-failure ratio > 5%.
  • Gateway dropped sessions / minute spiking 3× over baseline.

Tune them under monitoring/alerts/ to taste.

Logging

Workers log structured JSON to stdout; the Helm chart ships a sample Vector / Loki configuration if you want centralised logs. Per-session correlation IDs flow through every NATS message so traces can follow a player across crates.

Tracing

OpenTelemetry exporters wire up through environment variables (OTEL_EXPORTER_OTLP_*). The web API and the orchestrator emit spans by default; gameplay-hot paths in splintertree-world are sampled at 1% to keep overhead low.