🔭 Observability

Monitoring, metrics, and tracing for Glyphoxa in production and development.

📖 Overview

Glyphoxa ships a comprehensive observability stack built on industry-standard components:

Layer	Technology	Purpose
Instrumentation	OpenTelemetry Go SDK	Traces and metrics recorded at every pipeline stage
Collection	Prometheus	Scrapes the `/metrics` endpoint; stores time-series data
Visualisation	Grafana	Pre-built dashboards for latency, throughput, and errors
Health probes	Built-in `/healthz` and `/readyz`	Kubernetes-style liveness and readiness checks

All three infrastructure services (Glyphoxa, Prometheus, Grafana) are defined in the Docker Compose file and activated with the alpha profile:

docker compose --profile alpha up

🔗 OpenTelemetry Integration

Provider Initialisation

The observe.InitProvider function (in internal/observe/provider.go) bootstraps the OpenTelemetry SDK at startup. It creates:

MeterProvider – backed by a Prometheus exporter bridge so all OTel metrics are exposed on the standard /metrics HTTP endpoint.
TracerProvider – optionally batching spans to an OTLP exporter. When no TraceExporter is configured, spans are recorded in-process only (useful for dev / metrics-only deployments).

Both providers are registered as the global OTel providers via otel.SetMeterProvider and otel.SetTracerProvider.

shutdown, err := observe.InitProvider(ctx, observe.ProviderConfig{
    ServiceName:    "glyphoxa",
    ServiceVersion: version,
    TraceExporter:  otlpExporter, // nil to disable export
})
defer shutdown(ctx)

The returned shutdown function flushes pending telemetry and closes exporters – call it in a defer from main().

Tracing

The observe package exposes convenience helpers for distributed tracing:

Function	Description
`observe.Tracer()`	Returns the package-level `trace.Tracer` scoped to `github.com/MrWong99/glyphoxa`
`observe.StartSpan(ctx, name, ...opts)`	Starts a new span, returns `(ctx, span)`. Caller must call `span.End()`
`observe.CorrelationID(ctx)`	Extracts the 32-character hex trace ID from the span context (empty string if no active span)
`observe.Logger(ctx)`	Returns an `slog.Logger` enriched with `trace_id` and `span_id` attributes

HTTP Middleware

observe.Middleware(m *Metrics) wraps any http.Handler and performs the following for every request:

Extracts W3C Trace Context (traceparent header) from incoming requests, or starts a new trace.
Starts a server span named HTTP <METHOD> <PATH> with semantic convention attributes.
Sets the X-Correlation-ID response header to the trace ID.
Injects W3C trace context into response headers for downstream propagation.
Records request duration to glyphoxa.http.request.duration.
Logs request completion with status code, duration, and trace info via slog.
Ends the span with http.response.status_code attribute.

Span Naming Conventions

Spans created across the codebase follow this pattern:

Span Name	Where
`HTTP GET /healthz`	HTTP middleware
`HTTP POST /api/v1/...`	HTTP middleware
Custom operation names	Application code via `observe.StartSpan`

Structured Logging Correlation

The observe.Logger(ctx) helper bridges tracing and logging. When an active span exists in the context, the returned logger automatically includes trace_id and span_id fields, allowing log aggregation tools to correlate log lines with traces:

{
  "level": "INFO",
  "msg": "request completed",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "method": "GET",
  "path": "/readyz",
  "status": 200,
  "duration": "1.234ms"
}

📊 Prometheus Metrics

All metrics are defined in internal/observe/metrics.go under the instrumentation scope github.com/MrWong99/glyphoxa. The Prometheus exporter bridge translates OTel metric names to Prometheus conventions (dots become underscores, unit suffixes are appended automatically).

Histogram Bucket Boundaries

All latency histograms share the same bucket boundaries, optimised for voice-pipeline latencies:

0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s, 1s, 2.5s, 5s, 10s

Full Metric Reference

Histograms (Latency)

OTel Name	Prometheus Name	Unit	Labels	Description
`glyphoxa.stt.duration`	`glyphoxa_stt_duration_seconds`	seconds	(none)	Latency of speech-to-text transcription
`glyphoxa.llm.duration`	`glyphoxa_llm_duration_seconds`	seconds	(none)	Latency of LLM inference
`glyphoxa.tts.duration`	`glyphoxa_tts_duration_seconds`	seconds	(none)	Latency of text-to-speech synthesis
`glyphoxa.s2s.duration`	`glyphoxa_s2s_duration_seconds`	seconds	(none)	End-to-end speech-to-speech latency (mouth-to-ear)
`glyphoxa.tool_execution.duration`	`glyphoxa_tool_execution_duration_seconds`	seconds	(none)	Latency of MCP tool execution
`glyphoxa.http.request.duration`	`glyphoxa_http_request_duration_seconds`	seconds	`method`, `path`	HTTP request processing time

Counters

OTel Name	Prometheus Name	Labels	Description
`glyphoxa.provider.requests`	`glyphoxa_provider_requests_total`	`provider`, `kind`, `status`	Total provider API requests
`glyphoxa.tool.calls`	`glyphoxa_tool_calls_total`	`tool`, `status`	Total MCP tool invocations
`glyphoxa.npc.utterances`	`glyphoxa_npc_utterances_total`	`npc_id`	Total NPC response utterances
`glyphoxa.provider.errors`	`glyphoxa_provider_errors_total`	`provider`, `kind`	Total provider errors

Gauges (UpDownCounters)

OTel Name	Prometheus Name	Labels	Description
`glyphoxa.active_npcs`	`glyphoxa_active_npcs`	(none)	Number of currently active NPC agents
`glyphoxa.active_sessions`	`glyphoxa_active_sessions`	(none)	Number of live voice sessions
`glyphoxa.active_participants`	`glyphoxa_active_participants`	(none)	Number of connected participants across all sessions

Label Reference

Label	Used By	Values
`provider`	`provider.requests`, `provider.errors`	`openai`, `anthropic`, `deepgram`, `elevenlabs`, `ollama`, `coqui`, etc.
`kind`	`provider.requests`, `provider.errors`	`llm`, `stt`, `tts`, `s2s`, `embeddings`
`status`	`provider.requests`, `tool.calls`	`ok`, `error`
`tool`	`tool.calls`	MCP tool name (e.g. `dice_roll`, `lookup_spell`)
`npc_id`	`npc.utterances`	NPC identifier (e.g. `bartender_01`, `guard_02`)
`method`	`http.request.duration`	HTTP method (`GET`, `POST`, etc.)
`path`	`http.request.duration`	Request path (e.g. `/healthz`, `/readyz`)

Multi-Tenant Labels

In --mode=gateway and --mode=worker deployments, all metrics and traces are automatically enriched with tenant context when a TenantContext is present:

Label	Description
`tenant_id`	Tenant identifier (e.g. `acme-corp`)
`campaign_id`	Active campaign identifier

These labels enable per-tenant filtering in Prometheus and Grafana. Example PromQL queries:

# P99 LLM latency for a specific tenant
histogram_quantile(0.99, rate(glyphoxa_llm_duration_seconds_bucket{tenant_id="acme-corp"}[5m]))

# Active sessions by tenant
glyphoxa_active_sessions{} by (tenant_id)

# Total provider errors for a campaign
sum(rate(glyphoxa_provider_errors_total{tenant_id="acme-corp", campaign_id="curse-of-strahd"}[1h]))

The pre-built Grafana dashboard includes a $tenant_id template variable for filtering all panels by tenant. In single-process mode (--mode=full), these labels are omitted and all metrics are unscoped.

Convenience Recording Methods

The Metrics struct provides helper methods that apply the correct label sets automatically:

m := observe.DefaultMetrics()

// Record a successful OpenAI LLM request
m.RecordProviderRequest(ctx, "openai", "llm", "ok")

// Record a provider error
m.RecordProviderError(ctx, "elevenlabs", "tts")

// Record a tool call
m.RecordToolCall(ctx, "dice_roll", "ok")

// Record an NPC utterance
m.RecordNPCUtterance(ctx, "bartender_01")

// Adjust gauges
m.ActiveSessions.Add(ctx, 1)   // session started
m.ActiveSessions.Add(ctx, -1)  // session ended

📈 Grafana Dashboards

Pre-Built Dashboard: Glyphoxa - Alpha Overview

A comprehensive overview dashboard is provisioned automatically at deployments/compose/grafana/dashboards/glyphoxa-overview.json. It includes:

Panel	Type	Description
Active Sessions	Stat	Current number of live voice sessions. Thresholds: green < 3, yellow 3-5, red > 5
Active NPCs	Stat	Current number of active NPC agents
NPC Utterances (rate)	Stat	Utterances per second across all NPCs
Provider Errors (rate)	Stat	Errors per second across all providers. Threshold: red > 0.1/s
STT Latency (p50 / p95)	Time series	Speech-to-text latency percentiles over time
LLM Latency (p50 / p95)	Time series	LLM inference latency percentiles over time
TTS Latency (p50 / p95)	Time series	Text-to-speech latency percentiles over time
End-to-End S2S Latency (p50 / p95)	Time series	Full mouth-to-ear latency percentiles
Provider Requests by Kind	Time series	Request rate broken down by provider kind (llm, stt, tts, etc.)
Tool Calls by Tool	Time series	MCP tool invocation rate by tool name
HTTP Request Duration (p95)	Time series	p95 HTTP request latency broken down by path

Accessing Grafana


URL	`http://localhost:3000`
Default user	`admin`
Default password	`admin`
Dashboard folder	`Glyphoxa`

Dashboards are auto-provisioned from the filesystem on startup. The provisioning configuration is at:

Dashboard provider: deployments/compose/grafana/provisioning/dashboards/default.yml
Datasource: deployments/compose/grafana/provisioning/datasources/prometheus.yml

The Prometheus datasource is pre-configured to connect to http://prometheus:9090 and is set as the default.

Adding Custom Dashboards

Place additional JSON dashboard files in deployments/compose/grafana/dashboards/. They are automatically discovered on Grafana startup.

🏥 Health Endpoints

The health subsystem (internal/health/health.go) provides Kubernetes-compatible probe endpoints.

Endpoints

Endpoint	Method	Purpose	Always Healthy?
`/healthz`	`GET`	Liveness probe – is the process alive and serving HTTP?	Yes (always returns 200)
`/readyz`	`GET`	Readiness probe – is the service ready to accept traffic?	No (depends on checker results)

Response Format

Both endpoints return JSON with Content-Type: application/json; charset=utf-8.

Healthy response (200 OK):

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "providers": "ok"
  }
}

Unhealthy response (503 Service Unavailable):

{
  "status": "fail",
  "checks": {
    "database": "fail: connection refused",
    "providers": "ok"
  }
}

The /healthz endpoint always returns {"status": "ok"} with no checks – if the process can respond to HTTP, it is alive.

The /readyz endpoint evaluates registered Checker functions sequentially. Each checker has a 5-second timeout. If any checker fails or times out, the overall status is fail and HTTP status is 503.

Kubernetes Probe Configuration

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

Registering Custom Checkers

import "github.com/MrWong99/glyphoxa/internal/health"

h := health.New(
    health.Checker{
        Name:  "database",
        Check: func(ctx context.Context) error {
            return db.PingContext(ctx)
        },
    },
    health.Checker{
        Name:  "providers",
        Check: func(ctx context.Context) error {
            // verify provider connectivity
            return nil
        },
    },
)

h.Register(mux) // adds GET /healthz and GET /readyz

🎯 Key Metrics to Monitor

Latency

The voice pipeline’s perceived quality depends on end-to-end latency. Monitor these in order of impact:

What	PromQL	Target
Mouth-to-ear p95	`histogram_quantile(0.95, sum(rate(glyphoxa_s2s_duration_seconds_bucket[5m])) by (le))`	< 2s
STT p95	`histogram_quantile(0.95, sum(rate(glyphoxa_stt_duration_seconds_bucket[5m])) by (le))`	< 500ms
LLM p95	`histogram_quantile(0.95, sum(rate(glyphoxa_llm_duration_seconds_bucket[5m])) by (le))`	< 1s
TTS p95	`histogram_quantile(0.95, sum(rate(glyphoxa_tts_duration_seconds_bucket[5m])) by (le))`	< 500ms
Tool execution p95	`histogram_quantile(0.95, sum(rate(glyphoxa_tool_execution_duration_seconds_bucket[5m])) by (le))`	< 2s
HTTP p95 by path	`histogram_quantile(0.95, sum(rate(glyphoxa_http_request_duration_seconds_bucket[5m])) by (le, path))`	< 100ms

Error Rates

What	PromQL	Alert When
Provider error rate (total)	`sum(rate(glyphoxa_provider_errors_total[5m]))`	> 0.1/s sustained
Provider error rate by provider	`sum(rate(glyphoxa_provider_errors_total[5m])) by (provider)`	Any provider > 0.05/s
Provider error rate by kind	`sum(rate(glyphoxa_provider_errors_total[5m])) by (kind)`	STT or TTS errors (breaks pipeline)
Failed tool calls	`sum(rate(glyphoxa_tool_calls_total{status="error"}[5m]))`	> 0.1/s
Provider error ratio	`sum(rate(glyphoxa_provider_errors_total[5m])) / sum(rate(glyphoxa_provider_requests_total[5m]))`	> 5%

Resource Usage

Use standard Go runtime metrics exposed by the Prometheus client alongside Glyphoxa metrics:

What	PromQL	Notes
Goroutines	`go_goroutines`	Watch for leaks – sustained growth indicates a bug
Heap memory	`go_memstats_heap_alloc_bytes`	Should be stable under load
GC pause time	`go_gc_duration_seconds`	High pauses affect voice latency

Session Metrics

What	PromQL	Notes
Active sessions	`glyphoxa_active_sessions`	Current voice sessions
Active NPCs	`glyphoxa_active_npcs`	Currently loaded NPC agents
Active participants	`glyphoxa_active_participants`	Connected players across all sessions
Utterance throughput	`sum(rate(glyphoxa_npc_utterances_total[5m]))`	NPC responses per second
Utterances per NPC	`sum(rate(glyphoxa_npc_utterances_total[5m])) by (npc_id)`	Identifies hot NPCs

🚨 Alerting Recommendations

Critical Alerts

# Provider is completely down
- alert: ProviderDown
  expr: sum(rate(glyphoxa_provider_errors_total[5m])) by (provider, kind) > 0.5
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Provider  () is failing"
    description: "Error rate exceeds 0.5/s for 2 minutes. Voice pipeline is likely broken."

# No active sessions when there should be
- alert: NoActiveSessions
  expr: glyphoxa_active_sessions == 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "No active voice sessions"
    description: "No voice sessions have been active for 10 minutes during expected hours."

# End-to-end latency is unacceptable
- alert: HighS2SLatency
  expr: histogram_quantile(0.95, sum(rate(glyphoxa_s2s_duration_seconds_bucket[5m])) by (le)) > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "End-to-end voice latency p95 exceeds 5 seconds"
    description: "Players are experiencing unacceptable delays in NPC responses."

Warning Alerts

# LLM latency is degrading
- alert: HighLLMLatency
  expr: histogram_quantile(0.95, sum(rate(glyphoxa_llm_duration_seconds_bucket[5m])) by (le)) > 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "LLM inference p95 latency exceeds 3 seconds"

# Tool execution is slow
- alert: SlowToolExecution
  expr: histogram_quantile(0.95, sum(rate(glyphoxa_tool_execution_duration_seconds_bucket[5m])) by (le)) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "MCP tool execution p95 exceeds 5 seconds"

# Error ratio is climbing
- alert: HighProviderErrorRatio
  expr: >
    sum(rate(glyphoxa_provider_errors_total[5m]))
    /
    sum(rate(glyphoxa_provider_requests_total[5m]))
    > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Provider error ratio exceeds 5%"

# Goroutine leak
- alert: GoroutineLeak
  expr: go_goroutines > 1000
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Goroutine count exceeds 1000 for 15 minutes"
    description: "Possible goroutine leak. Investigate with pprof."

# STT latency spike
- alert: HighSTTLatency
  expr: histogram_quantile(0.95, sum(rate(glyphoxa_stt_duration_seconds_bucket[5m])) by (le)) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "STT transcription p95 exceeds 2 seconds"

# TTS latency spike
- alert: HighTTSLatency
  expr: histogram_quantile(0.95, sum(rate(glyphoxa_tts_duration_seconds_bucket[5m])) by (le)) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "TTS synthesis p95 exceeds 2 seconds"

⚙️ Configuration

Prometheus Scrape Configuration

Prometheus is configured in deployments/compose/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: glyphoxa
    static_configs:
      - targets: ["glyphoxa:8080"]
    metrics_path: /metrics

The scrape target assumes Glyphoxa is running in the same Docker Compose network on port 8080. The /metrics endpoint is served by the Prometheus exporter bridge created in observe.InitProvider.

Docker Compose Services

Both Prometheus and Grafana are activated with the alpha profile:

docker compose --profile alpha up

Service	Port	Profile	Storage
Prometheus	`9090`	`alpha`	`prometheus_data` volume, 30-day retention
Grafana	`3000`	`alpha`	`grafana_data` volume

Server Configuration

Observability-relevant fields in the Glyphoxa config file (config.yaml):

server:
  # The listen address also serves as the metrics endpoint
  # Metrics are exposed at http://<listen_addr>/metrics
  listen_addr: ":8080"

  # Log verbosity: debug | info | warn | error
  # Use "debug" during development to see trace IDs in every log line
  log_level: info

OpenTelemetry Provider Configuration

The observe.ProviderConfig struct controls OTel SDK setup:

Field	Type	Default	Description
`ServiceName`	`string`	`"glyphoxa"`	Service name reported in all telemetry (resource attribute `service.name`)
`ServiceVersion`	`string`	`""`	Service version reported in telemetry (resource attribute `service.version`)
`TraceExporter`	`sdktrace.SpanExporter`	`nil`	Span exporter (e.g. OTLP). When nil, spans are recorded but not exported

To export traces to an OTLP-compatible backend (Jaeger, Tempo, Honeycomb, etc.), configure a TraceExporter in the ProviderConfig:

import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"

exporter, err := otlptracegrpc.New(ctx,
    otlptracegrpc.WithEndpoint("tempo:4317"),
    otlptracegrpc.WithInsecure(),
)

shutdown, err := observe.InitProvider(ctx, observe.ProviderConfig{
    ServiceName:    "glyphoxa",
    ServiceVersion: "1.0.0",
    TraceExporter:  exporter,
})

Environment Variables

Standard OTel environment variables are respected by the Go SDK:

Variable	Example	Purpose
`OTEL_SERVICE_NAME`	`glyphoxa`	Override service name
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://tempo:4317`	OTLP collector endpoint
`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc`	OTLP transport protocol
`OTEL_TRACES_SAMPLER`	`parentbased_traceidratio`	Trace sampling strategy
`OTEL_TRACES_SAMPLER_ARG`	`0.1`	Sample 10% of traces

🔍 See also

deployment.md – Production deployment guide
troubleshooting.md – Common issues and debugging
architecture.md – System architecture and pipeline design