Glyphoxaβs distributed mode splits the system into a gateway and one or more workers that communicate over gRPC. This document covers the architecture, audio flow, session lifecycle, configuration, deployment, and known gotchas.
For single-process deployments, see Deployment β --mode=full requires no distributed configuration.
Architecture Overview
In distributed mode, three gRPC services connect the gateway and worker:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Gateway β
β β
β Discord Bot (VoiceManager) β
β β β
β ββ joins voice channel β
β ββ receives opus from players β
β ββ sends opus to players β
β β β
β AudioBridge Server β
β β per-session SessionBridge β
β β toWorker chan βββ Discord audio β
β β fromWorker chan βββ Discord audio β
β β β
β Session Orchestrator (PostgreSQL) β
β K8s Job Dispatcher β
β Admin API (:8081) β
β Slash Command Handlers β
ββββββββββ¬βββββββββββββ¬βββββββββββββββββββββββββ
β β
gRPC β gRPC β AudioBridgeService
controlβ (bidi β (opus stream)
plane β stream) β
β β
ββββββββββ΄βββββββββββββ΄βββββββββββββββββββββββββ
β Worker β
β β
β grpcbridge.Connection β
β β implements audio.Connection β
β β opus decode β PCM (per-user) β
β β PCM β opus encode (NPC output) β
β β β
β Voice Pipeline β
β VAD (Silero) β STT β LLM β TTS β Mixer β
β β
β Session Runtime β
β Heartbeat Reporter β
βββββββββββββββββββββββββββββββββββββββββββββββββ
Why Audio Bridge?
An earlier design (βVoice State Proxyβ) had the gateway capture Discord voice credentials and pass them to the worker, which would then connect to Discord voice directly. This approach was abandoned because:
- Complexity β capturing voice credentials from Discord gateway events required intercepting
VOICE_STATE_UPDATEandVOICE_SERVER_UPDATEat the right time, with fragile race conditions - Suspend/resume β the gateway had to suspend its own voice connection before the worker could take over (causing 60-second hangs)
- Slash commands β with the worker owning voice, the gateway couldnβt easily handle Discord slash commands for the same session
The Audio Bridge approach is simpler: the gateway keeps the Discord voice connection permanently and transparently proxies opus frames. The worker is completely Discord-unaware β it receives PCM audio and sends PCM audio through the standard audio.Connection interface.
How Voice Works
Audio Flow: Player β NPC
Player speaks in Discord
β
βΌ
Discord sends opus packets to gateway's VoiceManager
β
βΌ
Gateway: opus packet β SessionBridge.toWorker channel
β
βΌ (gRPC AudioBridgeService.StreamAudio)
β
Worker: grpcbridge.Connection.recvLoop()
ββ opus decode β PCM (per-user gopus.Decoder)
ββ demux by user_id into per-participant input channels
βΌ
Voice Pipeline: VAD β STT β LLM β TTS β Mixer
β
βΌ
Worker: grpcbridge.Connection.sendLoop()
ββ PCM β opus encode (48 kHz stereo, 20ms frames)
βΌ (gRPC AudioBridgeService.StreamAudio, return direction)
β
Gateway: SessionBridge.fromWorker channel β Discord OpusSend
β
βΌ
All players hear the NPC
Handshake
When a worker starts a session, it connects to the gatewayβs AudioBridgeService.StreamAudio RPC and sends an initial frame containing only the session_id. The gateway has 10 seconds to receive this handshake before closing the stream. The session_id routes the stream to the correct SessionBridge that was pre-created when the gateway dispatched the session.
Barge-In and Flush
When a player starts speaking while an NPC is outputting audio:
- The workerβs VAD detects player speech
- The mixer interrupts the current NPC segment and clears the queue
grpcbridge.Connection.Flush()drains the local output buffer- A
flushcontrol frame is sent to the gateway over the gRPC stream - The gatewayβs
SessionBridge.Flush()drains all buffered frames fromfromWorker - Stale NPC audio stops playing immediately on both sides
Buffer Sizes
| Buffer | Size | Purpose |
|---|---|---|
toWorker | 128 frames (~2.5s) | Discordβworker, realtime pace (50 fps) |
fromWorker | 1500 frames (~30s) | WorkerβDiscord, handles TTS bursts (faster-than-realtime generation) |
| Per-participant input | 64 frames | Worker-side, per-user PCM |
| Output | 64 frames | Worker-side, NPC audio from mixer |
How Sessions Work
Session Lifecycle
1. Player runs /session start in Discord
β
2. Gateway slash command handler β SessionOrchestrator.ValidateAndCreate()
ββ checks concurrent session limit (license tier)
ββ checks monthly quota (QuotaGuard)
ββ creates session record in PostgreSQL (state: pending)
β
3. Gateway β K8s Job Dispatcher
ββ creates K8s Job from template
ββ polls until pod is Ready
ββ returns worker pod IP + gRPC port
β
4. Gateway β AudioBridge.NewSessionBridge(sessionID)
(pre-creates the channel pair for the gRPC stream)
β
5. Gateway β VoiceManager.JoinVoiceChannel()
(connects bot to the player's voice channel)
β
6. Gateway β Worker (gRPC): StartSession(req)
ββ worker builds voice pipeline (RuntimeFactory)
ββ worker connects to AudioBridgeService.StreamAudio
ββ worker sends handshake frame with session_id
ββ worker starts VADβSTTβLLMβTTSβMixer loop
β
7. Worker β Gateway (gRPC): ReportState(session_id, ACTIVE)
β
8. Audio flows bidirectionally via AudioBridgeService
β
9. Player runs /session stop (or session times out)
β
10. Gateway β Worker (gRPC): StopSession(session_id)
ββ worker stops pipeline, disconnects audio
ββ worker reports state ENDED
β
11. Gateway cleans up: remove bridge, disconnect voice, update DB
Worker Dispatch (K8s Jobs)
The dispatch.Dispatcher (internal/gateway/dispatch/) creates Kubernetes Jobs for each voice session:
- Template: A pre-configured
batchv1.Jobwith environment variables for the worker - Pod readiness: Polls until the worker pod is Running and Ready (default timeout: 120s)
- Address resolution: Uses the pod IP + gRPC port (default: 50051) as the worker address
- Cleanup: Gateway deletes the K8s Job when the session ends
Key environment variables injected into worker pods:
| Variable | Value | Purpose |
|---|---|---|
GLYPHOXA_GRPC_ADDR | :50051 | Worker gRPC listen address |
GLYPHOXA_GATEWAY_ADDR | <gateway-service>:50051 | Gateway gRPC address for callbacks |
GLYPHOXA_AUDIO_BRIDGE_ADDR | <gateway-service>:50051 | Gateway AudioBridge address |
GLYPHOXA_DATABASE_DSN | postgres://... | PostgreSQL connection string |
GLYPHOXA_MCP_GATEWAY_URL | http://... | MCP gateway URL (optional) |
Heartbeat and Failure Detection
Workers send periodic heartbeats to the gateway via SessionGatewayService.Heartbeat. If the heartbeat stops:
- Audio stream disconnect: The
AudioBridgeServicedetects stream closure immediately and firesOnStreamDetach, triggering cleanup without waiting for heartbeat timeout - Heartbeat timeout:
CleanupZombies(timeout)transitions sessions with no heartbeat for >90 seconds toendedstate - Combined: The audio stream detach provides fast detection; the heartbeat timeout is a safety net
Configuration
Gateway Configuration
The gateway is configured primarily through environment variables and the admin API:
# Gateway mode
--mode=gateway
# Required environment variables:
GLYPHOXA_ADMIN_KEY=your-secret-key # Admin API authentication
GLYPHOXA_GRPC_ADDR=:50051 # gRPC listen address
GLYPHOXA_DATABASE_DSN=postgres://... # PostgreSQL with pgvector
Worker Configuration
Workers receive their configuration via the StartSessionRequest gRPC message, which includes tenant ID, campaign ID, NPC configs, and bot token. Provider configuration (which STT/LLM/TTS to use) comes from a ConfigMap mounted into the worker pod.
Example worker config (mounted as ConfigMap):
providers:
vad:
name: silero
options:
frame_size_ms: 32 # Must be 32 for Silero with 16kHz input
stt:
name: elevenlabs
api_key: ${ELEVENLABS_API_KEY}
options:
language: de # Must be set explicitly for non-English
llm:
name: gemini
api_key: ${GEMINI_API_KEY}
model: gemini-2.0-flash
tts:
name: elevenlabs
api_key: ${ELEVENLABS_API_KEY}
memory:
postgres_dsn: ${GLYPHOXA_DATABASE_DSN}
embedding_dimensions: 768
embeddings:
name: gemini
api_key: ${GEMINI_API_KEY}
model: gemini-embedding-001
NPC Configuration Format
NPCs are defined in the StartSessionRequest and stored in PostgreSQL (npc_definitions table). Key fields:
npcs:
- name: Heinrich
personality: "A gruff dwarven blacksmith..."
engine: cascaded # Must be "cascaded" (not "cascade")
voice:
voice_id: "abc123..." # voice is a struct, not a plain string
knowledge_scope:
- blacksmithing
- local_gossip
budget_tier: standard
gm_helper: false
address_only: false
Common mistakes:
enginemust becascaded(notcascade)voiceis a struct{voice_id: "..."}, not a plain stringvoice_idmust match a voice in your TTS provider (e.g., ElevenLabs voice ID)
Known Gotchas
gRPC Context Kills Long-Lived Resources
Problem: Using the gRPC RPC context for resources that outlive the RPC causes silent cancellation.
Solution: Always use context.Background() for long-lived pipeline components (VAD sessions, STT connections, TTS streams). Only use the RPC context for the RPC call itself.
// Wrong: pipeline dies when StartSession RPC returns
pipeline := newPipeline(rpcCtx)
// Right: pipeline lives until explicitly stopped
pipeline := newPipeline(context.Background())
DAVE E2EE Voice Encryption
Discordβs DAVE (Discord Audio Video Encryption) is mandatory since 2026-03-01. The gateway must use safedave.NewSession (a thread-safe wrapper around golibdave) when joining voice:
voice.WithDaveSessionCreateFunc(safedave.NewSession)
Without this, the gateway will connect to voice but receive/send encrypted frames that the pipeline canβt process. The symptom is silence in both directions.
VAD Frame Size Must Be 32ms
Silero VAD with 16 kHz input requires frame_size_ms: 32. Using 30ms (the default documented elsewhere) causes a dimension mismatch in the ONNX model and panics at runtime.
STT Language Must Be Set Explicitly
For non-English sessions, the STT providerβs language option must be set explicitly:
stt:
options:
language: de # German
Without this, STT defaults to English and produces garbled transcriptions of non-English speech.
German NPC Routing: Short Word False Matches
The NPC address detection router matches keywords in player speech. German articles like βderβ, βdieβ, βdasβ can falsely match NPC names containing those substrings. Fixed by requiring a minimum 4-rune keyword length. If you define German NPCs, avoid very short names.
zhi Re-Render Drops Manual ConfigMap Edits
If you deploy with zhi and manually edit the ConfigMap (e.g., to add NPCs), running zhi apply will re-render the template and overwrite your changes. Solution: add NPCs to the zhi workspace template, not the live ConfigMap.
Worker Pod DNS Resolution
Worker pods dispatched as K8s Jobs may not have DNS resolution for the gateway service if the DNS entry hasnβt propagated. The dispatcher uses pod IP + port directly to avoid this issue.
Deployment on K3s
This section covers the actual deployment topology used on the Glyphoxa home server (K3s at 192.168.178.44).
Cluster Layout
K3s Node (192.168.178.44)
βββ glyphoxa namespace
β βββ Deployment: glyphoxa-gateway (1 replica)
β β βββ Discord bot (VoiceManager)
β β βββ Admin API (:8081)
β β βββ gRPC server (:50051)
β β βββ AudioBridge server (same gRPC port)
β βββ Deployment: glyphoxa-postgres (1 replica)
β β βββ PostgreSQL + pgvector
β βββ Job: glyphoxa-session-<id> (per-session, created dynamically)
β β βββ Worker pod (voice pipeline)
β βββ Service: glyphoxa-gateway (ClusterIP)
β βββ port 8081 β admin API
β βββ port 50051 β gRPC (control + audio)
zhi Workspace
The deployment is managed by zhi at ~/zhi-deploy/glyphoxa-k8s:
glyphoxa-k8s/
βββ workspace.yaml # zhi workspace definition
βββ templates/
β βββ gateway-deployment.yaml
β βββ gateway-service.yaml
β βββ postgres-deployment.yaml
β βββ postgres-service.yaml
β βββ worker-job-template.yaml
β βββ configmap.yaml # Provider config (STT, LLM, TTS, etc.)
β βββ secrets.yaml # API keys (sealed)
Worker Job Template
The gateway uses a pre-configured Job template to dispatch workers. Key aspects:
- Image: Same image as the gateway (
ghcr.io/mrwong99/glyphoxa) - Command:
--mode=worker - Resources: CPU/memory limits appropriate for the voice pipeline
- activeDeadlineSeconds: 14400 (4 hours max session)
- Environment: Gateway address, audio bridge address, database DSN, API keys
- ConfigMap mount: Provider configuration (STT language, models, etc.)
Comparison: Full Mode vs Distributed Mode
| Aspect | Full Mode (--mode=full) | Distributed Mode (--mode=gateway + --mode=worker) |
|---|---|---|
| Processes | Single binary | Gateway + worker(s) as separate pods |
| Discord voice | Worker connects directly via VoiceOnlyPlatform | Gateway owns connection, proxies via AudioBridge |
| audio.Connection | discord.Connection | grpcbridge.Connection |
| Session control | local.Client (direct function calls) | grpctransport.Client (gRPC with circuit breaker) |
| State callbacks | local.Callback (direct function calls) | gRPC SessionGatewayService |
| Worker lifecycle | In-process | K8s Jobs (created/deleted per session) |
| Multi-tenant | No (single config) | Yes (admin API, per-tenant bots, quota) |
| Scaling | Vertical only | Horizontal (one worker per session) |
See also: Architecture Β· Multi-Tenant Β· Audio Pipeline Β· Deployment Β· Configuration