π― System Overview
Glyphoxa is a real-time voice AI framework that brings AI-driven talking NPCs into live TTRPG voice chat sessions. It captures player speech from a voice channel (Discord, WebRTC), routes it through a streaming AI pipeline (STT, LLM, TTS β or a single speech-to-speech model), and plays back NPC dialogue with distinct voices, personalities, and persistent memory β all within a 1.2-second mouth-to-ear latency target.
Written in Go for native concurrency, every pipeline stage runs as a goroutine connected by channels, enabling true end-to-end streaming where TTS starts synthesising before the LLM finishes generating.
πΊοΈ Architecture Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Transport Layer β
β (Discord / WebRTC / Custom Platform) β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Audio In β Audio Out β
β βββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββ β
β β Per-Speaker β β β Audio Mixer β β
β β Streams β β β - Priority queue (addressed NPC first) β β
β β (chan AudioFr)β β β - Barge-in detection (truncate on VAD) β β
β βββββββββ¬ββββββββ β β - Natural pacing (200-500ms gaps) β β
β β β ββββββββββββββββββββββββ¬βββββββββββββββββββββ β
βββββββββββββΌββββββββββββ΄βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ€
β β Agent Orchestrator + Router β β
β β βββββββββββββββββββββββββββββββ β β
β β β Address Detection β β β
β β β Turn-Taking / Barge-in β β β
β β β DM Commands & Puppet Mode β β β
β β ββββββββββββ¬βββββββββββββββββββ β
β β β β β
β ββββββββ΄βββββββ βββββββ΄βββββββ ββββββββββββ β β
β β NPC #1 β β NPC #2 β β NPC #3 β ... β
β β (Agent) β β (Agent) β β (Agent) β β β
β ββββββββ¬βββββββ βββββββ¬βββββββ ββββββ¬ββββββ β β
βββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄ββββββββββ΄ββββββββββββββββββββββββββ€
β Voice Engines β
β β
β βββββββββββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β
β β Cascaded Engine β β S2S Engine β β Sentence Cascade β β
β β STT β LLM β TTS β β Gemini Live β β β οΈ Experimental β β
β β (full pipeline) β β OpenAI RT β β Fast+Strong models β β
β βββββββββββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ€
β Memory Subsystem β MCP Tool Execution β
β β β
β βββββββββ ββββββββββ ββββββββ β βββββββββ βββββββββ ββββββββββ β
β β L1 β β L2 β β L3 β β β Dice β β Rules β β Memory β β
β βSessionβ βSemanticβ βGraph β β βRoller β βLookup β β Query β ... β
β β Log β β Index β β(KG) β β βββββββββ βββββββββ ββββββββββ β
β βββββββββ ββββββββββ ββββββββ β Budget Tiers: instant/fast/standard β
β βββ all on PostgreSQL βββββββ β β
ββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ€
β Observability (OpenTelemetry) β Resilience (Fallback + Circuit Break) β
β Metrics Β· Traces Β· Middleware β LLM / STT / TTS provider failover β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Distributed Topology (--mode=gateway + --mode=worker)
In multi-tenant deployments, the system splits into separate processes:
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β Gateway β β Worker β
β β β β
β Admin API (tenant CRUD) β β VAD β STT β LLM β TTS β Mixer β
β Bot Manager (per-tenant bots) β gRPCβ Session Runtime β
β Session Orchestrator βββββββΌββββββ€ Discord Voice (direct) β
β Usage / Quota Tracking β β MCP Tool Calls β
β Health + Metrics β β Health + Metrics β
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β β
βββββ PostgreSQL (session state) βββββββββββ
The gateway manages tenant lifecycle and session orchestration. Workers run the voice pipeline and connect directly to Discord voice channels β audio never flows through the gateway. Control signals (start/stop/heartbeat) use gRPC. In --mode=full, both roles run in-process with direct function calls instead of gRPC.
See Multi-Tenant Architecture for details.
π Data Flow
The voice pipeline is a streaming chain. Each stage is a goroutine reading from an input channel and writing to an output channel. Stages overlap β TTS starts before the LLM finishes, audio playback starts before TTS finishes.
Cascaded Path (default)
Player speaks
β
βΌ
ββββββββββββ 50-100ms Local Silero VAD, no network hop
β VAD βββββββββββββββ Segments speech from silence
ββββββ¬ββββββ
β
βΌ
ββββββββββββ 200-300ms Deepgram streaming / whisper.cpp
β STT βββββββββββββββ Keyword boost from knowledge graph
ββββββ¬ββββββ Phonetic entity correction on final
β
ββββββββββββββββββββΆ Speculative Memory Pre-fetch (parallel)
β Vector search starts on STT partials
βΌ
ββββββββββββ 30-50ms In-memory graph + recent transcript
β Hot Ctx βββββββββββββββ NPC identity, scene, relationships
β Assembly β
ββββββ¬ββββββ
β
βΌ
ββββββββββββ 300-500ms GPT-4o-mini / Claude / Gemini Flash
β LLM βββββββββββββββ Streaming tokens via Go channel
ββββββ¬ββββββ MCP tool calls execute inline
β
βΌ
ββββββββββββ 75-150ms ElevenLabs Flash / Coqui XTTS
β TTS βββββββββββββββ Sentence-by-sentence as tokens arrive
ββββββ¬ββββββ
β
βΌ
ββββββββββββ 20-50ms Opus encoding + platform transport
β Mixer βββββββββββββββ Priority queue, barge-in, pacing
ββββββ¬ββββββ
β
βΌ
NPC voice plays
ββββββββββββ
Total: 650-1100ms (pipelined)
S2S Path (speech-to-speech)
Audio goes directly to the S2S provider (Gemini Live or OpenAI Realtime), which handles recognition, generation, and synthesis in a single API call. Audio streams back through the same mixer and transport layer.
Latency: 150-600ms first audio, depending on provider.
Memory Write-back (shared)
After both paths, the complete exchange (player utterance + NPC response) is written to the session transcript (L1). A background goroutine runs phonetic correction and entity extraction for the knowledge graph (L3).
π¦ Key Packages
Application Layer (cmd/ and internal/)
| Package | Location | Responsibility |
|---|---|---|
cmd/glyphoxa | cmd/glyphoxa/ | Entry point. Parses flags, loads config, wires providers, starts the app, handles signals (SIGINT/SIGTERM). |
internal/app | internal/app/ | Top-level wiring. Creates and connects all subsystems via functional options. Owns App.Run() lifecycle and SessionManager for multi-guild sessions. |
internal/agent | internal/agent/ | NPCAgent and Router interfaces. NPC identity, scene context, utterance handling. Sub-packages: orchestrator (address detection, turn-taking, utterance buffering), npcstore (PostgreSQL-backed NPC definitions). |
internal/engine | internal/engine/ | VoiceEngine interface β the core abstraction over the conversational loop. Sub-packages: cascade (STTβLLMβTTS pipeline), s2s (Gemini Live / OpenAI Realtime wrapper). |
internal/config | internal/config/ | Configuration schema, YAML loader, environment variable overlay, provider registry, file watcher for hot-reload, and config diffing. |
internal/discord | internal/discord/ | Discord bot layer. Slash command router, interaction handlers (/npc, /session, /entity, /campaign, /recap, /feedback), DM role permissions, voice command filtering, pipeline stats dashboard. |
internal/mcp | internal/mcp/ | MCP host interface and implementation. Tool registry with budget tiers, latency calibration, LLM-to-MCP bridge. Built-in tools: dice roller, rules lookup, memory query, file I/O. |
internal/session | internal/session/ | Session lifecycle management. Context window tracking with auto-summarisation, memory guard (L1 write-through), reconnection handling, transcript consolidation. |
internal/hotctx | internal/hotctx/ | Hot context assembly and formatting. Concurrent fetch of NPC identity (L3), recent transcript (L1), and scene context. Speculative memory pre-fetch on STT partials. Target: <50ms. |
internal/observe | internal/observe/ | OpenTelemetry metrics (Prometheus exporter), distributed tracing, HTTP middleware for latency/status instrumentation, per-provider metric recording. |
internal/entity | internal/entity/ | Entity management: CRUD operations, YAML campaign import, VTT import (Foundry VTT, Roll20), in-memory store. |
internal/transcript | internal/transcript/ | Transcript correction pipeline. Phonetic matching against known entity names, LLM-based correction for low-confidence segments, verification. |
internal/resilience | internal/resilience/ | Provider failover with circuit breakers. LLMFallback, STTFallback, TTSFallback β each wraps multiple backends and auto-switches on failure. |
internal/health | internal/health/ | HTTP health endpoints. /healthz (liveness) and /readyz (readiness with pluggable checkers). |
internal/feedback | internal/feedback/ | Closed-alpha feedback storage. Append-only JSON lines file store. |
internal/gateway | internal/gateway/ | Multi-tenant gateway: admin API, bot management, session orchestration, usage tracking, gRPC transport |
internal/session | internal/session/ | Voice pipeline lifecycle: Runtime, WorkerHandler, context management |
internal/observe | internal/observe/ | Observability: Prometheus metrics, OTel traces, structured logging, HTTP middleware |
internal/health | internal/health/ | Health probes: /healthz, /readyz with pluggable readiness checkers |
internal/resilience | internal/resilience/ | Resilience: circuit breaker for gRPC clients and provider calls |
Public Libraries (pkg/)
| Package | Location | Responsibility |
|---|---|---|
pkg/audio | pkg/audio/ | Platform and Connection interfaces for voice channel connectivity. AudioFrame types, drain utilities. Sub-packages: discord (disgo voice adapter, Opus encode/decode), webrtc (Pion-based WebRTC platform, signaling, transport), mixer (priority queue with barge-in, natural pacing, heap-based scheduling), mock. |
pkg/memory | pkg/memory/ | Three-layer memory interfaces: SessionStore (L1), SemanticIndex (L2), KnowledgeGraph / GraphRAGQuerier (L3). Query options, schema SQL. Sub-packages: postgres (pgx/pgvector implementation, knowledge graph with recursive CTEs, semantic index), mock. |
pkg/provider | pkg/provider/ | Provider interfaces and implementations for all external AI services. Sub-packages by capability: llm (Provider interface + any-llm-go adapter), stt (Provider interface + Deepgram, whisper.cpp), tts (Provider interface + ElevenLabs, Coqui XTTS), s2s (Provider interface + Gemini Live, OpenAI Realtime), vad (Engine interface + Silero), embeddings (Provider interface + OpenAI, Ollama). Each has a mock sub-package. |
pkg/memory/export | pkg/memory/export/ | Campaign export/import as .tar.gz archives |
π§© Interface-First Design
Every subsystem in Glyphoxa defines a Go interface as its public contract. Concrete implementations satisfy the interface, and compile-time assertions verify correctness at build time.
The Pattern
// 1. Define the interface (in a "types" or package-level file)
type Provider interface {
StreamCompletion(ctx context.Context, req CompletionRequest) (<-chan Chunk, error)
Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
CountTokens(messages []Message) (int, error)
Capabilities() ModelCapabilities
}
// 2. Implement it (in a provider-specific package)
type AnyLLMProvider struct { /* ... */ }
// 3. Compile-time assertion (top of the implementation file)
var _ llm.Provider = (*AnyLLMProvider)(nil)
The var _ Interface = (*Impl)(nil) line ensures the compiler checks that *Impl satisfies Interface at build time, catching missing methods before any test runs.
Where This Pattern Appears
| Interface | Package | Implementations |
|---|---|---|
audio.Platform | pkg/audio | discord.Platform, webrtc.Platform |
audio.Connection | pkg/audio | discord.Connection, webrtc.Connection |
engine.VoiceEngine | internal/engine | cascade.Engine, s2s.Engine, mock.VoiceEngine |
llm.Provider | pkg/provider/llm | anyllm.Provider, resilience.LLMFallback, mock.Provider |
stt.Provider | pkg/provider/stt | deepgram.Provider, whisper.Provider, whisper.NativeProvider, resilience.STTFallback, mock.Provider |
tts.Provider | pkg/provider/tts | elevenlabs.Provider, coqui.Provider, resilience.TTSFallback, mock.Provider |
s2s.Provider | pkg/provider/s2s | gemini.Provider, openai.Provider, mock.Provider |
vad.Engine | pkg/provider/vad | mock.Engine (Silero via silero-vad-go) |
embeddings.Provider | pkg/provider/embeddings | openai.Provider, ollama.Provider, mock.Provider |
memory.SessionStore | pkg/memory | postgres.Store, session.MemoryGuard, mock.Store |
memory.KnowledgeGraph | pkg/memory | postgres.KnowledgeGraph, mock.Store |
mcp.Host | internal/mcp | mcphost.Host, mock.Host |
agent.NPCAgent | internal/agent | agent.NPC, mock.NPCAgent |
This design means swapping any provider (e.g., replacing ElevenLabs with Coqui XTTS, or Deepgram with whisper.cpp) is a configuration change β the orchestrator never imports a concrete provider package.
β±οΈ Latency Budget
The hard constraint is < 1.2 seconds mouth-to-ear (player finishes speaking to NPC voice starts playing). The hard limit is 2.0 seconds.
Cascaded Pipeline Breakdown
| Stage | Budget | Hard Limit | Technique |
|---|---|---|---|
| VAD + silence detection | 50β100ms | β | Local Silero VAD. No network hop. Sub-ms inference. |
| STT (streaming final) | 200β300ms | 500ms | Deepgram streaming. Transcript ready ~200ms after speech ends. |
| Speculative pre-fetch | 0ms (overlapped) | β | Vector search + graph query fires on STT partials, in parallel. |
| Hot context assembly | 30β50ms | 150ms | In-memory graph traversal + recent transcript slice. |
| LLM time-to-first-token | 300β500ms | 800ms | GPT-4o-mini or Gemini Flash streaming. |
| TTS time-to-first-byte | 75β150ms | 500ms | ElevenLabs Flash v2.5 streaming. |
| Audio transport overhead | 20β50ms | β | Opus encoding + platform playback. |
| Total (pipelined) | 650β1100ms | 2000ms | Pipelining overlaps STT tail with pre-fetch, LLM streaming with TTS streaming. |
S2S Comparison
| Engine | First Audio | Trade-offs |
|---|---|---|
| Cascaded (pipelined) | 650β1100ms | Full control over voice, model, tools |
| OpenAI Realtime (mini) | 150β400ms | Limited voices, 32k context window |
| OpenAI Realtime (full) | 200β500ms | Better quality, higher cost |
| Gemini Live (flash) | 300β600ms | 128k context, session resumption, free tier |
Why Streaming is Non-Negotiable
Without end-to-end streaming, latencies would be additive rather than overlapping:
Sequential: VAD(100) + STT(300) + LLM(500) + TTS(200) + Transport(50) = 1150ms minimum
(often 1500-2000ms)
Pipelined: VAD(100) + STT(200) βββ
βββ LLM TTFT(400) βββ
Pre-fetch(parallel) βββ βββ TTS TTFB(100) + Transport(50)
βββ Total: ~850ms typical
Goβs channel-based concurrency makes this natural: each stage is a goroutine reading from an input channel and writing to an output channel.
ποΈ Engine Types
Each NPC declares its engine type in configuration. The VoiceEngine interface unifies all types so the orchestrator is engine-agnostic.
Cascaded Engine (cascade.Engine)
The full STT β LLM β TTS pipeline. Each stage is a separate provider, giving maximum flexibility:
- Use when: You need distinct voices (ElevenLabs voice cloning), specific LLM choice (Claude for reasoning, GPT-4o-mini for speed), tool calling, or fine-grained control.
- Latency: 650β1100ms first audio.
- Location:
internal/engine/cascade/
S2S Engine (s2s.Engine)
A single API call handles audio-in to audio-out. Wraps Gemini Live or OpenAI Realtime.
- Use when: Lowest latency is the priority and you can accept the providerβs built-in voices and limited tool support.
- Latency: 150β600ms first audio.
- Location:
internal/engine/s2s/
Sentence Cascade (experimental)
A dual-model approach: a fast model (GPT-4o-mini) generates the opening sentence for immediate TTS playback (~500ms), while a strong model (Claude Sonnet) generates the substantive continuation in parallel. The listener hears a single continuous utterance.
- Use when: You want perceived sub-600ms latency with the quality of a strong model.
- Status: Experimental. See design/05-sentence-cascade.md.
- Location:
internal/engine/cascade/(built as a mode of the cascade engine)
| Engine | First Audio | Voice Control | Tool Calling | Context Window |
|---|---|---|---|---|
| Cascaded | 650β1100ms | Full (any TTS provider) | Full (MCP budget tiers) | Provider-dependent |
| S2S (Gemini) | 300β600ms | Provider voices only | Limited | 128k tokens |
| S2S (OpenAI) | 150β500ms | Provider voices only | Limited | 32k tokens |
| Sentence Cascade | ~500ms perceived | Full | Full | Provider-dependent |
π Design Documents
The full design is captured in a series of detailed documents. This architecture overview is the starting point; each design doc goes deep on its topic.
| # | Document | Description |
|---|---|---|
| 00 | Overview | Vision, product principles, core capabilities, performance targets |
| 01 | Architecture | System layers, detailed data flow, audio mixing, streaming requirements |
| 02 | Providers | LLM, STT, TTS, S2S, Audio platform interfaces and provider trade-offs |
| 03 | Memory | Three-layer hybrid memory: session log, semantic index, knowledge graph |
| 04 | MCP Tools | Tool integration, budget tiers, built-in tools, performance constraints |
| 05 | Sentence Cascade | Experimental dual-model cascade for perceived sub-600ms latency |
| 06 | NPC Agents | Agent design, multi-NPC orchestration, address detection, turn-taking |
| 07 | Technology | Why Go, dependency stack, CGo decisions, latency budget breakdown |
| 08 | Open Questions | Unresolved design questions and decisions in progress |
| 09 | Roadmap | Development phases and milestone planning |
| 10 | Knowledge Graph | L3 graph schema, PostgreSQL adjacency tables, recursive CTEs, GraphRAG |
| β | To Be Discussed | Items pending team discussion |
π See Also
- getting-started.md β Setup, build, and first run guide
- providers.md β Provider configuration and swapping guide
- audio-pipeline.md β Deep dive into audio transport, mixing, and barge-in
- memory.md β Memory system usage, hot context, and knowledge graph queries
- design/00-overview.md β Vision and product principles
- design/01-architecture.md β Detailed system architecture
- design/07-technology.md β Technology decisions and dependency stack
- CONTRIBUTING.md β Development setup, code style, and workflow