🎯 System Overview

Glyphoxa is a real-time voice AI framework that brings AI-driven talking NPCs into live TTRPG voice chat sessions. It captures player speech from a voice channel (Discord, WebRTC), routes it through a streaming AI pipeline (STT, LLM, TTS β€” or a single speech-to-speech model), and plays back NPC dialogue with distinct voices, personalities, and persistent memory β€” all within a 1.2-second mouth-to-ear latency target.

Written in Go for native concurrency, every pipeline stage runs as a goroutine connected by channels, enabling true end-to-end streaming where TTS starts synthesising before the LLM finishes generating.


πŸ—ΊοΈ Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          Audio Transport Layer                             β”‚
β”‚                    (Discord / WebRTC / Custom Platform)                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Audio In            β”‚                Audio Out                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚ Per-Speaker   β”‚   β”‚   β”‚ Audio Mixer                               β”‚    β”‚
β”‚   β”‚ Streams       β”‚   β”‚   β”‚  - Priority queue (addressed NPC first)   β”‚    β”‚
β”‚   β”‚ (chan AudioFr)β”‚   β”‚   β”‚  - Barge-in detection (truncate on VAD)   β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚  - Natural pacing (200-500ms gaps)        β”‚    β”‚
β”‚           β”‚           β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           β”‚     Agent Orchestrator + Router      β”‚                         β”‚
β”‚           β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚                         β”‚
β”‚           β”‚   β”‚  Address Detection          β”‚    β”‚                         β”‚
β”‚           β”‚   β”‚  Turn-Taking / Barge-in     β”‚    β”‚                         β”‚
β”‚           β”‚   β”‚  DM Commands & Puppet Mode  β”‚    β”‚                         β”‚
β”‚           β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚              β”‚                       β”‚                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚                         β”‚
β”‚    β”‚   NPC #1    β”‚ β”‚   NPC #2   β”‚ β”‚  NPC #3  β”‚ ...                         β”‚
β”‚    β”‚  (Agent)    β”‚ β”‚  (Agent)   β”‚ β”‚ (Agent)  β”‚   β”‚                         β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β”‚                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                          Voice Engines                                     β”‚
β”‚                                                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Cascaded Engine        β”‚ β”‚  S2S Engine  β”‚ β”‚  Sentence Cascade     β”‚    β”‚
β”‚  β”‚  STT β†’ LLM β†’ TTS        β”‚ β”‚  Gemini Live β”‚ β”‚  ⚠️ Experimental      β”‚    β”‚
β”‚  β”‚  (full pipeline)        β”‚ β”‚  OpenAI RT   β”‚ β”‚  Fast+Strong models   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Memory Subsystem               β”‚    MCP Tool Execution                   β”‚
β”‚                                  β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  L1   β”‚ β”‚   L2   β”‚ β”‚  L3  β”‚   β”‚  β”‚ Dice  β”‚ β”‚ Rules β”‚ β”‚ Memory β”‚         β”‚
β”‚  β”‚Sessionβ”‚ β”‚Semanticβ”‚ β”‚Graph β”‚   β”‚  β”‚Roller β”‚ β”‚Lookup β”‚ β”‚ Query  β”‚ ...     β”‚
β”‚  β”‚  Log  β”‚ β”‚ Index  β”‚ β”‚(KG)  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜   β”‚  Budget Tiers: instant/fast/standard    β”‚
β”‚  ─── all on PostgreSQL ───────   β”‚                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Observability (OpenTelemetry)  β”‚  Resilience (Fallback + Circuit Break)  β”‚
β”‚   Metrics Β· Traces Β· Middleware  β”‚  LLM / STT / TTS provider failover      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Distributed Topology (--mode=gateway + --mode=worker)

In multi-tenant deployments, the system splits into separate processes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Gateway               β”‚     β”‚            Worker                β”‚
β”‚                                 β”‚     β”‚                                  β”‚
β”‚  Admin API (tenant CRUD)        β”‚     β”‚  VAD β†’ STT β†’ LLM β†’ TTS β†’ Mixer  β”‚
β”‚  Bot Manager (per-tenant bots)  β”‚ gRPCβ”‚  Session Runtime                 β”‚
β”‚  Session Orchestrator     ──────┼──────  Discord Voice (direct)          β”‚
β”‚  Usage / Quota Tracking         β”‚     β”‚  MCP Tool Calls                  β”‚
β”‚  Health + Metrics               β”‚     β”‚  Health + Metrics                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                          β”‚
         └──── PostgreSQL (session state) β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The gateway manages tenant lifecycle and session orchestration. Workers run the voice pipeline and connect directly to Discord voice channels β€” audio never flows through the gateway. Control signals (start/stop/heartbeat) use gRPC. In --mode=full, both roles run in-process with direct function calls instead of gRPC.

See Multi-Tenant Architecture for details.


πŸ”€ Data Flow

The voice pipeline is a streaming chain. Each stage is a goroutine reading from an input channel and writing to an output channel. Stages overlap β€” TTS starts before the LLM finishes, audio playback starts before TTS finishes.

Cascaded Path (default)

 Player speaks
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   50-100ms    Local Silero VAD, no network hop
 β”‚    VAD   │──────────────  Segments speech from silence
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   200-300ms   Deepgram streaming / whisper.cpp
 β”‚   STT    │──────────────  Keyword boost from knowledge graph
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜               Phonetic entity correction on final
      β”‚
      │──────────────────▢ Speculative Memory Pre-fetch (parallel)
      β”‚                    Vector search starts on STT partials
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   30-50ms    In-memory graph + recent transcript
 β”‚ Hot Ctx  │──────────────  NPC identity, scene, relationships
 β”‚ Assembly β”‚
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   300-500ms   GPT-4o-mini / Claude / Gemini Flash
 β”‚   LLM    │──────────────  Streaming tokens via Go channel
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜               MCP tool calls execute inline
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   75-150ms    ElevenLabs Flash / Coqui XTTS
 β”‚   TTS    │──────────────  Sentence-by-sentence as tokens arrive
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   20-50ms     Opus encoding + platform transport
 β”‚  Mixer   │──────────────  Priority queue, barge-in, pacing
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
 NPC voice plays
                ────────────
                Total: 650-1100ms (pipelined)

S2S Path (speech-to-speech)

Audio goes directly to the S2S provider (Gemini Live or OpenAI Realtime), which handles recognition, generation, and synthesis in a single API call. Audio streams back through the same mixer and transport layer.

Latency: 150-600ms first audio, depending on provider.

Memory Write-back (shared)

After both paths, the complete exchange (player utterance + NPC response) is written to the session transcript (L1). A background goroutine runs phonetic correction and entity extraction for the knowledge graph (L3).


πŸ“¦ Key Packages

Application Layer (cmd/ and internal/)

Package Location Responsibility
cmd/glyphoxa cmd/glyphoxa/ Entry point. Parses flags, loads config, wires providers, starts the app, handles signals (SIGINT/SIGTERM).
internal/app internal/app/ Top-level wiring. Creates and connects all subsystems via functional options. Owns App.Run() lifecycle and SessionManager for multi-guild sessions.
internal/agent internal/agent/ NPCAgent and Router interfaces. NPC identity, scene context, utterance handling. Sub-packages: orchestrator (address detection, turn-taking, utterance buffering), npcstore (PostgreSQL-backed NPC definitions).
internal/engine internal/engine/ VoiceEngine interface — the core abstraction over the conversational loop. Sub-packages: cascade (STT→LLM→TTS pipeline), s2s (Gemini Live / OpenAI Realtime wrapper).
internal/config internal/config/ Configuration schema, YAML loader, environment variable overlay, provider registry, file watcher for hot-reload, and config diffing.
internal/discord internal/discord/ Discord bot layer. Slash command router, interaction handlers (/npc, /session, /entity, /campaign, /recap, /feedback), DM role permissions, voice command filtering, pipeline stats dashboard.
internal/mcp internal/mcp/ MCP host interface and implementation. Tool registry with budget tiers, latency calibration, LLM-to-MCP bridge. Built-in tools: dice roller, rules lookup, memory query, file I/O.
internal/session internal/session/ Session lifecycle management. Context window tracking with auto-summarisation, memory guard (L1 write-through), reconnection handling, transcript consolidation.
internal/hotctx internal/hotctx/ Hot context assembly and formatting. Concurrent fetch of NPC identity (L3), recent transcript (L1), and scene context. Speculative memory pre-fetch on STT partials. Target: <50ms.
internal/observe internal/observe/ OpenTelemetry metrics (Prometheus exporter), distributed tracing, HTTP middleware for latency/status instrumentation, per-provider metric recording.
internal/entity internal/entity/ Entity management: CRUD operations, YAML campaign import, VTT import (Foundry VTT, Roll20), in-memory store.
internal/transcript internal/transcript/ Transcript correction pipeline. Phonetic matching against known entity names, LLM-based correction for low-confidence segments, verification.
internal/resilience internal/resilience/ Provider failover with circuit breakers. LLMFallback, STTFallback, TTSFallback β€” each wraps multiple backends and auto-switches on failure.
internal/health internal/health/ HTTP health endpoints. /healthz (liveness) and /readyz (readiness with pluggable checkers).
internal/feedback internal/feedback/ Closed-alpha feedback storage. Append-only JSON lines file store.
internal/gateway internal/gateway/ Multi-tenant gateway: admin API, bot management, session orchestration, usage tracking, gRPC transport
internal/session internal/session/ Voice pipeline lifecycle: Runtime, WorkerHandler, context management
internal/observe internal/observe/ Observability: Prometheus metrics, OTel traces, structured logging, HTTP middleware
internal/health internal/health/ Health probes: /healthz, /readyz with pluggable readiness checkers
internal/resilience internal/resilience/ Resilience: circuit breaker for gRPC clients and provider calls

Public Libraries (pkg/)

Package Location Responsibility
pkg/audio pkg/audio/ Platform and Connection interfaces for voice channel connectivity. AudioFrame types, drain utilities. Sub-packages: discord (disgo voice adapter, Opus encode/decode), webrtc (Pion-based WebRTC platform, signaling, transport), mixer (priority queue with barge-in, natural pacing, heap-based scheduling), mock.
pkg/memory pkg/memory/ Three-layer memory interfaces: SessionStore (L1), SemanticIndex (L2), KnowledgeGraph / GraphRAGQuerier (L3). Query options, schema SQL. Sub-packages: postgres (pgx/pgvector implementation, knowledge graph with recursive CTEs, semantic index), mock.
pkg/provider pkg/provider/ Provider interfaces and implementations for all external AI services. Sub-packages by capability: llm (Provider interface + any-llm-go adapter), stt (Provider interface + Deepgram, whisper.cpp), tts (Provider interface + ElevenLabs, Coqui XTTS), s2s (Provider interface + Gemini Live, OpenAI Realtime), vad (Engine interface + Silero), embeddings (Provider interface + OpenAI, Ollama). Each has a mock sub-package.
pkg/memory/export pkg/memory/export/ Campaign export/import as .tar.gz archives

🧩 Interface-First Design

Every subsystem in Glyphoxa defines a Go interface as its public contract. Concrete implementations satisfy the interface, and compile-time assertions verify correctness at build time.

The Pattern

// 1. Define the interface (in a "types" or package-level file)
type Provider interface {
    StreamCompletion(ctx context.Context, req CompletionRequest) (<-chan Chunk, error)
    Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
    CountTokens(messages []Message) (int, error)
    Capabilities() ModelCapabilities
}

// 2. Implement it (in a provider-specific package)
type AnyLLMProvider struct { /* ... */ }

// 3. Compile-time assertion (top of the implementation file)
var _ llm.Provider = (*AnyLLMProvider)(nil)

The var _ Interface = (*Impl)(nil) line ensures the compiler checks that *Impl satisfies Interface at build time, catching missing methods before any test runs.

Where This Pattern Appears

Interface Package Implementations
audio.Platform pkg/audio discord.Platform, webrtc.Platform
audio.Connection pkg/audio discord.Connection, webrtc.Connection
engine.VoiceEngine internal/engine cascade.Engine, s2s.Engine, mock.VoiceEngine
llm.Provider pkg/provider/llm anyllm.Provider, resilience.LLMFallback, mock.Provider
stt.Provider pkg/provider/stt deepgram.Provider, whisper.Provider, whisper.NativeProvider, resilience.STTFallback, mock.Provider
tts.Provider pkg/provider/tts elevenlabs.Provider, coqui.Provider, resilience.TTSFallback, mock.Provider
s2s.Provider pkg/provider/s2s gemini.Provider, openai.Provider, mock.Provider
vad.Engine pkg/provider/vad mock.Engine (Silero via silero-vad-go)
embeddings.Provider pkg/provider/embeddings openai.Provider, ollama.Provider, mock.Provider
memory.SessionStore pkg/memory postgres.Store, session.MemoryGuard, mock.Store
memory.KnowledgeGraph pkg/memory postgres.KnowledgeGraph, mock.Store
mcp.Host internal/mcp mcphost.Host, mock.Host
agent.NPCAgent internal/agent agent.NPC, mock.NPCAgent

This design means swapping any provider (e.g., replacing ElevenLabs with Coqui XTTS, or Deepgram with whisper.cpp) is a configuration change β€” the orchestrator never imports a concrete provider package.


⏱️ Latency Budget

The hard constraint is < 1.2 seconds mouth-to-ear (player finishes speaking to NPC voice starts playing). The hard limit is 2.0 seconds.

Cascaded Pipeline Breakdown

Stage Budget Hard Limit Technique
VAD + silence detection 50–100ms β€” Local Silero VAD. No network hop. Sub-ms inference.
STT (streaming final) 200–300ms 500ms Deepgram streaming. Transcript ready ~200ms after speech ends.
Speculative pre-fetch 0ms (overlapped) β€” Vector search + graph query fires on STT partials, in parallel.
Hot context assembly 30–50ms 150ms In-memory graph traversal + recent transcript slice.
LLM time-to-first-token 300–500ms 800ms GPT-4o-mini or Gemini Flash streaming.
TTS time-to-first-byte 75–150ms 500ms ElevenLabs Flash v2.5 streaming.
Audio transport overhead 20–50ms β€” Opus encoding + platform playback.
Total (pipelined) 650–1100ms 2000ms Pipelining overlaps STT tail with pre-fetch, LLM streaming with TTS streaming.

S2S Comparison

Engine First Audio Trade-offs
Cascaded (pipelined) 650–1100ms Full control over voice, model, tools
OpenAI Realtime (mini) 150–400ms Limited voices, 32k context window
OpenAI Realtime (full) 200–500ms Better quality, higher cost
Gemini Live (flash) 300–600ms 128k context, session resumption, free tier

Why Streaming is Non-Negotiable

Without end-to-end streaming, latencies would be additive rather than overlapping:

Sequential:  VAD(100) + STT(300) + LLM(500) + TTS(200) + Transport(50) = 1150ms minimum
                                                                          (often 1500-2000ms)

Pipelined:   VAD(100) + STT(200) ──┐
                                   β”œβ”€β”€ LLM TTFT(400) ──┐
             Pre-fetch(parallel) β”€β”€β”˜                   β”œβ”€β”€ TTS TTFB(100) + Transport(50)
                                                       └── Total: ~850ms typical

Go’s channel-based concurrency makes this natural: each stage is a goroutine reading from an input channel and writing to an output channel.


πŸŽ™οΈ Engine Types

Each NPC declares its engine type in configuration. The VoiceEngine interface unifies all types so the orchestrator is engine-agnostic.

Cascaded Engine (cascade.Engine)

The full STT β†’ LLM β†’ TTS pipeline. Each stage is a separate provider, giving maximum flexibility:

  • Use when: You need distinct voices (ElevenLabs voice cloning), specific LLM choice (Claude for reasoning, GPT-4o-mini for speed), tool calling, or fine-grained control.
  • Latency: 650–1100ms first audio.
  • Location: internal/engine/cascade/

S2S Engine (s2s.Engine)

A single API call handles audio-in to audio-out. Wraps Gemini Live or OpenAI Realtime.

  • Use when: Lowest latency is the priority and you can accept the provider’s built-in voices and limited tool support.
  • Latency: 150–600ms first audio.
  • Location: internal/engine/s2s/

Sentence Cascade (experimental)

A dual-model approach: a fast model (GPT-4o-mini) generates the opening sentence for immediate TTS playback (~500ms), while a strong model (Claude Sonnet) generates the substantive continuation in parallel. The listener hears a single continuous utterance.

  • Use when: You want perceived sub-600ms latency with the quality of a strong model.
  • Status: Experimental. See design/05-sentence-cascade.md.
  • Location: internal/engine/cascade/ (built as a mode of the cascade engine)
Engine First Audio Voice Control Tool Calling Context Window
Cascaded 650–1100ms Full (any TTS provider) Full (MCP budget tiers) Provider-dependent
S2S (Gemini) 300–600ms Provider voices only Limited 128k tokens
S2S (OpenAI) 150–500ms Provider voices only Limited 32k tokens
Sentence Cascade ~500ms perceived Full Full Provider-dependent

πŸ“š Design Documents

The full design is captured in a series of detailed documents. This architecture overview is the starting point; each design doc goes deep on its topic.

# Document Description
00 Overview Vision, product principles, core capabilities, performance targets
01 Architecture System layers, detailed data flow, audio mixing, streaming requirements
02 Providers LLM, STT, TTS, S2S, Audio platform interfaces and provider trade-offs
03 Memory Three-layer hybrid memory: session log, semantic index, knowledge graph
04 MCP Tools Tool integration, budget tiers, built-in tools, performance constraints
05 Sentence Cascade Experimental dual-model cascade for perceived sub-600ms latency
06 NPC Agents Agent design, multi-NPC orchestration, address detection, turn-taking
07 Technology Why Go, dependency stack, CGo decisions, latency budget breakdown
08 Open Questions Unresolved design questions and decisions in progress
09 Roadmap Development phases and milestone planning
10 Knowledge Graph L3 graph schema, PostgreSQL adjacency tables, recursive CTEs, GraphRAG
β€” To Be Discussed Items pending team discussion

πŸ‘€ See Also


This site uses Just the Docs, a documentation theme for Jekyll.