Glyphoxa is in early alpha. This roadmap prioritizes interchangeable components, clean API boundaries, and usable interfaces over feature completeness. Every phase starts with interface design and ends with at least two concrete implementations to prove the abstraction holds.
Design Principles for Development
- Interface-first, implementation-second. Define the Go interface, write tests against it with mocks, then build concrete providers. If the interface feels wrong during implementation, fix the interface before continuing.
- Every component is swappable. A user should be able to replace any provider (LLM, STT, TTS, audio platform, memory backend, tool server) via configuration β never by editing application code.
- Clear package boundaries. Each package owns one concern. No package imports from a siblingβs internals. Communication between layers happens through interfaces and Go channels.
- Configuration-driven wiring. Provider selection, NPC definitions, tool registration, and budget tiers are all declarative config (YAML). The application reads config and wires together the right implementations at startup.
- Test at the boundary. Integration tests run against the interface, not the implementation. A test suite for
LLMProvidermust pass identically for OpenAI, Anthropic, Gemini, and Ollama.
Phase 1: Core Interfaces and Project Scaffold β
Goal: Establish the package structure, define all primary Go interfaces, and prove end-to-end audio flow with one provider per slot.
Package Structure
glyphoxa/
βββ cmd/glyphoxa/ # CLI entrypoint, config loading, dependency wiring
βββ config/ # Config schema, validation, provider registry
βββ engine/ # VoiceEngine interface, CascadedEngine, S2SEngine
β βββ cascaded/ # STTβLLMβTTS pipeline implementation
β βββ s2s/ # Speech-to-speech engine implementation
βββ provider/ # Provider interfaces + concrete implementations
β βββ llm/ # LLMProvider interface + openai/, anthropic/, gemini/, ollama/
β βββ stt/ # STTProvider interface + deepgram/, assemblyai/, whisper/
β βββ tts/ # TTSProvider interface + elevenlabs/, cartesia/, coqui/
β βββ s2s/ # S2SProvider interface + geminilive/, openairealtime/
β βββ embeddings/ # EmbeddingsProvider interface + openai/, voyage/, nomic/
β βββ vad/ # VADEngine interface + silero/
βββ audio/ # AudioPlatform interface, AudioMixer, frame types
β βββ discord/ # Discord voice transport
β βββ webrtc/ # WebRTC transport (future)
βββ agent/ # NPCAgent, AgentRouter, orchestrator, turn-taking
βββ memory/ # MemoryStore interface, hot layer, cold layer
β βββ session/ # L1 session log (PostgreSQL)
β βββ semantic/ # L2 vector index (pgvector)
β βββ graph/ # L3 knowledge graph (PostgreSQL adjacency tables)
βββ mcp/ # MCP host, tool registry, budget enforcer, calibration
βββ transcript/ # Transcript correction pipeline (phonetic match, LLM correction)
βββ internal/ # Shared utilities (logging, metrics, test helpers)
Interface Definitions
Define and document all primary interfaces. Each interface gets:
- A Go interface type with full godoc
- A
mock/subpackage with a test double - An interface compliance test suite that any implementation must pass
Priority interfaces (define first):
provider/llm.Providerβ streaming completions, tool calling, token counting, capabilitiesprovider/stt.Providerβ streaming sessions, partials/finals channels, keyword boostingprovider/tts.Providerβ streaming synthesis from text channel to audio channel, voice profilesaudio.Platformβ connect, input/output streams, participant lifecycleengine.VoiceEngineβ the unifying abstraction over cascaded and S2S paths
Secondary interfaces (define alongside first implementations):
provider/s2s.Providerβ audio-in/audio-out sessions, tool calling bridge, context injectionprovider/embeddings.Providerβ single/batch embedding, dimensionality, model IDprovider/vad.Engineβ frame processing, speech/silence eventsmemory.KnowledgeGraphβ entity/relationship CRUD, traversal, scoped visibility, identity snapshotsmemory.GraphRAGQuerierβ combined graph + vector search (optional extension of KnowledgeGraph)mcp.Hostβ tool discovery, registry, execution, budget enforcement
First Implementation Pass
Build one concrete provider per interface to prove the pipeline works end-to-end:
- STT: Deepgram Nova-3 (streaming WebSocket)
- LLM: OpenAI GPT-4o-mini via
any-llm-go(streaming with tool calling) - TTS: ElevenLabs Flash v2.5 (streaming WebSocket)
- VAD: Silero via
silero-vad-go - Audio: Discord via
disgo
End-to-end milestone: Discord bot joins voice β captures audio β VAD β STT β LLM (single static persona) β TTS β plays back. Measure and log latency at every stage boundary.
Config and Wiring
- Define the YAML config schema for provider selection, credentials, and NPC definitions
- Build a provider registry that maps config strings to constructor functions
- Wire everything in
cmd/glyphoxa/β config load β provider instantiation β engine assembly β run
Phase 2: Provider Breadth and Memory Foundation β
Goal: Prove interchangeability by adding a second provider for each slot. Build the memory subsystem with clean separation between layers.
Second Provider Pass
Add at least one alternative for every provider interface. Run the same compliance test suite against both:
| Interface | Primary | Secondary |
|---|---|---|
| LLM | OpenAI (GPT-4o-mini) | Anthropic (Claude Sonnet) |
| STT | Deepgram | whisper.cpp (local) |
| TTS | ElevenLabs | Coqui XTTS (local) |
| Embeddings | OpenAI text-embedding-3-small | nomic-embed-text (local) |
Switching between providers must be a single config change with zero code modifications. If itβs not, the interface is wrong β fix it.
Memory Subsystem
Build all three memory layers behind clean interfaces:
- L1 β Session Log: PostgreSQL storage with full-text index. Continuous transcript write with speaker labels and timestamps. Interface:
memory/session.Store. - L2 β Semantic Index: Chunk, embed, and store session content in pgvector. RAG retrieval with metadata filtering. Interface:
memory/semantic.Index. - L3 β Knowledge Graph: PostgreSQL adjacency tables.
KnowledgeGraphandGraphRAGQuerierinterfaces. Entity extraction pipeline from corrected transcripts.
Key design constraint: L1, L2, and L3 share a single PostgreSQL instance but are accessed through separate interfaces. A future migration (e.g., swapping L3 to Neo4j or L2 to a standalone vector DB) should only require implementing the new backend behind the existing interface.
Hot Layer Assembly
Build the orchestratorβs hot context assembly:
- NPC identity snapshot from L3 (
IdentitySnapshot) - Recent session transcript from L1
- Scene context from L3
Target: < 50ms assembly time. This runs before every LLM call and must never require an LLM round-trip.
Transcript Correction Pipeline
Implement the multi-stage correction pipeline as a composable chain:
- Phonetic entity match (inline, < 1ms) β Double Metaphone + Jaro-Winkler against known entity list from L3
- LLM transcript correction (background) β cheap LLM corrects remaining entity errors
Each stage is independently testable and skippable via config. The pipeline reads the entity list from the KnowledgeGraph interface β it does not directly access the database.
Phase 3: MCP Tools and Budget Enforcement β
Goal: Build the tool execution layer with performance budgets that enforce latency guarantees by construction.
MCP Host
Implement the MCP host using modelcontextprotocol/go-sdk:
- Tool discovery on server connection
- In-memory tool registry with schema and latency metadata
- Tool execution with timeout enforcement
- Support for both stdio (local) and HTTP/SSE (remote) transports
Budget Enforcer
The budget enforcer is the core of Glyphoxaβs tool strategy. It controls which tools the LLM can see based on the active latency tier:
- FAST (β€ 500ms): dice-roller,
memory.query_entities, file-io, music-ambiance - STANDARD (β€ 1500ms): FAST tools +
memory.search_sessions, rules-lookup, session-manager - DEEP (β€ 4000ms): All tools including image-gen, web-search
Implementation:
- Strip over-budget tools from function definitions before they reach the LLM
- Tier selection logic in the orchestrator (conversation state, keyword detection, DM commands)
- No prompt-based enforcement β the LLM never sees tools it canβt afford to call
Calibration Protocol
Build the calibration system:
- Synthetic probe on server connection
- Rolling window measurement (last 100 calls, p50 and p99)
- Health scoring with automatic tier demotion for degraded tools
Built-in Tool Servers
Ship core tools as bundled Go MCP servers:
- dice-roller β
roll(expression),roll_table(table_name) - memory tools β
search_sessions,query_entities,get_summary,search_facts(backed by L1/L2/L3 interfaces) - rules-lookup β
search_rules(query, system)(D&D 5e SRD) - file-io β
write_file,read_file
MCP Bridge for S2S
Wire MCP tools into S2S sessions:
- Convert
MCPToolDefschemas to S2S-native function definitions - Execute tools through the same
ToolCallHandleras cascaded - Respect budget tiers β only declare tier-appropriate tools
Phase 4: NPC Agents and Orchestration β
Goal: Build the agent layer that brings NPCs to life with distinct personalities, memories, and voice profiles.
NPC Agent Schema
Declarative NPC definitions in YAML:
npcs:
- name: Grimjaw
engine: cascaded
personality: "Gruff but kind. Speaks in short sentences."
voice:
provider: elevenlabs
voice_id: "Antoni"
pitch: -2
speed: 0.9
knowledge_scope: ["Ironhold", "Missing Shipment"]
tools: ["dice-roller", "memory.*"]
budget_tier: standard
The agent loader reads this config and wires together the correct VoiceEngine, provider instances, memory scopes, and tool sets. No NPC-specific code β everything is configuration.
Agent Orchestrator
Build the orchestration layer:
- Address detection: Determine which NPC was spoken to (by name, conversational context, or DM command)
- Turn-taking: Priority queue for NPC speech output. Natural pacing with configurable silence gaps
- Cross-NPC awareness: Shared recent-utterance buffer. Each NPCβs context includes what other NPCs just said
- DM override: Voice commands and Discord slash commands to mute, redirect, or puppet NPCs
Audio Mixer
Build the output serializer:
- Priority queue for NPC audio segments
- Barge-in detection (player speech interrupts NPC output)
- Configurable gap between segments (200β500ms Β± jitter)
Speculative Pre-fetch
Wire keyword extraction on STT partials to trigger parallel cold-layer queries:
- Entity name detection against L3 entity list
- Temporal reference detection (βlast timeβ, βdo you rememberβ)
- Results injected into prompt alongside hot context
Phase 5: S2S Engines, Entity Management, and Platform Breadth β
Goal: Complete the S2SEngine implementations and validate the full system with real play groups.
S2S Providers
Implement concrete S2S providers:
- Gemini Live (
gemini-live-2.5-flash-native-audio) β custom WebSocket client - OpenAI Realtime (
gpt-realtime-mini) β viago-openai-realtime
Both must satisfy the S2SProvider interface and pass the same compliance tests. Verify:
- Audio forwarding and playback
- Text context injection (hot layer)
- Tool calling bridge (MCP budget enforcement)
- Session lifecycle for long sessions (context window limits, summarization triggers)
Pre-session Entity Registration
Build the DMβs entity management interface:
- Discord slash commands:
/entity add,/entity list,/entity remove - Campaign config file loader (YAML bulk import)
- VTT import (Foundry VTT JSON, Roll20 JSON)
Experimental: Dual-Model Sentence Cascade
Prototype the sentence cascade with controlled A/B testing:
- Fast model (GPT-4o-mini) generates opener β TTS starts immediately
- Strong model (Claude Sonnet) continues from forced prefix
- Measure coherence, latency gain, and cost overhead
- Compare against single-model baseline and Cisco-style single-model forced prefix
This is experimental and opt-in per NPC via cascade_mode config.
Closed Alpha
Recruit 3β5 DMs for real session testing. Focus feedback on:
- Voice latency and naturalness
- NPC personality consistency
- Memory accuracy (does the NPC remember correctly?)
- DM workflow (is the control interface usable?)
- Provider switching (did it break anything?)
Second Audio Platform
Add WebRTC support via Pion to validate the AudioPlatform interface:
- Browser-based voice sessions without Discord
- Same pipeline, different transport β no changes above the audio layer
- If the abstraction holds cleanly, the interface is correct
Phase 6: Production Hardening and Observability
Goal: Make Glyphoxa reliable enough for multi-hour sessions with real play groups. Instrument everything, harden failure modes, and establish operational baselines.
Status: Mostly complete (see production scaling plan).
Structured Observability
- OpenTelemetry integration β : Traces for the full voice pipeline (VAD β STT β LLM β TTS β playback) with span-per-stage latency
- Prometheus metrics β : Endpoint exposing p50/p95/p99 latency per stage, active NPC count, memory query duration, tool execution times, error rates by provider
- Structured logging β
: Replace ad-hoc
logcalls withslog(structured, leveled). Correlation IDs linking a single utterance through the entire pipeline - Health endpoint β
:
/healthzand/readyzfor container orchestration. Include provider connectivity checks
Graceful Degradation
- Provider failover: If an STT/TTS/LLM provider returns errors or exceeds latency hard limits, automatically fall back to a configured secondary (e.g., ElevenLabs β Coqui, Deepgram β whisper.cpp)
- Circuit breakers β : Per-provider circuit breaker (closed β open β half-open). Prevent cascading failures when a single external API goes down
- S2S β cascaded fallback: If an S2S session fails mid-conversation, seamlessly restart the NPC on cascaded engine without losing context
- Memory layer isolation: L1/L2/L3 failures should degrade gracefully (NPC continues without memory rather than crashing)
Session Lifecycle
- Context window management: Automatic summarization when approaching provider context limits. For S2S sessions, trigger summary + re-injection before the window fills
- Long-session support: 4+ hour sessions need periodic memory consolidation β flush hot context to L1, re-summarize, prune stale entries
- Reconnection: If the audio platform disconnects (Discord voice timeout, WebRTC ICE failure), auto-reconnect and resume NPC state
Configuration Validation
- Startup validation: Fail fast with clear error messages if config references unknown providers, missing credentials, or invalid NPC definitions
- Config hot-reload β : Watch config file for changes and apply non-destructive updates (NPC personality, tool tiers, voice settings) without restarting the session
Resolve Open Design Items
β Resolved during implementation.
Address the two items from to-be-discussed.md:
- #6 β OpenAI Realtime error events: Implement
OnError(func(error))callback ons2s.SessionHandle(recommended Option B) - #7 β WebRTC
outputChownership: Update interface docs to clarify write-only channels are caller-owned (Option C), planSend(frame)+Close()struct for v1 (Option D)
Phase 7: DM Experience and Closed Alpha
Goal: Build the DM-facing control surface and run real play sessions to validate the product.
Status: Discord interface complete. Closed alpha pending.
Discord Bot Interface
- Slash commands β
:
/npc list,/npc mute <name>,/npc unmute <name>,/npc speak <name> <text>(puppet mode),/session start,/session stop,/session recap - Entity management β
:
/entity add,/entity list,/entity remove,/entity import <file>(YAML or VTT JSON) - Campaign management β
:
/campaign load <file>,/campaign info,/campaign switch - Session dashboard β : Embed showing active NPCs, latency stats, memory usage, and session duration
Voice Commands
β Implemented.
- DM voice shortcuts: βGrimjaw, be quietβ β mute, βEveryone, stopβ β mute all, βGrimjaw, sayβ¦β β puppet
- Keyword detection on STT partials with low-latency response
Companion Web UI (Stretch)
- Real-time session view: active NPCs, whoβs speaking, transcript, latency gauges
- NPC editor: personality, voice preview, knowledge scope, tool permissions
- Campaign/entity browser with relationship graph visualization
- This is a stretch goal β Discord-first for alpha
Closed Alpha Program
- Recruit 3β5 DMs for real session testing (2β4 hour sessions, multiple game systems)
- Focus feedback on:
- Voice latency and naturalness
- NPC personality consistency across long sessions
- Memory accuracy (βdoes the NPC remember correctly?β)
- DM workflow (βis the control interface usable mid-session?β)
- Provider switching reliability
- Structured feedback forms after each session
- Telemetry dashboards for latency distribution, error rates, provider usage
Phase 8: Polish, Performance, and Public Beta
Goal: Optimize based on alpha feedback, expand platform support, and prepare for public release.
Status: Helm chart and container images complete. Performance and platform work pending.
Performance Optimization
- Latency profiling: Identify and eliminate bottlenecks from alpha telemetry. Target consistent sub-1.2s mouth-to-ear
- Connection pooling: Reuse provider connections across NPC turns where possible (especially STT/TTS WebSocket sessions)
- Speculative pre-fetch tuning: Measure hit rate of keyword-based pre-fetch from alpha data. Tune thresholds to minimize wasted queries
- Memory query optimization: Index tuning on PostgreSQL (GIN for FTS, HNSW for pgvector) based on real query patterns
Multi-Platform Audio
- WebRTC production-ready: Browser-based voice sessions without Discord dependency
- Platform abstraction validation: Run identical NPC sessions on Discord and WebRTC to confirm the
AudioPlatforminterface holds - Additional platforms: Evaluate Mumble, TeamSpeak, or native desktop audio as community-requested
Game System Expansion
- D&D 5e SRD: Pre-indexed and bundled (OGL/CC)
- Pathfinder 2e: Pre-indexed (ORC license)
- User-uploaded rulebooks: PDF/text ingestion into the rules-lookup tool with per-campaign scoping
- System-agnostic mode: Dice roller and memory without system-specific rules
Deployment and Distribution
- Container images β : Multi-arch Docker images (amd64, arm64) published to GHCR
- Helm chart β : Kubernetes deployment with PostgreSQL, optional GPU node for local providers
- Single-binary self-host:
glyphoxa servewith embedded migrations and SQLite fallback for memory (no PostgreSQL required for small deployments) - Managed cloud offering: Evaluate feasibility based on alpha usage patterns and cost data
Documentation
- User guide: Getting started, configuration reference, NPC authoring guide
- DM handbook: Best practices for NPC design, session management, memory tuning
- Provider comparison: Latency/cost/quality matrix for all supported providers
- API reference: Generated godoc + OpenAPI spec for any HTTP endpoints
Open Items
These remain unresolved and will be addressed through prototyping and alpha feedback:
- Pricing model: $5β15/month range, hybrid subscription + overage. Needs real usage data from alpha
- Self-hosted vs. cloud: Goβs single-binary deployment makes both feasible. Open-source core + managed cloud offering is likely. Alpha will inform the split
- Voice consistency across engine switches: S2S β cascaded voice mismatch within a single NPC session. May require voice-cloning or accepting the tradeoff
- Game system licensing: D&D 5e SRD (free), Pathfinder 2e (ORC). User-uploaded rulebooks need careful scoping
See also: Overview Β· Architecture Β· Providers Β· Open Questions