End-to-end documentation for the Glyphoxa audio pipeline โ from player microphone to NPC voice output.
Overview
Glyphoxaโs audio pipeline is a bidirectional streaming system that captures player speech from a voice platform, detects speech boundaries, processes it through an AI engine, and delivers synthesised NPC responses back to all participants. The entire path is built on Go channels and goroutines, enabling true end-to-end streaming where each pipeline stage begins processing before the previous stage completes.
The pipeline consists of six core stages:
- Platform Transport (Inbound) โ captures per-participant audio from Discord or WebRTC
- Voice Activity Detection โ segments speech from silence using Silero ONNX
- Engine Processing โ converts player speech to NPC response (three engine types)
- Audio Mixer โ priority-queues NPC speech segments, handles barge-in
- Platform Transport (Outbound) โ encodes and transmits NPC audio to all participants
- Utterance Buffer โ maintains cross-NPC awareness of recent conversation
Audio Flow Diagram
INBOUND (Player -> Engine)
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Microphone โโโโโ>โ Platform โโโโโ>โ VAD โโโโโ>โ Engine โ
โ (Player) โ โ Transport โ โ (Silero)โ โ (Cascaded / โ
โโโโโโโโโโโโโโ โ (Discord / โ โโโโโโฌโโโโโ โ S2S / โ
โ WebRTC) โ โ โ Cascade) โ
โโโโโโโโโโโโโโโโโ SpeechStart/ โโโโโโโโฌโโโโโโโ
โ SpeechEnd โ
โ gates STT โ
per-participant engine.Response
AudioFrame channels {Text, Audio <-chan}
โ
OUTBOUND (Engine -> Speaker) โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ Speaker โ<โโโโโ Platform โ<โโโโโ Audio โ<โโโโโโโโโโโ
โ (All โ โ Transport โ โ Mixer โ
โ Players) โ โ (Discord / โ โ (Priorityโ
โโโโโโโโโโโโโโ โ WebRTC) โ โ Queue) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ
โ โ
Opus encode / Barge-in detection
WebRTC send DM override
Inter-segment gap
Detailed data flow:
Player speaks
โ
โโ Discord: Opus packet โโ> OpusRecv โโ> Opus decode โโ> PCM AudioFrame
โ โ
โโ WebRTC: PeerTransport.AudioInput() โโ> PCM AudioFrame โโโโโโค
โ โ
โผ โผ
InputStreams() โโโโโโโโโโโโโโโโโโโโโโโโโ> per-participant <-chan AudioFrame
(map[participantID] channel) โ
โผ
VAD.ProcessFrame()
โโ SpeechStart: begin buffering
โโ SpeechContinue: accumulate
โโ SpeechEnd: dispatch to engine
โโ Silence: discard / reset
โ
โผ
VoiceEngine.Process(AudioFrame, PromptContext)
โ
โโโโโโโดโโโโโโโโโโโ
โ engine.Response โ
โ .Audio <-chan โ
โ .Text string โ
โ .ToolCalls [] โ
โโโโโโโฌโโโโโโโโโโโ
โ
โผ
AudioSegment{NPCID, Audio, Priority}
โ
Mixer.Enqueue(segment, priority)
โ
dispatch goroutine
โโ priority queue (max-heap)
โโ inter-segment gap + jitter
โโ output callback โโ> []byte
โ
โผ
Connection.OutputStream() chan<- AudioFrame
โ
โโโโโโโดโโโโโโโโโโโ
โโ Discord: Opus encode โโ> OpusSend
โโ WebRTC: PeerTransport.SendAudio()
โ
โผ
All participants hear NPC
๐ก Platform Transports
Platform transports implement the audio.Platform and audio.Connection interfaces, abstracting voice channel connectivity from the rest of the pipeline.
Core Interfaces
// Platform is the entry point for a voice-channel provider.
type Platform interface {
Connect(ctx context.Context, channelID string) (Connection, error)
}
// Connection represents an active session on a voice channel.
type Connection interface {
InputStreams() map[string]<-chan AudioFrame // per-participant input
OutputStream() chan<- AudioFrame // single mixed output
OnParticipantChange(cb func(Event)) // join/leave callbacks
Disconnect() error
}
๐ฎ Discord Transport (pkg/audio/discord/)
The Discord transport bridges Discordโs Opus-based voice protocol with Glyphoxaโs PCM AudioFrame pipeline using the disgoorg/disgo library.
How it works:
| Stage | Detail |
|---|---|
| Inbound | Opus packets arrive via VoiceConnection.OpusRecv. Each SSRC gets its own gopus.Decoder. Decoded PCM frames are delivered to per-participant channels (buffer: 64 frames). |
| Outbound | PCM AudioFrame values written to OutputStream() are encoded to Opus via gopus.Encoder and sent to VoiceConnection.OpusSend. Discord speaking notifications are managed automatically. |
| Codec | 48 kHz stereo Opus, 20 ms frame size (960 samples/channel). |
| Participant tracking | VoiceStateUpdate events detect joins/leaves by guild and channel ID. SSRC-to-user-ID mapping is built lazily as packets arrive. |
| Lifecycle | Disconnect() closes all input channels, removes event handlers, and disconnects the voice connection. Safe to call multiple times. |
Configuration:
providers:
audio:
name: discord
api_key: "Bot MTIz..."
options:
guild_id: "123456789012345678"
dm_role_id: "987654321098765432"
When to use: Production Discord bot deployments. This is the primary transport for TTRPG sessions hosted on Discord.
๐ WebRTC Transport (pkg/audio/webrtc/)
The WebRTC transport enables browser-based voice sessions via pion/webrtc, without requiring Discord or any third-party voice platform.
How it works:
| Stage | Detail |
|---|---|
| Inbound | Each peer has a PeerTransport that delivers audio via AudioInput(). A readPeerInput goroutine forwards frames to the per-participant channel. |
| Outbound | A forwardOutput goroutine reads from the output channel and fans out to all connected peers via PeerTransport.SendAudio(). |
| Signaling | HTTP-based signaling server with three endpoints: POST /rooms/{roomID}/join (SDP offer/answer), POST /rooms/{roomID}/ice (ICE candidates), DELETE /rooms/{roomID}/leave. |
| ICE | Configurable STUN servers (default: stun:stun.l.google.com:19302). |
| Lifecycle | OutputWriter provides lifecycle-aware writes that safely drop frames after disconnect instead of panicking. |
Configuration:
providers:
audio:
name: webrtc
options:
stun_servers:
- "stun:stun.l.google.com:19302"
sample_rate: 48000
When to use: Custom web UIs, browser-based TTRPG tools, or environments where Discord is not available. Currently in alpha โ the PeerTransport interface abstracts the pion/webrtc integration so it can be developed independently.
Transport Comparison
| Feature | Discord | WebRTC |
|---|---|---|
| Provider name | discord | webrtc |
| Codec | Opus (48 kHz stereo) | Configurable (default 48 kHz) |
| Client | Discord app | Any WebRTC browser |
| Config | api_key + options.guild_id | options.stun_servers + options.sample_rate |
| Maturity | Production | Alpha |
| Participant tracking | VoiceStateUpdate events | Explicit AddPeer/RemovePeer |
| Multi-room | One connection per channel | One connection per room |
๐ Voice Activity Detection (VAD)
VAD sits between the platform transport and the engine, segmenting continuous audio into discrete speech regions. Only audio classified as speech is forwarded to the engine, saving STT/S2S costs and preventing hallucinated transcriptions of silence.
Interface
// Engine is the factory for VAD sessions.
type Engine interface {
NewSession(cfg Config) (SessionHandle, error)
}
// SessionHandle processes frames for a single audio stream.
type SessionHandle interface {
ProcessFrame(frame []byte) (VADEvent, error)
Reset()
Close() error
}
Silero ONNX Model
Glyphoxa uses Silero VAD as its default (and currently only) VAD backend. Silero is a compact neural network (~1 MB) that runs locally via ONNX Runtime โ no network hop, no API key required.
Key characteristics:
- Model: Silero VAD v4/v5 ONNX, loaded once at startup
- Inference: Local CPU via ONNX Runtime (requires
libonnxruntimeshared library) - Latency: Sub-millisecond per frame on modern hardware
- Statefulness: Each
SessionHandlemaintains its own internal state (ring buffers, smoothing history) so multiple concurrent participant streams are independent
Detection States
Each call to ProcessFrame() returns one of four VADEventType values:
| Event | Meaning | Pipeline Action |
|---|---|---|
VADSpeechStart | Speech just began | Begin buffering audio frames for the engine |
VADSpeechContinue | Ongoing speech | Continue accumulating frames |
VADSpeechEnd | Speech just ended | Dispatch accumulated audio to the engine |
VADSilence | No speech detected | Discard frame / keep session idle |
The transition from VADSilence to VADSpeechStart requires the speech probability to exceed SpeechThreshold. The reverse transition from active speech to VADSpeechEnd requires the probability to drop below SilenceThreshold โ providing hysteresis that prevents flickering on borderline frames.
Configuration and Tuning
cfg := vad.Config{
SampleRate: 16000, // must match incoming PCM
FrameSizeMs: 30, // 10, 20, or 30 ms
SpeechThreshold: 0.5, // probability to start speech
SilenceThreshold: 0.35, // probability to end speech
}
session, err := vadEngine.NewSession(cfg)
Tuning guidelines:
| Parameter | Lower Value | Higher Value |
|---|---|---|
SpeechThreshold | More sensitive โ catches quiet speech, more false positives | Less sensitive โ misses soft speech, fewer false starts |
SilenceThreshold | Longer speech segments โ waits through pauses | Shorter segments โ cuts off mid-pause, more fragmented |
FrameSizeMs | Finer granularity, more CPU | Coarser granularity, less CPU |
Recommended starting values: SpeechThreshold: 0.5, SilenceThreshold: 0.35, FrameSizeMs: 30. These defaults work well for typical TTRPG voice sessions where players speak clearly into microphones.
For noisy environments (background music, fan noise), increase SpeechThreshold to 0.6-0.7. For quiet, deliberate speakers, lower SpeechThreshold to 0.4.
โ๏ธ Engine Types
The VoiceEngine interface is the core abstraction over the AI processing pipeline. All engines implement the same interface, making them interchangeable per NPC:
type VoiceEngine interface {
Process(ctx context.Context, input AudioFrame, prompt PromptContext) (*Response, error)
InjectContext(ctx context.Context, update ContextUpdate) error
SetTools(tools []llm.ToolDefinition) error
OnToolCall(handler func(name, args string) (string, error))
Transcripts() <-chan memory.TranscriptEntry
Close() error
}
โ๏ธ Cascaded Engine (STT -> LLM -> TTS)
Package: internal/engine/cascade/ (sentence cascade variant) and the standard cascaded pipeline
The cascaded engine breaks voice processing into three explicit stages, each backed by an independent provider:
- STT โ transcribes player audio to text (e.g., Deepgram, Whisper)
- LLM โ generates NPC dialogue from the transcript + context (e.g., GPT-4o, Claude Sonnet)
- TTS โ synthesises the NPCโs response to audio (e.g., ElevenLabs, Coqui)
How it works:
- The agent assembles a
PromptContext(system prompt, hot context, conversation history, budget tier) Process()sends the transcript to the LLM with tool definitions gated by the MCP budget tier- LLM tokens stream back via a Go channel; sentence boundaries trigger incremental TTS synthesis
- The
Response.Audiochannel streams audio chunks as they are synthesised โ playback begins before the LLM finishes generating
Strengths: Maximum flexibility โ each provider can be swapped independently. Full control over voice selection, model choice, and tool calling. Keyword boosting for fantasy proper nouns in STT.
Trade-off: Highest latency of the three engines due to three sequential network hops (STT + LLM + TTS).
โก Speech-to-Speech Engine (S2S)
Package: internal/engine/s2s/
The S2S engine wraps a speech-to-speech provider (e.g., OpenAI Realtime, Gemini Live) that handles recognition, generation, and synthesis in a single API call.
How it works:
- Lazily opens an S2S session on the first
Process()call; reconnects transparently if the session dies - Audio frames are sent directly via
session.SendAudio() - Context is injected via
session.UpdateInstructions()andsession.InjectTextContext() - Response audio streams back on
session.Audio()and is forwarded to a per-turn channel - A silence timeout (
defaultTurnTimeout: 2s) detects end-of-turn when no audio arrives - Tools are forwarded to the session via
session.SetTools()and executed via the registered tool handler
Strengths: Lowest latency โ a single network hop replaces three. The model handles voice natively.
Trade-off: Less control over voice characteristics, model selection, and STT quality. No keyword boosting for fantasy names. S2S transcripts require more aggressive correction.
๐งช Sentence Cascade Engine (Experimental)
Package: internal/engine/cascade/
[!WARNING] Experimental. This is a novel technique not implemented in any known production system. It is inspired by speculative decoding and model cascading research but operates at the sentence level. Significant prototyping is required to validate coherence, latency gains, and the conditions under which it outperforms a single-model approach.
The sentence cascade reduces perceived latency by starting TTS playback with a fast modelโs opening sentence while a stronger model generates the substantive continuation.
How it works:
- Player finishes speaking; STT finalises the transcript
- Fast model (e.g., GPT-4o-mini, Gemini Flash) generates only the first sentence (~200 ms TTFT)
- TTS starts immediately on the first sentence โ voice onset within ~500 ms
- Strong model (e.g., Claude Sonnet, GPT-4o) receives the same prompt plus the fast modelโs first sentence as a forced assistant-role continuation prefix
- Strong modelโs output streams to TTS sentence-by-sentence โ seamless single utterance
Single-model fast path: If the fast modelโs entire response is one sentence (detected via FinishReason), the strong model is skipped entirely. This avoids unnecessary overhead for simple greetings.
Sentence boundary detection: Sentences are split at ., !, or ? followed by whitespace. Partial sentences are flushed when the stream ends.
Strengths: Sub-600 ms perceived latency for complex responses. The opening reaction sounds natural (โAh, the goblins!โ) while the strong model assembles the real answer.
Trade-off: Approximately doubles LLM cost per utterance. Risk of coherence/tone mismatch between models. Only valuable for latency-critical, complex interactions.
Engine Comparison
| Aspect | Cascaded (STT->LLM->TTS) | Speech-to-Speech (S2S) | Sentence Cascade |
|---|---|---|---|
| Latency (end-to-end) | 1.5โ3s | 0.5โ1.5s | 0.5โ1s perceived |
| Voice quality | High (dedicated TTS) | Model-dependent | High (dedicated TTS) |
| Voice control | Full (any TTS voice) | Limited (model voices) | Full (any TTS voice) |
| Model flexibility | Any STT + LLM + TTS | S2S providers only | Any 2 LLMs + TTS |
| Tool calling | Full MCP support | Provider-dependent | Strong model only |
| Keyword boosting | Yes (STT level) | No | Yes (STT level) |
| Cost per utterance | Moderate | Lowโmoderate | ~2x LLM cost |
| Complexity | Low | Low | High |
| Status | Production | Production | Experimental |
๐๏ธ Audio Mixer
Package: pkg/audio/mixer/
The audio mixer sits between engine outputs and the platform transportโs single output stream. Discord limits a bot to one outbound audio stream per guild, so when multiple NPCs need to speak, the mixer serialises their output through a priority queue.
Priority Queue
The PriorityMixer uses a max-heap (container/heap) ordered by priority (descending) with FIFO tie-breaking on insertion sequence:
mixer := mixer.New(outputCallback,
mixer.WithGap(300 * time.Millisecond),
mixer.WithQueueCapacity(16),
)
Scheduling rules:
| Rule | Behaviour |
|---|---|
| Priority ordering | Higher-priority segments play first. DM-designated NPCs get elevated priority. |
| FIFO within priority | Equal-priority segments play in insertion order. |
| Preemption | Enqueueing a segment with higher priority than the currently playing one immediately interrupts it with DMOverride semantics. |
| Streaming playback | Segments stream incrementally โ the mixer begins playback before the entire segment is synthesised. Each AudioSegment.Audio channel delivers []byte chunks as they arrive. |
Inter-Segment Gaps
A configurable silence gap (default: 300 ms) is inserted between consecutive segments to simulate natural turn-taking:
- Jitter: +/- 1/6 of the base gap is applied randomly to prevent robotic timing
- Zero gap: Setting the gap to zero plays segments back-to-back
- Runtime adjustment:
SetGap(d)changes the gap before the next segment starts
Barge-In Detection and Behaviour
When a player starts speaking while an NPC is still outputting audio:
- VAD detects player speech on an input stream during NPC playback
BargeIn(speakerID)is called on the mixer- Current segment is interrupted with
PlayerBargeInsemantics - Queue is cleared โ all pending NPC segments are drained (the conversation context has changed)
- Barge-in handler fires โ the registered callback receives the interrupting playerโs ID on a new goroutine
- Playerโs speech is routed to the addressed NPCโs engine for processing
Interrupt reasons:
| Reason | Behaviour |
|---|---|
PlayerBargeIn | Hard cut current segment + clear entire queue (player took the floor) |
DMOverride | Hard cut current segment, preserve queue (DM injected priority speech) |
Mixer Interface
type Mixer interface {
Enqueue(segment *AudioSegment, priority int)
Interrupt(reason InterruptReason)
OnBargeIn(handler func(speakerID string))
SetGap(d time.Duration)
}
๐ฌ Utterance Buffer
Package: internal/agent/orchestrator/
The UtteranceBuffer provides cross-NPC awareness โ each NPC can see what other NPCs and players have said recently, enabling coherent multi-NPC scenes (e.g., a tavern with three NPCs who reference each otherโs dialogue).
How It Works
- Every utterance (player or NPC) is added to the shared buffer via
Add(entry) - Before routing a new utterance to an NPC, the orchestrator calls
Recent(excludeNPCID, maxEntries)to retrieve recent cross-NPC context - These entries are injected into the target NPCโs engine via
InjectContext()so the NPCโs next response reflects what others have said - The NPCโs own utterances are excluded (via
excludeNPCID) to avoid self-referential context
Eviction Policy
The buffer enforces two limits:
| Limit | Default | Purpose |
|---|---|---|
| Max entries | 20 | Bounds memory usage |
| Max age | 5 minutes | Ensures only recent, relevant context is injected |
Eviction runs on every Add() call. Surviving entries are copied to a fresh backing array to prevent evicted entries from pinning memory.
Buffer Entry Structure
type BufferEntry struct {
SpeakerID string // player user-ID or NPC agent ID
SpeakerName string // human-readable name
Text string // utterance text
NPCID string // non-empty when produced by an NPC
Timestamp time.Time // when the utterance occurred
}
Configuration
buffer := orchestrator.NewUtteranceBuffer(maxSize, maxAge)
// Or via orchestrator options:
orch := orchestrator.New(agents,
orchestrator.WithBufferSize(30),
orchestrator.WithBufferDuration(10 * time.Minute),
)
๐ Choosing an Engine
Use this decision guide to select the right engine for each NPC:
| Criterion | Cascaded (STT->LLM->TTS) | Speech-to-Speech (S2S) | Sentence Cascade |
|---|---|---|---|
| Best for | Most NPCs, full control | Low-latency, simple NPCs | High-importance, complex NPCs |
| Latency | 1.5โ3s | 0.5โ1.5s | ~0.5s perceived |
| Voice quality | โญ โญ โญ Dedicated TTS voices | โญ โญ Model-dependent | โญ โญ โญ Dedicated TTS voices |
| Response quality | โญ โญ โญ Any LLM | โญ โญ S2S model only | โญ โญ โญ Strong model continuation |
| Cost | $$ | $ โ $$ | $$$ (~2x LLM) |
| Tool calling | Full MCP tool support | Provider-dependent | Strong model only |
| Fantasy name handling | Good (STT keyword boost) | Poor (no boost) | Good (STT keyword boost) |
| Configuration complexity | Low (3 providers) | Low (1 provider) | High (2 LLMs + TTS + STT) |
| Maturity | Production | Production | Experimental |
Decision Flowchart
Is sub-second latency critical for this NPC?
โโ No โโ> Use Cascaded (best flexibility and quality)
โโ Yes
โ โโ Are complex, multi-sentence responses expected?
โ โ โโ Yes โโ> Consider Sentence Cascade (experimental)
โ โ โโ No โโ> Use S2S (simplest low-latency option)
โ โโ Is fine voice control needed?
โ โโ Yes โโ> Use Cascaded or Sentence Cascade
โ โโ No โโ> Use S2S
Rules of thumb:
- Default choice: Cascaded. It offers the best balance of quality, flexibility, and debuggability.
- Combat callouts, simple greetings: S2S for lowest latency when response quality is less critical.
- Quest-givers, villain monologues, deep lore: Sentence cascade (if willing to accept experimental status and higher cost).
- Budget-constrained: S2S or cascaded with a fast LLM (GPT-4o-mini, Gemini Flash).
๐ See Also
- architecture.md โ system-level architecture overview
- providers.md โ provider configuration and supported backends
- configuration.md โ full configuration reference
- design/01-architecture.md โ detailed architecture design document
- design/05-sentence-cascade.md โ sentence cascade design rationale and research context