π§© Overview
Glyphoxa uses an interface-first provider architecture to decouple the voice AI pipeline from any specific vendor or service. Every external AI capability β LLM completions, speech-to-text, text-to-speech, speech-to-speech, embeddings, and voice activity detection β sits behind a Go interface defined in pkg/provider/. Swapping a cloud API for a local model, or migrating between vendors, is a configuration change that instantiates a different struct. No pipeline code needs to change.
Key design principles:
- One interface per capability. Each pipeline stage has exactly one interface. Implementations are in sub-packages named after the vendor or technology.
- Concurrency-safe by contract. All provider interfaces require implementations to be safe for concurrent use from multiple goroutines. Channels returned by streaming methods are closed by the implementation.
- Compile-time enforcement. Every concrete provider includes a
var _ Interface = (*Impl)(nil)assertion so that missing methods are caught at compile time, not at runtime. - Resilience built in. The
internal/resiliencepackage wraps any provider in aFallbackGroupwith per-entry circuit breakers, enabling automatic failover without changing caller code.
π Provider Interfaces
| Interface | Package | Key Methods | Purpose |
|---|---|---|---|
llm.Provider | pkg/provider/llm | StreamCompletion, Complete, CountTokens, Capabilities | LLM text completions (streaming and batch), token counting, and model capability introspection |
stt.Provider | pkg/provider/stt | StartStream (returns SessionHandle) | Opens streaming transcription sessions; SessionHandle exposes SendAudio, Partials, Finals, SetKeywords, Close |
tts.Provider | pkg/provider/tts | SynthesizeStream, ListVoices, CloneVoice | Streaming text-to-speech synthesis, voice catalogue listing, and voice cloning |
s2s.Provider | pkg/provider/s2s | Connect (returns SessionHandle), Capabilities | End-to-end speech-to-speech sessions; SessionHandle exposes SendAudio, Audio, Transcripts, OnToolCall, SetTools, UpdateInstructions, InjectTextContext, Interrupt, Close |
embeddings.Provider | pkg/provider/embeddings | Embed, EmbedBatch, Dimensions, ModelID | Text-to-vector conversion for semantic memory retrieval (L2 index) |
vad.Engine | pkg/provider/vad | NewSession (returns SessionHandle) | Voice activity detection; SessionHandle exposes ProcessFrame, Reset, Close |
LLM Provider
The LLM interface normalises differences between OpenAI, Anthropic, Google, Groq, and local model APIs. It supports streaming completions via Go channels, tool/function calling, system prompts, and multi-turn conversation history.
type Provider interface {
StreamCompletion(ctx context.Context, req CompletionRequest) (<-chan Chunk, error)
Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
CountTokens(messages []Message) (int, error)
Capabilities() ModelCapabilities
}
STT Provider
The STT interface models real-time transcription as a session-based streaming API. Each session accepts PCM audio frames and emits two transcript streams: low-latency partials (for speculative pre-fetch) and authoritative finals (for the session log and LLM input).
type Provider interface {
StartStream(ctx context.Context, cfg StreamConfig) (SessionHandle, error)
}
The SessionHandle interface exposes SendAudio, Partials, Finals, SetKeywords, and Close.
TTS Provider
The TTS interface accepts a channel of text fragments (piped directly from streaming LLM output) and returns a channel of raw PCM audio bytes. This channel-in/channel-out design enables low-latency pipelining without waiting for the full text.
type Provider interface {
SynthesizeStream(ctx context.Context, text <-chan string, voice VoiceProfile) (<-chan []byte, error)
ListVoices(ctx context.Context) ([]VoiceProfile, error)
CloneVoice(ctx context.Context, samples [][]byte) (*VoiceProfile, error)
}
S2S Provider
The S2S (speech-to-speech) interface models providers that handle audio-in to audio-out in a single stateful session, bypassing the separate STT/LLM/TTS pipeline. Sessions are long-lived and support mid-session reconfiguration of instructions, tools, and context.
type Provider interface {
Connect(ctx context.Context, cfg SessionConfig) (SessionHandle, error)
Capabilities() S2SCapabilities
}
Embeddings Provider
The embeddings interface converts text to dense float32 vectors for the semantic memory layer (pgvector). It supports both single-text and batch embedding for efficiency.
type Provider interface {
Embed(ctx context.Context, text string) ([]float32, error)
EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
Dimensions() int
ModelID() string
}
VAD Engine
Voice Activity Detection runs locally with sub-millisecond latency. It is the first stage of the audio pipeline β all audio passes through VAD before reaching STT or S2S engines. The interface is named Engine (not Provider) because VAD is always local and never a remote service.
type Engine interface {
NewSession(cfg Config) (SessionHandle, error)
}
The SessionHandle exposes synchronous ProcessFrame(frame []byte) (VADEvent, error), Reset(), and Close() error.
π¦ Supported Providers
LLM Providers
| Provider | Package | Backend | Status | Latency Tier | Cost Tier |
|---|---|---|---|---|---|
| OpenAI (GPT-4o, GPT-4o-mini, o1, o3) | pkg/provider/llm/anyllm | any-llm-go | Production | Medium | $$$ |
| Anthropic (Claude Sonnet, Haiku, Opus) | pkg/provider/llm/anyllm | any-llm-go | Production | Medium | $$$ |
| Google Gemini (2.0 Flash, 1.5 Pro) | pkg/provider/llm/anyllm | any-llm-go | Production | Medium | $$ |
| Groq (Llama 3.x) | pkg/provider/llm/anyllm | any-llm-go | Production | Low | $ |
| Ollama (any local model) | pkg/provider/llm/anyllm | any-llm-go | Production | Varies | Free |
| DeepSeek | pkg/provider/llm/anyllm | any-llm-go | Production | Medium | $ |
| Mistral | pkg/provider/llm/anyllm | any-llm-go | Production | Medium | $$ |
| llama.cpp (local server) | pkg/provider/llm/anyllm | any-llm-go | Production | Varies | Free |
| llamafile (local server) | pkg/provider/llm/anyllm | any-llm-go | Production | Varies | Free |
| Mock | pkg/provider/llm/mock | In-memory | Testing | β | β |
All LLM providers are implemented through a single unified adapter (anyllm.Provider) that wraps the mozilla-ai/any-llm-go library. This library provides a consistent interface across all supported backends, with Go channel-based streaming and typed error normalisation. Convenience constructors (NewOpenAI, NewAnthropic, NewGemini, etc.) are available for common providers.
STT Providers
| Provider | Package | Status | Latency Tier | Cost Tier | Keyword Boost |
|---|---|---|---|---|---|
| Deepgram Nova-3 | pkg/provider/stt/deepgram | Production | Low | $ | Yes (at session start) |
| Whisper.cpp (HTTP server) | pkg/provider/stt/whisper | Production | Medium | Free | No |
| Whisper.cpp (native CGO) | pkg/provider/stt/whisper (NativeProvider) | Production | Medium | Free | No |
| Mock | pkg/provider/stt/mock | Testing | β | β | β |
TTS Providers
| Provider | Package | Status | Latency Tier | Cost Tier | Voice Cloning |
|---|---|---|---|---|---|
| ElevenLabs (Flash v2.5) | pkg/provider/tts/elevenlabs | Production | Low | $$$ | Planned |
| Coqui TTS (Standard) | pkg/provider/tts/coqui | Production | Medium | Free | No |
| Coqui XTTS v2 | pkg/provider/tts/coqui (APIModeXTTS) | Production | Medium | Free | Yes |
| Mock | pkg/provider/tts/mock | Testing | β | β | β |
S2S Providers
| Provider | Package | Status | Latency Tier | Cost Tier | Mid-Session Updates |
|---|---|---|---|---|---|
| OpenAI Realtime (gpt-4o-realtime) | pkg/provider/s2s/openai | Production | Low | \(\) | Instructions, tools, interrupt |
| Gemini Live (gemini-2.0-flash-live) | pkg/provider/s2s/gemini | Production | Low | $$ | Context injection only |
| Mock | pkg/provider/s2s/mock | Testing | β | β | β |
Embeddings Providers
| Provider | Package | Status | Latency Tier | Cost Tier | Dimensions |
|---|---|---|---|---|---|
| OpenAI (text-embedding-3-small/large) | pkg/provider/embeddings/openai | Production | Medium | $ | 1536 / 3072 |
| Ollama (nomic-embed-text, mxbai-embed-large, all-minilm) | pkg/provider/embeddings/ollama | Production | Low | Free | 768 / 1024 / 384 |
| Mock | pkg/provider/embeddings/mock | Testing | β | β | β |
VAD Engines
| Provider | Package | Status | Latency | Cost |
|---|---|---|---|---|
| Silero VAD | Planned (via silero-vad-go) | Planned | Sub-ms | Free |
| Mock | pkg/provider/vad/mock | Testing | β | β |
Note: The VAD
Engineinterface andSessionHandleare fully defined. The Silero VAD implementation is planned; currently only the mock engine is available.
βοΈ Configuring Providers
Providers are configured in the providers section of the Glyphoxa YAML config file. Each entry follows the same name / api_key / base_url / model / options pattern.
Configuration Schema
type ProviderEntry struct {
Name string `yaml:"name"` // Provider implementation name
APIKey string `yaml:"api_key"` // Authentication key
BaseURL string `yaml:"base_url"` // Override default API endpoint
Model string `yaml:"model"` // Model selection within provider
Options map[string]any `yaml:"options"` // Provider-specific options
}
Full Example
providers:
llm:
name: openai
api_key: "${OPENAI_API_KEY}"
model: gpt-4o
stt:
name: deepgram
api_key: "${DEEPGRAM_API_KEY}"
model: nova-3
options:
language: en
sample_rate: 16000
tts:
name: elevenlabs
api_key: "${ELEVENLABS_API_KEY}"
model: eleven_flash_v2_5
options:
output_format: pcm_16000
s2s:
name: openai
api_key: "${OPENAI_API_KEY}"
model: gpt-4o-realtime-preview
embeddings:
name: openai
api_key: "${OPENAI_API_KEY}"
model: text-embedding-3-small
vad:
name: silero
options:
speech_threshold: 0.5
silence_threshold: 0.35
frame_size_ms: 30
Provider-Specific Examples
Local-only stack (no API keys required):
providers:
llm:
name: ollama
model: llama3.1:8b
base_url: http://localhost:11434
stt:
name: whisper
base_url: http://localhost:8080
options:
language: en
silence_threshold_ms: 500
tts:
name: coqui
base_url: http://localhost:5002
options:
language: en
api_mode: standard
embeddings:
name: ollama
model: nomic-embed-text
base_url: http://localhost:11434
S2S with Gemini Live:
providers:
s2s:
name: gemini
api_key: "${GEMINI_API_KEY}"
model: gemini-2.0-flash-live-001
Coqui XTTS with voice cloning:
providers:
tts:
name: coqui
base_url: http://localhost:8002
options:
api_mode: xtts
language: en
π οΈ Adding a New Provider
Follow these steps to add a new provider implementation.
Step 1: Create the package
Create a new directory under the appropriate provider type:
pkg/provider/<type>/<vendor>/
For example, to add an AssemblyAI STT provider:
pkg/provider/stt/assemblyai/
assemblyai.go
assemblyai_test.go
Step 2: Implement the interface
Your provider struct must implement the relevant interface. Here is a minimal skeleton for an STT provider:
package assemblyai
import (
"context"
"github.com/MrWong99/glyphoxa/pkg/provider/stt"
)
// Provider implements stt.Provider backed by the AssemblyAI streaming API.
type Provider struct {
apiKey string
model string
}
// New creates a new AssemblyAI Provider.
func New(apiKey string) (*Provider, error) {
if apiKey == "" {
return nil, errors.New("assemblyai: apiKey must not be empty")
}
return &Provider{apiKey: apiKey}, nil
}
// StartStream opens a streaming transcription session.
func (p *Provider) StartStream(ctx context.Context, cfg stt.StreamConfig) (stt.SessionHandle, error) {
// Implementation here...
}
Step 3: Add a compile-time interface assertion
Add this line at the package level (typically near the top of the file, after the struct definition):
// Compile-time assertion that Provider satisfies stt.Provider.
var _ stt.Provider = (*Provider)(nil)
This ensures that if the interface changes, the compiler will catch any missing methods immediately rather than surfacing them as runtime panics.
Step 4: Register in the config loader
Add your new provider name to the config loader so it can be selected via YAML configuration. Update the relevant factory function in internal/config/ or the provider registry to handle the new name value.
Step 5: Write tests
Every provider package should include tests that verify:
- The constructor rejects invalid inputs (empty API key, empty URL, etc.)
- The compile-time assertion is present
- Message parsing and serialisation are correct (unit-testable without a live service)
- Sessions behave correctly under cancellation and close
Follow the patterns established by existing test files (e.g., deepgram_test.go, whisper_test.go, elevenlabs_test.go).
Step 6: Add a mock (if needed)
If your provider has a session-based interface, consider adding a mock implementation in a mock/ sub-package for use in integration tests. See pkg/provider/stt/mock/ for a reference implementation.
β Compile-Time Interface Assertions
Every concrete provider in Glyphoxa includes a compile-time assertion of the form:
var _ stt.Provider = (*Provider)(nil)
This is a standard Go idiom that creates a zero-value pointer to the implementation type and assigns it to a blank variable typed as the interface. The compiler verifies that *Provider satisfies stt.Provider at compile time. If a method is missing or has the wrong signature, the build fails immediately.
Why this matters
Without compile-time assertions, a provider could silently fail to implement an interface. The error would only surface at runtime when the value is first assigned to an interface variable β potentially deep in the pipeline during a live session. In a real-time voice system where latency matters, this kind of late failure is unacceptable.
Where to find them
Compile-time assertions are placed in every provider and mock package:
// pkg/provider/stt/whisper/whisper.go
var _ stt.Provider = (*Provider)(nil)
// pkg/provider/stt/whisper/native.go
var _ stt.Provider = (*NativeProvider)(nil)
var _ stt.SessionHandle = (*nativeSession)(nil)
// pkg/provider/s2s/openai/openai.go
var _ s2s.Provider = (*Provider)(nil)
var _ s2s.SessionHandle = (*session)(nil)
// pkg/provider/tts/coqui/coqui.go
var _ tts.Provider = (*Provider)(nil)
// pkg/provider/embeddings/ollama/ollama.go
var _ embeddings.Provider = (*Provider)(nil)
// internal/resilience/llm_fallback.go
var _ llm.Provider = (*LLMFallback)(nil)
Session handle interfaces (e.g., stt.SessionHandle, s2s.SessionHandle, vad.SessionHandle) also receive assertions because they are equally important to correctness.
π‘οΈ Resilience and Failover
The internal/resilience package provides automatic failover and circuit breaking for provider calls. It ensures that a single provider outage does not take down the entire voice pipeline.
Fallback Groups
A FallbackGroup[T] wraps a primary provider and zero or more fallbacks of the same type. When the primary fails (or its circuit breaker is open), the next healthy fallback is tried in registration order.
// Create a fallback group with OpenAI as primary, Anthropic as fallback.
primary, _ := anyllm.NewOpenAI("gpt-4o")
fallback, _ := anyllm.NewAnthropic("claude-3-5-sonnet-latest")
group := resilience.NewLLMFallback(primary, "openai-gpt4o", resilience.FallbackConfig{
CircuitBreaker: resilience.CircuitBreakerConfig{
MaxFailures: 5,
ResetTimeout: 30 * time.Second,
HalfOpenMax: 3,
},
})
group.AddFallback("anthropic-sonnet", fallback)
// Use group as a normal llm.Provider -- failover is transparent.
resp, err := group.Complete(ctx, req)
Type-specific fallback wrappers are provided for each provider interface:
| Wrapper | Interface | Package |
|---|---|---|
LLMFallback | llm.Provider | internal/resilience |
STTFallback | stt.Provider | internal/resilience |
TTSFallback | tts.Provider | internal/resilience |
Each wrapper implements the full provider interface, so callers cannot distinguish a fallback-wrapped provider from a bare one.
Circuit Breaker
Each provider entry in a FallbackGroup has its own CircuitBreaker with three states:
| State | Behaviour |
|---|---|
| Closed (normal) | All calls are forwarded. Consecutive failures are counted. |
| Open (tripped) | Calls are rejected immediately with ErrCircuitOpen. Entered after MaxFailures consecutive failures. |
| Half-Open (probe) | Entered after ResetTimeout elapses. A limited number of probe calls (HalfOpenMax) are allowed through. If they succeed, the breaker closes. If any fail, it re-opens. |
Configuration parameters:
| Parameter | Default | Description |
|---|---|---|
MaxFailures | 5 | Consecutive failures before the breaker opens |
ResetTimeout | 30s | How long the breaker stays open before transitioning to half-open |
HalfOpenMax | 3 | Maximum probe calls in the half-open state |
Failover Scope
Failover covers the initial connection/request only. For streaming providers (STT, TTS, S2S), once a stream is established, mid-stream errors are the callerβs responsibility. This is a deliberate design choice: re-establishing a stream mid-utterance would produce incoherent audio output.
Error Propagation
When all providers in a group fail, Execute returns ErrAllFailed wrapping the last error. The FallbackGroup logs each failure at WARN level and each circuit-open skip at DEBUG level, providing visibility into provider health without flooding logs.
βοΈ Engine Trade-offs
Glyphoxa supports three conversation engine modes, each offering a different balance of latency, quality, cost, and flexibility. The engine is selected per-NPC via the engine field in the NPC configuration.
Engine Modes
Cascaded (cascaded) β The traditional pipeline: STT transcribes player speech to text, the LLM generates a response, and TTS synthesises the response into audio. Each stage is independently configurable and swappable.
Speech-to-Speech (s2s) β An end-to-end model (OpenAI Realtime or Gemini Live) handles the entire audio-in to audio-out flow in a single API session. The orchestrator interacts with the NPC exclusively through the VoiceEngine interface β it does not know which engine is behind it.
Sentence Cascade (sentence_cascade) β An experimental dual-model approach: a fast, small model generates the opening sentence (for rapid time-to-first-audio), then a stronger model generates the substantive continuation. Both outputs are piped through TTS.
Comparison Table
| Dimension | Cascaded (STT+LLM+TTS) | S2S (Direct) | Sentence Cascade |
|---|---|---|---|
| End-to-end latency | 1.5β3s (additive) | 0.5β1s | 0.8β1.5s (first sentence fast) |
| Response quality | Highest (full LLM) | Good (constrained by S2S model) | High (strong model for body) |
| Voice quality | Excellent (dedicated TTS) | Good (model-integrated) | Excellent (dedicated TTS) |
| Cost | \(--\)$ (3 API calls) | \(\) (single premium session) | \($--\)$$ (2 LLM + TTS) |
| Flexibility | Full (mix any STT+LLM+TTS) | Limited (single provider) | High (two LLMs + TTS) |
| Tool calling | Full LLM tool support | Provider-specific limits | Full LLM tool support |
| Keyword boosting | Yes (via STT provider) | No (model handles STT) | Yes (via STT provider) |
| Voice cloning | Yes (via TTS provider) | No (fixed voice set) | Yes (via TTS provider) |
| Provider requirements | STT + LLM + TTS | S2S provider | STT + 2x LLM + TTS |
| Status | Production | Production | Experimental |
When to Use Each Engine
Use Cascaded when:
- You need maximum flexibility in provider selection
- Response quality is the top priority
- You need keyword boosting for fantasy proper nouns
- You need voice cloning for custom NPC voices
- Latency of 1.5β3 seconds per turn is acceptable
Use S2S when:
- Sub-second response latency is critical (e.g., fast-paced combat narration)
- You are using OpenAI Realtime or Gemini Live
- Tool calling requirements are within S2S provider limits
- The S2S providerβs built-in voices meet your NPC needs
Use Sentence Cascade when:
- You want faster time-to-first-audio than pure cascaded
- You want the quality of a strong model for the substantive response
- You are experimenting with the dual-model approach
Configuration Example
npcs:
- name: "Greymantle the Sage"
engine: cascaded
personality: "A wise old wizard who speaks in measured tones."
voice:
provider: elevenlabs
voice_id: "pNInz6obpgDQGcFmaJgB"
budget_tier: standard
- name: "Quicksilver"
engine: s2s
personality: "A fast-talking rogue who never misses a beat."
voice:
provider: openai
voice_id: "alloy"
budget_tier: fast
- name: "The Oracle"
engine: sentence_cascade
personality: "A mysterious seer who begins with cryptic fragments."
voice:
provider: elevenlabs
voice_id: "21m00Tcm4TlvDq8ikWAM"
cascade_mode: always
cascade:
fast_model: gpt-4o-mini
strong_model: gpt-4o
π See also
- architecture.md β System architecture overview
- configuration.md β Full configuration reference
- audio-pipeline.md β Audio pipeline and mixer design
- design/02-providers.md β Design rationale for the provider abstraction layer