🧩 Overview

Glyphoxa uses an interface-first provider architecture to decouple the voice AI pipeline from any specific vendor or service. Every external AI capability – LLM completions, speech-to-text, text-to-speech, speech-to-speech, embeddings, and voice activity detection – sits behind a Go interface defined in pkg/provider/. Swapping a cloud API for a local model, or migrating between vendors, is a configuration change that instantiates a different struct. No pipeline code needs to change.

Key design principles:

  • One interface per capability. Each pipeline stage has exactly one interface. Implementations are in sub-packages named after the vendor or technology.
  • Concurrency-safe by contract. All provider interfaces require implementations to be safe for concurrent use from multiple goroutines. Channels returned by streaming methods are closed by the implementation.
  • Compile-time enforcement. Every concrete provider includes a var _ Interface = (*Impl)(nil) assertion so that missing methods are caught at compile time, not at runtime.
  • Resilience built in. The internal/resilience package wraps any provider in a FallbackGroup with per-entry circuit breakers, enabling automatic failover without changing caller code.

πŸ”Œ Provider Interfaces

Interface Package Key Methods Purpose
llm.Provider pkg/provider/llm StreamCompletion, Complete, CountTokens, Capabilities LLM text completions (streaming and batch), token counting, and model capability introspection
stt.Provider pkg/provider/stt StartStream (returns SessionHandle) Opens streaming transcription sessions; SessionHandle exposes SendAudio, Partials, Finals, SetKeywords, Close
tts.Provider pkg/provider/tts SynthesizeStream, ListVoices, CloneVoice Streaming text-to-speech synthesis, voice catalogue listing, and voice cloning
s2s.Provider pkg/provider/s2s Connect (returns SessionHandle), Capabilities End-to-end speech-to-speech sessions; SessionHandle exposes SendAudio, Audio, Transcripts, OnToolCall, SetTools, UpdateInstructions, InjectTextContext, Interrupt, Close
embeddings.Provider pkg/provider/embeddings Embed, EmbedBatch, Dimensions, ModelID Text-to-vector conversion for semantic memory retrieval (L2 index)
vad.Engine pkg/provider/vad NewSession (returns SessionHandle) Voice activity detection; SessionHandle exposes ProcessFrame, Reset, Close

LLM Provider

The LLM interface normalises differences between OpenAI, Anthropic, Google, Groq, and local model APIs. It supports streaming completions via Go channels, tool/function calling, system prompts, and multi-turn conversation history.

type Provider interface {
    StreamCompletion(ctx context.Context, req CompletionRequest) (<-chan Chunk, error)
    Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
    CountTokens(messages []Message) (int, error)
    Capabilities() ModelCapabilities
}

STT Provider

The STT interface models real-time transcription as a session-based streaming API. Each session accepts PCM audio frames and emits two transcript streams: low-latency partials (for speculative pre-fetch) and authoritative finals (for the session log and LLM input).

type Provider interface {
    StartStream(ctx context.Context, cfg StreamConfig) (SessionHandle, error)
}

The SessionHandle interface exposes SendAudio, Partials, Finals, SetKeywords, and Close.

TTS Provider

The TTS interface accepts a channel of text fragments (piped directly from streaming LLM output) and returns a channel of raw PCM audio bytes. This channel-in/channel-out design enables low-latency pipelining without waiting for the full text.

type Provider interface {
    SynthesizeStream(ctx context.Context, text <-chan string, voice VoiceProfile) (<-chan []byte, error)
    ListVoices(ctx context.Context) ([]VoiceProfile, error)
    CloneVoice(ctx context.Context, samples [][]byte) (*VoiceProfile, error)
}

S2S Provider

The S2S (speech-to-speech) interface models providers that handle audio-in to audio-out in a single stateful session, bypassing the separate STT/LLM/TTS pipeline. Sessions are long-lived and support mid-session reconfiguration of instructions, tools, and context.

type Provider interface {
    Connect(ctx context.Context, cfg SessionConfig) (SessionHandle, error)
    Capabilities() S2SCapabilities
}

Embeddings Provider

The embeddings interface converts text to dense float32 vectors for the semantic memory layer (pgvector). It supports both single-text and batch embedding for efficiency.

type Provider interface {
    Embed(ctx context.Context, text string) ([]float32, error)
    EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
    Dimensions() int
    ModelID() string
}

VAD Engine

Voice Activity Detection runs locally with sub-millisecond latency. It is the first stage of the audio pipeline – all audio passes through VAD before reaching STT or S2S engines. The interface is named Engine (not Provider) because VAD is always local and never a remote service.

type Engine interface {
    NewSession(cfg Config) (SessionHandle, error)
}

The SessionHandle exposes synchronous ProcessFrame(frame []byte) (VADEvent, error), Reset(), and Close() error.


πŸ“¦ Supported Providers

LLM Providers

Provider Package Backend Status Latency Tier Cost Tier
OpenAI (GPT-4o, GPT-4o-mini, o1, o3) pkg/provider/llm/anyllm any-llm-go Production Medium $$$
Anthropic (Claude Sonnet, Haiku, Opus) pkg/provider/llm/anyllm any-llm-go Production Medium $$$
Google Gemini (2.0 Flash, 1.5 Pro) pkg/provider/llm/anyllm any-llm-go Production Medium $$
Groq (Llama 3.x) pkg/provider/llm/anyllm any-llm-go Production Low $
Ollama (any local model) pkg/provider/llm/anyllm any-llm-go Production Varies Free
DeepSeek pkg/provider/llm/anyllm any-llm-go Production Medium $
Mistral pkg/provider/llm/anyllm any-llm-go Production Medium $$
llama.cpp (local server) pkg/provider/llm/anyllm any-llm-go Production Varies Free
llamafile (local server) pkg/provider/llm/anyllm any-llm-go Production Varies Free
Mock pkg/provider/llm/mock In-memory Testing – –

All LLM providers are implemented through a single unified adapter (anyllm.Provider) that wraps the mozilla-ai/any-llm-go library. This library provides a consistent interface across all supported backends, with Go channel-based streaming and typed error normalisation. Convenience constructors (NewOpenAI, NewAnthropic, NewGemini, etc.) are available for common providers.

STT Providers

Provider Package Status Latency Tier Cost Tier Keyword Boost
Deepgram Nova-3 pkg/provider/stt/deepgram Production Low $ Yes (at session start)
Whisper.cpp (HTTP server) pkg/provider/stt/whisper Production Medium Free No
Whisper.cpp (native CGO) pkg/provider/stt/whisper (NativeProvider) Production Medium Free No
Mock pkg/provider/stt/mock Testing – – –

TTS Providers

Provider Package Status Latency Tier Cost Tier Voice Cloning
ElevenLabs (Flash v2.5) pkg/provider/tts/elevenlabs Production Low $$$ Planned
Coqui TTS (Standard) pkg/provider/tts/coqui Production Medium Free No
Coqui XTTS v2 pkg/provider/tts/coqui (APIModeXTTS) Production Medium Free Yes
Mock pkg/provider/tts/mock Testing – – –

S2S Providers

Provider Package Status Latency Tier Cost Tier Mid-Session Updates
OpenAI Realtime (gpt-4o-realtime) pkg/provider/s2s/openai Production Low \(\) Instructions, tools, interrupt
Gemini Live (gemini-2.0-flash-live) pkg/provider/s2s/gemini Production Low $$ Context injection only
Mock pkg/provider/s2s/mock Testing – – –

Embeddings Providers

Provider Package Status Latency Tier Cost Tier Dimensions
OpenAI (text-embedding-3-small/large) pkg/provider/embeddings/openai Production Medium $ 1536 / 3072
Ollama (nomic-embed-text, mxbai-embed-large, all-minilm) pkg/provider/embeddings/ollama Production Low Free 768 / 1024 / 384
Mock pkg/provider/embeddings/mock Testing – – –

VAD Engines

Provider Package Status Latency Cost
Silero VAD Planned (via silero-vad-go) Planned Sub-ms Free
Mock pkg/provider/vad/mock Testing – –

Note: The VAD Engine interface and SessionHandle are fully defined. The Silero VAD implementation is planned; currently only the mock engine is available.


βš™οΈ Configuring Providers

Providers are configured in the providers section of the Glyphoxa YAML config file. Each entry follows the same name / api_key / base_url / model / options pattern.

Configuration Schema

type ProviderEntry struct {
    Name    string         `yaml:"name"`     // Provider implementation name
    APIKey  string         `yaml:"api_key"`  // Authentication key
    BaseURL string         `yaml:"base_url"` // Override default API endpoint
    Model   string         `yaml:"model"`    // Model selection within provider
    Options map[string]any `yaml:"options"`  // Provider-specific options
}

Full Example

providers:
  llm:
    name: openai
    api_key: "${OPENAI_API_KEY}"
    model: gpt-4o

  stt:
    name: deepgram
    api_key: "${DEEPGRAM_API_KEY}"
    model: nova-3
    options:
      language: en
      sample_rate: 16000

  tts:
    name: elevenlabs
    api_key: "${ELEVENLABS_API_KEY}"
    model: eleven_flash_v2_5
    options:
      output_format: pcm_16000

  s2s:
    name: openai
    api_key: "${OPENAI_API_KEY}"
    model: gpt-4o-realtime-preview

  embeddings:
    name: openai
    api_key: "${OPENAI_API_KEY}"
    model: text-embedding-3-small

  vad:
    name: silero
    options:
      speech_threshold: 0.5
      silence_threshold: 0.35
      frame_size_ms: 30

Provider-Specific Examples

Local-only stack (no API keys required):

providers:
  llm:
    name: ollama
    model: llama3.1:8b
    base_url: http://localhost:11434

  stt:
    name: whisper
    base_url: http://localhost:8080
    options:
      language: en
      silence_threshold_ms: 500

  tts:
    name: coqui
    base_url: http://localhost:5002
    options:
      language: en
      api_mode: standard

  embeddings:
    name: ollama
    model: nomic-embed-text
    base_url: http://localhost:11434

S2S with Gemini Live:

providers:
  s2s:
    name: gemini
    api_key: "${GEMINI_API_KEY}"
    model: gemini-2.0-flash-live-001

Coqui XTTS with voice cloning:

providers:
  tts:
    name: coqui
    base_url: http://localhost:8002
    options:
      api_mode: xtts
      language: en

πŸ› οΈ Adding a New Provider

Follow these steps to add a new provider implementation.

Step 1: Create the package

Create a new directory under the appropriate provider type:

pkg/provider/<type>/<vendor>/

For example, to add an AssemblyAI STT provider:

pkg/provider/stt/assemblyai/
    assemblyai.go
    assemblyai_test.go

Step 2: Implement the interface

Your provider struct must implement the relevant interface. Here is a minimal skeleton for an STT provider:

package assemblyai

import (
    "context"
    "github.com/MrWong99/glyphoxa/pkg/provider/stt"
)

// Provider implements stt.Provider backed by the AssemblyAI streaming API.
type Provider struct {
    apiKey string
    model  string
}

// New creates a new AssemblyAI Provider.
func New(apiKey string) (*Provider, error) {
    if apiKey == "" {
        return nil, errors.New("assemblyai: apiKey must not be empty")
    }
    return &Provider{apiKey: apiKey}, nil
}

// StartStream opens a streaming transcription session.
func (p *Provider) StartStream(ctx context.Context, cfg stt.StreamConfig) (stt.SessionHandle, error) {
    // Implementation here...
}

Step 3: Add a compile-time interface assertion

Add this line at the package level (typically near the top of the file, after the struct definition):

// Compile-time assertion that Provider satisfies stt.Provider.
var _ stt.Provider = (*Provider)(nil)

This ensures that if the interface changes, the compiler will catch any missing methods immediately rather than surfacing them as runtime panics.

Step 4: Register in the config loader

Add your new provider name to the config loader so it can be selected via YAML configuration. Update the relevant factory function in internal/config/ or the provider registry to handle the new name value.

Step 5: Write tests

Every provider package should include tests that verify:

  • The constructor rejects invalid inputs (empty API key, empty URL, etc.)
  • The compile-time assertion is present
  • Message parsing and serialisation are correct (unit-testable without a live service)
  • Sessions behave correctly under cancellation and close

Follow the patterns established by existing test files (e.g., deepgram_test.go, whisper_test.go, elevenlabs_test.go).

Step 6: Add a mock (if needed)

If your provider has a session-based interface, consider adding a mock implementation in a mock/ sub-package for use in integration tests. See pkg/provider/stt/mock/ for a reference implementation.


βœ… Compile-Time Interface Assertions

Every concrete provider in Glyphoxa includes a compile-time assertion of the form:

var _ stt.Provider = (*Provider)(nil)

This is a standard Go idiom that creates a zero-value pointer to the implementation type and assigns it to a blank variable typed as the interface. The compiler verifies that *Provider satisfies stt.Provider at compile time. If a method is missing or has the wrong signature, the build fails immediately.

Why this matters

Without compile-time assertions, a provider could silently fail to implement an interface. The error would only surface at runtime when the value is first assigned to an interface variable – potentially deep in the pipeline during a live session. In a real-time voice system where latency matters, this kind of late failure is unacceptable.

Where to find them

Compile-time assertions are placed in every provider and mock package:

// pkg/provider/stt/whisper/whisper.go
var _ stt.Provider = (*Provider)(nil)

// pkg/provider/stt/whisper/native.go
var _ stt.Provider = (*NativeProvider)(nil)
var _ stt.SessionHandle = (*nativeSession)(nil)

// pkg/provider/s2s/openai/openai.go
var _ s2s.Provider = (*Provider)(nil)
var _ s2s.SessionHandle = (*session)(nil)

// pkg/provider/tts/coqui/coqui.go
var _ tts.Provider = (*Provider)(nil)

// pkg/provider/embeddings/ollama/ollama.go
var _ embeddings.Provider = (*Provider)(nil)

// internal/resilience/llm_fallback.go
var _ llm.Provider = (*LLMFallback)(nil)

Session handle interfaces (e.g., stt.SessionHandle, s2s.SessionHandle, vad.SessionHandle) also receive assertions because they are equally important to correctness.


πŸ›‘οΈ Resilience and Failover

The internal/resilience package provides automatic failover and circuit breaking for provider calls. It ensures that a single provider outage does not take down the entire voice pipeline.

Fallback Groups

A FallbackGroup[T] wraps a primary provider and zero or more fallbacks of the same type. When the primary fails (or its circuit breaker is open), the next healthy fallback is tried in registration order.

// Create a fallback group with OpenAI as primary, Anthropic as fallback.
primary, _ := anyllm.NewOpenAI("gpt-4o")
fallback, _ := anyllm.NewAnthropic("claude-3-5-sonnet-latest")

group := resilience.NewLLMFallback(primary, "openai-gpt4o", resilience.FallbackConfig{
    CircuitBreaker: resilience.CircuitBreakerConfig{
        MaxFailures:  5,
        ResetTimeout: 30 * time.Second,
        HalfOpenMax:  3,
    },
})
group.AddFallback("anthropic-sonnet", fallback)

// Use group as a normal llm.Provider -- failover is transparent.
resp, err := group.Complete(ctx, req)

Type-specific fallback wrappers are provided for each provider interface:

Wrapper Interface Package
LLMFallback llm.Provider internal/resilience
STTFallback stt.Provider internal/resilience
TTSFallback tts.Provider internal/resilience

Each wrapper implements the full provider interface, so callers cannot distinguish a fallback-wrapped provider from a bare one.

Circuit Breaker

Each provider entry in a FallbackGroup has its own CircuitBreaker with three states:

State Behaviour
Closed (normal) All calls are forwarded. Consecutive failures are counted.
Open (tripped) Calls are rejected immediately with ErrCircuitOpen. Entered after MaxFailures consecutive failures.
Half-Open (probe) Entered after ResetTimeout elapses. A limited number of probe calls (HalfOpenMax) are allowed through. If they succeed, the breaker closes. If any fail, it re-opens.

Configuration parameters:

Parameter Default Description
MaxFailures 5 Consecutive failures before the breaker opens
ResetTimeout 30s How long the breaker stays open before transitioning to half-open
HalfOpenMax 3 Maximum probe calls in the half-open state

Failover Scope

Failover covers the initial connection/request only. For streaming providers (STT, TTS, S2S), once a stream is established, mid-stream errors are the caller’s responsibility. This is a deliberate design choice: re-establishing a stream mid-utterance would produce incoherent audio output.

Error Propagation

When all providers in a group fail, Execute returns ErrAllFailed wrapping the last error. The FallbackGroup logs each failure at WARN level and each circuit-open skip at DEBUG level, providing visibility into provider health without flooding logs.


βš–οΈ Engine Trade-offs

Glyphoxa supports three conversation engine modes, each offering a different balance of latency, quality, cost, and flexibility. The engine is selected per-NPC via the engine field in the NPC configuration.

Engine Modes

Cascaded (cascaded) – The traditional pipeline: STT transcribes player speech to text, the LLM generates a response, and TTS synthesises the response into audio. Each stage is independently configurable and swappable.

Speech-to-Speech (s2s) – An end-to-end model (OpenAI Realtime or Gemini Live) handles the entire audio-in to audio-out flow in a single API session. The orchestrator interacts with the NPC exclusively through the VoiceEngine interface – it does not know which engine is behind it.

Sentence Cascade (sentence_cascade) – An experimental dual-model approach: a fast, small model generates the opening sentence (for rapid time-to-first-audio), then a stronger model generates the substantive continuation. Both outputs are piped through TTS.

Comparison Table

Dimension Cascaded (STT+LLM+TTS) S2S (Direct) Sentence Cascade
End-to-end latency 1.5–3s (additive) 0.5–1s 0.8–1.5s (first sentence fast)
Response quality Highest (full LLM) Good (constrained by S2S model) High (strong model for body)
Voice quality Excellent (dedicated TTS) Good (model-integrated) Excellent (dedicated TTS)
Cost \(--\)$ (3 API calls) \(\) (single premium session) \($--\)$$ (2 LLM + TTS)
Flexibility Full (mix any STT+LLM+TTS) Limited (single provider) High (two LLMs + TTS)
Tool calling Full LLM tool support Provider-specific limits Full LLM tool support
Keyword boosting Yes (via STT provider) No (model handles STT) Yes (via STT provider)
Voice cloning Yes (via TTS provider) No (fixed voice set) Yes (via TTS provider)
Provider requirements STT + LLM + TTS S2S provider STT + 2x LLM + TTS
Status Production Production Experimental

When to Use Each Engine

Use Cascaded when:

  • You need maximum flexibility in provider selection
  • Response quality is the top priority
  • You need keyword boosting for fantasy proper nouns
  • You need voice cloning for custom NPC voices
  • Latency of 1.5–3 seconds per turn is acceptable

Use S2S when:

  • Sub-second response latency is critical (e.g., fast-paced combat narration)
  • You are using OpenAI Realtime or Gemini Live
  • Tool calling requirements are within S2S provider limits
  • The S2S provider’s built-in voices meet your NPC needs

Use Sentence Cascade when:

  • You want faster time-to-first-audio than pure cascaded
  • You want the quality of a strong model for the substantive response
  • You are experimenting with the dual-model approach

Configuration Example

npcs:
  - name: "Greymantle the Sage"
    engine: cascaded
    personality: "A wise old wizard who speaks in measured tones."
    voice:
      provider: elevenlabs
      voice_id: "pNInz6obpgDQGcFmaJgB"
    budget_tier: standard

  - name: "Quicksilver"
    engine: s2s
    personality: "A fast-talking rogue who never misses a beat."
    voice:
      provider: openai
      voice_id: "alloy"
    budget_tier: fast

  - name: "The Oracle"
    engine: sentence_cascade
    personality: "A mysterious seer who begins with cryptic fragments."
    voice:
      provider: elevenlabs
      voice_id: "21m00Tcm4TlvDq8ikWAM"
    cascade_mode: always
    cascade:
      fast_model: gpt-4o-mini
      strong_model: gpt-4o

πŸ“š See also


This site uses Just the Docs, a documentation theme for Jekyll.