Date: 2026-03-22 Status: Proposal Author: Claude (deep investigation)
Problem Statement
In distributed mode, Glyphoxa splits into a gateway (slash commands, session orchestration) and a worker (voice pipeline). Both need the same bot token to interact with Discord, but Discord enforces a single gateway WebSocket per bot token. Voice connections require gateway events (VOICE_STATE_UPDATE, VOICE_SERVER_UPDATE) that only flow over the gateway WebSocket.
Failed Approaches
| Approach | Why it fails |
|---|---|
| Both connect | Second IDENTIFY invalidates first (close code 4005) |
| Sharding | Guild maps to one shard; both gateway + worker need the same guild |
| Gateway handoff (current) | Gateway suspends, worker takes over β conn.Open() hangs 60s |
Root Cause Analysis
How disgo Voice Connections Work
The voice connection flow in disgo (voice/conn.go):
conn.Open(ctx, channelID, selfMute, selfDeaf)
β
ββ voiceStateUpdateFunc(ctx, guildID, channelID, ...)
β ββ Sends Opcode 4 (VoiceStateUpdate) via bot gateway
β
ββ Blocks on openedChan until voice is ready
ββ Signaled when SessionDescription is received (final handshake step)
For openedChan to fire, two bot gateway dispatch events must arrive:
- VOICE_STATE_UPDATE β
conn.HandleVoiceStateUpdate()storesSessionID - VOICE_SERVER_UPDATE β
conn.HandleVoiceServerUpdate()opens voice WebSocket withState{Token, Endpoint, SessionID}
The voice WebSocket then performs its own handshake: Identify β Ready β UDP β SelectProtocol β SessionDescription β openedChan signaled.
Why the Handoff Fails
The current handoff (gateway_bot.go:SuspendGateway β worker creates VoiceOnlyPlatform) has this sequence:
1. Gateway calls SuspendGateway()
ββ client.Gateway.Close(ctx) with CloseNormalClosure
ββ Clears SessionID, ResumeURL, LastSequenceReceived
2. Worker creates VoiceOnlyPlatform
ββ disgo.New(token, WithDefaultGateway())
ββ client.OpenGateway(ctx) β IDENTIFY β Ready
ββ Ready returns immediately (guilds listed as "unavailable")
3. Worker calls platform.Connect(ctx, channelID)
ββ voiceMgr.CreateConn(guildID)
ββ conn.Open(ctx, channelID, false, false)
ββ Sends Opcode 4 β waits on openedChan β HANGS
The most likely cause: The worker sends Opcode 4 before Discord has fully hydrated the new session. After IDENTIFY, Discord sends Ready with guilds as UnavailableGuild. Guild hydration happens via subsequent GUILD_CREATE dispatch events, which arrive asynchronously after Ready. OpenGateway() returns as soon as Ready is received (see gateway.go:649-656), before any GUILD_CREATE events are processed.
Discord likely silently drops or queues Opcode 4 (Voice State Update) for guilds that havenβt been hydrated via GUILD_CREATE yet. The voice events (VOICE_STATE_UPDATE, VOICE_SERVER_UPDATE) are never dispatched, so openedChan is never signaled, and conn.Open() blocks until the callerβs context times out (60s from gRPC deadline).
Additional contributing factors:
- No wait-for-guild-ready logic in
VoiceOnlyPlatformβ it callsConnect()immediately afterOpenGateway()returns - The
ReadyβSetSelfUserdispatch in disgo uses an unbuffered channel (readyChan), creating a window whereOpenGatewayreturns before event handlers have populated caches β though this is unlikely to be the direct cause since the network round-trip for voice events is much longer than the event dispatch delay - Even if the timing issue were fixed, the fundamental problem remains: the gateway pod cannot handle slash commands while the worker holds the gateway connection
Solution Options (Ranked)
Option B: Voice State Proxy (RECOMMENDED)
Architecture: Gateway bot owns the gateway connection permanently. It joins voice on behalf of the worker, captures the voice credentials, and sends them to the worker via gRPC. The worker connects directly to the Discord voice server (WebSocket + UDP) without needing a bot gateway connection at all.
βββββββββββββββββββββββββββββββββββββββββββββββ
β Gateway Pod (permanent gateway connection) β
β β
β 1. Receives /start slash command β
β 2. Sends Opcode 4 (join voice channel) β
β 3. Receives VOICE_STATE_UPDATE β sessionID β
β 4. Receives VOICE_SERVER_UPDATE β token, β
β endpoint β
β 5. Sends credentials to worker via gRPC β
β 6. Keeps gateway open for slash commands β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β gRPC (voice credentials)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Worker Pod (no bot gateway needed) β
β β
β 1. Receives voice credentials via gRPC β
β 2. Creates voice.Conn directly (NewConn) β
β 3. Calls HandleVoiceStateUpdate() + β
β HandleVoiceServerUpdate() manually β
β 4. voice.Conn connects to voice WebSocket β
β (wss://endpoint) + UDP directly β
β 5. Runs audio pipeline (VADβSTTβLLMβTTS) β
β 6. Audio flows directly: Worker β Discord β
βββββββββββββββββββββββββββββββββββββββββββββββ
Why this works: The voice WebSocket (wss://{endpoint}?v=8) and UDP connection are completely independent of the bot gateway. The bot gateway is only needed to:
- Send Opcode 4 (request to join voice) β gateway pod does this
- Receive
VOICE_STATE_UPDATE/VOICE_SERVER_UPDATEβ gateway pod captures these
After capturing the credentials, the worker uses disgoβs exported voice.NewConn() and voice.Gateway.Open(ctx, State{...}) to connect directly to the voice server. No second bot gateway connection needed.
Key disgo APIs that make this possible:
voice.NewConn(guildID, userID, voiceStateUpdateFunc, removeFunc, opts...)β exported, no dependency onbot.Clientconn.HandleVoiceStateUpdate(event)β public method, accepts event data directlyconn.HandleVoiceServerUpdate(event)β public method, triggers voice gateway connectionvoice.WithConnDaveSessionCreateFunc(golibdave.NewSession)β DAVE E2EE works without bot gateway
Code Changes
1. Extend StartSessionRequest with voice credentials
// internal/gateway/contract.go
type StartSessionRequest struct {
// ... existing fields ...
// Voice credentials (populated by gateway before dispatch).
// When set, the worker connects directly to the voice server
// without opening a bot gateway connection.
VoiceSessionID string
VoiceToken string
VoiceEndpoint string
BotUserID string // bot's user snowflake (for voice.NewConn)
}
2. Gateway captures voice credentials before dispatching to worker
// internal/gateway/sessionctrl.go β new method
func (gc *GatewaySessionController) captureVoiceCredentials(
ctx context.Context, guildID, channelID string,
) (sessionID, token, endpoint string, err error) {
gID, _ := snowflake.Parse(guildID)
chID, _ := snowflake.Parse(channelID)
type voiceCreds struct {
sessionID string
token string
endpoint string
}
credsCh := make(chan voiceCreds, 1)
var (
gotState bool
gotServer bool
mu sync.Mutex
creds voiceCreds
)
// Register temporary event listeners for voice events.
stateListener := bot.NewListenerFunc(func(e *events.GuildVoiceStateUpdate) {
if e.GuildID != gID || e.UserID != gc.gwBot.Client().ID() {
return
}
mu.Lock()
defer mu.Unlock()
creds.sessionID = e.SessionID
gotState = true
if gotServer {
credsCh <- creds
}
})
serverListener := bot.NewListenerFunc(func(e *events.VoiceServerUpdate) {
if e.GuildID != gID || e.Endpoint == nil {
return
}
mu.Lock()
defer mu.Unlock()
creds.token = e.Token
creds.endpoint = *e.Endpoint
gotServer = true
if gotState {
credsCh <- creds
}
})
gc.gwBot.Client().AddEventListeners(stateListener, serverListener)
defer gc.gwBot.Client().RemoveEventListeners(stateListener, serverListener)
// Send Opcode 4 to join voice.
if err := gc.gwBot.Client().UpdateVoiceState(ctx, gID, &chID, false, false); err != nil {
return "", "", "", fmt.Errorf("send voice state update: %w", err)
}
select {
case c := <-credsCh:
return c.sessionID, c.token, c.endpoint, nil
case <-ctx.Done():
return "", "", "", ctx.Err()
}
}
3. Update GatewaySessionController.Start() β join voice before dispatch
// internal/gateway/sessionctrl.go β modified Start()
func (gc *GatewaySessionController) Start(ctx context.Context, req SessionStartRequest) error {
// ... existing validation and session creation ...
if gc.dispatcher != nil {
// Capture voice credentials BEFORE dispatching to worker.
// Gateway stays connected β no suspend/resume needed.
voiceCtx, voiceCancel := context.WithTimeout(ctx, 10*time.Second)
defer voiceCancel()
vsID, vToken, vEndpoint, err := gc.captureVoiceCredentials(
voiceCtx, req.GuildID, req.ChannelID)
if err != nil {
_ = gc.orch.Transition(ctx, sessionID, SessionEnded, err.Error())
return fmt.Errorf("gateway: capture voice credentials: %w", err)
}
startReq := StartSessionRequest{
// ... existing fields ...
VoiceSessionID: vsID,
VoiceToken: vToken,
VoiceEndpoint: vEndpoint,
BotUserID: gc.gwBot.Client().ID().String(),
}
// Dispatch to worker (no SuspendGateway call needed!)
// ...
}
}
4. New worker voice platform β VoiceProxyPlatform
// pkg/audio/discord/voice_proxy.go
// VoiceProxyPlatform connects to a Discord voice server using
// pre-captured credentials (sessionID, token, endpoint) instead of
// opening its own bot gateway connection. This is used in distributed
// mode where the gateway pod owns the gateway and passes voice
// credentials to the worker via gRPC.
type VoiceProxyPlatform struct {
conn voice.Conn
guildID snowflake.ID
readyCh chan struct{}
closeOnce sync.Once
}
func NewVoiceProxyPlatform(
guildIDStr, botUserIDStr string,
opts ...voice.ConnConfigOpt,
) (*VoiceProxyPlatform, error) {
guildID, err := snowflake.Parse(guildIDStr)
if err != nil { return nil, fmt.Errorf("parse guild ID: %w", err) }
botUserID, err := snowflake.Parse(botUserIDStr)
if err != nil { return nil, fmt.Errorf("parse bot user ID: %w", err) }
vp := &VoiceProxyPlatform{
guildID: guildID,
readyCh: make(chan struct{}, 1),
}
// No-op: the gateway pod handles Opcode 4.
noopStateUpdate := func(ctx context.Context, guildID snowflake.ID,
channelID *snowflake.ID, selfMute, selfDeaf bool) error {
return nil
}
allOpts := append([]voice.ConnConfigOpt{
voice.WithConnEventHandlerFunc(func(_ voice.Gateway, _ voice.Opcode,
_ int, data voice.GatewayMessageData) {
if _, ok := data.(voice.GatewayMessageDataSessionDescription); ok {
select {
case vp.readyCh <- struct{}{}:
default:
}
}
}),
}, opts...)
vp.conn = voice.NewConn(guildID, botUserID, noopStateUpdate, func() {}, allOpts...)
return vp, nil
}
// Connect feeds the pre-captured voice credentials into the Conn, which
// triggers the voice WebSocket + UDP handshake internally.
func (vp *VoiceProxyPlatform) Connect(
ctx context.Context, channelIDStr, voiceSessionID, voiceToken, voiceEndpoint string,
) (audio.Connection, error) {
channelID, err := snowflake.Parse(channelIDStr)
if err != nil { return nil, fmt.Errorf("parse channel ID: %w", err) }
// Feed the credentials that the gateway captured.
vp.conn.HandleVoiceStateUpdate(botgateway.EventVoiceStateUpdate{
VoiceState: discord.VoiceState{
GuildID: vp.guildID,
ChannelID: &channelID,
UserID: vp.conn.GuildID(), // overridden in NewConn
SessionID: voiceSessionID,
},
})
vp.conn.HandleVoiceServerUpdate(botgateway.EventVoiceServerUpdate{
Token: voiceToken,
GuildID: vp.guildID,
Endpoint: &voiceEndpoint,
})
select {
case <-vp.readyCh:
return newConnection(vp.conn, vp.guildID), nil
case <-ctx.Done():
vp.conn.Close(ctx)
return nil, fmt.Errorf("voice proxy connect: %w", ctx.Err())
}
}
5. Update workerFactory.CreateRuntime β use proxy platform when credentials present
// cmd/glyphoxa/worker_factory.go β modified CreateRuntime (step 3)
if req.VoiceSessionID != "" && req.VoiceToken != "" && req.VoiceEndpoint != "" {
// Distributed mode: use pre-captured voice credentials.
proxyPlatform, err := discord.NewVoiceProxyPlatform(
req.GuildID, req.BotUserID,
voice.WithConnDaveSessionCreateFunc(golibdave.NewSession),
)
if err != nil { /* cleanup and return */ }
conn, err = proxyPlatform.Connect(sessionCtx,
req.ChannelID, req.VoiceSessionID, req.VoiceToken, req.VoiceEndpoint)
if err != nil { /* cleanup and return */ }
platform = proxyPlatform // for teardown
} else {
// Full mode or legacy: open own gateway (existing code).
voicePlatform, err := discord.NewVoiceOnlyPlatform(...)
// ...
}
6. Handle voice server updates during session (reconnection)
// New gRPC method: gateway β worker
// internal/gateway/contract.go
type WorkerClient interface {
// ... existing methods ...
UpdateVoiceServer(ctx context.Context, sessionID, token, endpoint string) error
}
// The gateway registers an ongoing listener for VOICE_SERVER_UPDATE.
// If the voice server changes mid-session (Discord migration), it
// forwards the new credentials to the worker.
7. Session teardown β gateway leaves voice
// internal/gateway/sessionctrl.go β modified Stop()
func (gc *GatewaySessionController) Stop(ctx context.Context, sessionID string) error {
// ... existing logic ...
// Leave voice channel (send Opcode 4 with channelID=nil).
if gc.gwBot != nil {
guildID := gc.guildIDForSession(sessionID)
if guildID != 0 {
_ = gc.gwBot.Client().UpdateVoiceState(ctx, guildID, nil, false, false)
}
}
// No ResumeGateway needed β gateway never suspended!
return nil
}
8. gRPC proto updates
message StartSessionRequest {
// ... existing fields ...
string voice_session_id = 9;
string voice_token = 10;
string voice_endpoint = 11;
string bot_user_id = 12;
}
// New RPC for mid-session voice server updates
service WorkerService {
// ... existing RPCs ...
rpc UpdateVoiceServer(UpdateVoiceServerRequest) returns (google.protobuf.Empty);
}
message UpdateVoiceServerRequest {
string session_id = 1;
string token = 2;
string endpoint = 3;
}
Tradeoffs
| Pro | Con |
|---|---|
| Gateway stays connected β slash commands always work | Voice credentials pass through gRPC (sensitive, but internal) |
| No suspend/resume dance | Mid-session voice server changes need forwarding |
| Worker needs no bot gateway (less resource usage) | Slightly more complex session start (capture + forward) |
| DAVE E2EE works unchanged on worker | Must handle gateway leaving voice on session end |
| Clean separation: gateway = Discord, worker = audio | New VoiceProxyPlatform to maintain |
| No timing/race issues | Bot shows as βin voiceβ on gateway pod (cosmetic) |
Option A: Gateway-Owned Voice (Audio Proxy)
Architecture: Gateway bot joins voice AND handles audio I/O. Audio data (opus frames) is streamed between gateway and worker via bidirectional gRPC streaming.
Discord β Gateway (voice + audio I/O) β gRPC stream β Worker (VADβSTTβLLMβTTS)
Pros:
- Simplest conceptual model β worker never touches Discord
- No voice credential passing
- Gateway handles all Discord interactions
Cons:
- Latency penalty: Every audio frame adds gRPC round-trip (~2-10ms). With 20ms opus frames, this adds ~5-20% overhead. May push past the 1.2s mouth-to-ear target.
- Gateway becomes bottleneck: All audio flows through it, increasing CPU/bandwidth
- Major refactor: Need bidirectional gRPC audio streaming, opus frame serialization, mixer changes
- Gateway resource usage: Gateway pods need audio processing capabilities
Verdict: Feasible but suboptimal. The latency penalty and gateway bottleneck make this worse than Option B for a voice-centric application.
Option D: Fix Current Approach
Potential fix: Wait for GUILD_CREATE before joining voice.
// In NewVoiceOnlyPlatform, after OpenGateway:
readyCh := make(chan struct{}, 1)
client.AddEventListeners(bot.NewListenerFunc(func(e *events.GuildReady) {
if e.Guild.ID == gID {
select {
case readyCh <- struct{}{}:
default:
}
}
}))
if err := client.OpenGateway(ctx); err != nil { ... }
select {
case <-readyCh:
// Guild is ready, safe to join voice
case <-ctx.Done():
return nil, ctx.Err()
}
Pros:
- Minimal code change β just add a wait
- Tests the root cause theory directly
Cons:
- Doesnβt solve the fundamental problem: Gateway canβt handle slash commands while worker holds the gateway
- Fragile: Depends on Discordβs guild hydration timing
- Still requires suspend/resume dance
- Single point of failure: If worker crashes, gateway must resume quickly or bot goes offline
Verdict: Worth trying as a quick diagnostic test to confirm the root cause theory, but not a production solution. Even if it works, the architectural problems remain.
Option C: Per-Tenant Bot Tokens (Already Implemented)
Each tenant already provides their own bot token via WithBotToken(token). The problem isnβt shared tokens β itβs that a single tenantβs bot token can only have one gateway connection.
Not a solution β this is already the case.
Recommendation
Option B (Voice State Proxy) is the clear winner:
- Correctness: Eliminates the single-gateway problem by design
- Performance: Direct workerβDiscord voice connection (no audio proxy overhead)
- Reliability: Gateway never suspends β slash commands always work
- Simplicity: Uses disgoβs existing public APIs (
voice.NewConn,HandleVoiceStateUpdate,HandleVoiceServerUpdate) - Compatibility: DAVE E2EE works unchanged; existing audio pipeline unaffected
Implementation Order
- Add
VoiceProxyPlatforminpkg/audio/discord/voice_proxy.go - Add
captureVoiceCredentialstoGatewaySessionController - Extend
StartSessionRequestwith voice credential fields - Update gRPC proto + codegen
- Update
workerFactory.CreateRuntimeto use proxy when credentials present - Add
UpdateVoiceServergRPC method for mid-session reconnection - Update
Stop()to leave voice channel via gateway - Remove
SuspendGateway/ResumeGateway(dead code after migration) - Remove
VoiceOnlyPlatform(only needed for legacy full mode fallback, or keep for--mode=full)
Testing Strategy
- Unit test
VoiceProxyPlatformwith mockvoice.Conn(verifyHandleVoiceStateUpdate/HandleVoiceServerUpdateare called correctly) - Unit test
captureVoiceCredentialswith mock event dispatch - Integration test: gateway captures credentials β worker connects to voice β audio flows
- Manual test with Discord:
/startβ verify bot joins voice, NPCs respond,/stopβ bot leaves