Auditor: Claude (Architecture Audit Agent) Scope: Full architectural audit across session lifecycle, multi-tenancy, voice pipeline, data consistency, configuration, error propagation, and security. Codebase: Glyphoxa main branch at commit 2fc85e5


Executive Summary

This audit reviewed ~170 non-test Go source files across the Glyphoxa codebase. 51 findings were identified: 4 Critical, 14 High, 28 Medium, and 5 Low severity. The most urgent issues are a missing tenant authorization check on session stop (allowing cross-tenant session termination), zombie sessions that permanently leak when they reach “active” state but never heartbeat, knowledge entity queries missing campaign_id filters, and NPC store lacking tenant-level isolation.


Findings

1. Session Lifecycle

1.1 TOCTOU Race in GatewaySessionController.Start

  • Severity: High
  • Location: internal/gateway/sessionctrl.go:152-158
  • Description: Start() acquires gc.mu to check gc.active[req.GuildID], then releases the lock before calling gc.orch.ValidateAndCreate(). A concurrent Start() call for the same guild can pass the check in the window between the unlock and the orchestrator’s atomic create, resulting in two sessions for the same guild.
  • Impact: Two workers dispatched for the same Discord voice channel, audio corruption, doubled resource usage.
  • Suggested fix: Hold the lock through ValidateAndCreate, or move the active-check into the orchestrator’s atomic INSERT (e.g., a UNIQUE constraint on (guild_id) WHERE state != 'ended').

1.2 Zombie Sessions With NULL Heartbeat in Active State

  • Severity: Critical
  • Location: internal/gateway/sessionorch/postgres.go:145-162
  • Description: CleanupZombies filters on last_heartbeat IS NOT NULL AND last_heartbeat < threshold. Sessions that transition to active state but never receive a heartbeat (e.g., worker dies immediately after ReportState but before the first heartbeat tick) have last_heartbeat IS NULL and state = 'active'. These are invisible to both CleanupZombies and CleanupStalePending (which only targets state = 'pending').
  • Impact: Permanently stuck “active” sessions that consume license quota slots forever, preventing the tenant from starting new sessions.
  • Suggested fix: Add a third cleanup sweep: transition sessions with state = 'active' AND last_heartbeat IS NULL AND started_at < now() - interval '5 minutes' to ended. Alternatively, extend CleanupZombies to also catch last_heartbeat IS NULL AND state != 'pending' AND started_at < threshold.

1.3 No State Transition Validation

  • Severity: Medium
  • Location: internal/gateway/sessionorch/postgres.go:67-90
  • Description: Transition() accepts any state transition without validation. There are no guards against invalid transitions like ended → active or active → pending. The UPDATE is unconditional.
  • Impact: If a bug or race condition causes an out-of-order transition, the session can end up in an inconsistent state. For example, a late ReportState(active) arriving after Stop() has already set ended would resurrect a dead session.
  • Suggested fix: Add a WHERE state != 'ended' guard to prevent re-opening ended sessions, and/or add a valid-transitions map (pending→active, pending→ended, active→ended).

1.4 Zombie Cleanup Doesn’t Update Gateway In-Memory State

  • Severity: High
  • Location: cmd/glyphoxa/main.go:625-662, internal/gateway/sessionctrl.go:42-62
  • Description: The zombie cleanup loop correctly transitions stale sessions to ended in the DB and cleans up orphaned K8s Jobs, but does NOT update the GatewaySessionController’s in-memory active map or workerAddrs map. After CleanupZombies runs, gc.active[guildID] still points to the now-ended session. IsActive(guildID) returns true, and /session start is rejected with “session already active.”
  • Impact: After a worker dies and zombie cleanup fires, the guild is permanently locked out of starting new sessions until the gateway is restarted.
  • Suggested fix: The zombie/orphan cleanup should notify the relevant GatewaySessionController to remove stale sessions from its in-memory maps. Alternatively, IsActive could cross-check with the orchestrator’s DB state.

1.5 Concurrent Stop Calls from Multiple Triggers

  • Severity: Medium
  • Location: internal/gateway/sessionctrl.go:310-357
  • Description: Stop can be called concurrently from three sources: (1) user /session stop, (2) onStreamDetach callback when audio bridge detects worker death, and (3) registerDisconnectListener when bot is kicked from voice. None guard against concurrent execution. The voice.Conn double-close could cause a panic in the disgo library.
  • Impact: Log noise from failed RPCs and “not found” K8s errors. Potential panic from double-closing the voice connection.
  • Suggested fix: Add a sync.Once or “stopping” sentinel per session to prevent concurrent Stop execution.

1.6 Full-Mode Session Not Stopped on Process Shutdown

  • Severity: Medium
  • Location: cmd/glyphoxa/main.go:264-285
  • Description: In runFull, the graceful shutdown path calls application.Shutdown() and bot.Close() but does NOT call sessionMgr.Stop(). The active session’s consolidator final consolidation is skipped (conversation history lost), the voice connection is not cleanly disconnected, and transcript recorders may not drain properly.
  • Impact: On SIGTERM during an active session: conversation history lost, Discord shows bot still in voice channel briefly, final transcript entries may be lost.
  • Suggested fix: Explicitly call sessionMgr.Stop(shutdownCtx) in the runFull shutdown path before application.Shutdown().

1.7 WorkerHandler.StopAll Doesn’t Report Ended State

  • Severity: Medium
  • Location: internal/session/worker_handler.go:289-302
  • Description: StopAll (worker graceful shutdown) empties the sessions map and stops runtimes, but does NOT call h.callback.ReportState(sessionID, SessionEnded, ...). Compare with StopSession which does report ended state. The gateway never learns these sessions have ended.
  • Impact: After worker pod shutdown, the gateway still considers those sessions active for up to ~60s (until heartbeat timeout + zombie cleanup). During this window, the guild is locked out and license slots are consumed by dead sessions.
  • Suggested fix: Have StopAll report ended state for each session via the callback, matching StopSession behavior.

1.8 Reconnector Does Not Signal Session Failure After Max Retries

  • Severity: Medium
  • Location: internal/session/reconnect.go:230-236
  • Description: When attemptReconnect exhausts all retries, it logs an error and returns. There is no callback, no session termination, and no state transition. The session remains “active” in the orchestrator but the audio connection is dead. Heartbeats continue flowing, the pipeline continues running on silence.
  • Impact: A persistent network disconnection leaves a zombie session that consumes resources, counts against license limits, and is invisible to the user.
  • Suggested fix: Add an OnFailure callback to ReconnectorConfig that triggers Stop and cleanup after all retries are exhausted.

1.9 Disconnect Listener Leak on Failed Voice Setup

  • Severity: Low
  • Location: internal/gateway/sessionctrl.go:430-465
  • Description: registerDisconnectListener adds a GuildVoiceStateUpdate event listener. If the voice bridge setup fails after the listener is registered but before the cleanup function is stored, the listener leaks.
  • Impact: Minor memory leak of event listeners over many failed session starts.
  • Suggested fix: Register the listener only after the voice bridge is fully set up.

2. Multi-Tenancy Isolation

2.1 StopSession Missing Tenant Authorization

  • Severity: Critical
  • Location: internal/web/handlers_sessions.go:121-140
  • Description: handleStopSession sends a StopWebSessionRequest to the gateway with only the sessionID — the tenant ID from the JWT claims is not included or verified. The gateway’s StopWebSession RPC (in grpctransport/management.go) calls StopSession(ctx, sessionID) without checking that the session belongs to the requesting tenant. Any authenticated user from any tenant can stop any session by knowing (or guessing) its UUID.
  • Impact: Cross-tenant session termination. A malicious tenant_admin could disrupt another tenant’s live game sessions.
  • Suggested fix: Include tenant_id in the StopWebSessionRequest protobuf. The gateway must verify session.TenantID == request.TenantID before stopping.

2.2 GetUser Store Method Not Scoped by Tenant

  • Severity: Medium
  • Location: internal/web/store.go:284-302
  • Description: GetUser() queries WHERE id = $1 without a tenant_id filter. While the handler (handleGetUser at handlers_users.go:51-57) checks user.TenantID != claims.TenantID after fetching, this is a defense-in-depth gap. If any other code path calls GetUser without the post-check, cross-tenant data leaks.
  • Impact: Potential information disclosure if a new handler or API consumer calls GetUser without the tenant check.
  • Suggested fix: Add tenant_id as a parameter to GetUser and include it in the SQL WHERE clause.

2.3 Lore/Knowledge Queries Scoped by Campaign but Not Tenant

  • Severity: Medium
  • Location: internal/web/store.go:583-604 (ListLoreDocuments), store.go:547-562 (CreateLoreDocument)
  • Description: Lore document CRUD is scoped by campaign_id but not tenant_id. The handlers verify the campaign belongs to the tenant first, but the store layer itself has no tenant guard. If a campaign ID collision occurred or a handler missed the ownership check, lore from one tenant could leak to another.
  • Impact: Defense-in-depth gap — currently mitigated by handler-level campaign ownership checks, but fragile against future handler additions.
  • Suggested fix: Add tenant_id to the lore_documents table or join against campaigns with a tenant_id filter in the store queries.

2.4 Knowledge Entity Queries Missing campaign_id Filter

  • Severity: High
  • Location: internal/web/store.go:697-751 (ListKnowledgeEntities), internal/web/store.go:753-771 (DeleteKnowledgeEntity)
  • Description: Both ListKnowledgeEntities and DeleteKnowledgeEntity query the tenant-specific schema (tenant_<id>.entities) but do NOT filter by campaign_id in the SQL WHERE clause. The campaignID parameter is accepted but never included in the query.
  • Impact: Within the same tenant, users with access to Campaign A can see and delete knowledge entities from Campaign B. This violates the campaign-scoped access model.
  • Suggested fix: Add AND campaign_id = $<N> to both queries, using the campaignID parameter that is already passed in.

2.5 NPC Store Has No Tenant-Level Isolation

  • Severity: High
  • Location: internal/agent/npcstore/postgres.go:132-159 (Get), internal/agent/npcstore/postgres.go:219-226 (Delete)
  • Description: The npc_definitions table has no tenant_id column. Get retrieves by NPC ID alone (WHERE id = $1). Delete also uses only WHERE id = $1. The web handlers verify the NPC’s campaign belongs to the current tenant via requireCampaign, but the store has no tenant guard.
  • Impact: Defense-in-depth violation. Any future code path calling npcs.Get() or npcs.Delete() without first verifying campaign ownership will have cross-tenant access.
  • Suggested fix: Add a tenant_id column to npc_definitions and include it in all queries, or move NPC definitions to tenant-specific schemas.

2.6 UpdateUser Store Method Not Scoped by Tenant

  • Severity: Medium
  • Location: internal/web/store.go:826-856
  • Description: UpdateUser uses WHERE id = $1 AND deleted_at IS NULL with no tenant_id filter. Compare to DeleteUser which properly includes AND tenant_id = $2. The handler does a separate tenant check, but the store method can update any user in any tenant.
  • Impact: Defense-in-depth violation — if any future caller invokes UpdateUser without the handler-level tenant check, cross-tenant user modification is possible.
  • Suggested fix: Add AND tenant_id = $<N> to the UpdateUser WHERE clause, consistent with DeleteUser.

2.7 Gateway Admin API Open When API Key Empty

  • Severity: Medium
  • Location: internal/gateway/admin.go:148-175
  • Description: When no API key is configured (a.apiKey == ""), the gateway admin API auth middleware allows all requests without authentication. This is noted as “backward compat” but is dangerous in production.
  • Impact: If deployed without an API key, anyone with network access can create, modify, or delete any tenant, including injecting bot tokens.
  • Suggested fix: Refuse to start the admin API without an API key in production, or default to denying all requests.

2.8 Worker gRPC Calls Not Tenant-Scoped

  • Severity: Medium
  • Location: internal/gateway/grpctransport/client.go:94-210, internal/gateway/grpctransport/server.go:88-169
  • Description: Worker gRPC calls for StopSession, ListNPCs, MuteNPC, UnmuteNPC, SpeakNPC are keyed by sessionID only — no tenantID is passed or verified. The gRPC auth only protects ManagementService RPCs; SessionWorkerService RPCs are unguarded.
  • Impact: If the gateway is compromised or multiple gateways share a worker pool, a session from Tenant A could be manipulated through worker RPCs referencing Tenant B’s session ID.
  • Suggested fix: Include tenant_id in worker RPC requests and have the worker verify it matches the session’s tenant.

2.9 ValidRoles Excludes super_admin (Correctly)

  • Severity: Low (Positive Finding)
  • Location: internal/web/store.go:58-62
  • Description: ValidRoles map deliberately excludes super_admin, preventing tenant_admin users from escalating roles to super_admin via the handleUpdateUser endpoint. This is correctly implemented.
  • Impact: N/A — this is working as intended and prevents privilege escalation.

3. Voice Pipeline

3.1 No Echo Cancellation / Self-Hearing Guard

  • Severity: Critical
  • Location: pkg/audio/discord/connection.go:99-160, internal/app/audio_pipeline.go:86-94
  • Description: The Discord connection’s ReceiveOpusFrame creates an input stream for every user ID that sends audio, including the bot’s own user ID. There is no filtering of the bot’s own user ID anywhere in the pipeline. audioPipeline.Start() iterates conn.InputStreams() and starts a VAD/STT worker for every participant, including the bot itself. Similarly, handleParticipantChange starts a worker for every EventJoin, with no bot-ID exclusion. The gRPC bridge has the same gap.
  • Impact: The NPC’s synthesized speech is decoded, fed into VAD, transcribed by STT, routed to an NPC agent, and generates a response — creating a feedback loop where the NPC talks to itself indefinitely. While Discord usually doesn’t route bot audio back, this is not guaranteed (especially with audio bridges in distributed mode).
  • Suggested fix: Store the bot’s own user ID in the Connection struct and skip it in ReceiveOpusFrame. Alternatively, filter in audioPipeline.startWorker() by checking the participant ID against a known bot ID.

3.2 Cascade Engine Background Goroutine Leak on Fast Close

  • Severity: Medium
  • Location: internal/engine/cascade/cascade.go:440-454
  • Description: Engine.Close() sets e.closed = true, closes e.done, then calls e.wg.Wait(). If a Process() call’s background goroutine is in forwardStrongModelTracked sending to textCh, and TTS has already returned (channel full or closed), the goroutine may block on textCh <- sentence indefinitely. The e.done channel is selected in waitForDone but not in the textCh send paths.
  • Impact: Goroutine leak if Close() is called while a Process() background goroutine is blocked sending text to TTS.
  • Suggested fix: Add case <-e.done: return to all textCh send select blocks, or use a context derived from the engine’s done channel.

3.3 Audio Frame Ordering in gRPC Bridge

  • Severity: Medium
  • Location: pkg/audio/grpcbridge/connection.go
  • Description: The gRPC audio bridge uses a bidirectional stream where frames are sent/received sequentially. gRPC guarantees in-order delivery on a single stream, so frame ordering is preserved in the normal case. However, if the stream reconnects (e.g., after a transient network error), frames sent during the reconnection window are lost with no indication to the pipeline.
  • Impact: Brief audio gaps during gRPC stream reconnection, which could cause STT to produce garbled transcriptions for that segment.
  • Suggested fix: Add sequence numbers to audio frames and detect/log gaps on the receiving end. Consider buffering a small window for reorder tolerance.

3.4 Data Race Between Flush() and sendLoop in gRPC Bridge

  • Severity: High
  • Location: pkg/audio/grpcbridge/connection.go:194-198, pkg/audio/grpcbridge/connection.go:313-324
  • Description: Flush() drains the output channel concurrently while sendLoop() also reads from it. Both goroutines race on <-c.output, causing frames to be non-deterministically consumed by either. Frames consumed by Flush() are silently discarded, while sendLoop may have already buffered a partial opus frame from a previous read.
  • Impact: During barge-in, the race produces audio glitches or corruption on the gateway side. A frame read by Flush is lost; sendLoop may encode against stale partial buffer data.
  • Suggested fix: Remove the local output drain from Flush() and let sendLoop do both the channel drain and buffer reset atomically via the existing flushCh signal.

3.5 STT Session Leak on Rapid VAD Speech Start/End Cycles

  • Severity: High
  • Location: internal/app/audio_pipeline.go:203-238
  • Description: When VADSpeechStart fires, the code opens a new STT session and launches a collectAndRoute goroutine. If a second VADSpeechStart fires before a corresponding VADSpeechEnd (e.g., VAD glitch), the sttSession variable is overwritten without closing the previous session. The old session’s WebSocket/HTTP connection is leaked.
  • Impact: Under rapid false-positive VAD triggers, this could leak many STT sessions (network connections, server-side state), causing resource exhaustion.
  • Suggested fix: Before opening a new STT session on VADSpeechStart, close the existing one if non-nil.

3.6 TTS/LLM Fallback Only Covers Stream Setup, Not Mid-Stream Failures

  • Severity: Medium
  • Location: internal/resilience/tts_fallback.go:33-37, internal/resilience/llm_fallback.go:43
  • Description: The fallback wrappers only apply failover to the initial SynthesizeStream/StreamCompletion call. Mid-stream errors (WebSocket disconnect, API timeout) close the audio channel prematurely. The text that was already consumed from the input channel cannot be re-sent to a fallback provider because channels are one-read. The circuit breaker won’t record this as a failure since the initial call succeeded.
  • Impact: If ElevenLabs drops the WebSocket mid-synthesis, the NPC’s response is truncated with no automatic recovery to a fallback provider.
  • Suggested fix: Buffer the text channel’s contents so they can be replayed to a fallback provider on mid-stream failure. Have the cascade engine detect stream errors and retry with a different provider.

3.7 voiceBridgeReceiver.frameCount Data Race

  • Severity: Medium
  • Location: internal/gateway/voicebridge.go:33,45-46,156
  • Description: frameCount (type uint64) is incremented from the disgo voice receiver goroutine but read from the cleanup goroutine with no synchronization. This is a data race detectable by -race.
  • Impact: Go race detector would flag this. Violates the project’s “race detector always on” convention.
  • Suggested fix: Change frameCount to atomic.Uint64.

3.8 Consolidator Summary Skip Heuristic Drops Legitimate Messages

  • Severity: Low
  • Location: internal/session/consolidator.go:158
  • Description: The consolidator skips “synthetic summary messages” by checking m.Content[0] == '['. Any legitimate user message starting with [ (e.g., “[OOC] hey guys”, “[laughs nervously]”, “[attacks the goblin]”) is incorrectly skipped and permanently lost from the session store.
  • Impact: Transcript entries starting with [ are silently lost. Common tabletop RPG conventions include bracketed OOC messages, action descriptions, and emotes — all would be dropped.
  • Suggested fix: Use a more specific prefix check, e.g., strings.HasPrefix(m.Content, "[Previous conversation summary]").

4. Data Consistency

4.1 Non-Atomic Invite Acceptance

  • Severity: Medium
  • Location: internal/web/handlers_oauth.go:319-347
  • Description: processInvite performs up to 3 separate database operations (UpdateUserTenant, UpdateUser, UseInvite) without a transaction. If the process crashes between operations, the user could be assigned to the tenant but the invite not marked as used (allowing reuse), or the role update could succeed but the tenant assignment fail.
  • Impact: Invite reuse, inconsistent user-tenant-role state after partial failure.
  • Suggested fix: Wrap the three operations in a single database transaction.
  • Severity: Medium
  • Location: internal/web/store.go:405-416
  • Description: DeleteCampaign performs a soft-delete (SET deleted_at = now()) but does not clean up dependent resources: lore documents (mgmt.lore_documents), campaign-NPC links (mgmt.campaign_npcs), and NPC definitions tied to the campaign. These orphaned records consume storage and could resurface if a campaign ID is reused.
  • Impact: Orphaned data after campaign deletion. Not a data corruption issue but a data hygiene concern.
  • Suggested fix: Either cascade the soft-delete to dependent tables, or add a cleanup sweep. For hard references, use ON DELETE CASCADE in the FK definitions.

5. Configuration & Startup

5.1 CORS Defaults to Allow-All

  • Severity: High
  • Location: internal/web/middleware.go:91-123, internal/web/config.go:86-88
  • Description: When AllowedOrigins is empty (which is the default when GLYPHOXA_WEB_ALLOWED_ORIGINS is not set), CORSMiddleware treats it as allowAll = true and sets Access-Control-Allow-Origin: *. This means any website can make authenticated API calls to the Glyphoxa web service if a user’s browser has the JWT.
  • Impact: CSRF-like attacks where a malicious website can interact with the Glyphoxa API on behalf of an authenticated user. Note: the code correctly does NOT send Access-Control-Allow-Credentials: true in allow-all mode, which mitigates cookie-based attacks. But since Glyphoxa uses Bearer tokens (not cookies), this is still exploitable if the token is stored in a way accessible to JS.
  • Suggested fix: Require GLYPHOXA_WEB_ALLOWED_ORIGINS to be explicitly set in production. Add a startup warning when running with wildcard CORS.

5.2 ManagementService gRPC Unauthenticated by Default

  • Severity: Critical
  • Location: cmd/glyphoxa/main.go:558-567
  • Description: When GLYPHOXA_GRPC_MGMT_SECRET is not set, the ManagementService gRPC endpoint is completely unauthenticated. Combined with the gateway admin API (2.7), both management interfaces default to open.
  • Impact: Any network-reachable client can invoke management RPCs: create/delete tenants, start/stop sessions, query usage.
  • Suggested fix: Require the secret in production or refuse to start. At minimum, bind to localhost-only when unauthenticated.

5.3 No API Key Validation for Cloud Providers at Config Time

  • Severity: High
  • Location: internal/config/loader.go:82-89, cmd/glyphoxa/main.go:1100-1115
  • Description: Config validation checks provider names but never validates that api_key is non-empty for providers that require it. LLM provider factories accept empty API keys and create the provider without error. The error only surfaces on the first API call at runtime.
  • Impact: The server starts successfully and appears healthy, but fails on the first player interaction.
  • Suggested fix: Add API key presence validation at the Validate() level for known cloud providers.

5.4 HTTP Servers Lack Read/Write/Idle Timeouts

  • Severity: Medium
  • Location: cmd/glyphoxa/main.go:509-512 (admin), cmd/glyphoxa/main.go:1005-1008 (MCP), cmd/glyphoxa/main.go:1066-1069 (observe)
  • Description: Three http.Server instances are created without ReadTimeout, WriteTimeout, or IdleTimeout. Only the web service sets these.
  • Impact: Slowloris-style denial-of-service attacks can exhaust file descriptors and goroutines.
  • Suggested fix: Set ReadTimeout: 15s, WriteTimeout: 30s, IdleTimeout: 60s on all HTTP servers.

5.5 Database SSL Mode Defaults to “prefer” Instead of “require”

  • Severity: Medium
  • Location: cmd/glyphoxa/main.go:732-750
  • Description: applySSLMode() defaults to sslmode=prefer. A TLS-stripping attacker can force cleartext database connections.
  • Impact: Database credentials and all query data (bot tokens, transcripts) transmitted in cleartext.
  • Suggested fix: Default to sslmode=require in production. Log a warning when prefer is in effect.

5.6 Embedding Dimension Mismatch Silently Defaults to 1536

  • Severity: Medium
  • Location: internal/app/app.go:212-215, cmd/glyphoxa/worker_factory.go:94-97
  • Description: When memory.embedding_dimensions is 0, both code paths silently default to 1536. If the provider uses a different dimension, pgvector columns are created with the wrong size.
  • Impact: Semantic search returns garbage results due to dimension mismatch.
  • Suggested fix: Cross-validate embedding_dimensions against the configured provider’s known defaults.

5.7 No Minimum Length on AdminAPIKey

  • Severity: Medium
  • Location: internal/web/config.go:146-183
  • Description: JWTSecret requires 32+ characters but AdminAPIKey has no minimum length.
  • Impact: Weak admin keys that can be brute-forced.
  • Suggested fix: Add len(c.AdminAPIKey) < 16 validation.

5.8 Vault Token Encryption Enabled After Health Check Failure

  • Severity: Medium
  • Location: cmd/glyphoxa/main.go:469-476
  • Description: When Vault’s health check fails, the code still assigns tokenEncryptor = tc. Subsequent encrypt/decrypt calls will either fail or fall through to plaintext.
  • Impact: Bot tokens may be stored in plaintext or all tenant operations may fail.
  • Suggested fix: Only set tokenEncryptor = tc when Ping succeeds.

6. Error Propagation

6.1 RateLimiter Cleanup Goroutine Leaks

  • Severity: Low
  • Location: internal/web/ratelimit.go:36-37
  • Description: NewRateLimiter starts a background goroutine (go rl.cleanup()) that ticks every 5 minutes indefinitely. There is no Stop() method, no context, and no way to shut down this goroutine. Each NewRateLimiter call spawns a permanent goroutine.
  • Impact: Minor goroutine leak — in practice only 2-3 rate limiters are created at startup (read + write + voice preview). But if rate limiters are ever created dynamically (e.g., per-tenant), this would leak significantly.
  • Suggested fix: Add a context.Context parameter to NewRateLimiter or a Close() method that stops the ticker.

6.2 Discord Connection Operates Silently with nil Opus Encoder

  • Severity: High
  • Location: pkg/audio/discord/connection.go:84-88,199-201
  • Description: If newOpusEncoder() fails during newConnection(), the error is logged but the connection is returned with c.encoder = nil. In ProvideOpusFrame(), when encoder is nil, the code silently discards PCM data and returns nil, nil (silence). The NPC’s TTS audio is generated and transcribed, but players hear complete silence.
  • Impact: Players hear nothing from the NPC. The system appears fully functional from the server side (no errors after the initial log), making this extremely hard to diagnose. The session continues consuming LLM and TTS quota with no audible output.
  • Suggested fix: Return an error from newConnection() when the encoder cannot be created. At minimum, emit periodic warnings in ProvideOpusFrame rather than silently dropping all audio.

6.3 S2S Engine Audio Channel Shared Across Concurrent Process Calls

  • Severity: Medium
  • Location: internal/engine/s2s/engine.go:199-252,261-302
  • Description: Each Process call captures sessionAudioCh (the shared session audio channel) and spawns a forwardAudio goroutine that reads from it. If two Process calls are concurrent (docstring says this is allowed), both goroutines read from the same channel. Audio chunks are non-deterministically split between the two turn channels.
  • Impact: Rapid successive calls to Process produce garbled per-turn audio because chunks are stolen by the previous turn’s forwarder.
  • Suggested fix: Use a single long-lived forwarder goroutine that demuxes audio into the most recent turn’s channel, or serialize Process calls.

6.4 ElevenLabs STT readLoop Swallows Fatal Errors

  • Severity: Medium
  • Location: pkg/provider/stt/elevenlabs/elevenlabs.go:532-548
  • Description: When parseResponse encounters a fatal error like auth_error, quota_exceeded, or invalid_api_key, it logs and returns (zero, false). The readLoop continues calling conn.Read(). The session continues in a degraded state producing no transcripts but still sending audio (wasting bandwidth/quota).
  • Impact: Invalid API key or exceeded quota causes silent transcript loss. The NPC responds based on empty transcripts, leading to confused behavior.
  • Suggested fix: Close the done channel on fatal errors so readLoop exits and the failure propagates upstream.

6.5 No Circuit Breaker on GatewayClient

  • Severity: Medium
  • Location: internal/gateway/grpctransport/server.go:211-251
  • Description: The worker-facing Client wraps every call in a circuit breaker, but the gateway-facing GatewayClient (used by workers for heartbeats and state-report) has none. Gateway outage causes every heartbeat call to block until gRPC deadline.
  • Impact: Gateway outage causes worker goroutines to pile up waiting on gRPC timeouts, potentially exhausting worker resources.
  • Suggested fix: Add a circuit breaker to GatewayClient so heartbeat calls fail fast when the gateway is unreachable.

6.6 Dispatch Context Not Cancelled on Success Path

  • Severity: Low
  • Location: internal/gateway/dispatch/dispatcher.go:108
  • Description: Dispatch() creates ctx, cancel := context.WithTimeout(ctx, d.timeout) and stores the cancel in the session. On the success path, cancel is not called because the context must remain live. The timeout timer goroutine lingers for up to 120s.
  • Impact: Each successful dispatch leaks a timer goroutine for up to 120 seconds. Bounded and temporary.

7. Security

7.1 X-Forwarded-For Trusted Without Verification

  • Severity: High
  • Location: internal/web/ratelimit.go:150-171
  • Description: The clientIP() function trusts X-Forwarded-For and X-Real-IP headers unconditionally. Any client can set these headers to an arbitrary IP, completely bypassing the IP-based rate limiting for unauthenticated requests.
  • Impact: Rate limiting on unauthenticated endpoints (OAuth callbacks, API key login) is ineffective. An attacker can make unlimited requests by rotating the X-Forwarded-For value.
  • Suggested fix: Only trust X-Forwarded-For when behind a known reverse proxy. Add a TrustedProxies config option, and only strip the rightmost untrusted IP from the chain.

7.2 OAuth State Comparison Vulnerable to Timing Attack

  • Severity: Low
  • Location: internal/web/handlers_oauth.go:66, handlers_oauth.go:219
  • Description: OAuth state parameters are compared using != (standard string comparison) rather than subtle.ConstantTimeCompare. This is a timing side-channel, though the state is a random 16-byte hex string with a 5-minute TTL, making exploitation impractical.
  • Impact: Theoretical — the random state and short TTL make a timing attack infeasible in practice, but it violates security best practices.
  • Suggested fix: Use subtle.ConstantTimeCompare for state comparison, consistent with the gRPC shared secret comparison in mgmt_auth.go:80.

7.3 JWT Token Exposed in Redirect URL

  • Severity: High
  • Location: internal/web/handlers_oauth.go:158, handlers_oauth.go:311
  • Description: After OAuth2 callback, the JWT is placed directly in the redirect URL query parameter: "/auth/callback?token=" + url.QueryEscape(token). The token appears in browser history, server access logs, and potentially in Referer headers on subsequent navigation.
  • Impact: JWT token leakage via browser history, proxy logs, or HTTP Referer headers. Anyone with access to these sources can impersonate the user for 24 hours.
  • Suggested fix: Use a short-lived authorization code flow: redirect with a single-use code, then exchange it for the JWT via a POST request. Alternatively, set the token in a secure, httpOnly cookie during the redirect.

7.4 SSRF via Provider Test Endpoint

  • Severity: High
  • Location: internal/web/handlers_providers.go:77-126
  • Description: The handleTestProvider endpoint accepts a user-supplied base_url field and makes HTTP requests to it with API keys attached in headers. No URL validation or allowlisting. An attacker with tenant_admin role can point base_url to internal services (e.g., http://169.254.169.254/ for cloud metadata).
  • Impact: Server-Side Request Forgery (SSRF). Attacker can probe internal infrastructure, access cloud metadata endpoints (AWS/GCP/Azure instance credentials), or exfiltrate data. The server includes API keys/tokens in outbound request headers.
  • Suggested fix: Validate base_url against an allowlist of known provider domains. Block private/link-local IP ranges and localhost. Resolve hostname before request and reject private IPs.

7.5 Gateway Admin API Key Comparison Not Constant-Time

  • Severity: High
  • Location: internal/gateway/admin.go:169
  • Description: The API key comparison uses key != a.apiKey (standard string equality), vulnerable to timing attacks. Compare with handleAPIKeyLogin and mgmt_auth.go which correctly use subtle.ConstantTimeCompare.
  • Impact: An attacker with network access to the admin API can determine the correct key character-by-character by measuring response times.
  • Suggested fix: Use crypto/subtle.ConstantTimeCompare([]byte(key), []byte(a.apiKey)) != 1.

7.6 No JWT Revocation Mechanism

  • Severity: Medium
  • Location: internal/web/auth.go:32-34, internal/web/middleware.go:24-50
  • Description: JWTs are valid for 24 hours (Claims.Expires). There is no token revocation mechanism — no deny-list, no server-side session store, and no way to invalidate a token before expiry.
  • Impact: Compromised tokens cannot be revoked. If a user’s account is deleted or role changed, their JWT remains valid until expiry.
  • Suggested fix: Implement a lightweight token deny-list. Alternatively, shorten token lifetime to 15-30 minutes with a refresh token mechanism.

7.7 WebRTC Signaling Has No Authentication

  • Severity: Medium
  • Location: pkg/audio/webrtc/signaling.go:33-38
  • Description: WebRTC signaling endpoints (/rooms/{roomID}/join, /rooms/{roomID}/ice, /rooms/{roomID}/leave) have no authentication. Any client can join rooms, inject ICE candidates, or disconnect peers.
  • Impact: Unauthorized users can join voice rooms, impersonate others, inject audio, or disconnect legitimate users — a direct path to session hijacking.
  • Suggested fix: Add authentication (session token/API key). Verify user_id matches the authenticated caller.

Summary Table

# Finding Severity Area
1.1 TOCTOU race in session start High Session Lifecycle
1.2 Zombie sessions with NULL heartbeat Critical Session Lifecycle
1.3 No state transition validation Medium Session Lifecycle
1.4 Zombie cleanup doesn’t update in-memory state High Session Lifecycle
1.5 Concurrent Stop from multiple triggers Medium Session Lifecycle
1.6 Full-mode session not stopped on shutdown Medium Session Lifecycle
1.7 StopAll doesn’t report ended state Medium Session Lifecycle
1.8 Reconnector silent failure after max retries Medium Session Lifecycle
1.9 Disconnect listener leak on failure Low Session Lifecycle
2.1 StopSession missing tenant auth Critical Multi-Tenancy
2.2 GetUser not scoped by tenant Medium Multi-Tenancy
2.3 Lore/Knowledge no tenant guard in store Medium Multi-Tenancy
2.4 Knowledge entity queries missing campaign_id High Multi-Tenancy
2.5 NPC store has no tenant isolation High Multi-Tenancy
2.6 UpdateUser not scoped by tenant Medium Multi-Tenancy
2.7 Gateway admin API open when key empty Medium Multi-Tenancy
2.8 Worker gRPC calls not tenant-scoped Medium Multi-Tenancy
3.1 No echo cancellation — NPC self-talk loop Critical Voice Pipeline
3.2 Cascade engine goroutine leak Medium Voice Pipeline
3.3 Audio frame gaps on stream reconnect Medium Voice Pipeline
3.4 Flush/sendLoop data race in gRPC bridge High Voice Pipeline
3.5 STT session leak on rapid VAD cycles High Voice Pipeline
3.6 TTS/LLM fallback no mid-stream recovery Medium Voice Pipeline
3.7 voiceBridgeReceiver frameCount data race Medium Voice Pipeline
3.8 Consolidator drops bracketed messages Low Voice Pipeline
4.1 Non-atomic invite acceptance Medium Data Consistency
4.2 Campaign deletion no cascade Medium Data Consistency
5.1 CORS defaults to allow-all High Config & Startup
5.2 ManagementService gRPC unauthenticated by default Critical Config & Startup
5.3 No API key validation for cloud providers High Config & Startup
5.4 HTTP servers lack timeouts (Slowloris) Medium Config & Startup
5.5 DB SSL defaults to “prefer” not “require” Medium Config & Startup
5.6 Embedding dimension mismatch defaults to 1536 Medium Config & Startup
5.7 No min length on AdminAPIKey Medium Config & Startup
5.8 Vault encryption enabled after health check fail Medium Config & Startup
6.1 RateLimiter goroutine leak Low Error Propagation
6.2 Nil opus encoder silent failure High Error Propagation
6.3 S2S audio channel shared across concurrent calls Medium Error Propagation
6.4 ElevenLabs STT swallows fatal errors Medium Error Propagation
6.5 No circuit breaker on GatewayClient Medium Error Propagation
6.6 Dispatch context timer leak on success Low Error Propagation
7.1 X-Forwarded-For trusted blindly High Security
7.2 OAuth state timing attack Low Security
7.3 JWT in redirect URL High Security
7.4 SSRF via provider test base_url High Security
7.5 Admin API key comparison not constant-time High Security
7.6 No JWT revocation Medium Security
7.7 WebRTC signaling no authentication Medium Security

Critical (4): 1.2, 2.1, 3.1, 5.2 — require immediate attention High (14): 1.1, 1.4, 2.4, 2.5, 3.4, 3.5, 5.1, 5.3, 6.2, 7.1, 7.3, 7.4, 7.5 — address before production Medium (28): 1.3, 1.5, 1.6, 1.7, 1.8, 2.2, 2.3, 2.6, 2.7, 2.8, 3.2, 3.3, 3.6, 3.7, 4.1, 4.2, 5.4, 5.5, 5.6, 5.7, 5.8, 6.3, 6.4, 6.5, 7.6, 7.7 — next sprint Low (5): 1.9, 3.8, 6.1, 6.6, 7.2 — address opportunistically


Positive Observations

  • Tenant schema isolation for memory/knowledge data (per-tenant PostgreSQL schemas) is well-implemented with proper pgx.Identifier.Sanitize() usage preventing SQL injection.
  • Vault Transit encryption for bot tokens at rest is a strong security measure.
  • Circuit breaker implementation is clean and correct with proper half-open probe logic.
  • Session orchestrator design with CleanupZombies and CleanupStalePending shows good awareness of failure modes.
  • gRPC management auth uses subtle.ConstantTimeCompare correctly.
  • MaxBytesMiddleware prevents request body memory exhaustion.
  • Bot token validation uses proper HMAC comparison.
  • Compile-time interface assertions (var _ Interface = (*Impl)(nil)) are consistently used throughout the codebase.

This site uses Just the Docs, a documentation theme for Jekyll.