Auditor: Claude (Architecture Audit Agent) Scope: Full architectural audit across session lifecycle, multi-tenancy, voice pipeline, data consistency, configuration, error propagation, and security. Codebase: Glyphoxa main branch at commit 2fc85e5
Executive Summary
This audit reviewed ~170 non-test Go source files across the Glyphoxa codebase. 51 findings were identified: 4 Critical, 14 High, 28 Medium, and 5 Low severity. The most urgent issues are a missing tenant authorization check on session stop (allowing cross-tenant session termination), zombie sessions that permanently leak when they reach “active” state but never heartbeat, knowledge entity queries missing campaign_id filters, and NPC store lacking tenant-level isolation.
Findings
1. Session Lifecycle
1.1 TOCTOU Race in GatewaySessionController.Start
- Severity: High
- Location:
internal/gateway/sessionctrl.go:152-158 - Description:
Start()acquiresgc.muto checkgc.active[req.GuildID], then releases the lock before callinggc.orch.ValidateAndCreate(). A concurrentStart()call for the same guild can pass the check in the window between the unlock and the orchestrator’s atomic create, resulting in two sessions for the same guild. - Impact: Two workers dispatched for the same Discord voice channel, audio corruption, doubled resource usage.
- Suggested fix: Hold the lock through
ValidateAndCreate, or move the active-check into the orchestrator’s atomicINSERT(e.g., aUNIQUEconstraint on(guild_id) WHERE state != 'ended').
1.2 Zombie Sessions With NULL Heartbeat in Active State
- Severity: Critical
- Location:
internal/gateway/sessionorch/postgres.go:145-162 - Description:
CleanupZombiesfilters onlast_heartbeat IS NOT NULL AND last_heartbeat < threshold. Sessions that transition toactivestate but never receive a heartbeat (e.g., worker dies immediately afterReportStatebut before the first heartbeat tick) havelast_heartbeat IS NULLandstate = 'active'. These are invisible to bothCleanupZombiesandCleanupStalePending(which only targetsstate = 'pending'). - Impact: Permanently stuck “active” sessions that consume license quota slots forever, preventing the tenant from starting new sessions.
- Suggested fix: Add a third cleanup sweep: transition sessions with
state = 'active' AND last_heartbeat IS NULL AND started_at < now() - interval '5 minutes'to ended. Alternatively, extendCleanupZombiesto also catchlast_heartbeat IS NULL AND state != 'pending' AND started_at < threshold.
1.3 No State Transition Validation
- Severity: Medium
- Location:
internal/gateway/sessionorch/postgres.go:67-90 - Description:
Transition()accepts any state transition without validation. There are no guards against invalid transitions likeended → activeoractive → pending. TheUPDATEis unconditional. - Impact: If a bug or race condition causes an out-of-order transition, the session can end up in an inconsistent state. For example, a late
ReportState(active)arriving afterStop()has already setendedwould resurrect a dead session. - Suggested fix: Add a
WHERE state != 'ended'guard to prevent re-opening ended sessions, and/or add a valid-transitions map (pending→active, pending→ended, active→ended).
1.4 Zombie Cleanup Doesn’t Update Gateway In-Memory State
- Severity: High
- Location:
cmd/glyphoxa/main.go:625-662,internal/gateway/sessionctrl.go:42-62 - Description: The zombie cleanup loop correctly transitions stale sessions to
endedin the DB and cleans up orphaned K8s Jobs, but does NOT update theGatewaySessionController’s in-memoryactivemap orworkerAddrsmap. AfterCleanupZombiesruns,gc.active[guildID]still points to the now-ended session.IsActive(guildID)returnstrue, and/session startis rejected with “session already active.” - Impact: After a worker dies and zombie cleanup fires, the guild is permanently locked out of starting new sessions until the gateway is restarted.
- Suggested fix: The zombie/orphan cleanup should notify the relevant
GatewaySessionControllerto remove stale sessions from its in-memory maps. Alternatively,IsActivecould cross-check with the orchestrator’s DB state.
1.5 Concurrent Stop Calls from Multiple Triggers
- Severity: Medium
- Location:
internal/gateway/sessionctrl.go:310-357 - Description:
Stopcan be called concurrently from three sources: (1) user/session stop, (2)onStreamDetachcallback when audio bridge detects worker death, and (3)registerDisconnectListenerwhen bot is kicked from voice. None guard against concurrent execution. Thevoice.Conndouble-close could cause a panic in the disgo library. - Impact: Log noise from failed RPCs and “not found” K8s errors. Potential panic from double-closing the voice connection.
- Suggested fix: Add a
sync.Onceor “stopping” sentinel per session to prevent concurrent Stop execution.
1.6 Full-Mode Session Not Stopped on Process Shutdown
- Severity: Medium
- Location:
cmd/glyphoxa/main.go:264-285 - Description: In
runFull, the graceful shutdown path callsapplication.Shutdown()andbot.Close()but does NOT callsessionMgr.Stop(). The active session’s consolidator final consolidation is skipped (conversation history lost), the voice connection is not cleanly disconnected, and transcript recorders may not drain properly. - Impact: On SIGTERM during an active session: conversation history lost, Discord shows bot still in voice channel briefly, final transcript entries may be lost.
- Suggested fix: Explicitly call
sessionMgr.Stop(shutdownCtx)in therunFullshutdown path beforeapplication.Shutdown().
1.7 WorkerHandler.StopAll Doesn’t Report Ended State
- Severity: Medium
- Location:
internal/session/worker_handler.go:289-302 - Description:
StopAll(worker graceful shutdown) empties the sessions map and stops runtimes, but does NOT callh.callback.ReportState(sessionID, SessionEnded, ...). Compare withStopSessionwhich does report ended state. The gateway never learns these sessions have ended. - Impact: After worker pod shutdown, the gateway still considers those sessions active for up to ~60s (until heartbeat timeout + zombie cleanup). During this window, the guild is locked out and license slots are consumed by dead sessions.
- Suggested fix: Have
StopAllreport ended state for each session via the callback, matchingStopSessionbehavior.
1.8 Reconnector Does Not Signal Session Failure After Max Retries
- Severity: Medium
- Location:
internal/session/reconnect.go:230-236 - Description: When
attemptReconnectexhausts all retries, it logs an error and returns. There is no callback, no session termination, and no state transition. The session remains “active” in the orchestrator but the audio connection is dead. Heartbeats continue flowing, the pipeline continues running on silence. - Impact: A persistent network disconnection leaves a zombie session that consumes resources, counts against license limits, and is invisible to the user.
- Suggested fix: Add an
OnFailurecallback toReconnectorConfigthat triggersStopand cleanup after all retries are exhausted.
1.9 Disconnect Listener Leak on Failed Voice Setup
- Severity: Low
- Location:
internal/gateway/sessionctrl.go:430-465 - Description:
registerDisconnectListeneradds aGuildVoiceStateUpdateevent listener. If the voice bridge setup fails after the listener is registered but before the cleanup function is stored, the listener leaks. - Impact: Minor memory leak of event listeners over many failed session starts.
- Suggested fix: Register the listener only after the voice bridge is fully set up.
2. Multi-Tenancy Isolation
2.1 StopSession Missing Tenant Authorization
- Severity: Critical
- Location:
internal/web/handlers_sessions.go:121-140 - Description:
handleStopSessionsends aStopWebSessionRequestto the gateway with only thesessionID— the tenant ID from the JWT claims is not included or verified. The gateway’sStopWebSessionRPC (ingrpctransport/management.go) callsStopSession(ctx, sessionID)without checking that the session belongs to the requesting tenant. Any authenticated user from any tenant can stop any session by knowing (or guessing) its UUID. - Impact: Cross-tenant session termination. A malicious tenant_admin could disrupt another tenant’s live game sessions.
- Suggested fix: Include
tenant_idin theStopWebSessionRequestprotobuf. The gateway must verifysession.TenantID == request.TenantIDbefore stopping.
2.2 GetUser Store Method Not Scoped by Tenant
- Severity: Medium
- Location:
internal/web/store.go:284-302 - Description:
GetUser()queriesWHERE id = $1without atenant_idfilter. While the handler (handleGetUserathandlers_users.go:51-57) checksuser.TenantID != claims.TenantIDafter fetching, this is a defense-in-depth gap. If any other code path callsGetUserwithout the post-check, cross-tenant data leaks. - Impact: Potential information disclosure if a new handler or API consumer calls
GetUserwithout the tenant check. - Suggested fix: Add
tenant_idas a parameter toGetUserand include it in the SQLWHEREclause.
2.3 Lore/Knowledge Queries Scoped by Campaign but Not Tenant
- Severity: Medium
- Location:
internal/web/store.go:583-604(ListLoreDocuments),store.go:547-562(CreateLoreDocument) - Description: Lore document CRUD is scoped by
campaign_idbut nottenant_id. The handlers verify the campaign belongs to the tenant first, but the store layer itself has no tenant guard. If a campaign ID collision occurred or a handler missed the ownership check, lore from one tenant could leak to another. - Impact: Defense-in-depth gap — currently mitigated by handler-level campaign ownership checks, but fragile against future handler additions.
- Suggested fix: Add
tenant_idto thelore_documentstable or join againstcampaignswith atenant_idfilter in the store queries.
2.4 Knowledge Entity Queries Missing campaign_id Filter
- Severity: High
- Location:
internal/web/store.go:697-751(ListKnowledgeEntities),internal/web/store.go:753-771(DeleteKnowledgeEntity) - Description: Both
ListKnowledgeEntitiesandDeleteKnowledgeEntityquery the tenant-specific schema (tenant_<id>.entities) but do NOT filter bycampaign_idin the SQL WHERE clause. ThecampaignIDparameter is accepted but never included in the query. - Impact: Within the same tenant, users with access to Campaign A can see and delete knowledge entities from Campaign B. This violates the campaign-scoped access model.
- Suggested fix: Add
AND campaign_id = $<N>to both queries, using thecampaignIDparameter that is already passed in.
2.5 NPC Store Has No Tenant-Level Isolation
- Severity: High
- Location:
internal/agent/npcstore/postgres.go:132-159(Get),internal/agent/npcstore/postgres.go:219-226(Delete) - Description: The
npc_definitionstable has notenant_idcolumn.Getretrieves by NPC ID alone (WHERE id = $1).Deletealso uses onlyWHERE id = $1. The web handlers verify the NPC’s campaign belongs to the current tenant viarequireCampaign, but the store has no tenant guard. - Impact: Defense-in-depth violation. Any future code path calling
npcs.Get()ornpcs.Delete()without first verifying campaign ownership will have cross-tenant access. - Suggested fix: Add a
tenant_idcolumn tonpc_definitionsand include it in all queries, or move NPC definitions to tenant-specific schemas.
2.6 UpdateUser Store Method Not Scoped by Tenant
- Severity: Medium
- Location:
internal/web/store.go:826-856 - Description:
UpdateUserusesWHERE id = $1 AND deleted_at IS NULLwith notenant_idfilter. Compare toDeleteUserwhich properly includesAND tenant_id = $2. The handler does a separate tenant check, but the store method can update any user in any tenant. - Impact: Defense-in-depth violation — if any future caller invokes
UpdateUserwithout the handler-level tenant check, cross-tenant user modification is possible. - Suggested fix: Add
AND tenant_id = $<N>to theUpdateUserWHERE clause, consistent withDeleteUser.
2.7 Gateway Admin API Open When API Key Empty
- Severity: Medium
- Location:
internal/gateway/admin.go:148-175 - Description: When no API key is configured (
a.apiKey == ""), the gateway admin API auth middleware allows all requests without authentication. This is noted as “backward compat” but is dangerous in production. - Impact: If deployed without an API key, anyone with network access can create, modify, or delete any tenant, including injecting bot tokens.
- Suggested fix: Refuse to start the admin API without an API key in production, or default to denying all requests.
2.8 Worker gRPC Calls Not Tenant-Scoped
- Severity: Medium
- Location:
internal/gateway/grpctransport/client.go:94-210,internal/gateway/grpctransport/server.go:88-169 - Description: Worker gRPC calls for
StopSession,ListNPCs,MuteNPC,UnmuteNPC,SpeakNPCare keyed bysessionIDonly — notenantIDis passed or verified. The gRPC auth only protects ManagementService RPCs; SessionWorkerService RPCs are unguarded. - Impact: If the gateway is compromised or multiple gateways share a worker pool, a session from Tenant A could be manipulated through worker RPCs referencing Tenant B’s session ID.
- Suggested fix: Include
tenant_idin worker RPC requests and have the worker verify it matches the session’s tenant.
2.9 ValidRoles Excludes super_admin (Correctly)
- Severity: Low (Positive Finding)
- Location:
internal/web/store.go:58-62 - Description:
ValidRolesmap deliberately excludessuper_admin, preventing tenant_admin users from escalating roles tosuper_adminvia thehandleUpdateUserendpoint. This is correctly implemented. - Impact: N/A — this is working as intended and prevents privilege escalation.
3. Voice Pipeline
3.1 No Echo Cancellation / Self-Hearing Guard
- Severity: Critical
- Location:
pkg/audio/discord/connection.go:99-160,internal/app/audio_pipeline.go:86-94 - Description: The Discord connection’s
ReceiveOpusFramecreates an input stream for every user ID that sends audio, including the bot’s own user ID. There is no filtering of the bot’s own user ID anywhere in the pipeline.audioPipeline.Start()iteratesconn.InputStreams()and starts a VAD/STT worker for every participant, including the bot itself. Similarly,handleParticipantChangestarts a worker for everyEventJoin, with no bot-ID exclusion. The gRPC bridge has the same gap. - Impact: The NPC’s synthesized speech is decoded, fed into VAD, transcribed by STT, routed to an NPC agent, and generates a response — creating a feedback loop where the NPC talks to itself indefinitely. While Discord usually doesn’t route bot audio back, this is not guaranteed (especially with audio bridges in distributed mode).
- Suggested fix: Store the bot’s own user ID in the
Connectionstruct and skip it inReceiveOpusFrame. Alternatively, filter inaudioPipeline.startWorker()by checking the participant ID against a known bot ID.
3.2 Cascade Engine Background Goroutine Leak on Fast Close
- Severity: Medium
- Location:
internal/engine/cascade/cascade.go:440-454 - Description:
Engine.Close()setse.closed = true, closese.done, then callse.wg.Wait(). If aProcess()call’s background goroutine is inforwardStrongModelTrackedsending totextCh, and TTS has already returned (channel full or closed), the goroutine may block ontextCh <- sentenceindefinitely. Thee.donechannel is selected inwaitForDonebut not in thetextChsend paths. - Impact: Goroutine leak if Close() is called while a Process() background goroutine is blocked sending text to TTS.
- Suggested fix: Add
case <-e.done: returnto alltextChsend select blocks, or use a context derived from the engine’s done channel.
3.3 Audio Frame Ordering in gRPC Bridge
- Severity: Medium
- Location:
pkg/audio/grpcbridge/connection.go - Description: The gRPC audio bridge uses a bidirectional stream where frames are sent/received sequentially. gRPC guarantees in-order delivery on a single stream, so frame ordering is preserved in the normal case. However, if the stream reconnects (e.g., after a transient network error), frames sent during the reconnection window are lost with no indication to the pipeline.
- Impact: Brief audio gaps during gRPC stream reconnection, which could cause STT to produce garbled transcriptions for that segment.
- Suggested fix: Add sequence numbers to audio frames and detect/log gaps on the receiving end. Consider buffering a small window for reorder tolerance.
3.4 Data Race Between Flush() and sendLoop in gRPC Bridge
- Severity: High
- Location:
pkg/audio/grpcbridge/connection.go:194-198,pkg/audio/grpcbridge/connection.go:313-324 - Description:
Flush()drains the output channel concurrently whilesendLoop()also reads from it. Both goroutines race on<-c.output, causing frames to be non-deterministically consumed by either. Frames consumed byFlush()are silently discarded, whilesendLoopmay have already buffered a partial opus frame from a previous read. - Impact: During barge-in, the race produces audio glitches or corruption on the gateway side. A frame read by Flush is lost; sendLoop may encode against stale partial buffer data.
- Suggested fix: Remove the local output drain from
Flush()and letsendLoopdo both the channel drain and buffer reset atomically via the existingflushChsignal.
3.5 STT Session Leak on Rapid VAD Speech Start/End Cycles
- Severity: High
- Location:
internal/app/audio_pipeline.go:203-238 - Description: When
VADSpeechStartfires, the code opens a new STT session and launches acollectAndRoutegoroutine. If a secondVADSpeechStartfires before a correspondingVADSpeechEnd(e.g., VAD glitch), thesttSessionvariable is overwritten without closing the previous session. The old session’s WebSocket/HTTP connection is leaked. - Impact: Under rapid false-positive VAD triggers, this could leak many STT sessions (network connections, server-side state), causing resource exhaustion.
- Suggested fix: Before opening a new STT session on
VADSpeechStart, close the existing one if non-nil.
3.6 TTS/LLM Fallback Only Covers Stream Setup, Not Mid-Stream Failures
- Severity: Medium
- Location:
internal/resilience/tts_fallback.go:33-37,internal/resilience/llm_fallback.go:43 - Description: The fallback wrappers only apply failover to the initial
SynthesizeStream/StreamCompletioncall. Mid-stream errors (WebSocket disconnect, API timeout) close the audio channel prematurely. The text that was already consumed from the input channel cannot be re-sent to a fallback provider because channels are one-read. The circuit breaker won’t record this as a failure since the initial call succeeded. - Impact: If ElevenLabs drops the WebSocket mid-synthesis, the NPC’s response is truncated with no automatic recovery to a fallback provider.
- Suggested fix: Buffer the text channel’s contents so they can be replayed to a fallback provider on mid-stream failure. Have the cascade engine detect stream errors and retry with a different provider.
3.7 voiceBridgeReceiver.frameCount Data Race
- Severity: Medium
- Location:
internal/gateway/voicebridge.go:33,45-46,156 - Description:
frameCount(typeuint64) is incremented from the disgo voice receiver goroutine but read from the cleanup goroutine with no synchronization. This is a data race detectable by-race. - Impact: Go race detector would flag this. Violates the project’s “race detector always on” convention.
- Suggested fix: Change
frameCounttoatomic.Uint64.
3.8 Consolidator Summary Skip Heuristic Drops Legitimate Messages
- Severity: Low
- Location:
internal/session/consolidator.go:158 - Description: The consolidator skips “synthetic summary messages” by checking
m.Content[0] == '['. Any legitimate user message starting with[(e.g., “[OOC] hey guys”, “[laughs nervously]”, “[attacks the goblin]”) is incorrectly skipped and permanently lost from the session store. - Impact: Transcript entries starting with
[are silently lost. Common tabletop RPG conventions include bracketed OOC messages, action descriptions, and emotes — all would be dropped. - Suggested fix: Use a more specific prefix check, e.g.,
strings.HasPrefix(m.Content, "[Previous conversation summary]").
4. Data Consistency
4.1 Non-Atomic Invite Acceptance
- Severity: Medium
- Location:
internal/web/handlers_oauth.go:319-347 - Description:
processInviteperforms up to 3 separate database operations (UpdateUserTenant,UpdateUser,UseInvite) without a transaction. If the process crashes between operations, the user could be assigned to the tenant but the invite not marked as used (allowing reuse), or the role update could succeed but the tenant assignment fail. - Impact: Invite reuse, inconsistent user-tenant-role state after partial failure.
- Suggested fix: Wrap the three operations in a single database transaction.
4.2 Campaign Deletion Doesn’t Cascade to Lore/NPC Links
- Severity: Medium
- Location:
internal/web/store.go:405-416 - Description:
DeleteCampaignperforms a soft-delete (SET deleted_at = now()) but does not clean up dependent resources: lore documents (mgmt.lore_documents), campaign-NPC links (mgmt.campaign_npcs), and NPC definitions tied to the campaign. These orphaned records consume storage and could resurface if a campaign ID is reused. - Impact: Orphaned data after campaign deletion. Not a data corruption issue but a data hygiene concern.
- Suggested fix: Either cascade the soft-delete to dependent tables, or add a cleanup sweep. For hard references, use ON DELETE CASCADE in the FK definitions.
5. Configuration & Startup
5.1 CORS Defaults to Allow-All
- Severity: High
- Location:
internal/web/middleware.go:91-123,internal/web/config.go:86-88 - Description: When
AllowedOriginsis empty (which is the default whenGLYPHOXA_WEB_ALLOWED_ORIGINSis not set),CORSMiddlewaretreats it asallowAll = trueand setsAccess-Control-Allow-Origin: *. This means any website can make authenticated API calls to the Glyphoxa web service if a user’s browser has the JWT. - Impact: CSRF-like attacks where a malicious website can interact with the Glyphoxa API on behalf of an authenticated user. Note: the code correctly does NOT send
Access-Control-Allow-Credentials: truein allow-all mode, which mitigates cookie-based attacks. But since Glyphoxa uses Bearer tokens (not cookies), this is still exploitable if the token is stored in a way accessible to JS. - Suggested fix: Require
GLYPHOXA_WEB_ALLOWED_ORIGINSto be explicitly set in production. Add a startup warning when running with wildcard CORS.
5.2 ManagementService gRPC Unauthenticated by Default
- Severity: Critical
- Location:
cmd/glyphoxa/main.go:558-567 - Description: When
GLYPHOXA_GRPC_MGMT_SECRETis not set, the ManagementService gRPC endpoint is completely unauthenticated. Combined with the gateway admin API (2.7), both management interfaces default to open. - Impact: Any network-reachable client can invoke management RPCs: create/delete tenants, start/stop sessions, query usage.
- Suggested fix: Require the secret in production or refuse to start. At minimum, bind to localhost-only when unauthenticated.
5.3 No API Key Validation for Cloud Providers at Config Time
- Severity: High
- Location:
internal/config/loader.go:82-89,cmd/glyphoxa/main.go:1100-1115 - Description: Config validation checks provider names but never validates that
api_keyis non-empty for providers that require it. LLM provider factories accept empty API keys and create the provider without error. The error only surfaces on the first API call at runtime. - Impact: The server starts successfully and appears healthy, but fails on the first player interaction.
- Suggested fix: Add API key presence validation at the
Validate()level for known cloud providers.
5.4 HTTP Servers Lack Read/Write/Idle Timeouts
- Severity: Medium
- Location:
cmd/glyphoxa/main.go:509-512(admin),cmd/glyphoxa/main.go:1005-1008(MCP),cmd/glyphoxa/main.go:1066-1069(observe) - Description: Three
http.Serverinstances are created withoutReadTimeout,WriteTimeout, orIdleTimeout. Only the web service sets these. - Impact: Slowloris-style denial-of-service attacks can exhaust file descriptors and goroutines.
- Suggested fix: Set
ReadTimeout: 15s,WriteTimeout: 30s,IdleTimeout: 60son all HTTP servers.
5.5 Database SSL Mode Defaults to “prefer” Instead of “require”
- Severity: Medium
- Location:
cmd/glyphoxa/main.go:732-750 - Description:
applySSLMode()defaults tosslmode=prefer. A TLS-stripping attacker can force cleartext database connections. - Impact: Database credentials and all query data (bot tokens, transcripts) transmitted in cleartext.
- Suggested fix: Default to
sslmode=requirein production. Log a warning whenpreferis in effect.
5.6 Embedding Dimension Mismatch Silently Defaults to 1536
- Severity: Medium
- Location:
internal/app/app.go:212-215,cmd/glyphoxa/worker_factory.go:94-97 - Description: When
memory.embedding_dimensionsis 0, both code paths silently default to 1536. If the provider uses a different dimension, pgvector columns are created with the wrong size. - Impact: Semantic search returns garbage results due to dimension mismatch.
- Suggested fix: Cross-validate
embedding_dimensionsagainst the configured provider’s known defaults.
5.7 No Minimum Length on AdminAPIKey
- Severity: Medium
- Location:
internal/web/config.go:146-183 - Description:
JWTSecretrequires 32+ characters butAdminAPIKeyhas no minimum length. - Impact: Weak admin keys that can be brute-forced.
- Suggested fix: Add
len(c.AdminAPIKey) < 16validation.
5.8 Vault Token Encryption Enabled After Health Check Failure
- Severity: Medium
- Location:
cmd/glyphoxa/main.go:469-476 - Description: When Vault’s health check fails, the code still assigns
tokenEncryptor = tc. Subsequent encrypt/decrypt calls will either fail or fall through to plaintext. - Impact: Bot tokens may be stored in plaintext or all tenant operations may fail.
- Suggested fix: Only set
tokenEncryptor = tcwhen Ping succeeds.
6. Error Propagation
6.1 RateLimiter Cleanup Goroutine Leaks
- Severity: Low
- Location:
internal/web/ratelimit.go:36-37 - Description:
NewRateLimiterstarts a background goroutine (go rl.cleanup()) that ticks every 5 minutes indefinitely. There is noStop()method, no context, and no way to shut down this goroutine. EachNewRateLimitercall spawns a permanent goroutine. - Impact: Minor goroutine leak — in practice only 2-3 rate limiters are created at startup (read + write + voice preview). But if rate limiters are ever created dynamically (e.g., per-tenant), this would leak significantly.
- Suggested fix: Add a
context.Contextparameter toNewRateLimiteror aClose()method that stops the ticker.
6.2 Discord Connection Operates Silently with nil Opus Encoder
- Severity: High
- Location:
pkg/audio/discord/connection.go:84-88,199-201 - Description: If
newOpusEncoder()fails duringnewConnection(), the error is logged but the connection is returned withc.encoder = nil. InProvideOpusFrame(), whenencoderis nil, the code silently discards PCM data and returnsnil, nil(silence). The NPC’s TTS audio is generated and transcribed, but players hear complete silence. - Impact: Players hear nothing from the NPC. The system appears fully functional from the server side (no errors after the initial log), making this extremely hard to diagnose. The session continues consuming LLM and TTS quota with no audible output.
- Suggested fix: Return an error from
newConnection()when the encoder cannot be created. At minimum, emit periodic warnings inProvideOpusFramerather than silently dropping all audio.
6.3 S2S Engine Audio Channel Shared Across Concurrent Process Calls
- Severity: Medium
- Location:
internal/engine/s2s/engine.go:199-252,261-302 - Description: Each
Processcall capturessessionAudioCh(the shared session audio channel) and spawns aforwardAudiogoroutine that reads from it. If twoProcesscalls are concurrent (docstring says this is allowed), both goroutines read from the same channel. Audio chunks are non-deterministically split between the two turn channels. - Impact: Rapid successive calls to
Processproduce garbled per-turn audio because chunks are stolen by the previous turn’s forwarder. - Suggested fix: Use a single long-lived forwarder goroutine that demuxes audio into the most recent turn’s channel, or serialize
Processcalls.
6.4 ElevenLabs STT readLoop Swallows Fatal Errors
- Severity: Medium
- Location:
pkg/provider/stt/elevenlabs/elevenlabs.go:532-548 - Description: When
parseResponseencounters a fatal error likeauth_error,quota_exceeded, orinvalid_api_key, it logs and returns(zero, false). ThereadLoopcontinues callingconn.Read(). The session continues in a degraded state producing no transcripts but still sending audio (wasting bandwidth/quota). - Impact: Invalid API key or exceeded quota causes silent transcript loss. The NPC responds based on empty transcripts, leading to confused behavior.
- Suggested fix: Close the
donechannel on fatal errors soreadLoopexits and the failure propagates upstream.
6.5 No Circuit Breaker on GatewayClient
- Severity: Medium
- Location:
internal/gateway/grpctransport/server.go:211-251 - Description: The worker-facing
Clientwraps every call in a circuit breaker, but the gateway-facingGatewayClient(used by workers for heartbeats and state-report) has none. Gateway outage causes every heartbeat call to block until gRPC deadline. - Impact: Gateway outage causes worker goroutines to pile up waiting on gRPC timeouts, potentially exhausting worker resources.
- Suggested fix: Add a circuit breaker to
GatewayClientso heartbeat calls fail fast when the gateway is unreachable.
6.6 Dispatch Context Not Cancelled on Success Path
- Severity: Low
- Location:
internal/gateway/dispatch/dispatcher.go:108 - Description:
Dispatch()createsctx, cancel := context.WithTimeout(ctx, d.timeout)and stores the cancel in the session. On the success path, cancel is not called because the context must remain live. The timeout timer goroutine lingers for up to 120s. - Impact: Each successful dispatch leaks a timer goroutine for up to 120 seconds. Bounded and temporary.
7. Security
7.1 X-Forwarded-For Trusted Without Verification
- Severity: High
- Location:
internal/web/ratelimit.go:150-171 - Description: The
clientIP()function trustsX-Forwarded-ForandX-Real-IPheaders unconditionally. Any client can set these headers to an arbitrary IP, completely bypassing the IP-based rate limiting for unauthenticated requests. - Impact: Rate limiting on unauthenticated endpoints (OAuth callbacks, API key login) is ineffective. An attacker can make unlimited requests by rotating the
X-Forwarded-Forvalue. - Suggested fix: Only trust
X-Forwarded-Forwhen behind a known reverse proxy. Add aTrustedProxiesconfig option, and only strip the rightmost untrusted IP from the chain.
7.2 OAuth State Comparison Vulnerable to Timing Attack
- Severity: Low
- Location:
internal/web/handlers_oauth.go:66,handlers_oauth.go:219 - Description: OAuth state parameters are compared using
!=(standard string comparison) rather thansubtle.ConstantTimeCompare. This is a timing side-channel, though the state is a random 16-byte hex string with a 5-minute TTL, making exploitation impractical. - Impact: Theoretical — the random state and short TTL make a timing attack infeasible in practice, but it violates security best practices.
- Suggested fix: Use
subtle.ConstantTimeComparefor state comparison, consistent with the gRPC shared secret comparison inmgmt_auth.go:80.
7.3 JWT Token Exposed in Redirect URL
- Severity: High
- Location:
internal/web/handlers_oauth.go:158,handlers_oauth.go:311 - Description: After OAuth2 callback, the JWT is placed directly in the redirect URL query parameter:
"/auth/callback?token=" + url.QueryEscape(token). The token appears in browser history, server access logs, and potentially inRefererheaders on subsequent navigation. - Impact: JWT token leakage via browser history, proxy logs, or HTTP Referer headers. Anyone with access to these sources can impersonate the user for 24 hours.
- Suggested fix: Use a short-lived authorization code flow: redirect with a single-use code, then exchange it for the JWT via a POST request. Alternatively, set the token in a secure, httpOnly cookie during the redirect.
7.4 SSRF via Provider Test Endpoint
- Severity: High
- Location:
internal/web/handlers_providers.go:77-126 - Description: The
handleTestProviderendpoint accepts a user-suppliedbase_urlfield and makes HTTP requests to it with API keys attached in headers. No URL validation or allowlisting. An attacker withtenant_adminrole can pointbase_urlto internal services (e.g.,http://169.254.169.254/for cloud metadata). - Impact: Server-Side Request Forgery (SSRF). Attacker can probe internal infrastructure, access cloud metadata endpoints (AWS/GCP/Azure instance credentials), or exfiltrate data. The server includes API keys/tokens in outbound request headers.
- Suggested fix: Validate
base_urlagainst an allowlist of known provider domains. Block private/link-local IP ranges andlocalhost. Resolve hostname before request and reject private IPs.
7.5 Gateway Admin API Key Comparison Not Constant-Time
- Severity: High
- Location:
internal/gateway/admin.go:169 - Description: The API key comparison uses
key != a.apiKey(standard string equality), vulnerable to timing attacks. Compare withhandleAPIKeyLoginandmgmt_auth.gowhich correctly usesubtle.ConstantTimeCompare. - Impact: An attacker with network access to the admin API can determine the correct key character-by-character by measuring response times.
- Suggested fix: Use
crypto/subtle.ConstantTimeCompare([]byte(key), []byte(a.apiKey)) != 1.
7.6 No JWT Revocation Mechanism
- Severity: Medium
- Location:
internal/web/auth.go:32-34,internal/web/middleware.go:24-50 - Description: JWTs are valid for 24 hours (
Claims.Expires). There is no token revocation mechanism — no deny-list, no server-side session store, and no way to invalidate a token before expiry. - Impact: Compromised tokens cannot be revoked. If a user’s account is deleted or role changed, their JWT remains valid until expiry.
- Suggested fix: Implement a lightweight token deny-list. Alternatively, shorten token lifetime to 15-30 minutes with a refresh token mechanism.
7.7 WebRTC Signaling Has No Authentication
- Severity: Medium
- Location:
pkg/audio/webrtc/signaling.go:33-38 - Description: WebRTC signaling endpoints (
/rooms/{roomID}/join,/rooms/{roomID}/ice,/rooms/{roomID}/leave) have no authentication. Any client can join rooms, inject ICE candidates, or disconnect peers. - Impact: Unauthorized users can join voice rooms, impersonate others, inject audio, or disconnect legitimate users — a direct path to session hijacking.
- Suggested fix: Add authentication (session token/API key). Verify
user_idmatches the authenticated caller.
Summary Table
| # | Finding | Severity | Area |
|---|---|---|---|
| 1.1 | TOCTOU race in session start | High | Session Lifecycle |
| 1.2 | Zombie sessions with NULL heartbeat | Critical | Session Lifecycle |
| 1.3 | No state transition validation | Medium | Session Lifecycle |
| 1.4 | Zombie cleanup doesn’t update in-memory state | High | Session Lifecycle |
| 1.5 | Concurrent Stop from multiple triggers | Medium | Session Lifecycle |
| 1.6 | Full-mode session not stopped on shutdown | Medium | Session Lifecycle |
| 1.7 | StopAll doesn’t report ended state | Medium | Session Lifecycle |
| 1.8 | Reconnector silent failure after max retries | Medium | Session Lifecycle |
| 1.9 | Disconnect listener leak on failure | Low | Session Lifecycle |
| 2.1 | StopSession missing tenant auth | Critical | Multi-Tenancy |
| 2.2 | GetUser not scoped by tenant | Medium | Multi-Tenancy |
| 2.3 | Lore/Knowledge no tenant guard in store | Medium | Multi-Tenancy |
| 2.4 | Knowledge entity queries missing campaign_id | High | Multi-Tenancy |
| 2.5 | NPC store has no tenant isolation | High | Multi-Tenancy |
| 2.6 | UpdateUser not scoped by tenant | Medium | Multi-Tenancy |
| 2.7 | Gateway admin API open when key empty | Medium | Multi-Tenancy |
| 2.8 | Worker gRPC calls not tenant-scoped | Medium | Multi-Tenancy |
| 3.1 | No echo cancellation — NPC self-talk loop | Critical | Voice Pipeline |
| 3.2 | Cascade engine goroutine leak | Medium | Voice Pipeline |
| 3.3 | Audio frame gaps on stream reconnect | Medium | Voice Pipeline |
| 3.4 | Flush/sendLoop data race in gRPC bridge | High | Voice Pipeline |
| 3.5 | STT session leak on rapid VAD cycles | High | Voice Pipeline |
| 3.6 | TTS/LLM fallback no mid-stream recovery | Medium | Voice Pipeline |
| 3.7 | voiceBridgeReceiver frameCount data race | Medium | Voice Pipeline |
| 3.8 | Consolidator drops bracketed messages | Low | Voice Pipeline |
| 4.1 | Non-atomic invite acceptance | Medium | Data Consistency |
| 4.2 | Campaign deletion no cascade | Medium | Data Consistency |
| 5.1 | CORS defaults to allow-all | High | Config & Startup |
| 5.2 | ManagementService gRPC unauthenticated by default | Critical | Config & Startup |
| 5.3 | No API key validation for cloud providers | High | Config & Startup |
| 5.4 | HTTP servers lack timeouts (Slowloris) | Medium | Config & Startup |
| 5.5 | DB SSL defaults to “prefer” not “require” | Medium | Config & Startup |
| 5.6 | Embedding dimension mismatch defaults to 1536 | Medium | Config & Startup |
| 5.7 | No min length on AdminAPIKey | Medium | Config & Startup |
| 5.8 | Vault encryption enabled after health check fail | Medium | Config & Startup |
| 6.1 | RateLimiter goroutine leak | Low | Error Propagation |
| 6.2 | Nil opus encoder silent failure | High | Error Propagation |
| 6.3 | S2S audio channel shared across concurrent calls | Medium | Error Propagation |
| 6.4 | ElevenLabs STT swallows fatal errors | Medium | Error Propagation |
| 6.5 | No circuit breaker on GatewayClient | Medium | Error Propagation |
| 6.6 | Dispatch context timer leak on success | Low | Error Propagation |
| 7.1 | X-Forwarded-For trusted blindly | High | Security |
| 7.2 | OAuth state timing attack | Low | Security |
| 7.3 | JWT in redirect URL | High | Security |
| 7.4 | SSRF via provider test base_url | High | Security |
| 7.5 | Admin API key comparison not constant-time | High | Security |
| 7.6 | No JWT revocation | Medium | Security |
| 7.7 | WebRTC signaling no authentication | Medium | Security |
Critical (4): 1.2, 2.1, 3.1, 5.2 — require immediate attention High (14): 1.1, 1.4, 2.4, 2.5, 3.4, 3.5, 5.1, 5.3, 6.2, 7.1, 7.3, 7.4, 7.5 — address before production Medium (28): 1.3, 1.5, 1.6, 1.7, 1.8, 2.2, 2.3, 2.6, 2.7, 2.8, 3.2, 3.3, 3.6, 3.7, 4.1, 4.2, 5.4, 5.5, 5.6, 5.7, 5.8, 6.3, 6.4, 6.5, 7.6, 7.7 — next sprint Low (5): 1.9, 3.8, 6.1, 6.6, 7.2 — address opportunistically
Positive Observations
- Tenant schema isolation for memory/knowledge data (per-tenant PostgreSQL schemas) is well-implemented with proper
pgx.Identifier.Sanitize()usage preventing SQL injection. - Vault Transit encryption for bot tokens at rest is a strong security measure.
- Circuit breaker implementation is clean and correct with proper half-open probe logic.
- Session orchestrator design with
CleanupZombiesandCleanupStalePendingshows good awareness of failure modes. - gRPC management auth uses
subtle.ConstantTimeComparecorrectly. - MaxBytesMiddleware prevents request body memory exhaustion.
- Bot token validation uses proper HMAC comparison.
- Compile-time interface assertions (
var _ Interface = (*Impl)(nil)) are consistently used throughout the codebase.