Luk went to bed at ~01:00 CET. Claude is running Waves 3-5 autonomously. All decisions made without Lukβs input are documented here for morning review.
Status Tracker
- Wave 3 Lane J: NPC gRPC extensions + recap
- Wave 3 Lane K: Vault + security
- Wave 4 Lane L: End-to-end testing + docs
- Wave 5: Exploratory β Discord deploy + slash command verification
- Deploy updated gateway to K3s
- Verify bot comes online with slash commands
Decisions Log
D1: Vault Transit for bot token encryption (01:00 CET)
Decision: Use Vault HTTP API directly (no Go Vault SDK dependency) for Transit encrypt/decrypt of bot tokens. Graceful degradation: if Vault is unavailable, store tokens in plaintext with a log warning. This avoids adding a heavy dependency and keeps the code simple. Why: Vault SDK adds ~40 transitive deps. The Transit API is 2 HTTP calls (encrypt + decrypt). The admin store is the only consumer.
D2: Vault PKI scope (01:00 CET)
Decision: Set up the Vault PKI role and document cert issuance. Full auto-rotation of gRPC certs is deferred β it requires a sidecar or init container pattern thatβs too complex for one night. For now, the TLS env var support from Wave 2 is sufficient. Manual cert issuance from Vault PKI is documented. Why: Auto-rotation requires either a Vault agent sidecar or a Go-level cert watcher. Both are multi-day efforts. The TLS plumbing is already there.
D3: K8s Secrets for API keys (01:00 CET)
Decision: Move ElevenLabs API key, Gemini API key, and Discord bot token from ConfigMap to K8s Secrets. The gateway deployment mounts them as env vars. The config.yaml in the ConfigMap references env var placeholders. Why: ConfigMaps are not encrypted at rest. Secrets get etcd encryption by default in K3s.
D4: NPC recap in gateway mode (01:00 CET)
Decision: Text-only recap for MVP. The gateway can query the shared PostgreSQL session store for transcript data. Voice recap (TTS narration) is deferred since the gateway doesnβt have TTS providers loaded. Why: Voice recap requires TTS infrastructure on the gateway. Text recap works immediately via shared DB.
D5: Monitoring loop approach (01:00 CET)
Decision: 30-minute CronCreate loop checks agent progress, merges PRs, launches next waves. Stops at 05:00 CET or when all waves complete.
Progress Log
01:54 CET β First check
- Wave 3 agents running (2 processes alive)
- No PRs or branches pushed yet
- Agents still working on proto changes and Vault integration
02:01 CET β Second check
- Both agents still running (2 processes)
- No PRs, no branches, no Discord reports yet
- Expected β these are heavy lanes (proto regen, Vault HTTP client, K8s Secrets)
~02:05 CET β Lane K complete
- Vault Transit client implemented (
internal/gateway/vault/transit.go)- HTTP API based β no
hashicorp/vault/apidependency (D1) - Graceful degradation: Encrypt returns plaintext, Decrypt errors only for vault: prefixed data
- NoopEncryptor for when Vault is not configured
- HTTP API based β no
- PostgresAdminStore updated to encrypt/decrypt bot tokens on write/read
- Constructor now accepts
vault.TokenEncryptor(nil β NoopEncryptor) - Pre-existing plaintext tokens pass through transparently (no βvault:v1:β prefix)
- Constructor now accepts
- Vault PKI client implemented (
internal/gateway/vault/pki.go)- Issues certs from Vault PKI, writes to disk (D2)
- Full auto-rotation deferred per D2
scripts/vault-setup.shconfigures Transit + PKI + policy
- K8s Secrets for sensitive config (D3)
- Added
secrets.*values for API keys, DB DSN, Vault token - Gateway + worker deployments mount via secretKeyRef
- Database DSN from secrets takes precedence over
database.dsn
- Added
- Fixed pre-existing duplicate variable declarations in main.go (orch, callbackBridge, usageStore)
- All tests pass (vault package: 100%, existing tests: unaffected)
Wave 4: Deploy & Test (10:17β10:22 CET)
D6: Migration idempotency fix (10:17 CET)
Decision: Made ALTER TABLE ADD CONSTRAINT unique_active_campaign idempotent by wrapping in a DO $$ IF NOT EXISTS (SELECT FROM pg_constraint WHERE conname = ...) block. PostgreSQL doesnβt support IF NOT EXISTS on ADD CONSTRAINT natively. Why: The gateway crashed on startup because the migration runner re-executes all SQL files on every boot (no version tracking). The constraint already existed from the previous deployment.
D7: K8s dispatcher env vars (10:17 CET)
Decision: Patched the gateway deployment to add two env vars:
GLYPHOXA_K8S_NAMESPACE=glyphoxaβ enables the K8s Job dispatcherGLYPHOXA_JOB_TEMPLATE_CM=glyphoxa-worker-job-templateβ points to the correct ConfigMap (default wasglyphoxa-worker-job, actual name isglyphoxa-worker-job-template) Why: WithoutGLYPHOXA_K8S_NAMESPACE, the dispatcher is disabled entirely and/session startcanβt create worker pods. These should be added to the Helm/zhi deployment manifests permanently.
D8: RBAC ConfigMap read permission (10:18 CET)
Decision: Patched the glyphoxa-job-manager Role to add get verb for ConfigMaps. The gateway service account needs this to load the worker job template. Why: The Role only had permissions for batch/jobs and pods. Loading the job template from a ConfigMap requires core/configmaps read access. This should be added to the zhi deployment manifests.
Deploy result
- Both CI and Docker workflows passed for the migration fix commit
- Gateway pod started cleanly after restart
- Migrations applied successfully
- K8s dispatcher initialized
- Health endpoints:
/healthzok,/readyzok (admin_api: ok, grpc: ok) - Tenant βlukβ created with campaign_id
die-chroniken-von-rabenheim - Discord bot connected:
has_commands=true
Slash commands registered
All 5 commands visible in Discord via the Application Commands API:
/sessionβ start, stop, status, recap/npcβ list, mute, unmute, speak, muteall, unmuteall/entityβ knowledge management/campaignβ campaign management/feedbackβ user feedback
Wave 5: Voice Pipeline Investigation (10:22 CET)
Worker dispatch readiness: READY
The full worker dispatch infrastructure is in place:
- K8s Job template (
glyphoxa-worker-job-templateConfigMap) β correctly configured with:- Image:
ghcr.io/mrwong99/glyphoxa:main - Mode:
--mode=worker - Gateway address:
glyphoxa-gateway.glyphoxa.svc:50051 - Database DSN
- Shared config volume
- Resource limits (500mβ2 CPU, 384Miβ768Mi RAM)
- Image:
- RBAC (
glyphoxa-job-managerRole) β allows job CRUD, pod list/watch, configmap get - Service account (
glyphoxa-gateway) bound to the role - Dispatcher initialized in gateway with 120s timeout for pod readiness
What happens on /session start
- Gateway creates session record in PostgreSQL (with campaign/guild/license constraints)
- Dispatcher stamps the job template with session ID + tenant ID
- K8s Job is created β worker pod starts
- Gateway polls for pod Running + IP (120s timeout)
- Worker connects to Discord voice and runs VADβSTTβLLMβTTS pipeline
- Control flows over gRPC between gateway and worker
To test end-to-end
A human needs to:
- Join a voice channel in the Discord guild (1178124747884736613)
- Run
/session startin a text channel - Worker pod will be dispatched automatically
- Speak β NPCs should respond
Known limitations for first test
- gRPC TLS not configured (insecure, fine for LAN)
- Vault encryption not active (no VAULT_ADDR/VAULT_TOKEN set)
- API keys are in ConfigMap, not K8s Secrets (D3 code exists but K8s manifests not updated)
- Worker pod resource limits may need tuning for actual voice processing
- NPC commands return βno active sessionβ in gateway mode until gRPC extensions fully wire the orchestrator (D4 scope)