PS_AI_Agent/.claude/session_log_2026-02-19.md
j.foucher b888f7fcb6 Update memory: document v1.5.0 mic chunk size fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 18:42:47 +01:00

15 KiB

Session Log — 2026-02-19

Project: PS_AI_Agent (Unreal Engine 5.5) Machine: Desktop PC (j_foucher) Working directory: E:\ASTERION\GIT\PS_AI_Agent


Conversation Summary

1. Initial Request

User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5. Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs. Plugin name requested: PS_AI_Agent_ElevenLabs.

2. Codebase Exploration

Explored the Convai plugin source at ConvAI/Convai/ to understand:

  • Module/settings structure
  • AudioCapture patterns
  • HTTP proxy pattern
  • gRPC streaming architecture (to know what to replace with WebSocket)
  • Convai already had EVoiceType::ElevenLabsVoices — confirming the direction

3. Plugin Created

All source files written from scratch under: Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/

Files created:

  • PS_AI_Agent_ElevenLabs.uplugin
  • PS_AI_Agent_ElevenLabs.Build.cs
  • Public/PS_AI_Agent_ElevenLabs.h — Module + UElevenLabsSettings
  • Public/ElevenLabsDefinitions.h — Enums, structs, protocol constants
  • Public/ElevenLabsWebSocketProxy.h + .cpp — WS session manager
  • Public/ElevenLabsConversationalAgentComponent.h + .cpp — Main NPC component
  • Public/ElevenLabsMicrophoneCaptureComponent.h + .cpp — Mic capture
  • PS_AI_Agent.uproject — Plugin registered

Commit: f0055e8

4. Memory Files Created

To allow context recovery on any machine (including laptop):

  • .claude/MEMORY.md — project structure + patterns (auto-loaded by Claude Code)
  • .claude/elevenlabs_plugin.md — plugin file map + API protocol details
  • .claude/project_context.md — original ask, intent, short/long-term goals
  • Local copy also at C:\Users\j_foucher\.claude\projects\...\memory\

Commit: f0055e8 (with plugin), updated in 4d6ae10

5. .gitignore Updated

Added to existing ignores:

  • Unreal/PS_AI_Agent/Plugins/*/Binaries/
  • Unreal/PS_AI_Agent/Plugins/*/Intermediate/
  • Unreal/PS_AI_Agent/*.sln / *.suo
  • .claude/settings.local.json
  • generate_pptx.py

Commit: 4d6ae10, b114ab0

6. Compile — First Attempt (Errors Found)

Ran Build.bat PS_AI_AgentEditor Win64 Development. Errors:

  • WebSockets listed in .uplugin — it's a module not a plugin → removed
  • OpenDefaultCaptureStream doesn't exist in UE 5.5 → use OpenAudioCaptureStream
  • FOnAudioCaptureFunction callback uses const void* not const float* → fixed cast
  • TArray::RemoveAt(0, N, false) deprecated → use EAllowShrinking::No
  • AudioCapture is a plugin and must be in .uplugin Plugins array → added

Commit: bb1a857

7. Compile — Success

Clean build, no warnings, no errors. Output: Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll

Memory updated with confirmed UE 5.5 API patterns. Commit: 3b98edc

8. Documentation — Markdown

Full reference doc written to .claude/PS_AI_Agent_ElevenLabs_Documentation.md:

  • Installation, Project Settings, Quick Start (BP + C++), Components Reference, Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.

Commit: c833ccd

9. Documentation — PowerPoint

20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):

  • File: PS_AI_Agent_ElevenLabs_Documentation.pptx in repo root
  • Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
  • Generator script generate_pptx.py excluded from git via .gitignore

Commit: 1b72026


Session 2 — 2026-02-19 (continued context)

10. API vs Implementation Cross-Check (3 bugs found and fixed)

Cross-referenced elevenlabs_api_reference.md against plugin source. Found 3 protocol bugs:

Bug 1 — Transcript fields wrong:

  • Type: "transcript""user_transcript"
  • Event key: "transcript_event""user_transcription_event"
  • Field: "message""user_transcript"

Bug 2 — Pong format wrong:

  • event_id was nested in pong_event{} → must be top-level

Bug 3 — Client turn mode messages don't exist:

  • "user_turn_start" / "user_turn_end" are not valid API types
  • Replaced: start → "user_activity", end → no-op (server detects silence)

Commit: ae2c9b9

11. SendTextMessage Added

User asked for text input to agent for testing (without mic). Added SendTextMessage(FString) to UElevenLabsWebSocketProxy and UElevenLabsConversationalAgentComponent. Sends {"type":"user_message","text":"..."} — agent replies with audio + text.

Commit: b489d11

12. Binary WebSocket Frame Fix

User reported: "Received unexpected binary WebSocket frame" warnings. Root cause: ElevenLabs sends ALL WebSocket frames as binary, never text. OnMessage (text handler) never fires. OnRawMessage must handle everything.

Fix: Implemented OnWsBinaryMessage with fragment reassembly (BinaryFrameBuffer).

Commit: 669c503

13. JSON vs PCM Discrimination Fix

After binary fix: "Failed to parse WebSocket message as JSON" errors. Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.

Fix: Peek at byte[0] of assembled buffer:

  • '{' (0x7B) → UTF-8 JSON → route to OnWsMessage()
  • anything else → raw PCM audio → broadcast to OnAudioReceived

Commit: 4834567

14. Documentation Updated to v1.1.0

Full rewrite of .claude/PS_AI_Agent_ElevenLabs_Documentation.md:

  • Added Changelog section (v1.0.0 / v1.1.0)
  • Updated audio pipeline (binary PCM path, not Base64 JSON)
  • Added SendTextMessage to all function tables and examples
  • Corrected turn mode docs, transcript docs, OnAgentConnected timing
  • New troubleshooting entries

Commit: e464cfe

15. Test Blueprint Asset Updated

test_AI_Actor.uasset updated in UE Editor.

Commit: 99017f4


Git History (this session)

Hash Message
f0055e8 Add PS_AI_Agent_ElevenLabs plugin (initial implementation)
4d6ae10 Update .gitignore: exclude plugin build artifacts and local Claude settings
b114ab0 Broaden .gitignore: use glob for all plugin Binaries/Intermediate
bb1a857 Fix compile errors in PS_AI_Agent_ElevenLabs plugin
3b98edc Update memory: document confirmed UE 5.5 API patterns and plugin compile status
c833ccd Add plugin documentation for PS_AI_Agent_ElevenLabs
1b72026 Add PowerPoint documentation and update .gitignore
bbeb429 ElevenLabs API reference doc
dbd6161 TestMap, test actor, DefaultEngine.ini, memory update
ae2c9b9 Fix 3 WebSocket protocol bugs
b489d11 Add SendTextMessage
669c503 Fix binary WebSocket frames
4834567 Fix JSON vs binary frame discrimination
e464cfe Update documentation to v1.1.0
99017f4 Update test_AI_Actor blueprint asset

Key Technical Decisions Made This Session

Decision Reason
WebSocket instead of gRPC ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed
AudioCapture in .uplugin Plugins array It's an engine plugin, not a module — UBT requires it declared
WebSockets in Build.cs only It's a module (no .uplugin file), declaring it in .uplugin causes build error
FOnAudioCaptureFunction uses const void* UE 5.3+ API change — must cast to float* inside callback
EAllowShrinking::No Bool overload of RemoveAt deprecated in UE 5.5
USoundWaveProcedural for playback Allows pushing raw PCM bytes at runtime without file I/O
Silence threshold = 30 ticks ~0.5s at 60fps heuristic to detect agent finished speaking
Binary frame handling ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM
user_activity for client turn user_turn_start/user_turn_end don't exist in ElevenLabs API


Session 3 — 2026-02-19 (bug fixes from live testing)

16. Three Runtime Bugs Fixed (v1.2.0)

User reported after live testing:

  1. AI speaks twice — every audio response played double
  2. Cannot speak — mic capture didn't reach ElevenLabs
  3. Latency — requested enable_intermediate_response: true

Bug 1 Root Cause — Double Audio: UE's libwebsockets backend fires both OnMessage() (text callback) and OnRawMessage() (binary callback) for the same incoming frame. We had bound both WebSocket->OnMessage() and WebSocket->OnRawMessage() in Connect(). Result: every audio frame was decoded and enqueued twice → played twice.

Fix: Remove OnMessage binding entirely. OnRawMessage now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).

Bug 2 Root Cause — Mic Silent: ElevenLabs requires a conversation_initiation_client_data message sent immediately after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.

Fix: Send conversation_initiation_client_data in OnWsConnected() before any other message.

Bug 2 Secondary — Delegate Stacking: StartListening() called Mic->OnAudioCaptured.AddUObject(this, ...) without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.

Fix: Add Mic->OnAudioCaptured.RemoveAll(this) before AddUObject in StartListening().

Bug 3 — Latency: Added "enable_intermediate_response": true inside custom_llm_extra_body of the conversation_initiation_client_data message. Also added optimize_streaming_latency: 3 in conversation_config_override.tts.

Files changed:

  • ElevenLabsWebSocketProxy.cpp:
    • Connect(): removed OnMessage binding
    • OnWsConnected(): now sends full conversation_initiation_client_data JSON
  • ElevenLabsConversationalAgentComponent.cpp:
    • StartListening(): added RemoveAll guard before delegate binding


Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)

17. Two More Bugs Found and Fixed (v1.3.0)

User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.

Analysis of log:

  • Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
  • Mic opens and closes correctly — audio capture IS happening
  • Server never responds to mic input → audio reaching ElevenLabs but being ignored

Bug A — TurnMode mismatch in conversation_initiation_client_data: OnWsConnected() hardcoded "mode": "server_vad" in the init message regardless of the component's TurnMode setting. User's Blueprint uses Client turn mode (push-to-talk), so the server was configured for server_vad while the client sent client_vad audio signals.

Fix: Read TurnMode field on the proxy (set from the component before Connect()). Translate EElevenLabsTurnMode::Client"client_vad", Server → "server_vad".

Bug B — user_activity never sent continuously: In client VAD mode, ElevenLabs requires user_activity to be sent continuously alongside every audio chunk to keep the server's VAD aware the user is speaking. SendUserTurnStart() sent it once on key press, but never again during speech. Server-side, without continuous user_activity, the server treated the audio as noise.

Fix: In SendAudioChunk(), automatically send user_activity before each audio chunk when TurnMode == Client. This keeps the signal continuous for the full duration of speech. When the user releases T, StopListening() stops the mic → audio stops → user_activity stops → server detects silence and triggers the agent response.

Bug C — TurnMode not propagated to proxy: UElevenLabsConversationalAgentComponent never told the proxy what TurnMode to use. Added WebSocketProxy->TurnMode = TurnMode before Connect() in StartConversation().

Files changed:

  • ElevenLabsWebSocketProxy.h: added public TurnMode field
  • ElevenLabsWebSocketProxy.cpp:
    • OnWsConnected(): use TurnMode to set correct mode string in init message
    • SendAudioChunk(): auto-send user_activity before each chunk in Client mode
  • ElevenLabsConversationalAgentComponent.cpp:
    • StartConversation(): set WebSocketProxy->TurnMode = TurnMode before Connect()


Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)

18. Root Cause Found and Fixed (v1.4.0)

Log analysis revealed the true root cause:

Exact sequence:

OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
User presses T → StartListening() → bIsListening guard → no-op
User releases T → StopListening() → bIsListening=false, mic CLOSES
User presses T → StartListening() → NOW opens mic (was closed)
User releases T → StopListening() → mic closes — but ElevenLabs never got audio

Root cause: bAutoStartListening = true opens the mic on connect and sets bIsListening = true. In Client/push-to-talk mode, every T-press hits the bIsListening guard and does nothing. Every T-release closes the auto-started mic. The mic was never open during actual speech.

Fix: HandleConnected() now only calls StartListening() when TurnMode == Server. In Client mode, bAutoStartListening is ignored — the user controls listening via T key.

File changed:

  • ElevenLabsConversationalAgentComponent.cpp:
    • HandleConnected(): guard bAutoStartListening with TurnMode == Server check


Session 6 — 2026-02-19 (audio chunk size fix)

19. Mic Audio Chunk Accumulation (v1.5.0)

Root cause (from diagnostic log in Session 5): Log showed hundreds of SendAudioChunk: 158 bytes (TurnMode=Client) lines with zero server responses.

  • 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
  • WASAPI (Windows Audio Session API) fires the FAudioCapture callback at its internal buffer period (~5ms)
  • ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
  • Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded

Fix applied: Added MicAccumulationBuffer TArray<uint8> to UElevenLabsConversationalAgentComponent. OnMicrophoneDataCaptured() appends each callback's converted bytes and only calls SendAudioChunk when >= MicChunkMinBytes (3200 bytes = 100ms) have accumulated.

StopListening() flushes any remaining bytes in the buffer before sending SendUserTurnEnd(), so the last partial chunk of speech is never dropped.

HandleDisconnected() clears the buffer to prevent stale data on reconnect.

Files changed:

  • ElevenLabsConversationalAgentComponent.h: added MicAccumulationBuffer + MicChunkMinBytes = 3200
  • ElevenLabsConversationalAgentComponent.cpp:
    • OnMicrophoneDataCaptured(): accumulate → send when threshold reached
    • StopListening(): flush remainder before end-of-turn signal
    • HandleDisconnected(): clear accumulation buffer

Commit: 91cf5b1


Next Steps (not done yet)

  • Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
  • Test USoundWaveProcedural underflow behaviour in practice (check for audio glitches)
  • Test SendTextMessage end-to-end in Blueprint
  • Add lip-sync support (future)
  • Add session memory / conversation history (future, matching Convai)
  • Add environment/action context support (future)
  • Consider Signed URL Mode backend implementation