15 KiB
Session Log — 2026-02-19
Project: PS_AI_Agent (Unreal Engine 5.5)
Machine: Desktop PC (j_foucher)
Working directory: E:\ASTERION\GIT\PS_AI_Agent
Conversation Summary
1. Initial Request
User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5.
Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs.
Plugin name requested: PS_AI_Agent_ElevenLabs.
2. Codebase Exploration
Explored the Convai plugin source at ConvAI/Convai/ to understand:
- Module/settings structure
- AudioCapture patterns
- HTTP proxy pattern
- gRPC streaming architecture (to know what to replace with WebSocket)
- Convai already had
EVoiceType::ElevenLabsVoices— confirming the direction
3. Plugin Created
All source files written from scratch under:
Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/
Files created:
PS_AI_Agent_ElevenLabs.upluginPS_AI_Agent_ElevenLabs.Build.csPublic/PS_AI_Agent_ElevenLabs.h— Module +UElevenLabsSettingsPublic/ElevenLabsDefinitions.h— Enums, structs, protocol constantsPublic/ElevenLabsWebSocketProxy.h+.cpp— WS session managerPublic/ElevenLabsConversationalAgentComponent.h+.cpp— Main NPC componentPublic/ElevenLabsMicrophoneCaptureComponent.h+.cpp— Mic capturePS_AI_Agent.uproject— Plugin registered
Commit: f0055e8
4. Memory Files Created
To allow context recovery on any machine (including laptop):
.claude/MEMORY.md— project structure + patterns (auto-loaded by Claude Code).claude/elevenlabs_plugin.md— plugin file map + API protocol details.claude/project_context.md— original ask, intent, short/long-term goals- Local copy also at
C:\Users\j_foucher\.claude\projects\...\memory\
Commit: f0055e8 (with plugin), updated in 4d6ae10
5. .gitignore Updated
Added to existing ignores:
Unreal/PS_AI_Agent/Plugins/*/Binaries/Unreal/PS_AI_Agent/Plugins/*/Intermediate/Unreal/PS_AI_Agent/*.sln/*.suo.claude/settings.local.jsongenerate_pptx.py
Commit: 4d6ae10, b114ab0
6. Compile — First Attempt (Errors Found)
Ran Build.bat PS_AI_AgentEditor Win64 Development. Errors:
WebSocketslisted in.uplugin— it's a module not a plugin → removedOpenDefaultCaptureStreamdoesn't exist in UE 5.5 → useOpenAudioCaptureStreamFOnAudioCaptureFunctioncallback usesconst void*notconst float*→ fixed castTArray::RemoveAt(0, N, false)deprecated → useEAllowShrinking::NoAudioCaptureis a plugin and must be in.upluginPlugins array → added
Commit: bb1a857
7. Compile — Success
Clean build, no warnings, no errors.
Output: Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll
Memory updated with confirmed UE 5.5 API patterns. Commit: 3b98edc
8. Documentation — Markdown
Full reference doc written to .claude/PS_AI_Agent_ElevenLabs_Documentation.md:
- Installation, Project Settings, Quick Start (BP + C++), Components Reference, Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.
Commit: c833ccd
9. Documentation — PowerPoint
20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):
- File:
PS_AI_Agent_ElevenLabs_Documentation.pptxin repo root - Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
- Generator script
generate_pptx.pyexcluded from git via .gitignore
Commit: 1b72026
Session 2 — 2026-02-19 (continued context)
10. API vs Implementation Cross-Check (3 bugs found and fixed)
Cross-referenced elevenlabs_api_reference.md against plugin source. Found 3 protocol bugs:
Bug 1 — Transcript fields wrong:
- Type:
"transcript"→"user_transcript" - Event key:
"transcript_event"→"user_transcription_event" - Field:
"message"→"user_transcript"
Bug 2 — Pong format wrong:
event_idwas nested inpong_event{}→ must be top-level
Bug 3 — Client turn mode messages don't exist:
"user_turn_start"/"user_turn_end"are not valid API types- Replaced: start →
"user_activity", end → no-op (server detects silence)
Commit: ae2c9b9
11. SendTextMessage Added
User asked for text input to agent for testing (without mic).
Added SendTextMessage(FString) to UElevenLabsWebSocketProxy and UElevenLabsConversationalAgentComponent.
Sends {"type":"user_message","text":"..."} — agent replies with audio + text.
Commit: b489d11
12. Binary WebSocket Frame Fix
User reported: "Received unexpected binary WebSocket frame" warnings.
Root cause: ElevenLabs sends ALL WebSocket frames as binary, never text.
OnMessage (text handler) never fires. OnRawMessage must handle everything.
Fix: Implemented OnWsBinaryMessage with fragment reassembly (BinaryFrameBuffer).
Commit: 669c503
13. JSON vs PCM Discrimination Fix
After binary fix: "Failed to parse WebSocket message as JSON" errors.
Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.
Fix: Peek at byte[0] of assembled buffer:
'{'(0x7B) → UTF-8 JSON → route toOnWsMessage()- anything else → raw PCM audio → broadcast to
OnAudioReceived
Commit: 4834567
14. Documentation Updated to v1.1.0
Full rewrite of .claude/PS_AI_Agent_ElevenLabs_Documentation.md:
- Added Changelog section (v1.0.0 / v1.1.0)
- Updated audio pipeline (binary PCM path, not Base64 JSON)
- Added
SendTextMessageto all function tables and examples - Corrected turn mode docs, transcript docs,
OnAgentConnectedtiming - New troubleshooting entries
Commit: e464cfe
15. Test Blueprint Asset Updated
test_AI_Actor.uasset updated in UE Editor.
Commit: 99017f4
Git History (this session)
| Hash | Message |
|---|---|
f0055e8 |
Add PS_AI_Agent_ElevenLabs plugin (initial implementation) |
4d6ae10 |
Update .gitignore: exclude plugin build artifacts and local Claude settings |
b114ab0 |
Broaden .gitignore: use glob for all plugin Binaries/Intermediate |
bb1a857 |
Fix compile errors in PS_AI_Agent_ElevenLabs plugin |
3b98edc |
Update memory: document confirmed UE 5.5 API patterns and plugin compile status |
c833ccd |
Add plugin documentation for PS_AI_Agent_ElevenLabs |
1b72026 |
Add PowerPoint documentation and update .gitignore |
bbeb429 |
ElevenLabs API reference doc |
dbd6161 |
TestMap, test actor, DefaultEngine.ini, memory update |
ae2c9b9 |
Fix 3 WebSocket protocol bugs |
b489d11 |
Add SendTextMessage |
669c503 |
Fix binary WebSocket frames |
4834567 |
Fix JSON vs binary frame discrimination |
e464cfe |
Update documentation to v1.1.0 |
99017f4 |
Update test_AI_Actor blueprint asset |
Key Technical Decisions Made This Session
| Decision | Reason |
|---|---|
| WebSocket instead of gRPC | ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed |
AudioCapture in .uplugin Plugins array |
It's an engine plugin, not a module — UBT requires it declared |
WebSockets in Build.cs only |
It's a module (no .uplugin file), declaring it in .uplugin causes build error |
FOnAudioCaptureFunction uses const void* |
UE 5.3+ API change — must cast to float* inside callback |
EAllowShrinking::No |
Bool overload of RemoveAt deprecated in UE 5.5 |
USoundWaveProcedural for playback |
Allows pushing raw PCM bytes at runtime without file I/O |
| Silence threshold = 30 ticks | ~0.5s at 60fps heuristic to detect agent finished speaking |
| Binary frame handling | ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM |
user_activity for client turn |
user_turn_start/user_turn_end don't exist in ElevenLabs API |
Session 3 — 2026-02-19 (bug fixes from live testing)
16. Three Runtime Bugs Fixed (v1.2.0)
User reported after live testing:
- AI speaks twice — every audio response played double
- Cannot speak — mic capture didn't reach ElevenLabs
- Latency — requested
enable_intermediate_response: true
Bug 1 Root Cause — Double Audio:
UE's libwebsockets backend fires both OnMessage() (text callback) and OnRawMessage() (binary callback) for the same incoming frame.
We had bound both WebSocket->OnMessage() and WebSocket->OnRawMessage() in Connect().
Result: every audio frame was decoded and enqueued twice → played twice.
Fix: Remove OnMessage binding entirely. OnRawMessage now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
Bug 2 Root Cause — Mic Silent:
ElevenLabs requires a conversation_initiation_client_data message sent immediately after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
Fix: Send conversation_initiation_client_data in OnWsConnected() before any other message.
Bug 2 Secondary — Delegate Stacking:
StartListening() called Mic->OnAudioCaptured.AddUObject(this, ...) without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
Fix: Add Mic->OnAudioCaptured.RemoveAll(this) before AddUObject in StartListening().
Bug 3 — Latency:
Added "enable_intermediate_response": true inside custom_llm_extra_body of the conversation_initiation_client_data message. Also added optimize_streaming_latency: 3 in conversation_config_override.tts.
Files changed:
ElevenLabsWebSocketProxy.cpp:Connect(): removedOnMessagebindingOnWsConnected(): now sends fullconversation_initiation_client_dataJSON
ElevenLabsConversationalAgentComponent.cpp:StartListening(): addedRemoveAllguard before delegate binding
Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
17. Two More Bugs Found and Fixed (v1.3.0)
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
Analysis of log:
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
- Mic opens and closes correctly — audio capture IS happening
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
Bug A — TurnMode mismatch in conversation_initiation_client_data:
OnWsConnected() hardcoded "mode": "server_vad" in the init message regardless of the
component's TurnMode setting. User's Blueprint uses Client turn mode (push-to-talk),
so the server was configured for server_vad while the client sent client_vad audio signals.
Fix: Read TurnMode field on the proxy (set from the component before Connect()).
Translate EElevenLabsTurnMode::Client → "client_vad", Server → "server_vad".
Bug B — user_activity never sent continuously:
In client VAD mode, ElevenLabs requires user_activity to be sent continuously
alongside every audio chunk to keep the server's VAD aware the user is speaking.
SendUserTurnStart() sent it once on key press, but never again during speech.
Server-side, without continuous user_activity, the server treated the audio as noise.
Fix: In SendAudioChunk(), automatically send user_activity before each audio chunk
when TurnMode == Client. This keeps the signal continuous for the full duration of speech.
When the user releases T, StopListening() stops the mic → audio stops → user_activity
stops → server detects silence and triggers the agent response.
Bug C — TurnMode not propagated to proxy:
UElevenLabsConversationalAgentComponent never told the proxy what TurnMode to use.
Added WebSocketProxy->TurnMode = TurnMode before Connect() in StartConversation().
Files changed:
ElevenLabsWebSocketProxy.h: addedpublic TurnModefieldElevenLabsWebSocketProxy.cpp:OnWsConnected(): useTurnModeto set correct mode string in init messageSendAudioChunk(): auto-senduser_activitybefore each chunk in Client mode
ElevenLabsConversationalAgentComponent.cpp:StartConversation(): setWebSocketProxy->TurnMode = TurnModebeforeConnect()
Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
18. Root Cause Found and Fixed (v1.4.0)
Log analysis revealed the true root cause:
Exact sequence:
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
User presses T → StartListening() → bIsListening guard → no-op
User releases T → StopListening() → bIsListening=false, mic CLOSES
User presses T → StartListening() → NOW opens mic (was closed)
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
Root cause: bAutoStartListening = true opens the mic on connect and sets bIsListening = true.
In Client/push-to-talk mode, every T-press hits the bIsListening guard and does nothing.
Every T-release closes the auto-started mic. The mic was never open during actual speech.
Fix: HandleConnected() now only calls StartListening() when TurnMode == Server.
In Client mode, bAutoStartListening is ignored — the user controls listening via T key.
File changed:
ElevenLabsConversationalAgentComponent.cpp:HandleConnected(): guardbAutoStartListeningwithTurnMode == Servercheck
Session 6 — 2026-02-19 (audio chunk size fix)
19. Mic Audio Chunk Accumulation (v1.5.0)
Root cause (from diagnostic log in Session 5):
Log showed hundreds of SendAudioChunk: 158 bytes (TurnMode=Client) lines with zero server responses.
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
- WASAPI (Windows Audio Session API) fires the
FAudioCapturecallback at its internal buffer period (~5ms) - ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
Fix applied:
Added MicAccumulationBuffer TArray<uint8> to UElevenLabsConversationalAgentComponent.
OnMicrophoneDataCaptured() appends each callback's converted bytes and only calls SendAudioChunk
when >= MicChunkMinBytes (3200 bytes = 100ms) have accumulated.
StopListening() flushes any remaining bytes in the buffer before sending SendUserTurnEnd(),
so the last partial chunk of speech is never dropped.
HandleDisconnected() clears the buffer to prevent stale data on reconnect.
Files changed:
ElevenLabsConversationalAgentComponent.h: addedMicAccumulationBuffer+MicChunkMinBytes = 3200ElevenLabsConversationalAgentComponent.cpp:OnMicrophoneDataCaptured(): accumulate → send when threshold reachedStopListening(): flush remainder before end-of-turn signalHandleDisconnected(): clear accumulation buffer
Commit: 91cf5b1
Next Steps (not done yet)
- Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
- Test
USoundWaveProceduralunderflow behaviour in practice (check for audio glitches) - Test
SendTextMessageend-to-end in Blueprint - Add lip-sync support (future)
- Add session memory / conversation history (future, matching Convai)
- Add environment/action context support (future)
- Consider Signed URL Mode backend implementation