j.foucher b888f7fcb6 Update memory: document v1.5.0 mic chunk size fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-19 18:42:47 +01:00

15 KiB

Raw Blame History

Session Log — 2026-02-19

Project: PS_AI_Agent (Unreal Engine 5.5) Machine: Desktop PC (j_foucher) Working directory: E:\ASTERION\GIT\PS_AI_Agent

Conversation Summary

1. Initial Request

User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5. Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs. Plugin name requested: PS_AI_Agent_ElevenLabs.

2. Codebase Exploration

Explored the Convai plugin source at ConvAI/Convai/ to understand:

Module/settings structure
AudioCapture patterns
HTTP proxy pattern
gRPC streaming architecture (to know what to replace with WebSocket)
Convai already had EVoiceType::ElevenLabsVoices — confirming the direction

3. Plugin Created

All source files written from scratch under: Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/

Files created:

PS_AI_Agent_ElevenLabs.uplugin
PS_AI_Agent_ElevenLabs.Build.cs
Public/PS_AI_Agent_ElevenLabs.h — Module + UElevenLabsSettings
Public/ElevenLabsDefinitions.h — Enums, structs, protocol constants
Public/ElevenLabsWebSocketProxy.h + .cpp — WS session manager
Public/ElevenLabsConversationalAgentComponent.h + .cpp — Main NPC component
Public/ElevenLabsMicrophoneCaptureComponent.h + .cpp — Mic capture
PS_AI_Agent.uproject — Plugin registered

Commit: f0055e8

4. Memory Files Created

To allow context recovery on any machine (including laptop):

.claude/MEMORY.md — project structure + patterns (auto-loaded by Claude Code)
.claude/elevenlabs_plugin.md — plugin file map + API protocol details
.claude/project_context.md — original ask, intent, short/long-term goals
Local copy also at C:\Users\j_foucher\.claude\projects\...\memory\

Commit: f0055e8 (with plugin), updated in 4d6ae10

5. .gitignore Updated

Added to existing ignores:

Unreal/PS_AI_Agent/Plugins/*/Binaries/
Unreal/PS_AI_Agent/Plugins/*/Intermediate/
Unreal/PS_AI_Agent/*.sln / *.suo
.claude/settings.local.json
generate_pptx.py

Commit: 4d6ae10, b114ab0

6. Compile — First Attempt (Errors Found)

Ran Build.bat PS_AI_AgentEditor Win64 Development. Errors:

WebSockets listed in .uplugin — it's a module not a plugin → removed
OpenDefaultCaptureStream doesn't exist in UE 5.5 → use OpenAudioCaptureStream
FOnAudioCaptureFunction callback uses const void* not const float* → fixed cast
TArray::RemoveAt(0, N, false) deprecated → use EAllowShrinking::No
AudioCapture is a plugin and must be in .uplugin Plugins array → added

Commit: bb1a857

7. Compile — Success

Clean build, no warnings, no errors. Output: Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll

Memory updated with confirmed UE 5.5 API patterns. Commit: 3b98edc

8. Documentation — Markdown

Full reference doc written to .claude/PS_AI_Agent_ElevenLabs_Documentation.md:

Installation, Project Settings, Quick Start (BP + C++), Components Reference, Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.

Commit: c833ccd

9. Documentation — PowerPoint

20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):

File: PS_AI_Agent_ElevenLabs_Documentation.pptx in repo root
Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
Generator script generate_pptx.py excluded from git via .gitignore

Commit: 1b72026

Session 2 — 2026-02-19 (continued context)

10. API vs Implementation Cross-Check (3 bugs found and fixed)

Cross-referenced elevenlabs_api_reference.md against plugin source. Found 3 protocol bugs:

Bug 1 — Transcript fields wrong:

Type: "transcript" → "user_transcript"
Event key: "transcript_event" → "user_transcription_event"
Field: "message" → "user_transcript"

Bug 2 — Pong format wrong:

event_id was nested in pong_event{} → must be top-level

Bug 3 — Client turn mode messages don't exist:

"user_turn_start" / "user_turn_end" are not valid API types
Replaced: start → "user_activity", end → no-op (server detects silence)

Commit: ae2c9b9

11. SendTextMessage Added

User asked for text input to agent for testing (without mic). Added SendTextMessage(FString) to UElevenLabsWebSocketProxy and UElevenLabsConversationalAgentComponent. Sends {"type":"user_message","text":"..."} — agent replies with audio + text.

Commit: b489d11

12. Binary WebSocket Frame Fix

User reported: "Received unexpected binary WebSocket frame" warnings. Root cause: ElevenLabs sends ALL WebSocket frames as binary, never text. OnMessage (text handler) never fires. OnRawMessage must handle everything.

Fix: Implemented OnWsBinaryMessage with fragment reassembly (BinaryFrameBuffer).

Commit: 669c503

13. JSON vs PCM Discrimination Fix

After binary fix: "Failed to parse WebSocket message as JSON" errors. Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.

Fix: Peek at byte[0] of assembled buffer:

'{' (0x7B) → UTF-8 JSON → route to OnWsMessage()
anything else → raw PCM audio → broadcast to OnAudioReceived

Commit: 4834567

14. Documentation Updated to v1.1.0

Full rewrite of .claude/PS_AI_Agent_ElevenLabs_Documentation.md:

Added Changelog section (v1.0.0 / v1.1.0)
Updated audio pipeline (binary PCM path, not Base64 JSON)
Added SendTextMessage to all function tables and examples
Corrected turn mode docs, transcript docs, OnAgentConnected timing
New troubleshooting entries

Commit: e464cfe

15. Test Blueprint Asset Updated

test_AI_Actor.uasset updated in UE Editor.

Commit: 99017f4

Git History (this session)

Hash	Message
`f0055e8`	Add PS_AI_Agent_ElevenLabs plugin (initial implementation)
`4d6ae10`	Update .gitignore: exclude plugin build artifacts and local Claude settings
`b114ab0`	Broaden .gitignore: use glob for all plugin Binaries/Intermediate
`bb1a857`	Fix compile errors in PS_AI_Agent_ElevenLabs plugin
`3b98edc`	Update memory: document confirmed UE 5.5 API patterns and plugin compile status
`c833ccd`	Add plugin documentation for PS_AI_Agent_ElevenLabs
`1b72026`	Add PowerPoint documentation and update .gitignore
`bbeb429`	ElevenLabs API reference doc
`dbd6161`	TestMap, test actor, DefaultEngine.ini, memory update
`ae2c9b9`	Fix 3 WebSocket protocol bugs
`b489d11`	Add SendTextMessage
`669c503`	Fix binary WebSocket frames
`4834567`	Fix JSON vs binary frame discrimination
`e464cfe`	Update documentation to v1.1.0
`99017f4`	Update test_AI_Actor blueprint asset

Key Technical Decisions Made This Session

Decision	Reason
WebSocket instead of gRPC	ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed
`AudioCapture` in `.uplugin` Plugins array	It's an engine plugin, not a module — UBT requires it declared
`WebSockets` in Build.cs only	It's a module (no `.uplugin` file), declaring it in `.uplugin` causes build error
`FOnAudioCaptureFunction` uses `const void*`	UE 5.3+ API change — must cast to `float*` inside callback
`EAllowShrinking::No`	Bool overload of `RemoveAt` deprecated in UE 5.5
`USoundWaveProcedural` for playback	Allows pushing raw PCM bytes at runtime without file I/O
Silence threshold = 30 ticks	~0.5s at 60fps heuristic to detect agent finished speaking
Binary frame handling	ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM
`user_activity` for client turn	`user_turn_start`/`user_turn_end` don't exist in ElevenLabs API

Session 3 — 2026-02-19 (bug fixes from live testing)

16. Three Runtime Bugs Fixed (v1.2.0)

User reported after live testing:

AI speaks twice — every audio response played double
Cannot speak — mic capture didn't reach ElevenLabs
Latency — requested enable_intermediate_response: true

Bug 1 Root Cause — Double Audio: UE's libwebsockets backend fires both OnMessage() (text callback) and OnRawMessage() (binary callback) for the same incoming frame. We had bound both WebSocket->OnMessage() and WebSocket->OnRawMessage() in Connect(). Result: every audio frame was decoded and enqueued twice → played twice.

Fix: Remove OnMessage binding entirely. OnRawMessage now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).

Bug 2 Root Cause — Mic Silent: ElevenLabs requires a conversation_initiation_client_data message sent immediately after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.

Fix: Send conversation_initiation_client_data in OnWsConnected() before any other message.

Bug 2 Secondary — Delegate Stacking: StartListening() called Mic->OnAudioCaptured.AddUObject(this, ...) without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.

Fix: Add Mic->OnAudioCaptured.RemoveAll(this) before AddUObject in StartListening().

Bug 3 — Latency: Added "enable_intermediate_response": true inside custom_llm_extra_body of the conversation_initiation_client_data message. Also added optimize_streaming_latency: 3 in conversation_config_override.tts.

Files changed:

ElevenLabsWebSocketProxy.cpp:
- Connect(): removed OnMessage binding
- OnWsConnected(): now sends full conversation_initiation_client_data JSON
ElevenLabsConversationalAgentComponent.cpp:
- StartListening(): added RemoveAll guard before delegate binding

Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)

17. Two More Bugs Found and Fixed (v1.3.0)

User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.

Analysis of log:

Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
Mic opens and closes correctly — audio capture IS happening
Server never responds to mic input → audio reaching ElevenLabs but being ignored

Bug A — TurnMode mismatch in conversation_initiation_client_data: OnWsConnected() hardcoded "mode": "server_vad" in the init message regardless of the component's TurnMode setting. User's Blueprint uses Client turn mode (push-to-talk), so the server was configured for server_vad while the client sent client_vad audio signals.

Fix: Read TurnMode field on the proxy (set from the component before Connect()). Translate EElevenLabsTurnMode::Client → "client_vad", Server → "server_vad".

Bug B — user_activity never sent continuously: In client VAD mode, ElevenLabs requires user_activity to be sent continuously alongside every audio chunk to keep the server's VAD aware the user is speaking. SendUserTurnStart() sent it once on key press, but never again during speech. Server-side, without continuous user_activity, the server treated the audio as noise.

Fix: In SendAudioChunk(), automatically send user_activity before each audio chunk when TurnMode == Client. This keeps the signal continuous for the full duration of speech. When the user releases T, StopListening() stops the mic → audio stops → user_activity stops → server detects silence and triggers the agent response.

Bug C — TurnMode not propagated to proxy: UElevenLabsConversationalAgentComponent never told the proxy what TurnMode to use. Added WebSocketProxy->TurnMode = TurnMode before Connect() in StartConversation().

Files changed:

ElevenLabsWebSocketProxy.h: added public TurnMode field
ElevenLabsWebSocketProxy.cpp:
- OnWsConnected(): use TurnMode to set correct mode string in init message
- SendAudioChunk(): auto-send user_activity before each chunk in Client mode
ElevenLabsConversationalAgentComponent.cpp:
- StartConversation(): set WebSocketProxy->TurnMode = TurnMode before Connect()

Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)

18. Root Cause Found and Fixed (v1.4.0)

Log analysis revealed the true root cause:

Exact sequence:

OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
User presses T → StartListening() → bIsListening guard → no-op
User releases T → StopListening() → bIsListening=false, mic CLOSES
User presses T → StartListening() → NOW opens mic (was closed)
User releases T → StopListening() → mic closes — but ElevenLabs never got audio

Root cause: bAutoStartListening = true opens the mic on connect and sets bIsListening = true. In Client/push-to-talk mode, every T-press hits the bIsListening guard and does nothing. Every T-release closes the auto-started mic. The mic was never open during actual speech.

Fix: HandleConnected() now only calls StartListening() when TurnMode == Server. In Client mode, bAutoStartListening is ignored — the user controls listening via T key.

File changed:

ElevenLabsConversationalAgentComponent.cpp:
- HandleConnected(): guard bAutoStartListening with TurnMode == Server check

Session 6 — 2026-02-19 (audio chunk size fix)

19. Mic Audio Chunk Accumulation (v1.5.0)

Root cause (from diagnostic log in Session 5): Log showed hundreds of SendAudioChunk: 158 bytes (TurnMode=Client) lines with zero server responses.

158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
WASAPI (Windows Audio Session API) fires the FAudioCapture callback at its internal buffer period (~5ms)
ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded

Fix applied: Added MicAccumulationBuffer TArray<uint8> to UElevenLabsConversationalAgentComponent. OnMicrophoneDataCaptured() appends each callback's converted bytes and only calls SendAudioChunk when >= MicChunkMinBytes (3200 bytes = 100ms) have accumulated.

StopListening() flushes any remaining bytes in the buffer before sending SendUserTurnEnd(), so the last partial chunk of speech is never dropped.

HandleDisconnected() clears the buffer to prevent stale data on reconnect.

Files changed:

ElevenLabsConversationalAgentComponent.h: added MicAccumulationBuffer + MicChunkMinBytes = 3200
ElevenLabsConversationalAgentComponent.cpp:
- OnMicrophoneDataCaptured(): accumulate → send when threshold reached
- StopListening(): flush remainder before end-of-turn signal
- HandleDisconnected(): clear accumulation buffer

Commit: 91cf5b1

Next Steps (not done yet)

Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
Test USoundWaveProcedural underflow behaviour in practice (check for audio glitches)
Test SendTextMessage end-to-end in Blueprint
Add lip-sync support (future)
Add session memory / conversation history (future, matching Convai)
Add environment/action context support (future)
Consider Signed URL Mode backend implementation

15 KiB Raw Blame History

Session Log — 2026-02-19

Conversation Summary

1. Initial Request

2. Codebase Exploration

3. Plugin Created

4. Memory Files Created

5. .gitignore Updated

6. Compile — First Attempt (Errors Found)

7. Compile — Success

8. Documentation — Markdown

9. Documentation — PowerPoint

Session 2 — 2026-02-19 (continued context)

10. API vs Implementation Cross-Check (3 bugs found and fixed)

11. SendTextMessage Added

12. Binary WebSocket Frame Fix

13. JSON vs PCM Discrimination Fix

14. Documentation Updated to v1.1.0

15. Test Blueprint Asset Updated

Git History (this session)

Key Technical Decisions Made This Session

Session 3 — 2026-02-19 (bug fixes from live testing)

16. Three Runtime Bugs Fixed (v1.2.0)

Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)

17. Two More Bugs Found and Fixed (v1.3.0)

Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)

18. Root Cause Found and Fixed (v1.4.0)

Session 6 — 2026-02-19 (audio chunk size fix)

19. Mic Audio Chunk Accumulation (v1.5.0)

Next Steps (not done yet)

15 KiB

Raw Blame History