352 lines
15 KiB
Markdown
352 lines
15 KiB
Markdown
# Session Log — 2026-02-19
|
|
|
|
**Project**: PS_AI_Agent (Unreal Engine 5.5)
|
|
**Machine**: Desktop PC (j_foucher)
|
|
**Working directory**: `E:\ASTERION\GIT\PS_AI_Agent`
|
|
|
|
---
|
|
|
|
## Conversation Summary
|
|
|
|
### 1. Initial Request
|
|
User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5.
|
|
Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs.
|
|
Plugin name requested: `PS_AI_Agent_ElevenLabs`.
|
|
|
|
### 2. Codebase Exploration
|
|
Explored the Convai plugin source at `ConvAI/Convai/` to understand:
|
|
- Module/settings structure
|
|
- AudioCapture patterns
|
|
- HTTP proxy pattern
|
|
- gRPC streaming architecture (to know what to replace with WebSocket)
|
|
- Convai already had `EVoiceType::ElevenLabsVoices` — confirming the direction
|
|
|
|
### 3. Plugin Created
|
|
All source files written from scratch under:
|
|
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
|
|
|
|
Files created:
|
|
- `PS_AI_Agent_ElevenLabs.uplugin`
|
|
- `PS_AI_Agent_ElevenLabs.Build.cs`
|
|
- `Public/PS_AI_Agent_ElevenLabs.h` — Module + `UElevenLabsSettings`
|
|
- `Public/ElevenLabsDefinitions.h` — Enums, structs, protocol constants
|
|
- `Public/ElevenLabsWebSocketProxy.h` + `.cpp` — WS session manager
|
|
- `Public/ElevenLabsConversationalAgentComponent.h` + `.cpp` — Main NPC component
|
|
- `Public/ElevenLabsMicrophoneCaptureComponent.h` + `.cpp` — Mic capture
|
|
- `PS_AI_Agent.uproject` — Plugin registered
|
|
|
|
Commit: `f0055e8`
|
|
|
|
### 4. Memory Files Created
|
|
To allow context recovery on any machine (including laptop):
|
|
- `.claude/MEMORY.md` — project structure + patterns (auto-loaded by Claude Code)
|
|
- `.claude/elevenlabs_plugin.md` — plugin file map + API protocol details
|
|
- `.claude/project_context.md` — original ask, intent, short/long-term goals
|
|
- Local copy also at `C:\Users\j_foucher\.claude\projects\...\memory\`
|
|
|
|
Commit: `f0055e8` (with plugin), updated in `4d6ae10`
|
|
|
|
### 5. .gitignore Updated
|
|
Added to existing ignores:
|
|
- `Unreal/PS_AI_Agent/Plugins/*/Binaries/`
|
|
- `Unreal/PS_AI_Agent/Plugins/*/Intermediate/`
|
|
- `Unreal/PS_AI_Agent/*.sln` / `*.suo`
|
|
- `.claude/settings.local.json`
|
|
- `generate_pptx.py`
|
|
|
|
Commit: `4d6ae10`, `b114ab0`
|
|
|
|
### 6. Compile — First Attempt (Errors Found)
|
|
Ran `Build.bat PS_AI_AgentEditor Win64 Development`. Errors:
|
|
- `WebSockets` listed in `.uplugin` — it's a module not a plugin → removed
|
|
- `OpenDefaultCaptureStream` doesn't exist in UE 5.5 → use `OpenAudioCaptureStream`
|
|
- `FOnAudioCaptureFunction` callback uses `const void*` not `const float*` → fixed cast
|
|
- `TArray::RemoveAt(0, N, false)` deprecated → use `EAllowShrinking::No`
|
|
- `AudioCapture` is a plugin and must be in `.uplugin` Plugins array → added
|
|
|
|
Commit: `bb1a857`
|
|
|
|
### 7. Compile — Success
|
|
Clean build, no warnings, no errors.
|
|
Output: `Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll`
|
|
|
|
Memory updated with confirmed UE 5.5 API patterns. Commit: `3b98edc`
|
|
|
|
### 8. Documentation — Markdown
|
|
Full reference doc written to `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
|
- Installation, Project Settings, Quick Start (BP + C++), Components Reference,
|
|
Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.
|
|
|
|
Commit: `c833ccd`
|
|
|
|
### 9. Documentation — PowerPoint
|
|
20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):
|
|
- File: `PS_AI_Agent_ElevenLabs_Documentation.pptx` in repo root
|
|
- Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
|
|
- Generator script `generate_pptx.py` excluded from git via .gitignore
|
|
|
|
Commit: `1b72026`
|
|
|
|
---
|
|
|
|
## Session 2 — 2026-02-19 (continued context)
|
|
|
|
### 10. API vs Implementation Cross-Check (3 bugs found and fixed)
|
|
Cross-referenced `elevenlabs_api_reference.md` against plugin source. Found 3 protocol bugs:
|
|
|
|
**Bug 1 — Transcript fields wrong:**
|
|
- Type: `"transcript"` → `"user_transcript"`
|
|
- Event key: `"transcript_event"` → `"user_transcription_event"`
|
|
- Field: `"message"` → `"user_transcript"`
|
|
|
|
**Bug 2 — Pong format wrong:**
|
|
- `event_id` was nested in `pong_event{}` → must be top-level
|
|
|
|
**Bug 3 — Client turn mode messages don't exist:**
|
|
- `"user_turn_start"` / `"user_turn_end"` are not valid API types
|
|
- Replaced: start → `"user_activity"`, end → no-op (server detects silence)
|
|
|
|
Commit: `ae2c9b9`
|
|
|
|
### 11. SendTextMessage Added
|
|
User asked for text input to agent for testing (without mic).
|
|
Added `SendTextMessage(FString)` to `UElevenLabsWebSocketProxy` and `UElevenLabsConversationalAgentComponent`.
|
|
Sends `{"type":"user_message","text":"..."}` — agent replies with audio + text.
|
|
|
|
Commit: `b489d11`
|
|
|
|
### 12. Binary WebSocket Frame Fix
|
|
User reported: `"Received unexpected binary WebSocket frame"` warnings.
|
|
Root cause: ElevenLabs sends **ALL WebSocket frames as binary**, never text.
|
|
`OnMessage` (text handler) never fires. `OnRawMessage` must handle everything.
|
|
|
|
Fix: Implemented `OnWsBinaryMessage` with fragment reassembly (`BinaryFrameBuffer`).
|
|
|
|
Commit: `669c503`
|
|
|
|
### 13. JSON vs PCM Discrimination Fix
|
|
After binary fix: `"Failed to parse WebSocket message as JSON"` errors.
|
|
Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.
|
|
|
|
Fix: Peek at byte[0] of assembled buffer:
|
|
- `'{'` (0x7B) → UTF-8 JSON → route to `OnWsMessage()`
|
|
- anything else → raw PCM audio → broadcast to `OnAudioReceived`
|
|
|
|
Commit: `4834567`
|
|
|
|
### 14. Documentation Updated to v1.1.0
|
|
Full rewrite of `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
|
- Added Changelog section (v1.0.0 / v1.1.0)
|
|
- Updated audio pipeline (binary PCM path, not Base64 JSON)
|
|
- Added `SendTextMessage` to all function tables and examples
|
|
- Corrected turn mode docs, transcript docs, `OnAgentConnected` timing
|
|
- New troubleshooting entries
|
|
|
|
Commit: `e464cfe`
|
|
|
|
### 15. Test Blueprint Asset Updated
|
|
`test_AI_Actor.uasset` updated in UE Editor.
|
|
|
|
Commit: `99017f4`
|
|
|
|
---
|
|
|
|
## Git History (this session)
|
|
|
|
| Hash | Message |
|
|
|------|---------|
|
|
| `f0055e8` | Add PS_AI_Agent_ElevenLabs plugin (initial implementation) |
|
|
| `4d6ae10` | Update .gitignore: exclude plugin build artifacts and local Claude settings |
|
|
| `b114ab0` | Broaden .gitignore: use glob for all plugin Binaries/Intermediate |
|
|
| `bb1a857` | Fix compile errors in PS_AI_Agent_ElevenLabs plugin |
|
|
| `3b98edc` | Update memory: document confirmed UE 5.5 API patterns and plugin compile status |
|
|
| `c833ccd` | Add plugin documentation for PS_AI_Agent_ElevenLabs |
|
|
| `1b72026` | Add PowerPoint documentation and update .gitignore |
|
|
| `bbeb429` | ElevenLabs API reference doc |
|
|
| `dbd6161` | TestMap, test actor, DefaultEngine.ini, memory update |
|
|
| `ae2c9b9` | Fix 3 WebSocket protocol bugs |
|
|
| `b489d11` | Add SendTextMessage |
|
|
| `669c503` | Fix binary WebSocket frames |
|
|
| `4834567` | Fix JSON vs binary frame discrimination |
|
|
| `e464cfe` | Update documentation to v1.1.0 |
|
|
| `99017f4` | Update test_AI_Actor blueprint asset |
|
|
|
|
---
|
|
|
|
## Key Technical Decisions Made This Session
|
|
|
|
| Decision | Reason |
|
|
|----------|--------|
|
|
| WebSocket instead of gRPC | ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed |
|
|
| `AudioCapture` in `.uplugin` Plugins array | It's an engine plugin, not a module — UBT requires it declared |
|
|
| `WebSockets` in Build.cs only | It's a module (no `.uplugin` file), declaring it in `.uplugin` causes build error |
|
|
| `FOnAudioCaptureFunction` uses `const void*` | UE 5.3+ API change — must cast to `float*` inside callback |
|
|
| `EAllowShrinking::No` | Bool overload of `RemoveAt` deprecated in UE 5.5 |
|
|
| `USoundWaveProcedural` for playback | Allows pushing raw PCM bytes at runtime without file I/O |
|
|
| Silence threshold = 30 ticks | ~0.5s at 60fps heuristic to detect agent finished speaking |
|
|
| Binary frame handling | ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM |
|
|
| `user_activity` for client turn | `user_turn_start`/`user_turn_end` don't exist in ElevenLabs API |
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Session 3 — 2026-02-19 (bug fixes from live testing)
|
|
|
|
### 16. Three Runtime Bugs Fixed (v1.2.0)
|
|
|
|
User reported after live testing:
|
|
1. **AI speaks twice** — every audio response played double
|
|
2. **Cannot speak** — mic capture didn't reach ElevenLabs
|
|
3. **Latency** — requested `enable_intermediate_response: true`
|
|
|
|
**Bug 1 Root Cause — Double Audio:**
|
|
UE's libwebsockets backend fires **both** `OnMessage()` (text callback) **and** `OnRawMessage()` (binary callback) for the same incoming frame.
|
|
We had bound both `WebSocket->OnMessage()` and `WebSocket->OnRawMessage()` in `Connect()`.
|
|
Result: every audio frame was decoded and enqueued twice → played twice.
|
|
|
|
Fix: **Remove `OnMessage` binding entirely.** `OnRawMessage` now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
|
|
|
|
**Bug 2 Root Cause — Mic Silent:**
|
|
ElevenLabs requires a `conversation_initiation_client_data` message sent **immediately** after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
|
|
|
|
Fix: Send `conversation_initiation_client_data` in `OnWsConnected()` before any other message.
|
|
|
|
**Bug 2 Secondary — Delegate Stacking:**
|
|
`StartListening()` called `Mic->OnAudioCaptured.AddUObject(this, ...)` without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
|
|
|
|
Fix: Add `Mic->OnAudioCaptured.RemoveAll(this)` before `AddUObject` in `StartListening()`.
|
|
|
|
**Bug 3 — Latency:**
|
|
Added `"enable_intermediate_response": true` inside `custom_llm_extra_body` of the `conversation_initiation_client_data` message. Also added `optimize_streaming_latency: 3` in `conversation_config_override.tts`.
|
|
|
|
**Files changed:**
|
|
- `ElevenLabsWebSocketProxy.cpp`:
|
|
- `Connect()`: removed `OnMessage` binding
|
|
- `OnWsConnected()`: now sends full `conversation_initiation_client_data` JSON
|
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
|
- `StartListening()`: added `RemoveAll` guard before delegate binding
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
|
|
|
|
### 17. Two More Bugs Found and Fixed (v1.3.0)
|
|
|
|
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
|
|
|
|
**Analysis of log:**
|
|
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
|
|
- Mic opens and closes correctly — audio capture IS happening
|
|
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
|
|
|
|
**Bug A — TurnMode mismatch in conversation_initiation_client_data:**
|
|
`OnWsConnected()` hardcoded `"mode": "server_vad"` in the init message regardless of the
|
|
component's `TurnMode` setting. User's Blueprint uses Client turn mode (push-to-talk),
|
|
so the server was configured for server_vad while the client sent client_vad audio signals.
|
|
|
|
Fix: Read `TurnMode` field on the proxy (set from the component before `Connect()`).
|
|
Translate `EElevenLabsTurnMode::Client` → `"client_vad"`, Server → `"server_vad"`.
|
|
|
|
**Bug B — user_activity never sent continuously:**
|
|
In client VAD mode, ElevenLabs requires `user_activity` to be sent **continuously**
|
|
alongside every audio chunk to keep the server's VAD aware the user is speaking.
|
|
`SendUserTurnStart()` sent it once on key press, but never again during speech.
|
|
Server-side, without continuous `user_activity`, the server treated the audio as noise.
|
|
|
|
Fix: In `SendAudioChunk()`, automatically send `user_activity` before each audio chunk
|
|
when `TurnMode == Client`. This keeps the signal continuous for the full duration of speech.
|
|
When the user releases T, `StopListening()` stops the mic → audio stops → `user_activity`
|
|
stops → server detects silence and triggers the agent response.
|
|
|
|
**Bug C — TurnMode not propagated to proxy:**
|
|
`UElevenLabsConversationalAgentComponent` never told the proxy what TurnMode to use.
|
|
Added `WebSocketProxy->TurnMode = TurnMode` before `Connect()` in `StartConversation()`.
|
|
|
|
**Files changed:**
|
|
- `ElevenLabsWebSocketProxy.h`: added `public TurnMode` field
|
|
- `ElevenLabsWebSocketProxy.cpp`:
|
|
- `OnWsConnected()`: use `TurnMode` to set correct mode string in init message
|
|
- `SendAudioChunk()`: auto-send `user_activity` before each chunk in Client mode
|
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
|
- `StartConversation()`: set `WebSocketProxy->TurnMode = TurnMode` before `Connect()`
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
|
|
|
|
### 18. Root Cause Found and Fixed (v1.4.0)
|
|
|
|
Log analysis revealed the true root cause:
|
|
|
|
**Exact sequence:**
|
|
```
|
|
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
|
|
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
|
|
User presses T → StartListening() → bIsListening guard → no-op
|
|
User releases T → StopListening() → bIsListening=false, mic CLOSES
|
|
User presses T → StartListening() → NOW opens mic (was closed)
|
|
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
|
|
```
|
|
|
|
**Root cause:** `bAutoStartListening = true` opens the mic on connect and sets `bIsListening = true`.
|
|
In Client/push-to-talk mode, every T-press hits the `bIsListening` guard and does nothing.
|
|
Every T-release closes the auto-started mic. The mic was never open during actual speech.
|
|
|
|
**Fix:** `HandleConnected()` now only calls `StartListening()` when `TurnMode == Server`.
|
|
In Client mode, `bAutoStartListening` is ignored — the user controls listening via T key.
|
|
|
|
**File changed:**
|
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
|
- `HandleConnected()`: guard `bAutoStartListening` with `TurnMode == Server` check
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Session 6 — 2026-02-19 (audio chunk size fix)
|
|
|
|
### 19. Mic Audio Chunk Accumulation (v1.5.0)
|
|
|
|
**Root cause (from diagnostic log in Session 5):**
|
|
Log showed hundreds of `SendAudioChunk: 158 bytes (TurnMode=Client)` lines with zero server responses.
|
|
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
|
|
- WASAPI (Windows Audio Session API) fires the `FAudioCapture` callback at its internal buffer period (~5ms)
|
|
- ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
|
|
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
|
|
|
|
**Fix applied:**
|
|
Added `MicAccumulationBuffer TArray<uint8>` to `UElevenLabsConversationalAgentComponent`.
|
|
`OnMicrophoneDataCaptured()` appends each callback's converted bytes and only calls `SendAudioChunk`
|
|
when `>= MicChunkMinBytes` (3200 bytes = 100ms) have accumulated.
|
|
|
|
`StopListening()` flushes any remaining bytes in the buffer before sending `SendUserTurnEnd()`,
|
|
so the last partial chunk of speech is never dropped.
|
|
|
|
`HandleDisconnected()` clears the buffer to prevent stale data on reconnect.
|
|
|
|
**Files changed:**
|
|
- `ElevenLabsConversationalAgentComponent.h`: added `MicAccumulationBuffer` + `MicChunkMinBytes = 3200`
|
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
|
- `OnMicrophoneDataCaptured()`: accumulate → send when threshold reached
|
|
- `StopListening()`: flush remainder before end-of-turn signal
|
|
- `HandleDisconnected()`: clear accumulation buffer
|
|
|
|
Commit: `91cf5b1`
|
|
|
|
---
|
|
|
|
## Next Steps (not done yet)
|
|
|
|
- [ ] Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
|
|
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
|
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
|
- [ ] Add lip-sync support (future)
|
|
- [ ] Add session memory / conversation history (future, matching Convai)
|
|
- [ ] Add environment/action context support (future)
|
|
- [ ] Consider Signed URL Mode backend implementation
|