Update memory: document v1.5.0 mic chunk size fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
91cf5b1bb4
commit
b888f7fcb6
@ -43,20 +43,30 @@
|
||||
|
||||
## Plugin Status
|
||||
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
||||
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
|
||||
- v1.5.0 — mic audio chunk size fixed: WASAPI 5ms callbacks accumulated to 100ms before sending
|
||||
- v1.4.0 — push-to-talk fully fixed: bAutoStartListening now ignored in Client turn mode
|
||||
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
||||
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
||||
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
||||
- Connection confirmed working end-to-end; audio receive path functional
|
||||
- `conversation_initiation_client_data` now sent immediately on WS connect (required for mic + latency)
|
||||
|
||||
## Audio Chunk Size — CRITICAL
|
||||
- WASAPI fires mic callbacks every ~5ms → **158 bytes** at 16kHz 16-bit mono
|
||||
- ElevenLabs VAD/STT requires **≥3200 bytes (100ms)** per chunk; smaller chunks are silently ignored
|
||||
- Fix: `MicAccumulationBuffer` in component accumulates chunks; sends only when `>= MicChunkMinBytes` (3200)
|
||||
- `StopListening()` flushes remainder so final partial chunk is never dropped before end-of-turn
|
||||
|
||||
## ElevenLabs WebSocket Protocol Notes
|
||||
- **ALL frames are binary** — `OnRawMessage` handles everything; `OnMessage` (text) never fires
|
||||
- **ALL frames are binary** — bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text) — UE fires both for same frame → double audio bug
|
||||
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
||||
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
||||
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
||||
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
||||
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
|
||||
- Client turn mode (`client_vad`): send `user_activity` **with every audio chunk** (not just once) — server needs continuous signal to know user is speaking; stopping chunks = silence detected = agent responds
|
||||
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
||||
- **MUST send `conversation_initiation_client_data` immediately after WS connect** — without it, server won't process client audio (mic appears dead)
|
||||
- `conversation_initiation_client_data` payload: `conversation_config_override.agent.turn.mode`, `conversation_config_override.tts.optimize_streaming_latency`, `custom_llm_extra_body.enable_intermediate_response`
|
||||
- `enable_intermediate_response: true` in `custom_llm_extra_body` reduces time-to-first-audio latency
|
||||
|
||||
## API Keys / Secrets
|
||||
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
||||
|
||||
@ -189,9 +189,160 @@ Commit: `99017f4`
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Session 3 — 2026-02-19 (bug fixes from live testing)
|
||||
|
||||
### 16. Three Runtime Bugs Fixed (v1.2.0)
|
||||
|
||||
User reported after live testing:
|
||||
1. **AI speaks twice** — every audio response played double
|
||||
2. **Cannot speak** — mic capture didn't reach ElevenLabs
|
||||
3. **Latency** — requested `enable_intermediate_response: true`
|
||||
|
||||
**Bug 1 Root Cause — Double Audio:**
|
||||
UE's libwebsockets backend fires **both** `OnMessage()` (text callback) **and** `OnRawMessage()` (binary callback) for the same incoming frame.
|
||||
We had bound both `WebSocket->OnMessage()` and `WebSocket->OnRawMessage()` in `Connect()`.
|
||||
Result: every audio frame was decoded and enqueued twice → played twice.
|
||||
|
||||
Fix: **Remove `OnMessage` binding entirely.** `OnRawMessage` now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
|
||||
|
||||
**Bug 2 Root Cause — Mic Silent:**
|
||||
ElevenLabs requires a `conversation_initiation_client_data` message sent **immediately** after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
|
||||
|
||||
Fix: Send `conversation_initiation_client_data` in `OnWsConnected()` before any other message.
|
||||
|
||||
**Bug 2 Secondary — Delegate Stacking:**
|
||||
`StartListening()` called `Mic->OnAudioCaptured.AddUObject(this, ...)` without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
|
||||
|
||||
Fix: Add `Mic->OnAudioCaptured.RemoveAll(this)` before `AddUObject` in `StartListening()`.
|
||||
|
||||
**Bug 3 — Latency:**
|
||||
Added `"enable_intermediate_response": true` inside `custom_llm_extra_body` of the `conversation_initiation_client_data` message. Also added `optimize_streaming_latency: 3` in `conversation_config_override.tts`.
|
||||
|
||||
**Files changed:**
|
||||
- `ElevenLabsWebSocketProxy.cpp`:
|
||||
- `Connect()`: removed `OnMessage` binding
|
||||
- `OnWsConnected()`: now sends full `conversation_initiation_client_data` JSON
|
||||
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||
- `StartListening()`: added `RemoveAll` guard before delegate binding
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
|
||||
|
||||
### 17. Two More Bugs Found and Fixed (v1.3.0)
|
||||
|
||||
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
|
||||
|
||||
**Analysis of log:**
|
||||
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
|
||||
- Mic opens and closes correctly — audio capture IS happening
|
||||
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
|
||||
|
||||
**Bug A — TurnMode mismatch in conversation_initiation_client_data:**
|
||||
`OnWsConnected()` hardcoded `"mode": "server_vad"` in the init message regardless of the
|
||||
component's `TurnMode` setting. User's Blueprint uses Client turn mode (push-to-talk),
|
||||
so the server was configured for server_vad while the client sent client_vad audio signals.
|
||||
|
||||
Fix: Read `TurnMode` field on the proxy (set from the component before `Connect()`).
|
||||
Translate `EElevenLabsTurnMode::Client` → `"client_vad"`, Server → `"server_vad"`.
|
||||
|
||||
**Bug B — user_activity never sent continuously:**
|
||||
In client VAD mode, ElevenLabs requires `user_activity` to be sent **continuously**
|
||||
alongside every audio chunk to keep the server's VAD aware the user is speaking.
|
||||
`SendUserTurnStart()` sent it once on key press, but never again during speech.
|
||||
Server-side, without continuous `user_activity`, the server treated the audio as noise.
|
||||
|
||||
Fix: In `SendAudioChunk()`, automatically send `user_activity` before each audio chunk
|
||||
when `TurnMode == Client`. This keeps the signal continuous for the full duration of speech.
|
||||
When the user releases T, `StopListening()` stops the mic → audio stops → `user_activity`
|
||||
stops → server detects silence and triggers the agent response.
|
||||
|
||||
**Bug C — TurnMode not propagated to proxy:**
|
||||
`UElevenLabsConversationalAgentComponent` never told the proxy what TurnMode to use.
|
||||
Added `WebSocketProxy->TurnMode = TurnMode` before `Connect()` in `StartConversation()`.
|
||||
|
||||
**Files changed:**
|
||||
- `ElevenLabsWebSocketProxy.h`: added `public TurnMode` field
|
||||
- `ElevenLabsWebSocketProxy.cpp`:
|
||||
- `OnWsConnected()`: use `TurnMode` to set correct mode string in init message
|
||||
- `SendAudioChunk()`: auto-send `user_activity` before each chunk in Client mode
|
||||
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||
- `StartConversation()`: set `WebSocketProxy->TurnMode = TurnMode` before `Connect()`
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
|
||||
|
||||
### 18. Root Cause Found and Fixed (v1.4.0)
|
||||
|
||||
Log analysis revealed the true root cause:
|
||||
|
||||
**Exact sequence:**
|
||||
```
|
||||
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
|
||||
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
|
||||
User presses T → StartListening() → bIsListening guard → no-op
|
||||
User releases T → StopListening() → bIsListening=false, mic CLOSES
|
||||
User presses T → StartListening() → NOW opens mic (was closed)
|
||||
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
|
||||
```
|
||||
|
||||
**Root cause:** `bAutoStartListening = true` opens the mic on connect and sets `bIsListening = true`.
|
||||
In Client/push-to-talk mode, every T-press hits the `bIsListening` guard and does nothing.
|
||||
Every T-release closes the auto-started mic. The mic was never open during actual speech.
|
||||
|
||||
**Fix:** `HandleConnected()` now only calls `StartListening()` when `TurnMode == Server`.
|
||||
In Client mode, `bAutoStartListening` is ignored — the user controls listening via T key.
|
||||
|
||||
**File changed:**
|
||||
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||
- `HandleConnected()`: guard `bAutoStartListening` with `TurnMode == Server` check
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Session 6 — 2026-02-19 (audio chunk size fix)
|
||||
|
||||
### 19. Mic Audio Chunk Accumulation (v1.5.0)
|
||||
|
||||
**Root cause (from diagnostic log in Session 5):**
|
||||
Log showed hundreds of `SendAudioChunk: 158 bytes (TurnMode=Client)` lines with zero server responses.
|
||||
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
|
||||
- WASAPI (Windows Audio Session API) fires the `FAudioCapture` callback at its internal buffer period (~5ms)
|
||||
- ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
|
||||
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
|
||||
|
||||
**Fix applied:**
|
||||
Added `MicAccumulationBuffer TArray<uint8>` to `UElevenLabsConversationalAgentComponent`.
|
||||
`OnMicrophoneDataCaptured()` appends each callback's converted bytes and only calls `SendAudioChunk`
|
||||
when `>= MicChunkMinBytes` (3200 bytes = 100ms) have accumulated.
|
||||
|
||||
`StopListening()` flushes any remaining bytes in the buffer before sending `SendUserTurnEnd()`,
|
||||
so the last partial chunk of speech is never dropped.
|
||||
|
||||
`HandleDisconnected()` clears the buffer to prevent stale data on reconnect.
|
||||
|
||||
**Files changed:**
|
||||
- `ElevenLabsConversationalAgentComponent.h`: added `MicAccumulationBuffer` + `MicChunkMinBytes = 3200`
|
||||
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||
- `OnMicrophoneDataCaptured()`: accumulate → send when threshold reached
|
||||
- `StopListening()`: flush remainder before end-of-turn signal
|
||||
- `HandleDisconnected()`: clear accumulation buffer
|
||||
|
||||
Commit: `91cf5b1`
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (not done yet)
|
||||
|
||||
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
|
||||
- [ ] Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
|
||||
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
||||
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
||||
- [ ] Add lip-sync support (future)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user