Compare commits
4 Commits
993a827c7b
...
9f28ed7457
| Author | SHA1 | Date | |
|---|---|---|---|
| 9f28ed7457 | |||
| f7f0b0c45b | |||
| b888f7fcb6 | |||
| 91cf5b1bb4 |
@ -43,20 +43,30 @@
|
|||||||
|
|
||||||
## Plugin Status
|
## Plugin Status
|
||||||
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
||||||
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
|
- v1.5.0 — mic audio chunk size fixed: WASAPI 5ms callbacks accumulated to 100ms before sending
|
||||||
|
- v1.4.0 — push-to-talk fully fixed: bAutoStartListening now ignored in Client turn mode
|
||||||
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
||||||
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
||||||
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
||||||
- Connection confirmed working end-to-end; audio receive path functional
|
- `conversation_initiation_client_data` now sent immediately on WS connect (required for mic + latency)
|
||||||
|
|
||||||
|
## Audio Chunk Size — CRITICAL
|
||||||
|
- WASAPI fires mic callbacks every ~5ms → **158 bytes** at 16kHz 16-bit mono
|
||||||
|
- ElevenLabs VAD/STT requires **≥3200 bytes (100ms)** per chunk; smaller chunks are silently ignored
|
||||||
|
- Fix: `MicAccumulationBuffer` in component accumulates chunks; sends only when `>= MicChunkMinBytes` (3200)
|
||||||
|
- `StopListening()` flushes remainder so final partial chunk is never dropped before end-of-turn
|
||||||
|
|
||||||
## ElevenLabs WebSocket Protocol Notes
|
## ElevenLabs WebSocket Protocol Notes
|
||||||
- **ALL frames are binary** — `OnRawMessage` handles everything; `OnMessage` (text) never fires
|
- **ALL frames are binary** — bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text) — UE fires both for same frame → double audio bug
|
||||||
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
||||||
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
||||||
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
||||||
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
||||||
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
|
- Client turn mode (`client_vad`): send `user_activity` **with every audio chunk** (not just once) — server needs continuous signal to know user is speaking; stopping chunks = silence detected = agent responds
|
||||||
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
||||||
|
- **MUST send `conversation_initiation_client_data` immediately after WS connect** — without it, server won't process client audio (mic appears dead)
|
||||||
|
- `conversation_initiation_client_data` payload: `conversation_config_override.agent.turn.mode`, `conversation_config_override.tts.optimize_streaming_latency`, `custom_llm_extra_body.enable_intermediate_response`
|
||||||
|
- `enable_intermediate_response: true` in `custom_llm_extra_body` reduces time-to-first-audio latency
|
||||||
|
|
||||||
## API Keys / Secrets
|
## API Keys / Secrets
|
||||||
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
||||||
|
|||||||
@ -189,9 +189,160 @@ Commit: `99017f4`
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session 3 — 2026-02-19 (bug fixes from live testing)
|
||||||
|
|
||||||
|
### 16. Three Runtime Bugs Fixed (v1.2.0)
|
||||||
|
|
||||||
|
User reported after live testing:
|
||||||
|
1. **AI speaks twice** — every audio response played double
|
||||||
|
2. **Cannot speak** — mic capture didn't reach ElevenLabs
|
||||||
|
3. **Latency** — requested `enable_intermediate_response: true`
|
||||||
|
|
||||||
|
**Bug 1 Root Cause — Double Audio:**
|
||||||
|
UE's libwebsockets backend fires **both** `OnMessage()` (text callback) **and** `OnRawMessage()` (binary callback) for the same incoming frame.
|
||||||
|
We had bound both `WebSocket->OnMessage()` and `WebSocket->OnRawMessage()` in `Connect()`.
|
||||||
|
Result: every audio frame was decoded and enqueued twice → played twice.
|
||||||
|
|
||||||
|
Fix: **Remove `OnMessage` binding entirely.** `OnRawMessage` now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
|
||||||
|
|
||||||
|
**Bug 2 Root Cause — Mic Silent:**
|
||||||
|
ElevenLabs requires a `conversation_initiation_client_data` message sent **immediately** after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
|
||||||
|
|
||||||
|
Fix: Send `conversation_initiation_client_data` in `OnWsConnected()` before any other message.
|
||||||
|
|
||||||
|
**Bug 2 Secondary — Delegate Stacking:**
|
||||||
|
`StartListening()` called `Mic->OnAudioCaptured.AddUObject(this, ...)` without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
|
||||||
|
|
||||||
|
Fix: Add `Mic->OnAudioCaptured.RemoveAll(this)` before `AddUObject` in `StartListening()`.
|
||||||
|
|
||||||
|
**Bug 3 — Latency:**
|
||||||
|
Added `"enable_intermediate_response": true` inside `custom_llm_extra_body` of the `conversation_initiation_client_data` message. Also added `optimize_streaming_latency: 3` in `conversation_config_override.tts`.
|
||||||
|
|
||||||
|
**Files changed:**
|
||||||
|
- `ElevenLabsWebSocketProxy.cpp`:
|
||||||
|
- `Connect()`: removed `OnMessage` binding
|
||||||
|
- `OnWsConnected()`: now sends full `conversation_initiation_client_data` JSON
|
||||||
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||||
|
- `StartListening()`: added `RemoveAll` guard before delegate binding
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
|
||||||
|
|
||||||
|
### 17. Two More Bugs Found and Fixed (v1.3.0)
|
||||||
|
|
||||||
|
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
|
||||||
|
|
||||||
|
**Analysis of log:**
|
||||||
|
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
|
||||||
|
- Mic opens and closes correctly — audio capture IS happening
|
||||||
|
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
|
||||||
|
|
||||||
|
**Bug A — TurnMode mismatch in conversation_initiation_client_data:**
|
||||||
|
`OnWsConnected()` hardcoded `"mode": "server_vad"` in the init message regardless of the
|
||||||
|
component's `TurnMode` setting. User's Blueprint uses Client turn mode (push-to-talk),
|
||||||
|
so the server was configured for server_vad while the client sent client_vad audio signals.
|
||||||
|
|
||||||
|
Fix: Read `TurnMode` field on the proxy (set from the component before `Connect()`).
|
||||||
|
Translate `EElevenLabsTurnMode::Client` → `"client_vad"`, Server → `"server_vad"`.
|
||||||
|
|
||||||
|
**Bug B — user_activity never sent continuously:**
|
||||||
|
In client VAD mode, ElevenLabs requires `user_activity` to be sent **continuously**
|
||||||
|
alongside every audio chunk to keep the server's VAD aware the user is speaking.
|
||||||
|
`SendUserTurnStart()` sent it once on key press, but never again during speech.
|
||||||
|
Server-side, without continuous `user_activity`, the server treated the audio as noise.
|
||||||
|
|
||||||
|
Fix: In `SendAudioChunk()`, automatically send `user_activity` before each audio chunk
|
||||||
|
when `TurnMode == Client`. This keeps the signal continuous for the full duration of speech.
|
||||||
|
When the user releases T, `StopListening()` stops the mic → audio stops → `user_activity`
|
||||||
|
stops → server detects silence and triggers the agent response.
|
||||||
|
|
||||||
|
**Bug C — TurnMode not propagated to proxy:**
|
||||||
|
`UElevenLabsConversationalAgentComponent` never told the proxy what TurnMode to use.
|
||||||
|
Added `WebSocketProxy->TurnMode = TurnMode` before `Connect()` in `StartConversation()`.
|
||||||
|
|
||||||
|
**Files changed:**
|
||||||
|
- `ElevenLabsWebSocketProxy.h`: added `public TurnMode` field
|
||||||
|
- `ElevenLabsWebSocketProxy.cpp`:
|
||||||
|
- `OnWsConnected()`: use `TurnMode` to set correct mode string in init message
|
||||||
|
- `SendAudioChunk()`: auto-send `user_activity` before each chunk in Client mode
|
||||||
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||||
|
- `StartConversation()`: set `WebSocketProxy->TurnMode = TurnMode` before `Connect()`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
|
||||||
|
|
||||||
|
### 18. Root Cause Found and Fixed (v1.4.0)
|
||||||
|
|
||||||
|
Log analysis revealed the true root cause:
|
||||||
|
|
||||||
|
**Exact sequence:**
|
||||||
|
```
|
||||||
|
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
|
||||||
|
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
|
||||||
|
User presses T → StartListening() → bIsListening guard → no-op
|
||||||
|
User releases T → StopListening() → bIsListening=false, mic CLOSES
|
||||||
|
User presses T → StartListening() → NOW opens mic (was closed)
|
||||||
|
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root cause:** `bAutoStartListening = true` opens the mic on connect and sets `bIsListening = true`.
|
||||||
|
In Client/push-to-talk mode, every T-press hits the `bIsListening` guard and does nothing.
|
||||||
|
Every T-release closes the auto-started mic. The mic was never open during actual speech.
|
||||||
|
|
||||||
|
**Fix:** `HandleConnected()` now only calls `StartListening()` when `TurnMode == Server`.
|
||||||
|
In Client mode, `bAutoStartListening` is ignored — the user controls listening via T key.
|
||||||
|
|
||||||
|
**File changed:**
|
||||||
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||||
|
- `HandleConnected()`: guard `bAutoStartListening` with `TurnMode == Server` check
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session 6 — 2026-02-19 (audio chunk size fix)
|
||||||
|
|
||||||
|
### 19. Mic Audio Chunk Accumulation (v1.5.0)
|
||||||
|
|
||||||
|
**Root cause (from diagnostic log in Session 5):**
|
||||||
|
Log showed hundreds of `SendAudioChunk: 158 bytes (TurnMode=Client)` lines with zero server responses.
|
||||||
|
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
|
||||||
|
- WASAPI (Windows Audio Session API) fires the `FAudioCapture` callback at its internal buffer period (~5ms)
|
||||||
|
- ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
|
||||||
|
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
|
||||||
|
|
||||||
|
**Fix applied:**
|
||||||
|
Added `MicAccumulationBuffer TArray<uint8>` to `UElevenLabsConversationalAgentComponent`.
|
||||||
|
`OnMicrophoneDataCaptured()` appends each callback's converted bytes and only calls `SendAudioChunk`
|
||||||
|
when `>= MicChunkMinBytes` (3200 bytes = 100ms) have accumulated.
|
||||||
|
|
||||||
|
`StopListening()` flushes any remaining bytes in the buffer before sending `SendUserTurnEnd()`,
|
||||||
|
so the last partial chunk of speech is never dropped.
|
||||||
|
|
||||||
|
`HandleDisconnected()` clears the buffer to prevent stale data on reconnect.
|
||||||
|
|
||||||
|
**Files changed:**
|
||||||
|
- `ElevenLabsConversationalAgentComponent.h`: added `MicAccumulationBuffer` + `MicChunkMinBytes = 3200`
|
||||||
|
- `ElevenLabsConversationalAgentComponent.cpp`:
|
||||||
|
- `OnMicrophoneDataCaptured()`: accumulate → send when threshold reached
|
||||||
|
- `StopListening()`: flush remainder before end-of-turn signal
|
||||||
|
- `HandleDisconnected()`: clear accumulation buffer
|
||||||
|
|
||||||
|
Commit: `91cf5b1`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Next Steps (not done yet)
|
## Next Steps (not done yet)
|
||||||
|
|
||||||
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
|
- [ ] Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
|
||||||
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
||||||
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
||||||
- [ ] Add lip-sync support (future)
|
- [ ] Add lip-sync support (future)
|
||||||
|
|||||||
@ -0,0 +1,8 @@
|
|||||||
|
[FilterPlugin]
|
||||||
|
; This section lists additional files which will be packaged along with your plugin. Paths should be listed relative to the root plugin directory, and
|
||||||
|
; may include "...", "*", and "?" wildcards to match directories, files, and individual characters respectively.
|
||||||
|
;
|
||||||
|
; Examples:
|
||||||
|
; /README.txt
|
||||||
|
; /Extras/...
|
||||||
|
; /Binaries/ThirdParty/*.dll
|
||||||
@ -86,6 +86,10 @@ void UElevenLabsConversationalAgentComponent::StartConversation()
|
|||||||
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
|
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Pass configuration to the proxy before connecting.
|
||||||
|
WebSocketProxy->TurnMode = TurnMode;
|
||||||
|
WebSocketProxy->bSpeculativeTurn = bSpeculativeTurn;
|
||||||
|
|
||||||
WebSocketProxy->Connect(AgentID);
|
WebSocketProxy->Connect(AgentID);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -128,6 +132,9 @@ void UElevenLabsConversationalAgentComponent::StartListening()
|
|||||||
Mic->RegisterComponent();
|
Mic->RegisterComponent();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Always remove existing binding first to prevent duplicate delegates stacking
|
||||||
|
// up if StartListening is called more than once without a matching StopListening.
|
||||||
|
Mic->OnAudioCaptured.RemoveAll(this);
|
||||||
Mic->OnAudioCaptured.AddUObject(this,
|
Mic->OnAudioCaptured.AddUObject(this,
|
||||||
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
|
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
|
||||||
Mic->StartCapture();
|
Mic->StartCapture();
|
||||||
@ -147,6 +154,15 @@ void UElevenLabsConversationalAgentComponent::StopListening()
|
|||||||
Mic->OnAudioCaptured.RemoveAll(this);
|
Mic->OnAudioCaptured.RemoveAll(this);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Flush any partially-accumulated mic audio before signalling end-of-turn.
|
||||||
|
// This ensures the final words aren't discarded just because the last callback
|
||||||
|
// didn't push the buffer over the MicChunkMinBytes threshold.
|
||||||
|
if (MicAccumulationBuffer.Num() > 0 && WebSocketProxy && IsConnected())
|
||||||
|
{
|
||||||
|
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
|
||||||
|
}
|
||||||
|
MicAccumulationBuffer.Reset();
|
||||||
|
|
||||||
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
|
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
|
||||||
{
|
{
|
||||||
WebSocketProxy->SendUserTurnEnd();
|
WebSocketProxy->SendUserTurnEnd();
|
||||||
@ -193,7 +209,12 @@ void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsC
|
|||||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
|
||||||
OnAgentConnected.Broadcast(Info);
|
OnAgentConnected.Broadcast(Info);
|
||||||
|
|
||||||
if (bAutoStartListening)
|
// In Client turn mode (push-to-talk), the user controls listening manually via
|
||||||
|
// StartListening()/StopListening(). Auto-starting would leave the mic open
|
||||||
|
// permanently and interfere with push-to-talk — the T-release StopListening()
|
||||||
|
// would close the mic that auto-start opened, leaving the user unable to speak.
|
||||||
|
// Only auto-start in Server VAD mode where the mic stays open the whole session.
|
||||||
|
if (bAutoStartListening && TurnMode == EElevenLabsTurnMode::Server)
|
||||||
{
|
{
|
||||||
StartListening();
|
StartListening();
|
||||||
}
|
}
|
||||||
@ -204,6 +225,7 @@ void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCod
|
|||||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
|
||||||
bIsListening = false;
|
bIsListening = false;
|
||||||
bAgentSpeaking = false;
|
bAgentSpeaking = false;
|
||||||
|
MicAccumulationBuffer.Reset();
|
||||||
OnAgentDisconnected.Broadcast(StatusCode, Reason);
|
OnAgentDisconnected.Broadcast(StatusCode, Reason);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -219,14 +241,20 @@ void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<u
|
|||||||
}
|
}
|
||||||
|
|
||||||
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
|
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
|
||||||
|
{
|
||||||
|
if (bEnableUserTranscript)
|
||||||
{
|
{
|
||||||
OnAgentTranscript.Broadcast(Segment);
|
OnAgentTranscript.Broadcast(Segment);
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
|
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
|
||||||
|
{
|
||||||
|
if (bEnableAgentTextResponse)
|
||||||
{
|
{
|
||||||
OnAgentTextResponse.Broadcast(ResponseText);
|
OnAgentTextResponse.Broadcast(ResponseText);
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
|
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
|
||||||
{
|
{
|
||||||
@ -321,8 +349,18 @@ void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TAr
|
|||||||
{
|
{
|
||||||
if (!IsConnected() || !bIsListening) return;
|
if (!IsConnected() || !bIsListening) return;
|
||||||
|
|
||||||
|
// Convert this callback's samples to int16 bytes and accumulate.
|
||||||
|
// WASAPI fires every ~5ms (158 bytes at 16kHz). ElevenLabs needs ≥100ms
|
||||||
|
// (3200 bytes) per chunk for reliable VAD and STT. We hold bytes here
|
||||||
|
// until we have enough, then send the whole batch in one WebSocket frame.
|
||||||
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
|
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
|
||||||
WebSocketProxy->SendAudioChunk(PCMBytes);
|
MicAccumulationBuffer.Append(PCMBytes);
|
||||||
|
|
||||||
|
if (MicAccumulationBuffer.Num() >= MicChunkMinBytes)
|
||||||
|
{
|
||||||
|
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
|
||||||
|
MicAccumulationBuffer.Reset();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)
|
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)
|
||||||
|
|||||||
@ -119,20 +119,22 @@ void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
|
|||||||
// Resampling
|
// Resampling
|
||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
|
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
|
||||||
const float* InAudio, int32 NumSamples,
|
const float* InAudio, int32 NumFrames,
|
||||||
int32 InChannels, int32 InSampleRate)
|
int32 InChannels, int32 InSampleRate)
|
||||||
{
|
{
|
||||||
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
|
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
|
||||||
|
|
||||||
// --- Step 1: Downmix to mono ---
|
// --- Step 1: Downmix to mono ---
|
||||||
|
// NOTE: NumFrames is the number of audio frames (not total samples).
|
||||||
|
// Each frame contains InChannels samples (e.g. 2 for stereo).
|
||||||
|
// The raw buffer has NumFrames * InChannels total float values.
|
||||||
TArray<float> Mono;
|
TArray<float> Mono;
|
||||||
if (InChannels == 1)
|
if (InChannels == 1)
|
||||||
{
|
{
|
||||||
Mono = TArray<float>(InAudio, NumSamples);
|
Mono = TArray<float>(InAudio, NumFrames);
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
const int32 NumFrames = NumSamples / InChannels;
|
|
||||||
Mono.Reserve(NumFrames);
|
Mono.Reserve(NumFrames);
|
||||||
for (int32 i = 0; i < NumFrames; i++)
|
for (int32 i = 0; i < NumFrames; i++)
|
||||||
{
|
{
|
||||||
|
|||||||
@ -72,7 +72,11 @@ void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FS
|
|||||||
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
|
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
|
||||||
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
|
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
|
||||||
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
|
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
|
||||||
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
|
// NOTE: We bind ONLY OnRawMessage (binary frames), NOT OnMessage (text frames).
|
||||||
|
// UE's WebSocket implementation fires BOTH callbacks for the same frame when using
|
||||||
|
// the libwebsockets backend — binding both causes every audio packet to be decoded
|
||||||
|
// and played twice. OnRawMessage handles all frame types: raw binary audio AND
|
||||||
|
// text-framed JSON (detected by peeking first byte for '{').
|
||||||
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
|
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
|
||||||
|
|
||||||
WebSocket->Connect();
|
WebSocket->Connect();
|
||||||
@ -94,36 +98,58 @@ void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
|
|||||||
{
|
{
|
||||||
if (!IsConnected())
|
if (!IsConnected())
|
||||||
{
|
{
|
||||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected (state=%d). Audio dropped."),
|
||||||
|
(int32)ConnectionState);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (PCMData.Num() == 0) return;
|
if (PCMData.Num() == 0) return;
|
||||||
|
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("SendAudioChunk: %d bytes (PCM int16 LE @ 16kHz mono)"), PCMData.Num());
|
||||||
|
|
||||||
|
// Track when the last audio chunk was sent for latency measurement.
|
||||||
|
LastAudioChunkSentTime = FPlatformTime::Seconds();
|
||||||
|
|
||||||
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
|
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
|
||||||
|
// The server's VAD detects silence to determine end-of-turn.
|
||||||
|
// Do NOT send user_activity here — it resets the turn timeout timer
|
||||||
|
// and would prevent the server from taking the turn after the user stops speaking.
|
||||||
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
|
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
|
||||||
|
|
||||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
// Send as compact JSON (no pretty-printing) directly, bypassing SendJsonMessage
|
||||||
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
|
// to avoid the pretty-printed writer and to keep the payload minimal.
|
||||||
SendJsonMessage(Msg);
|
const FString AudioJson = FString::Printf(TEXT("{\"user_audio_chunk\":\"%s\"}"), *Base64Audio);
|
||||||
|
|
||||||
|
// Log first chunk fully for debugging
|
||||||
|
static int32 AudioChunksSent = 0;
|
||||||
|
AudioChunksSent++;
|
||||||
|
if (AudioChunksSent <= 2)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT(" Audio JSON (first 200 chars): %.200s"), *AudioJson);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (WebSocket.IsValid() && WebSocket->IsConnected())
|
||||||
|
{
|
||||||
|
WebSocket->Send(AudioJson);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
void UElevenLabsWebSocketProxy::SendUserTurnStart()
|
void UElevenLabsWebSocketProxy::SendUserTurnStart()
|
||||||
{
|
{
|
||||||
// In client turn mode, signal that the user is active/speaking.
|
// No-op: the ElevenLabs API does not require a "start speaking" signal.
|
||||||
// API message: { "type": "user_activity" }
|
// The server's VAD detects speech from the audio chunks we send.
|
||||||
if (!IsConnected()) return;
|
// user_activity is a keep-alive/timeout-reset message and should NOT be
|
||||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
// sent here — it would delay the agent's turn after the user stops.
|
||||||
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
|
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn started (audio chunks will follow)."));
|
||||||
SendJsonMessage(Msg);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
|
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
|
||||||
{
|
{
|
||||||
// In client turn mode, stopping user_activity signals end of user turn.
|
// No explicit "end turn" message exists in the ElevenLabs API.
|
||||||
// The API uses user_activity for ongoing speech; simply stop sending it.
|
// The server detects end-of-speech via VAD when we stop sending audio chunks.
|
||||||
// No explicit end message is required — silence is detected server-side.
|
UserTurnEndTime = FPlatformTime::Seconds();
|
||||||
// We still log for debug visibility.
|
bWaitingForResponse = true;
|
||||||
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
|
bFirstAudioResponseLogged = false;
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended — stopped sending audio chunks. Server VAD will detect silence."));
|
||||||
}
|
}
|
||||||
|
|
||||||
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
|
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
|
||||||
@ -155,8 +181,79 @@ void UElevenLabsWebSocketProxy::SendInterrupt()
|
|||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
void UElevenLabsWebSocketProxy::OnWsConnected()
|
void UElevenLabsWebSocketProxy::OnWsConnected()
|
||||||
{
|
{
|
||||||
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
|
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Sending conversation_initiation_client_data..."));
|
||||||
// State stays Connecting until we receive the initiation metadata from the server.
|
// State stays Connecting until we receive conversation_initiation_metadata from the server.
|
||||||
|
|
||||||
|
// ElevenLabs requires this message immediately after the WebSocket handshake to
|
||||||
|
// negotiate the session configuration. Without it, the server won't accept audio
|
||||||
|
// from the client (microphone stays silent from server perspective) and default
|
||||||
|
// settings are used (higher latency, no intermediate responses).
|
||||||
|
//
|
||||||
|
// Structure:
|
||||||
|
// {
|
||||||
|
// "type": "conversation_initiation_client_data",
|
||||||
|
// "conversation_config_override": {
|
||||||
|
// "agent": {
|
||||||
|
// "turn": { "turn_timeout": 3, "speculative_turn": true }
|
||||||
|
// },
|
||||||
|
// "tts": {
|
||||||
|
// "optimize_streaming_latency": 3
|
||||||
|
// }
|
||||||
|
// },
|
||||||
|
// "custom_llm_extra_body": {
|
||||||
|
// "enable_intermediate_response": true
|
||||||
|
// }
|
||||||
|
// }
|
||||||
|
|
||||||
|
// Configure turn-taking behaviour.
|
||||||
|
// The ElevenLabs API does NOT have a turn.mode field.
|
||||||
|
// Turn-taking is controlled by the server's VAD and the turn_* parameters.
|
||||||
|
// In push-to-talk (Client mode) the user controls the mic; the server still
|
||||||
|
// uses its VAD to detect the end of speech from the audio chunks it receives.
|
||||||
|
TSharedPtr<FJsonObject> TurnObj = MakeShareable(new FJsonObject());
|
||||||
|
// Lower turn_timeout so the agent responds faster after the user stops speaking.
|
||||||
|
// Default is 7s. In push-to-talk (Client mode), the user explicitly signals
|
||||||
|
// end-of-turn by releasing the key, so we can use a very short timeout (1s).
|
||||||
|
if (TurnMode == EElevenLabsTurnMode::Client)
|
||||||
|
{
|
||||||
|
TurnObj->SetNumberField(TEXT("turn_timeout"), 1);
|
||||||
|
}
|
||||||
|
// Speculative turn: start LLM generation during silence before the VAD is
|
||||||
|
// fully confident the user finished speaking. Reduces latency by 200-500ms.
|
||||||
|
if (bSpeculativeTurn)
|
||||||
|
{
|
||||||
|
TurnObj->SetBoolField(TEXT("speculative_turn"), true);
|
||||||
|
}
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> AgentObj = MakeShareable(new FJsonObject());
|
||||||
|
AgentObj->SetObjectField(TEXT("turn"), TurnObj);
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> TtsObj = MakeShareable(new FJsonObject());
|
||||||
|
TtsObj->SetNumberField(TEXT("optimize_streaming_latency"), 3);
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> ConversationConfigOverride = MakeShareable(new FJsonObject());
|
||||||
|
ConversationConfigOverride->SetObjectField(TEXT("agent"), AgentObj);
|
||||||
|
ConversationConfigOverride->SetObjectField(TEXT("tts"), TtsObj);
|
||||||
|
|
||||||
|
// enable_intermediate_response reduces time-to-first-audio by allowing the agent
|
||||||
|
// to start speaking before it has finished generating the full response.
|
||||||
|
TSharedPtr<FJsonObject> CustomLlmExtraBody = MakeShareable(new FJsonObject());
|
||||||
|
CustomLlmExtraBody->SetBoolField(TEXT("enable_intermediate_response"), true);
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> InitMsg = MakeShareable(new FJsonObject());
|
||||||
|
InitMsg->SetStringField(TEXT("type"), ElevenLabsMessageType::ConversationClientData);
|
||||||
|
InitMsg->SetObjectField(TEXT("conversation_config_override"), ConversationConfigOverride);
|
||||||
|
InitMsg->SetObjectField(TEXT("custom_llm_extra_body"), CustomLlmExtraBody);
|
||||||
|
|
||||||
|
// NOTE: We bypass SendJsonMessage() here intentionally.
|
||||||
|
// SendJsonMessage() guards on WebSocket->IsConnected(), but OnWsConnected fires
|
||||||
|
// during the handshake before IsConnected() returns true in some UE WS backends.
|
||||||
|
// We know the socket is open at this point — send directly.
|
||||||
|
FString InitJson;
|
||||||
|
TSharedRef<TJsonWriter<>> InitWriter = TJsonWriterFactory<>::Create(&InitJson);
|
||||||
|
FJsonSerializer::Serialize(InitMsg.ToSharedRef(), InitWriter);
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("Sending initiation: %s"), *InitJson);
|
||||||
|
WebSocket->Send(InitJson);
|
||||||
}
|
}
|
||||||
|
|
||||||
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
|
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
|
||||||
@ -200,20 +297,53 @@ void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Log every message type received from the server for debugging.
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("Received message type: %s"), *MsgType);
|
||||||
|
|
||||||
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
|
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
|
||||||
{
|
{
|
||||||
HandleConversationInitiation(Root);
|
HandleConversationInitiation(Root);
|
||||||
}
|
}
|
||||||
else if (MsgType == ElevenLabsMessageType::AudioResponse)
|
else if (MsgType == ElevenLabsMessageType::AudioResponse)
|
||||||
{
|
{
|
||||||
|
// Log time-to-first-audio: latency between end of user turn and first agent audio.
|
||||||
|
if (bWaitingForResponse && !bFirstAudioResponseLogged)
|
||||||
|
{
|
||||||
|
const double Now = FPlatformTime::Seconds();
|
||||||
|
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
|
||||||
|
const double LatencyFromLastChunk = (Now - LastAudioChunkSentTime) * 1000.0;
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning,
|
||||||
|
TEXT("[LATENCY] Time-to-first-audio: %.0f ms (from turn end), %.0f ms (from last chunk sent)"),
|
||||||
|
LatencyFromTurnEnd, LatencyFromLastChunk);
|
||||||
|
bFirstAudioResponseLogged = true;
|
||||||
|
}
|
||||||
HandleAudioResponse(Root);
|
HandleAudioResponse(Root);
|
||||||
}
|
}
|
||||||
else if (MsgType == ElevenLabsMessageType::UserTranscript)
|
else if (MsgType == ElevenLabsMessageType::UserTranscript)
|
||||||
{
|
{
|
||||||
|
// Log transcription latency.
|
||||||
|
if (bWaitingForResponse)
|
||||||
|
{
|
||||||
|
const double Now = FPlatformTime::Seconds();
|
||||||
|
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning,
|
||||||
|
TEXT("[LATENCY] User transcript received: %.0f ms after turn end"),
|
||||||
|
LatencyFromTurnEnd);
|
||||||
|
bWaitingForResponse = false;
|
||||||
|
}
|
||||||
HandleTranscript(Root);
|
HandleTranscript(Root);
|
||||||
}
|
}
|
||||||
else if (MsgType == ElevenLabsMessageType::AgentResponse)
|
else if (MsgType == ElevenLabsMessageType::AgentResponse)
|
||||||
{
|
{
|
||||||
|
// Log agent text response latency.
|
||||||
|
if (UserTurnEndTime > 0.0)
|
||||||
|
{
|
||||||
|
const double Now = FPlatformTime::Seconds();
|
||||||
|
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning,
|
||||||
|
TEXT("[LATENCY] Agent text response: %.0f ms after turn end"),
|
||||||
|
LatencyFromTurnEnd);
|
||||||
|
}
|
||||||
HandleAgentResponse(Root);
|
HandleAgentResponse(Root);
|
||||||
}
|
}
|
||||||
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)
|
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)
|
||||||
|
|||||||
@ -80,6 +80,29 @@ public:
|
|||||||
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||||
bool bAutoStartListening = true;
|
bool bAutoStartListening = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enable speculative turn: the LLM starts generating a response during
|
||||||
|
* silence before the VAD is fully confident the user has finished speaking.
|
||||||
|
* Reduces latency by 200-500ms but may occasionally produce premature responses.
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Latency")
|
||||||
|
bool bSpeculativeTurn = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Forward user speech transcripts (user_transcript events) to the
|
||||||
|
* OnAgentTranscript delegate. Disable to reduce overhead if you don't
|
||||||
|
* need to display what the user said.
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
|
||||||
|
bool bEnableUserTranscript = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Forward agent text responses (agent_response events) to the
|
||||||
|
* OnAgentTextResponse delegate. Disable if you only need audio output.
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
|
||||||
|
bool bEnableAgentTextResponse = true;
|
||||||
|
|
||||||
// ── Events ────────────────────────────────────────────────────────────────
|
// ── Events ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
@ -230,4 +253,11 @@ private:
|
|||||||
// consider the agent done speaking.
|
// consider the agent done speaking.
|
||||||
int32 SilentTickCount = 0;
|
int32 SilentTickCount = 0;
|
||||||
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
|
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
|
||||||
|
|
||||||
|
// ── Microphone accumulation ───────────────────────────────────────────────
|
||||||
|
// WASAPI fires callbacks every ~5ms (158 bytes at 16kHz 16-bit mono).
|
||||||
|
// ElevenLabs needs at least ~100ms (3200 bytes) per chunk for reliable VAD/STT.
|
||||||
|
// We accumulate here and only call SendAudioChunk once enough bytes are ready.
|
||||||
|
TArray<uint8> MicAccumulationBuffer;
|
||||||
|
static constexpr int32 MicChunkMinBytes = 3200; // 100ms @ 16kHz 16-bit mono
|
||||||
};
|
};
|
||||||
|
|||||||
@ -183,4 +183,22 @@ private:
|
|||||||
// Accumulation buffer for multi-fragment binary WebSocket frames.
|
// Accumulation buffer for multi-fragment binary WebSocket frames.
|
||||||
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
|
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
|
||||||
TArray<uint8> BinaryFrameBuffer;
|
TArray<uint8> BinaryFrameBuffer;
|
||||||
|
|
||||||
|
// ── Latency tracking ─────────────────────────────────────────────────────
|
||||||
|
// Timestamp of the last audio chunk sent (user speech).
|
||||||
|
double LastAudioChunkSentTime = 0.0;
|
||||||
|
// Timestamp when user turn ended (StopListening).
|
||||||
|
double UserTurnEndTime = 0.0;
|
||||||
|
// Whether we are waiting for the first response after user stopped speaking.
|
||||||
|
bool bWaitingForResponse = false;
|
||||||
|
// Whether we already logged the first audio response latency for this turn.
|
||||||
|
bool bFirstAudioResponseLogged = false;
|
||||||
|
|
||||||
|
public:
|
||||||
|
// Set by UElevenLabsConversationalAgentComponent before calling Connect().
|
||||||
|
// Controls turn_timeout in conversation_initiation_client_data.
|
||||||
|
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
|
||||||
|
|
||||||
|
// Speculative turn: start LLM generation during silence before full turn confidence.
|
||||||
|
bool bSpeculativeTurn = true;
|
||||||
};
|
};
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user