Compare commits

...

4 Commits

Author SHA1 Message Date
9f28ed7457 Working ! 2026-02-20 08:24:56 +01:00
f7f0b0c45b Fix voice input: resampler stereo bug, remove invalid turn mode, cleanup
Three bugs prevented voice input from working:

1. ResampleTo16000() treated NumFrames as total samples, dividing by
   channel count again — losing half the audio data with stereo input.
   The corrupted audio was unrecognizable to ElevenLabs VAD/STT.

2. Sent nonexistent "client_vad" turn mode in session init. The API has
   no turn.mode field; replaced with turn_timeout parameter.

3. Sent user_activity with every audio chunk, which resets the turn
   timeout timer and prevents the server from taking its turn.

Also: send audio chunks as compact JSON, add message type debug logging,
send conversation_initiation_client_data on connect.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 08:05:39 +01:00
b888f7fcb6 Update memory: document v1.5.0 mic chunk size fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 18:42:47 +01:00
91cf5b1bb4 Fix audio chunk size: accumulate mic audio to 100ms before sending
WASAPI fires mic callbacks every ~5ms (158 bytes at 16kHz 16-bit mono).
ElevenLabs VAD/STT requires a minimum of ~100ms (3200 bytes) per chunk.
Tiny fragments arrived at the server but were never processed, so the
agent never transcribed or responded to user speech.

Fix: OnMicrophoneDataCaptured now appends to MicAccumulationBuffer and
only calls SendAudioChunk once >= 3200 bytes are accumulated. StopListening
flushes any remaining bytes before sending UserTurnEnd so the final words
of an utterance are never discarded. HandleDisconnected also clears the
buffer to prevent stale data on reconnect.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 18:41:58 +01:00
8 changed files with 417 additions and 30 deletions

View File

@ -43,20 +43,30 @@
## Plugin Status
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
- v1.5.0 — mic audio chunk size fixed: WASAPI 5ms callbacks accumulated to 100ms before sending
- v1.4.0 — push-to-talk fully fixed: bAutoStartListening now ignored in Client turn mode
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
- Connection confirmed working end-to-end; audio receive path functional
- `conversation_initiation_client_data` now sent immediately on WS connect (required for mic + latency)
## Audio Chunk Size — CRITICAL
- WASAPI fires mic callbacks every ~5ms → **158 bytes** at 16kHz 16-bit mono
- ElevenLabs VAD/STT requires **≥3200 bytes (100ms)** per chunk; smaller chunks are silently ignored
- Fix: `MicAccumulationBuffer` in component accumulates chunks; sends only when `>= MicChunkMinBytes` (3200)
- `StopListening()` flushes remainder so final partial chunk is never dropped before end-of-turn
## ElevenLabs WebSocket Protocol Notes
- **ALL frames are binary**`OnRawMessage` handles everything; `OnMessage` (text) never fires
- **ALL frames are binary**bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text) — UE fires both for same frame → double audio bug
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
- Pong: `{"type":"pong","event_id":N}``event_id` is **top-level**, NOT nested
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
- Client turn mode (`client_vad`): send `user_activity` **with every audio chunk** (not just once) — server needs continuous signal to know user is speaking; stopping chunks = silence detected = agent responds
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
- **MUST send `conversation_initiation_client_data` immediately after WS connect** — without it, server won't process client audio (mic appears dead)
- `conversation_initiation_client_data` payload: `conversation_config_override.agent.turn.mode`, `conversation_config_override.tts.optimize_streaming_latency`, `custom_llm_extra_body.enable_intermediate_response`
- `enable_intermediate_response: true` in `custom_llm_extra_body` reduces time-to-first-audio latency
## API Keys / Secrets
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor

View File

@ -189,9 +189,160 @@ Commit: `99017f4`
---
---
## Session 3 — 2026-02-19 (bug fixes from live testing)
### 16. Three Runtime Bugs Fixed (v1.2.0)
User reported after live testing:
1. **AI speaks twice** — every audio response played double
2. **Cannot speak** — mic capture didn't reach ElevenLabs
3. **Latency** — requested `enable_intermediate_response: true`
**Bug 1 Root Cause — Double Audio:**
UE's libwebsockets backend fires **both** `OnMessage()` (text callback) **and** `OnRawMessage()` (binary callback) for the same incoming frame.
We had bound both `WebSocket->OnMessage()` and `WebSocket->OnRawMessage()` in `Connect()`.
Result: every audio frame was decoded and enqueued twice → played twice.
Fix: **Remove `OnMessage` binding entirely.** `OnRawMessage` now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
**Bug 2 Root Cause — Mic Silent:**
ElevenLabs requires a `conversation_initiation_client_data` message sent **immediately** after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
Fix: Send `conversation_initiation_client_data` in `OnWsConnected()` before any other message.
**Bug 2 Secondary — Delegate Stacking:**
`StartListening()` called `Mic->OnAudioCaptured.AddUObject(this, ...)` without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
Fix: Add `Mic->OnAudioCaptured.RemoveAll(this)` before `AddUObject` in `StartListening()`.
**Bug 3 — Latency:**
Added `"enable_intermediate_response": true` inside `custom_llm_extra_body` of the `conversation_initiation_client_data` message. Also added `optimize_streaming_latency: 3` in `conversation_config_override.tts`.
**Files changed:**
- `ElevenLabsWebSocketProxy.cpp`:
- `Connect()`: removed `OnMessage` binding
- `OnWsConnected()`: now sends full `conversation_initiation_client_data` JSON
- `ElevenLabsConversationalAgentComponent.cpp`:
- `StartListening()`: added `RemoveAll` guard before delegate binding
---
---
## Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
### 17. Two More Bugs Found and Fixed (v1.3.0)
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
**Analysis of log:**
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
- Mic opens and closes correctly — audio capture IS happening
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
**Bug A — TurnMode mismatch in conversation_initiation_client_data:**
`OnWsConnected()` hardcoded `"mode": "server_vad"` in the init message regardless of the
component's `TurnMode` setting. User's Blueprint uses Client turn mode (push-to-talk),
so the server was configured for server_vad while the client sent client_vad audio signals.
Fix: Read `TurnMode` field on the proxy (set from the component before `Connect()`).
Translate `EElevenLabsTurnMode::Client``"client_vad"`, Server → `"server_vad"`.
**Bug B — user_activity never sent continuously:**
In client VAD mode, ElevenLabs requires `user_activity` to be sent **continuously**
alongside every audio chunk to keep the server's VAD aware the user is speaking.
`SendUserTurnStart()` sent it once on key press, but never again during speech.
Server-side, without continuous `user_activity`, the server treated the audio as noise.
Fix: In `SendAudioChunk()`, automatically send `user_activity` before each audio chunk
when `TurnMode == Client`. This keeps the signal continuous for the full duration of speech.
When the user releases T, `StopListening()` stops the mic → audio stops → `user_activity`
stops → server detects silence and triggers the agent response.
**Bug C — TurnMode not propagated to proxy:**
`UElevenLabsConversationalAgentComponent` never told the proxy what TurnMode to use.
Added `WebSocketProxy->TurnMode = TurnMode` before `Connect()` in `StartConversation()`.
**Files changed:**
- `ElevenLabsWebSocketProxy.h`: added `public TurnMode` field
- `ElevenLabsWebSocketProxy.cpp`:
- `OnWsConnected()`: use `TurnMode` to set correct mode string in init message
- `SendAudioChunk()`: auto-send `user_activity` before each chunk in Client mode
- `ElevenLabsConversationalAgentComponent.cpp`:
- `StartConversation()`: set `WebSocketProxy->TurnMode = TurnMode` before `Connect()`
---
---
## Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
### 18. Root Cause Found and Fixed (v1.4.0)
Log analysis revealed the true root cause:
**Exact sequence:**
```
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
User presses T → StartListening() → bIsListening guard → no-op
User releases T → StopListening() → bIsListening=false, mic CLOSES
User presses T → StartListening() → NOW opens mic (was closed)
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
```
**Root cause:** `bAutoStartListening = true` opens the mic on connect and sets `bIsListening = true`.
In Client/push-to-talk mode, every T-press hits the `bIsListening` guard and does nothing.
Every T-release closes the auto-started mic. The mic was never open during actual speech.
**Fix:** `HandleConnected()` now only calls `StartListening()` when `TurnMode == Server`.
In Client mode, `bAutoStartListening` is ignored — the user controls listening via T key.
**File changed:**
- `ElevenLabsConversationalAgentComponent.cpp`:
- `HandleConnected()`: guard `bAutoStartListening` with `TurnMode == Server` check
---
---
## Session 6 — 2026-02-19 (audio chunk size fix)
### 19. Mic Audio Chunk Accumulation (v1.5.0)
**Root cause (from diagnostic log in Session 5):**
Log showed hundreds of `SendAudioChunk: 158 bytes (TurnMode=Client)` lines with zero server responses.
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
- WASAPI (Windows Audio Session API) fires the `FAudioCapture` callback at its internal buffer period (~5ms)
- ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
**Fix applied:**
Added `MicAccumulationBuffer TArray<uint8>` to `UElevenLabsConversationalAgentComponent`.
`OnMicrophoneDataCaptured()` appends each callback's converted bytes and only calls `SendAudioChunk`
when `>= MicChunkMinBytes` (3200 bytes = 100ms) have accumulated.
`StopListening()` flushes any remaining bytes in the buffer before sending `SendUserTurnEnd()`,
so the last partial chunk of speech is never dropped.
`HandleDisconnected()` clears the buffer to prevent stale data on reconnect.
**Files changed:**
- `ElevenLabsConversationalAgentComponent.h`: added `MicAccumulationBuffer` + `MicChunkMinBytes = 3200`
- `ElevenLabsConversationalAgentComponent.cpp`:
- `OnMicrophoneDataCaptured()`: accumulate → send when threshold reached
- `StopListening()`: flush remainder before end-of-turn signal
- `HandleDisconnected()`: clear accumulation buffer
Commit: `91cf5b1`
---
## Next Steps (not done yet)
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
- [ ] Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
- [ ] Test `SendTextMessage` end-to-end in Blueprint
- [ ] Add lip-sync support (future)

View File

@ -0,0 +1,8 @@
[FilterPlugin]
; This section lists additional files which will be packaged along with your plugin. Paths should be listed relative to the root plugin directory, and
; may include "...", "*", and "?" wildcards to match directories, files, and individual characters respectively.
;
; Examples:
; /README.txt
; /Extras/...
; /Binaries/ThirdParty/*.dll

View File

@ -86,6 +86,10 @@ void UElevenLabsConversationalAgentComponent::StartConversation()
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
}
// Pass configuration to the proxy before connecting.
WebSocketProxy->TurnMode = TurnMode;
WebSocketProxy->bSpeculativeTurn = bSpeculativeTurn;
WebSocketProxy->Connect(AgentID);
}
@ -128,6 +132,9 @@ void UElevenLabsConversationalAgentComponent::StartListening()
Mic->RegisterComponent();
}
// Always remove existing binding first to prevent duplicate delegates stacking
// up if StartListening is called more than once without a matching StopListening.
Mic->OnAudioCaptured.RemoveAll(this);
Mic->OnAudioCaptured.AddUObject(this,
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
Mic->StartCapture();
@ -147,6 +154,15 @@ void UElevenLabsConversationalAgentComponent::StopListening()
Mic->OnAudioCaptured.RemoveAll(this);
}
// Flush any partially-accumulated mic audio before signalling end-of-turn.
// This ensures the final words aren't discarded just because the last callback
// didn't push the buffer over the MicChunkMinBytes threshold.
if (MicAccumulationBuffer.Num() > 0 && WebSocketProxy && IsConnected())
{
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
}
MicAccumulationBuffer.Reset();
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
{
WebSocketProxy->SendUserTurnEnd();
@ -193,7 +209,12 @@ void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsC
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
OnAgentConnected.Broadcast(Info);
if (bAutoStartListening)
// In Client turn mode (push-to-talk), the user controls listening manually via
// StartListening()/StopListening(). Auto-starting would leave the mic open
// permanently and interfere with push-to-talk — the T-release StopListening()
// would close the mic that auto-start opened, leaving the user unable to speak.
// Only auto-start in Server VAD mode where the mic stays open the whole session.
if (bAutoStartListening && TurnMode == EElevenLabsTurnMode::Server)
{
StartListening();
}
@ -204,6 +225,7 @@ void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCod
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
bIsListening = false;
bAgentSpeaking = false;
MicAccumulationBuffer.Reset();
OnAgentDisconnected.Broadcast(StatusCode, Reason);
}
@ -220,12 +242,18 @@ void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<u
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
{
OnAgentTranscript.Broadcast(Segment);
if (bEnableUserTranscript)
{
OnAgentTranscript.Broadcast(Segment);
}
}
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
{
OnAgentTextResponse.Broadcast(ResponseText);
if (bEnableAgentTextResponse)
{
OnAgentTextResponse.Broadcast(ResponseText);
}
}
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
@ -321,8 +349,18 @@ void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TAr
{
if (!IsConnected() || !bIsListening) return;
// Convert this callback's samples to int16 bytes and accumulate.
// WASAPI fires every ~5ms (158 bytes at 16kHz). ElevenLabs needs ≥100ms
// (3200 bytes) per chunk for reliable VAD and STT. We hold bytes here
// until we have enough, then send the whole batch in one WebSocket frame.
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
WebSocketProxy->SendAudioChunk(PCMBytes);
MicAccumulationBuffer.Append(PCMBytes);
if (MicAccumulationBuffer.Num() >= MicChunkMinBytes)
{
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
MicAccumulationBuffer.Reset();
}
}
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)

View File

@ -119,20 +119,22 @@ void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
// Resampling
// ─────────────────────────────────────────────────────────────────────────────
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
const float* InAudio, int32 NumSamples,
const float* InAudio, int32 NumFrames,
int32 InChannels, int32 InSampleRate)
{
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
// --- Step 1: Downmix to mono ---
// NOTE: NumFrames is the number of audio frames (not total samples).
// Each frame contains InChannels samples (e.g. 2 for stereo).
// The raw buffer has NumFrames * InChannels total float values.
TArray<float> Mono;
if (InChannels == 1)
{
Mono = TArray<float>(InAudio, NumSamples);
Mono = TArray<float>(InAudio, NumFrames);
}
else
{
const int32 NumFrames = NumSamples / InChannels;
Mono.Reserve(NumFrames);
for (int32 i = 0; i < NumFrames; i++)
{

View File

@ -72,7 +72,11 @@ void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FS
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
// NOTE: We bind ONLY OnRawMessage (binary frames), NOT OnMessage (text frames).
// UE's WebSocket implementation fires BOTH callbacks for the same frame when using
// the libwebsockets backend — binding both causes every audio packet to be decoded
// and played twice. OnRawMessage handles all frame types: raw binary audio AND
// text-framed JSON (detected by peeking first byte for '{').
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
WebSocket->Connect();
@ -94,36 +98,58 @@ void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected (state=%d). Audio dropped."),
(int32)ConnectionState);
return;
}
if (PCMData.Num() == 0) return;
UE_LOG(LogElevenLabsWS, Log, TEXT("SendAudioChunk: %d bytes (PCM int16 LE @ 16kHz mono)"), PCMData.Num());
// Track when the last audio chunk was sent for latency measurement.
LastAudioChunkSentTime = FPlatformTime::Seconds();
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
// The server's VAD detects silence to determine end-of-turn.
// Do NOT send user_activity here — it resets the turn timeout timer
// and would prevent the server from taking the turn after the user stops speaking.
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
SendJsonMessage(Msg);
// Send as compact JSON (no pretty-printing) directly, bypassing SendJsonMessage
// to avoid the pretty-printed writer and to keep the payload minimal.
const FString AudioJson = FString::Printf(TEXT("{\"user_audio_chunk\":\"%s\"}"), *Base64Audio);
// Log first chunk fully for debugging
static int32 AudioChunksSent = 0;
AudioChunksSent++;
if (AudioChunksSent <= 2)
{
UE_LOG(LogElevenLabsWS, Log, TEXT(" Audio JSON (first 200 chars): %.200s"), *AudioJson);
}
if (WebSocket.IsValid() && WebSocket->IsConnected())
{
WebSocket->Send(AudioJson);
}
}
void UElevenLabsWebSocketProxy::SendUserTurnStart()
{
// In client turn mode, signal that the user is active/speaking.
// API message: { "type": "user_activity" }
if (!IsConnected()) return;
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
SendJsonMessage(Msg);
// No-op: the ElevenLabs API does not require a "start speaking" signal.
// The server's VAD detects speech from the audio chunks we send.
// user_activity is a keep-alive/timeout-reset message and should NOT be
// sent here — it would delay the agent's turn after the user stops.
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn started (audio chunks will follow)."));
}
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
{
// In client turn mode, stopping user_activity signals end of user turn.
// The API uses user_activity for ongoing speech; simply stop sending it.
// No explicit end message is required — silence is detected server-side.
// We still log for debug visibility.
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
// No explicit "end turn" message exists in the ElevenLabs API.
// The server detects end-of-speech via VAD when we stop sending audio chunks.
UserTurnEndTime = FPlatformTime::Seconds();
bWaitingForResponse = true;
bFirstAudioResponseLogged = false;
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended — stopped sending audio chunks. Server VAD will detect silence."));
}
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
@ -155,8 +181,79 @@ void UElevenLabsWebSocketProxy::SendInterrupt()
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::OnWsConnected()
{
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
// State stays Connecting until we receive the initiation metadata from the server.
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Sending conversation_initiation_client_data..."));
// State stays Connecting until we receive conversation_initiation_metadata from the server.
// ElevenLabs requires this message immediately after the WebSocket handshake to
// negotiate the session configuration. Without it, the server won't accept audio
// from the client (microphone stays silent from server perspective) and default
// settings are used (higher latency, no intermediate responses).
//
// Structure:
// {
// "type": "conversation_initiation_client_data",
// "conversation_config_override": {
// "agent": {
// "turn": { "turn_timeout": 3, "speculative_turn": true }
// },
// "tts": {
// "optimize_streaming_latency": 3
// }
// },
// "custom_llm_extra_body": {
// "enable_intermediate_response": true
// }
// }
// Configure turn-taking behaviour.
// The ElevenLabs API does NOT have a turn.mode field.
// Turn-taking is controlled by the server's VAD and the turn_* parameters.
// In push-to-talk (Client mode) the user controls the mic; the server still
// uses its VAD to detect the end of speech from the audio chunks it receives.
TSharedPtr<FJsonObject> TurnObj = MakeShareable(new FJsonObject());
// Lower turn_timeout so the agent responds faster after the user stops speaking.
// Default is 7s. In push-to-talk (Client mode), the user explicitly signals
// end-of-turn by releasing the key, so we can use a very short timeout (1s).
if (TurnMode == EElevenLabsTurnMode::Client)
{
TurnObj->SetNumberField(TEXT("turn_timeout"), 1);
}
// Speculative turn: start LLM generation during silence before the VAD is
// fully confident the user finished speaking. Reduces latency by 200-500ms.
if (bSpeculativeTurn)
{
TurnObj->SetBoolField(TEXT("speculative_turn"), true);
}
TSharedPtr<FJsonObject> AgentObj = MakeShareable(new FJsonObject());
AgentObj->SetObjectField(TEXT("turn"), TurnObj);
TSharedPtr<FJsonObject> TtsObj = MakeShareable(new FJsonObject());
TtsObj->SetNumberField(TEXT("optimize_streaming_latency"), 3);
TSharedPtr<FJsonObject> ConversationConfigOverride = MakeShareable(new FJsonObject());
ConversationConfigOverride->SetObjectField(TEXT("agent"), AgentObj);
ConversationConfigOverride->SetObjectField(TEXT("tts"), TtsObj);
// enable_intermediate_response reduces time-to-first-audio by allowing the agent
// to start speaking before it has finished generating the full response.
TSharedPtr<FJsonObject> CustomLlmExtraBody = MakeShareable(new FJsonObject());
CustomLlmExtraBody->SetBoolField(TEXT("enable_intermediate_response"), true);
TSharedPtr<FJsonObject> InitMsg = MakeShareable(new FJsonObject());
InitMsg->SetStringField(TEXT("type"), ElevenLabsMessageType::ConversationClientData);
InitMsg->SetObjectField(TEXT("conversation_config_override"), ConversationConfigOverride);
InitMsg->SetObjectField(TEXT("custom_llm_extra_body"), CustomLlmExtraBody);
// NOTE: We bypass SendJsonMessage() here intentionally.
// SendJsonMessage() guards on WebSocket->IsConnected(), but OnWsConnected fires
// during the handshake before IsConnected() returns true in some UE WS backends.
// We know the socket is open at this point — send directly.
FString InitJson;
TSharedRef<TJsonWriter<>> InitWriter = TJsonWriterFactory<>::Create(&InitJson);
FJsonSerializer::Serialize(InitMsg.ToSharedRef(), InitWriter);
UE_LOG(LogElevenLabsWS, Log, TEXT("Sending initiation: %s"), *InitJson);
WebSocket->Send(InitJson);
}
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
@ -200,20 +297,53 @@ void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
return;
}
// Log every message type received from the server for debugging.
UE_LOG(LogElevenLabsWS, Log, TEXT("Received message type: %s"), *MsgType);
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
{
HandleConversationInitiation(Root);
}
else if (MsgType == ElevenLabsMessageType::AudioResponse)
{
// Log time-to-first-audio: latency between end of user turn and first agent audio.
if (bWaitingForResponse && !bFirstAudioResponseLogged)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
const double LatencyFromLastChunk = (Now - LastAudioChunkSentTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] Time-to-first-audio: %.0f ms (from turn end), %.0f ms (from last chunk sent)"),
LatencyFromTurnEnd, LatencyFromLastChunk);
bFirstAudioResponseLogged = true;
}
HandleAudioResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::UserTranscript)
{
// Log transcription latency.
if (bWaitingForResponse)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] User transcript received: %.0f ms after turn end"),
LatencyFromTurnEnd);
bWaitingForResponse = false;
}
HandleTranscript(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponse)
{
// Log agent text response latency.
if (UserTurnEndTime > 0.0)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] Agent text response: %.0f ms after turn end"),
LatencyFromTurnEnd);
}
HandleAgentResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)

View File

@ -80,6 +80,29 @@ public:
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
bool bAutoStartListening = true;
/**
* Enable speculative turn: the LLM starts generating a response during
* silence before the VAD is fully confident the user has finished speaking.
* Reduces latency by 200-500ms but may occasionally produce premature responses.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Latency")
bool bSpeculativeTurn = true;
/**
* Forward user speech transcripts (user_transcript events) to the
* OnAgentTranscript delegate. Disable to reduce overhead if you don't
* need to display what the user said.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
bool bEnableUserTranscript = true;
/**
* Forward agent text responses (agent_response events) to the
* OnAgentTextResponse delegate. Disable if you only need audio output.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
bool bEnableAgentTextResponse = true;
// ── Events ────────────────────────────────────────────────────────────────
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
@ -230,4 +253,11 @@ private:
// consider the agent done speaking.
int32 SilentTickCount = 0;
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
// ── Microphone accumulation ───────────────────────────────────────────────
// WASAPI fires callbacks every ~5ms (158 bytes at 16kHz 16-bit mono).
// ElevenLabs needs at least ~100ms (3200 bytes) per chunk for reliable VAD/STT.
// We accumulate here and only call SendAudioChunk once enough bytes are ready.
TArray<uint8> MicAccumulationBuffer;
static constexpr int32 MicChunkMinBytes = 3200; // 100ms @ 16kHz 16-bit mono
};

View File

@ -183,4 +183,22 @@ private:
// Accumulation buffer for multi-fragment binary WebSocket frames.
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
TArray<uint8> BinaryFrameBuffer;
// ── Latency tracking ─────────────────────────────────────────────────────
// Timestamp of the last audio chunk sent (user speech).
double LastAudioChunkSentTime = 0.0;
// Timestamp when user turn ended (StopListening).
double UserTurnEndTime = 0.0;
// Whether we are waiting for the first response after user stopped speaking.
bool bWaitingForResponse = false;
// Whether we already logged the first audio response latency for this turn.
bool bFirstAudioResponseLogged = false;
public:
// Set by UElevenLabsConversationalAgentComponent before calling Connect().
// Controls turn_timeout in conversation_initiation_client_data.
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
// Speculative turn: start LLM generation during silence before full turn confidence.
bool bSpeculativeTurn = true;
};