Compare commits

..

No commits in common. "9f28ed7457e7b59cc041e651730924bd798c173d" and "993a827c7bdda85d0cec0803fee55191835124ac" have entirely different histories.

8 changed files with 30 additions and 417 deletions

View File

@ -43,30 +43,20 @@
## Plugin Status
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
- v1.5.0 — mic audio chunk size fixed: WASAPI 5ms callbacks accumulated to 100ms before sending
- v1.4.0 — push-to-talk fully fixed: bAutoStartListening now ignored in Client turn mode
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
- `conversation_initiation_client_data` now sent immediately on WS connect (required for mic + latency)
## Audio Chunk Size — CRITICAL
- WASAPI fires mic callbacks every ~5ms → **158 bytes** at 16kHz 16-bit mono
- ElevenLabs VAD/STT requires **≥3200 bytes (100ms)** per chunk; smaller chunks are silently ignored
- Fix: `MicAccumulationBuffer` in component accumulates chunks; sends only when `>= MicChunkMinBytes` (3200)
- `StopListening()` flushes remainder so final partial chunk is never dropped before end-of-turn
- Connection confirmed working end-to-end; audio receive path functional
## ElevenLabs WebSocket Protocol Notes
- **ALL frames are binary**bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text) — UE fires both for same frame → double audio bug
- **ALL frames are binary**`OnRawMessage` handles everything; `OnMessage` (text) never fires
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
- Pong: `{"type":"pong","event_id":N}``event_id` is **top-level**, NOT nested
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
- Client turn mode (`client_vad`): send `user_activity` **with every audio chunk** (not just once) — server needs continuous signal to know user is speaking; stopping chunks = silence detected = agent responds
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
- **MUST send `conversation_initiation_client_data` immediately after WS connect** — without it, server won't process client audio (mic appears dead)
- `conversation_initiation_client_data` payload: `conversation_config_override.agent.turn.mode`, `conversation_config_override.tts.optimize_streaming_latency`, `custom_llm_extra_body.enable_intermediate_response`
- `enable_intermediate_response: true` in `custom_llm_extra_body` reduces time-to-first-audio latency
## API Keys / Secrets
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor

View File

@ -189,160 +189,9 @@ Commit: `99017f4`
---
---
## Session 3 — 2026-02-19 (bug fixes from live testing)
### 16. Three Runtime Bugs Fixed (v1.2.0)
User reported after live testing:
1. **AI speaks twice** — every audio response played double
2. **Cannot speak** — mic capture didn't reach ElevenLabs
3. **Latency** — requested `enable_intermediate_response: true`
**Bug 1 Root Cause — Double Audio:**
UE's libwebsockets backend fires **both** `OnMessage()` (text callback) **and** `OnRawMessage()` (binary callback) for the same incoming frame.
We had bound both `WebSocket->OnMessage()` and `WebSocket->OnRawMessage()` in `Connect()`.
Result: every audio frame was decoded and enqueued twice → played twice.
Fix: **Remove `OnMessage` binding entirely.** `OnRawMessage` now handles all frames (JSON control messages peeked via first byte, raw PCM otherwise).
**Bug 2 Root Cause — Mic Silent:**
ElevenLabs requires a `conversation_initiation_client_data` message sent **immediately** after the WebSocket handshake completes. Without it, the server never enters a state where it will accept and process client audio chunks. This is a required session negotiation step, not optional.
Fix: Send `conversation_initiation_client_data` in `OnWsConnected()` before any other message.
**Bug 2 Secondary — Delegate Stacking:**
`StartListening()` called `Mic->OnAudioCaptured.AddUObject(this, ...)` without first removing existing bindings. If called more than once (e.g. after reconnect), delegates stack up and audio is sent multiple times per frame.
Fix: Add `Mic->OnAudioCaptured.RemoveAll(this)` before `AddUObject` in `StartListening()`.
**Bug 3 — Latency:**
Added `"enable_intermediate_response": true` inside `custom_llm_extra_body` of the `conversation_initiation_client_data` message. Also added `optimize_streaming_latency: 3` in `conversation_config_override.tts`.
**Files changed:**
- `ElevenLabsWebSocketProxy.cpp`:
- `Connect()`: removed `OnMessage` binding
- `OnWsConnected()`: now sends full `conversation_initiation_client_data` JSON
- `ElevenLabsConversationalAgentComponent.cpp`:
- `StartListening()`: added `RemoveAll` guard before delegate binding
---
---
## Session 4 — 2026-02-19 (mic still silent — push-to-talk deeper investigation)
### 17. Two More Bugs Found and Fixed (v1.3.0)
User confirmed Bug 1 (double audio) was fixed. Bug 2 (cannot speak) persisted.
**Analysis of log:**
- Blueprint is correct: T Pressed → StartListening, T Released → StopListening (proper push-to-talk)
- Mic opens and closes correctly — audio capture IS happening
- Server never responds to mic input → audio reaching ElevenLabs but being ignored
**Bug A — TurnMode mismatch in conversation_initiation_client_data:**
`OnWsConnected()` hardcoded `"mode": "server_vad"` in the init message regardless of the
component's `TurnMode` setting. User's Blueprint uses Client turn mode (push-to-talk),
so the server was configured for server_vad while the client sent client_vad audio signals.
Fix: Read `TurnMode` field on the proxy (set from the component before `Connect()`).
Translate `EElevenLabsTurnMode::Client``"client_vad"`, Server → `"server_vad"`.
**Bug B — user_activity never sent continuously:**
In client VAD mode, ElevenLabs requires `user_activity` to be sent **continuously**
alongside every audio chunk to keep the server's VAD aware the user is speaking.
`SendUserTurnStart()` sent it once on key press, but never again during speech.
Server-side, without continuous `user_activity`, the server treated the audio as noise.
Fix: In `SendAudioChunk()`, automatically send `user_activity` before each audio chunk
when `TurnMode == Client`. This keeps the signal continuous for the full duration of speech.
When the user releases T, `StopListening()` stops the mic → audio stops → `user_activity`
stops → server detects silence and triggers the agent response.
**Bug C — TurnMode not propagated to proxy:**
`UElevenLabsConversationalAgentComponent` never told the proxy what TurnMode to use.
Added `WebSocketProxy->TurnMode = TurnMode` before `Connect()` in `StartConversation()`.
**Files changed:**
- `ElevenLabsWebSocketProxy.h`: added `public TurnMode` field
- `ElevenLabsWebSocketProxy.cpp`:
- `OnWsConnected()`: use `TurnMode` to set correct mode string in init message
- `SendAudioChunk()`: auto-send `user_activity` before each chunk in Client mode
- `ElevenLabsConversationalAgentComponent.cpp`:
- `StartConversation()`: set `WebSocketProxy->TurnMode = TurnMode` before `Connect()`
---
---
## Session 5 — 2026-02-19 (still can't speak — bAutoStartListening conflict)
### 18. Root Cause Found and Fixed (v1.4.0)
Log analysis revealed the true root cause:
**Exact sequence:**
```
OnConnected → bAutoStartListening=true → StartListening() → bIsListening=true, mic opens
OnAgentStoppedSpeaking → Blueprint calls StartListening() → bIsListening guard → no-op (already open)
User presses T → StartListening() → bIsListening guard → no-op
User releases T → StopListening() → bIsListening=false, mic CLOSES
User presses T → StartListening() → NOW opens mic (was closed)
User releases T → StopListening() → mic closes — but ElevenLabs never got audio
```
**Root cause:** `bAutoStartListening = true` opens the mic on connect and sets `bIsListening = true`.
In Client/push-to-talk mode, every T-press hits the `bIsListening` guard and does nothing.
Every T-release closes the auto-started mic. The mic was never open during actual speech.
**Fix:** `HandleConnected()` now only calls `StartListening()` when `TurnMode == Server`.
In Client mode, `bAutoStartListening` is ignored — the user controls listening via T key.
**File changed:**
- `ElevenLabsConversationalAgentComponent.cpp`:
- `HandleConnected()`: guard `bAutoStartListening` with `TurnMode == Server` check
---
---
## Session 6 — 2026-02-19 (audio chunk size fix)
### 19. Mic Audio Chunk Accumulation (v1.5.0)
**Root cause (from diagnostic log in Session 5):**
Log showed hundreds of `SendAudioChunk: 158 bytes (TurnMode=Client)` lines with zero server responses.
- 158 bytes = 79 samples = ~5ms of audio at 16kHz 16-bit mono
- WASAPI (Windows Audio Session API) fires the `FAudioCapture` callback at its internal buffer period (~5ms)
- ElevenLabs requires a minimum chunk size for its VAD and STT to operate (~100ms / 3200 bytes)
- Tiny 5ms fragments arrived at the server but were silently ignored → agent never responded
**Fix applied:**
Added `MicAccumulationBuffer TArray<uint8>` to `UElevenLabsConversationalAgentComponent`.
`OnMicrophoneDataCaptured()` appends each callback's converted bytes and only calls `SendAudioChunk`
when `>= MicChunkMinBytes` (3200 bytes = 100ms) have accumulated.
`StopListening()` flushes any remaining bytes in the buffer before sending `SendUserTurnEnd()`,
so the last partial chunk of speech is never dropped.
`HandleDisconnected()` clears the buffer to prevent stale data on reconnect.
**Files changed:**
- `ElevenLabsConversationalAgentComponent.h`: added `MicAccumulationBuffer` + `MicChunkMinBytes = 3200`
- `ElevenLabsConversationalAgentComponent.cpp`:
- `OnMicrophoneDataCaptured()`: accumulate → send when threshold reached
- `StopListening()`: flush remainder before end-of-turn signal
- `HandleDisconnected()`: clear accumulation buffer
Commit: `91cf5b1`
---
## Next Steps (not done yet)
- [ ] Test v1.5.0 in Editor — verify push-to-talk mic works end-to-end (should be the final fix)
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
- [ ] Test `SendTextMessage` end-to-end in Blueprint
- [ ] Add lip-sync support (future)

View File

@ -1,8 +0,0 @@
[FilterPlugin]
; This section lists additional files which will be packaged along with your plugin. Paths should be listed relative to the root plugin directory, and
; may include "...", "*", and "?" wildcards to match directories, files, and individual characters respectively.
;
; Examples:
; /README.txt
; /Extras/...
; /Binaries/ThirdParty/*.dll

View File

@ -86,10 +86,6 @@ void UElevenLabsConversationalAgentComponent::StartConversation()
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
}
// Pass configuration to the proxy before connecting.
WebSocketProxy->TurnMode = TurnMode;
WebSocketProxy->bSpeculativeTurn = bSpeculativeTurn;
WebSocketProxy->Connect(AgentID);
}
@ -132,9 +128,6 @@ void UElevenLabsConversationalAgentComponent::StartListening()
Mic->RegisterComponent();
}
// Always remove existing binding first to prevent duplicate delegates stacking
// up if StartListening is called more than once without a matching StopListening.
Mic->OnAudioCaptured.RemoveAll(this);
Mic->OnAudioCaptured.AddUObject(this,
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
Mic->StartCapture();
@ -154,15 +147,6 @@ void UElevenLabsConversationalAgentComponent::StopListening()
Mic->OnAudioCaptured.RemoveAll(this);
}
// Flush any partially-accumulated mic audio before signalling end-of-turn.
// This ensures the final words aren't discarded just because the last callback
// didn't push the buffer over the MicChunkMinBytes threshold.
if (MicAccumulationBuffer.Num() > 0 && WebSocketProxy && IsConnected())
{
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
}
MicAccumulationBuffer.Reset();
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
{
WebSocketProxy->SendUserTurnEnd();
@ -209,12 +193,7 @@ void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsC
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
OnAgentConnected.Broadcast(Info);
// In Client turn mode (push-to-talk), the user controls listening manually via
// StartListening()/StopListening(). Auto-starting would leave the mic open
// permanently and interfere with push-to-talk — the T-release StopListening()
// would close the mic that auto-start opened, leaving the user unable to speak.
// Only auto-start in Server VAD mode where the mic stays open the whole session.
if (bAutoStartListening && TurnMode == EElevenLabsTurnMode::Server)
if (bAutoStartListening)
{
StartListening();
}
@ -225,7 +204,6 @@ void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCod
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
bIsListening = false;
bAgentSpeaking = false;
MicAccumulationBuffer.Reset();
OnAgentDisconnected.Broadcast(StatusCode, Reason);
}
@ -242,18 +220,12 @@ void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<u
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
{
if (bEnableUserTranscript)
{
OnAgentTranscript.Broadcast(Segment);
}
OnAgentTranscript.Broadcast(Segment);
}
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
{
if (bEnableAgentTextResponse)
{
OnAgentTextResponse.Broadcast(ResponseText);
}
OnAgentTextResponse.Broadcast(ResponseText);
}
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
@ -349,18 +321,8 @@ void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TAr
{
if (!IsConnected() || !bIsListening) return;
// Convert this callback's samples to int16 bytes and accumulate.
// WASAPI fires every ~5ms (158 bytes at 16kHz). ElevenLabs needs ≥100ms
// (3200 bytes) per chunk for reliable VAD and STT. We hold bytes here
// until we have enough, then send the whole batch in one WebSocket frame.
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
MicAccumulationBuffer.Append(PCMBytes);
if (MicAccumulationBuffer.Num() >= MicChunkMinBytes)
{
WebSocketProxy->SendAudioChunk(MicAccumulationBuffer);
MicAccumulationBuffer.Reset();
}
WebSocketProxy->SendAudioChunk(PCMBytes);
}
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)

View File

@ -119,22 +119,20 @@ void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
// Resampling
// ─────────────────────────────────────────────────────────────────────────────
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
const float* InAudio, int32 NumFrames,
const float* InAudio, int32 NumSamples,
int32 InChannels, int32 InSampleRate)
{
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
// --- Step 1: Downmix to mono ---
// NOTE: NumFrames is the number of audio frames (not total samples).
// Each frame contains InChannels samples (e.g. 2 for stereo).
// The raw buffer has NumFrames * InChannels total float values.
TArray<float> Mono;
if (InChannels == 1)
{
Mono = TArray<float>(InAudio, NumFrames);
Mono = TArray<float>(InAudio, NumSamples);
}
else
{
const int32 NumFrames = NumSamples / InChannels;
Mono.Reserve(NumFrames);
for (int32 i = 0; i < NumFrames; i++)
{

View File

@ -72,11 +72,7 @@ void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FS
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
// NOTE: We bind ONLY OnRawMessage (binary frames), NOT OnMessage (text frames).
// UE's WebSocket implementation fires BOTH callbacks for the same frame when using
// the libwebsockets backend — binding both causes every audio packet to be decoded
// and played twice. OnRawMessage handles all frame types: raw binary audio AND
// text-framed JSON (detected by peeking first byte for '{').
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
WebSocket->Connect();
@ -98,58 +94,36 @@ void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected (state=%d). Audio dropped."),
(int32)ConnectionState);
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
return;
}
if (PCMData.Num() == 0) return;
UE_LOG(LogElevenLabsWS, Log, TEXT("SendAudioChunk: %d bytes (PCM int16 LE @ 16kHz mono)"), PCMData.Num());
// Track when the last audio chunk was sent for latency measurement.
LastAudioChunkSentTime = FPlatformTime::Seconds();
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
// The server's VAD detects silence to determine end-of-turn.
// Do NOT send user_activity here — it resets the turn timeout timer
// and would prevent the server from taking the turn after the user stops speaking.
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
// Send as compact JSON (no pretty-printing) directly, bypassing SendJsonMessage
// to avoid the pretty-printed writer and to keep the payload minimal.
const FString AudioJson = FString::Printf(TEXT("{\"user_audio_chunk\":\"%s\"}"), *Base64Audio);
// Log first chunk fully for debugging
static int32 AudioChunksSent = 0;
AudioChunksSent++;
if (AudioChunksSent <= 2)
{
UE_LOG(LogElevenLabsWS, Log, TEXT(" Audio JSON (first 200 chars): %.200s"), *AudioJson);
}
if (WebSocket.IsValid() && WebSocket->IsConnected())
{
WebSocket->Send(AudioJson);
}
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
SendJsonMessage(Msg);
}
void UElevenLabsWebSocketProxy::SendUserTurnStart()
{
// No-op: the ElevenLabs API does not require a "start speaking" signal.
// The server's VAD detects speech from the audio chunks we send.
// user_activity is a keep-alive/timeout-reset message and should NOT be
// sent here — it would delay the agent's turn after the user stops.
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn started (audio chunks will follow)."));
// In client turn mode, signal that the user is active/speaking.
// API message: { "type": "user_activity" }
if (!IsConnected()) return;
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
SendJsonMessage(Msg);
}
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
{
// No explicit "end turn" message exists in the ElevenLabs API.
// The server detects end-of-speech via VAD when we stop sending audio chunks.
UserTurnEndTime = FPlatformTime::Seconds();
bWaitingForResponse = true;
bFirstAudioResponseLogged = false;
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended — stopped sending audio chunks. Server VAD will detect silence."));
// In client turn mode, stopping user_activity signals end of user turn.
// The API uses user_activity for ongoing speech; simply stop sending it.
// No explicit end message is required — silence is detected server-side.
// We still log for debug visibility.
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
}
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
@ -181,79 +155,8 @@ void UElevenLabsWebSocketProxy::SendInterrupt()
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::OnWsConnected()
{
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Sending conversation_initiation_client_data..."));
// State stays Connecting until we receive conversation_initiation_metadata from the server.
// ElevenLabs requires this message immediately after the WebSocket handshake to
// negotiate the session configuration. Without it, the server won't accept audio
// from the client (microphone stays silent from server perspective) and default
// settings are used (higher latency, no intermediate responses).
//
// Structure:
// {
// "type": "conversation_initiation_client_data",
// "conversation_config_override": {
// "agent": {
// "turn": { "turn_timeout": 3, "speculative_turn": true }
// },
// "tts": {
// "optimize_streaming_latency": 3
// }
// },
// "custom_llm_extra_body": {
// "enable_intermediate_response": true
// }
// }
// Configure turn-taking behaviour.
// The ElevenLabs API does NOT have a turn.mode field.
// Turn-taking is controlled by the server's VAD and the turn_* parameters.
// In push-to-talk (Client mode) the user controls the mic; the server still
// uses its VAD to detect the end of speech from the audio chunks it receives.
TSharedPtr<FJsonObject> TurnObj = MakeShareable(new FJsonObject());
// Lower turn_timeout so the agent responds faster after the user stops speaking.
// Default is 7s. In push-to-talk (Client mode), the user explicitly signals
// end-of-turn by releasing the key, so we can use a very short timeout (1s).
if (TurnMode == EElevenLabsTurnMode::Client)
{
TurnObj->SetNumberField(TEXT("turn_timeout"), 1);
}
// Speculative turn: start LLM generation during silence before the VAD is
// fully confident the user finished speaking. Reduces latency by 200-500ms.
if (bSpeculativeTurn)
{
TurnObj->SetBoolField(TEXT("speculative_turn"), true);
}
TSharedPtr<FJsonObject> AgentObj = MakeShareable(new FJsonObject());
AgentObj->SetObjectField(TEXT("turn"), TurnObj);
TSharedPtr<FJsonObject> TtsObj = MakeShareable(new FJsonObject());
TtsObj->SetNumberField(TEXT("optimize_streaming_latency"), 3);
TSharedPtr<FJsonObject> ConversationConfigOverride = MakeShareable(new FJsonObject());
ConversationConfigOverride->SetObjectField(TEXT("agent"), AgentObj);
ConversationConfigOverride->SetObjectField(TEXT("tts"), TtsObj);
// enable_intermediate_response reduces time-to-first-audio by allowing the agent
// to start speaking before it has finished generating the full response.
TSharedPtr<FJsonObject> CustomLlmExtraBody = MakeShareable(new FJsonObject());
CustomLlmExtraBody->SetBoolField(TEXT("enable_intermediate_response"), true);
TSharedPtr<FJsonObject> InitMsg = MakeShareable(new FJsonObject());
InitMsg->SetStringField(TEXT("type"), ElevenLabsMessageType::ConversationClientData);
InitMsg->SetObjectField(TEXT("conversation_config_override"), ConversationConfigOverride);
InitMsg->SetObjectField(TEXT("custom_llm_extra_body"), CustomLlmExtraBody);
// NOTE: We bypass SendJsonMessage() here intentionally.
// SendJsonMessage() guards on WebSocket->IsConnected(), but OnWsConnected fires
// during the handshake before IsConnected() returns true in some UE WS backends.
// We know the socket is open at this point — send directly.
FString InitJson;
TSharedRef<TJsonWriter<>> InitWriter = TJsonWriterFactory<>::Create(&InitJson);
FJsonSerializer::Serialize(InitMsg.ToSharedRef(), InitWriter);
UE_LOG(LogElevenLabsWS, Log, TEXT("Sending initiation: %s"), *InitJson);
WebSocket->Send(InitJson);
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
// State stays Connecting until we receive the initiation metadata from the server.
}
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
@ -297,53 +200,20 @@ void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
return;
}
// Log every message type received from the server for debugging.
UE_LOG(LogElevenLabsWS, Log, TEXT("Received message type: %s"), *MsgType);
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
{
HandleConversationInitiation(Root);
}
else if (MsgType == ElevenLabsMessageType::AudioResponse)
{
// Log time-to-first-audio: latency between end of user turn and first agent audio.
if (bWaitingForResponse && !bFirstAudioResponseLogged)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
const double LatencyFromLastChunk = (Now - LastAudioChunkSentTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] Time-to-first-audio: %.0f ms (from turn end), %.0f ms (from last chunk sent)"),
LatencyFromTurnEnd, LatencyFromLastChunk);
bFirstAudioResponseLogged = true;
}
HandleAudioResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::UserTranscript)
{
// Log transcription latency.
if (bWaitingForResponse)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] User transcript received: %.0f ms after turn end"),
LatencyFromTurnEnd);
bWaitingForResponse = false;
}
HandleTranscript(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponse)
{
// Log agent text response latency.
if (UserTurnEndTime > 0.0)
{
const double Now = FPlatformTime::Seconds();
const double LatencyFromTurnEnd = (Now - UserTurnEndTime) * 1000.0;
UE_LOG(LogElevenLabsWS, Warning,
TEXT("[LATENCY] Agent text response: %.0f ms after turn end"),
LatencyFromTurnEnd);
}
HandleAgentResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)

View File

@ -80,29 +80,6 @@ public:
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
bool bAutoStartListening = true;
/**
* Enable speculative turn: the LLM starts generating a response during
* silence before the VAD is fully confident the user has finished speaking.
* Reduces latency by 200-500ms but may occasionally produce premature responses.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Latency")
bool bSpeculativeTurn = true;
/**
* Forward user speech transcripts (user_transcript events) to the
* OnAgentTranscript delegate. Disable to reduce overhead if you don't
* need to display what the user said.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
bool bEnableUserTranscript = true;
/**
* Forward agent text responses (agent_response events) to the
* OnAgentTextResponse delegate. Disable if you only need audio output.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Events")
bool bEnableAgentTextResponse = true;
// ── Events ────────────────────────────────────────────────────────────────
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
@ -253,11 +230,4 @@ private:
// consider the agent done speaking.
int32 SilentTickCount = 0;
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
// ── Microphone accumulation ───────────────────────────────────────────────
// WASAPI fires callbacks every ~5ms (158 bytes at 16kHz 16-bit mono).
// ElevenLabs needs at least ~100ms (3200 bytes) per chunk for reliable VAD/STT.
// We accumulate here and only call SendAudioChunk once enough bytes are ready.
TArray<uint8> MicAccumulationBuffer;
static constexpr int32 MicChunkMinBytes = 3200; // 100ms @ 16kHz 16-bit mono
};

View File

@ -183,22 +183,4 @@ private:
// Accumulation buffer for multi-fragment binary WebSocket frames.
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
TArray<uint8> BinaryFrameBuffer;
// ── Latency tracking ─────────────────────────────────────────────────────
// Timestamp of the last audio chunk sent (user speech).
double LastAudioChunkSentTime = 0.0;
// Timestamp when user turn ended (StopListening).
double UserTurnEndTime = 0.0;
// Whether we are waiting for the first response after user stopped speaking.
bool bWaitingForResponse = false;
// Whether we already logged the first audio response latency for this turn.
bool bFirstAudioResponseLogged = false;
public:
// Set by UElevenLabsConversationalAgentComponent before calling Connect().
// Controls turn_timeout in conversation_initiation_client_data.
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
// Speculative turn: start LLM generation during silence before full turn confidence.
bool bSpeculativeTurn = true;
};