Update plugin documentation to v1.1.0

Reflects all bug fixes and new features added since initial release:
- Binary WS frame handling (JSON vs raw PCM discrimination)
- Corrected transcript message type and field names
- Corrected pong format (top-level event_id)
- Corrected client turn mode (user_activity, no explicit end message)
- New SendTextMessage feature documented with Blueprint + C++ examples
- Added Section 13: Changelog (v1.0.0 / v1.1.0)
- Updated audio pipeline diagram for raw binary PCM output path
- Added OnAgentConnected timing note (fires after initiation_metadata)
- Added FTranscriptSegment clarification (speaker always "user")
- Added API key / git workflow note in Security section
- New troubleshooting entries for binary frames and OnAgentConnected
- New "Test without microphone" common pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
j.foucher 2026-02-19 14:01:09 +01:00
parent 483456728d
commit e464cfe288

View File

@ -1,9 +1,9 @@
# PS_AI_Agent_ElevenLabs — Plugin Documentation
**Engine**: Unreal Engine 5.5
**Plugin version**: 1.0.0
**Status**: Beta
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/conversational-ai)
**Plugin version**: 1.1.0
**Status**: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/eleven-agents/quickstart)
---
@ -24,6 +24,7 @@
10. [Audio Pipeline](#10-audio-pipeline)
11. [Common Patterns](#11-common-patterns)
12. [Troubleshooting](#12-troubleshooting)
13. [Changelog](#13-changelog)
---
@ -44,7 +45,7 @@ UElevenLabsMicrophoneCaptureComponent
UElevenLabsConversationalAgentComponent
• Converts float32 → int16 PCM bytes
Sends via WebSocket to ElevenLabs
Base64-encodes and sends via WebSocket
│ (wss://api.elevenlabs.io/v1/convai/conversation)
ElevenLabs Conversational AI Agent
@ -54,7 +55,7 @@ ElevenLabs Conversational AI Agent
UElevenLabsConversationalAgentComponent
• Receives Base64 PCM audio chunks
• Receives raw binary PCM audio frames
• Feeds USoundWaveProcedural → UAudioComponent
@ -66,6 +67,12 @@ Agent voice plays from the Actor's position in the world
- Blueprint-first: all events and controls are exposed to Blueprint
- Real-time bidirectional: audio streams in both directions simultaneously
- Server VAD (default) or push-to-talk
- Text input supported (no microphone needed for testing)
### Wire frame protocol notes
ElevenLabs sends **all WebSocket frames as binary** (not text frames). The plugin handles two binary frame types automatically:
- **JSON control frames** (start with `{`) — conversation init, transcripts, agent responses, ping/pong
- **Raw PCM audio frames** (binary) — agent speech audio, played directly via `USoundWaveProcedural`
---
@ -103,14 +110,16 @@ Go to **Edit → Project Settings → Plugins → ElevenLabs AI Agent**.
| Setting | Description | Required |
|---|---|---|
| **API Key** | Your ElevenLabs API key from [elevenlabs.io](https://elevenlabs.io) | Yes (unless using Signed URL Mode) |
| **Agent ID** | Default agent ID. Create agents at [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai) | Yes (unless set per-component) |
| **API Key** | Your ElevenLabs API key. Find it at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | Yes (unless using Signed URL Mode or a public agent) |
| **Agent ID** | Default agent ID. Find it in the URL when editing an agent: `elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>` | Yes (unless set per-component) |
| **Signed URL Mode** | Fetch the WS URL from your own backend (keeps key off client). See [Section 9](#9-security--signed-url-mode) | No |
| **Signed URL Endpoint** | Your backend URL returning `{ "signed_url": "wss://..." }` | Only if Signed URL Mode = true |
| **Custom WebSocket URL** | Override the default `wss://api.elevenlabs.io/...` endpoint (debug only) | No |
| **Verbose Logging** | Log every WebSocket JSON frame to Output Log | No |
| **Verbose Logging** | Log every WebSocket frame type and first bytes to Output Log | No |
> **Security note**: Never ship with the API key hard-coded in a packaged build. Use Signed URL Mode for production, or load the key at runtime from a secure backend.
> **Security note**: The API key set in Project Settings is saved to `DefaultEngine.ini`. **Never commit this file with the key in it** — strip the `[ElevenLabsSettings]` section before committing. Use Signed URL Mode for production builds.
> **Finding your Agent ID**: Go to [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai), click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.
---
@ -135,7 +144,7 @@ Event BeginPlay
└─► [ElevenLabs Agent] Start Conversation
[ElevenLabs Agent] On Agent Connected
└─► Print String "Connected! ID: " + Conversation Info → Conversation ID
└─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID
[ElevenLabs Agent] On Agent Text Response
└─► Set Text (UI widget) ← Response Text
@ -166,6 +175,17 @@ Input Action "Talk" (Released)
└─► [ElevenLabs Agent] Stop Listening
```
### Step 5 — Testing without a microphone
Once connected, use **Send Text Message** instead of speaking:
```
[ElevenLabs Agent] On Agent Connected
└─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"
```
The agent will reply with audio and text exactly as if it heard you speak.
---
## 5. Quick Start (C++)
@ -206,7 +226,10 @@ ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
// Start the conversation:
ElevenLabsAgent->StartConversation();
// Later, to end it:
// Send a text message (useful for testing without mic):
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));
// Later, to end:
ElevenLabsAgent->EndConversation();
```
@ -249,17 +272,18 @@ The **main component** — attach this to any Actor that should be able to speak
|---|---|---|---|
| `AgentID` | `FString` | `""` | Agent ID for this actor. Overrides the project-level default when non-empty. |
| `TurnMode` | `EElevenLabsTurnMode` | `Server` | How speaker turns are detected. See [Section 8](#8-turn-modes). |
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is ready. |
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is connected and ready. |
#### Functions
| Function | Blueprint | Description |
|---|---|---|
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once connected. |
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once `OnAgentConnected` fires. |
| `EndConversation()` | Callable | Closes the WebSocket, stops mic, stops audio playback. |
| `StartListening()` | Callable | Starts microphone capture. In Client mode, also sends `user_turn_start` to ElevenLabs. |
| `StopListening()` | Callable | Stops microphone capture. In Client mode, also sends `user_turn_end`. |
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately. |
| `StartListening()` | Callable | Starts microphone capture and streams to ElevenLabs. In Client mode, also sends `user_activity`. |
| `StopListening()` | Callable | Stops microphone capture. In Client mode, stops sending `user_activity`. |
| `SendTextMessage(Text)` | Callable | Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing. |
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately and clears the audio queue. |
| `IsConnected()` | Pure | Returns true if the WebSocket is open and the conversation is active. |
| `IsListening()` | Pure | Returns true if the microphone is currently capturing. |
| `IsAgentSpeaking()` | Pure | Returns true if agent audio is currently playing. |
@ -270,13 +294,13 @@ The **main component** — attach this to any Actor that should be able to speak
| Event | Parameters | Fired when |
|---|---|---|
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation complete. |
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation metadata received. Safe to call `SendTextMessage` here. |
| `OnAgentDisconnected` | `int32 StatusCode`, `FString Reason` | WebSocket closed (graceful or remote). |
| `OnAgentError` | `FString ErrorMessage` | Connection or protocol error. |
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | Any transcript arrives (user or agent, tentative or final). |
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (complements the audio). |
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent. |
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (agent done speaking). |
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | User speech-to-text transcript received (speaker is always `"user"`). |
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (mirrors the audio). |
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent (audio playback begins). |
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (heuristic — agent done speaking). |
| `OnAgentInterrupted` | — | Agent speech was interrupted (by user or by `InterruptAgent()`). |
---
@ -292,19 +316,19 @@ A lightweight microphone capture component. Managed automatically by `UElevenLab
| Property | Type | Default | Description |
|---|---|---|---|
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples. Range: 0.0 4.0. |
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples before resampling. Range: 0.0 4.0. |
#### Functions
| Function | Blueprint | Description |
|---|---|---|
| `StartCapture()` | Callable | Opens the default audio input device and starts streaming. |
| `StartCapture()` | Callable | Opens the default audio input device and begins streaming. |
| `StopCapture()` | Callable | Stops streaming and closes the device. |
| `IsCapturing()` | Pure | True while actively capturing. |
#### Delegate
`OnAudioCaptured` — fires on the game thread with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.
`OnAudioCaptured` — fires on the **game thread** with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.
---
@ -321,10 +345,11 @@ Low-level WebSocket session manager. Used internally by `UElevenLabsConversation
|---|---|
| `Connect(AgentID, APIKey)` | Open the WS connection. Parameters override project settings when non-empty. |
| `Disconnect()` | Send close frame and tear down the connection. |
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes. Called automatically by the agent component. |
| `SendUserTurnStart()` | Signal start of user speech (Client turn mode only). |
| `SendUserTurnEnd()` | Signal end of user speech (Client turn mode only). |
| `SendInterrupt()` | Ask the agent to stop speaking. |
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component. |
| `SendTextMessage(Text)` | Send `{"type":"user_message","text":"..."}`. Agent replies as if it heard speech. |
| `SendUserTurnStart()` | Client turn mode: sends `{"type":"user_activity"}` to signal user is speaking. |
| `SendUserTurnEnd()` | Client turn mode: stops sending `user_activity` (no explicit message — server detects silence). |
| `SendInterrupt()` | Ask the agent to stop speaking: sends `{"type":"interrupt"}`. |
| `GetConnectionState()` | Returns `EElevenLabsConnectionState`. |
| `GetConversationInfo()` | Returns `FElevenLabsConversationInfo`. |
@ -336,11 +361,13 @@ Low-level WebSocket session manager. Used internally by `UElevenLabsConversation
```
Disconnected — No active connection
Connecting — WebSocket handshake in progress
Connected — Conversation active and ready
Connecting — WebSocket handshake in progress / awaiting conversation_initiation_metadata
Connected — Conversation active and ready (fires OnAgentConnected)
Error — Connection or protocol failure
```
> Note: State remains `Connecting` until the server sends `conversation_initiation_metadata`. `OnAgentConnected` fires on transition to `Connected`.
### EElevenLabsTurnMode
```
@ -352,15 +379,15 @@ Client — Your code calls StartListening/StopListening to define turns (push-t
```
ConversationID FString — Unique session ID assigned by ElevenLabs
AgentID FString — The agent that responded
AgentID FString — The agent ID for this session
```
### FElevenLabsTranscriptSegment
```
Text FString — Transcribed text
Speaker FString — "user" or "agent"
bIsFinal bool — false while still speaking, true when the turn is complete
Speaker FString — "user" (agent text comes via OnAgentTextResponse, not transcript)
bIsFinal bool — Always true for user transcripts (ElevenLabs sends final only)
```
---
@ -371,31 +398,31 @@ bIsFinal bool — false while still speaking, true when the turn is complet
ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.
**When to use**: Casual conversation, hands-free interaction.
**When to use**: Casual conversation, hands-free interaction, natural dialogue.
```
StartConversation() → mic streams continuously
StartConversation() → mic streams continuously (if bAutoStartListening = true)
ElevenLabs detects speech / silence automatically
Agent replies when it detects end-of-speech
```
### Client Controlled (push-to-talk)
Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`.
Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`. The plugin sends `{"type":"user_activity"}` while the user is speaking; stopping it signals end of turn.
**When to use**: Noisy environments, precise control, walkie-talkie style.
**When to use**: Noisy environments, precise control, walkie-talkie style UI.
```
Input Pressed → StartListening() → sends user_turn_start + begins audio
Input Released → StopListening() → stops audio + sends user_turn_end
Agent replies after user_turn_end
Input Pressed → StartListening() → streams audio + sends user_activity
Input Released → StopListening() → stops audio (no explicit end message)
Server detects silence and hands turn to agent
```
---
## 9. Security — Signed URL Mode
By default, the API key is stored in Project Settings (Engine.ini). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.
By default, the API key is stored in Project Settings (`DefaultEngine.ini`). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.
### Production setup
@ -407,6 +434,13 @@ By default, the API key is stored in Project Settings (Engine.ini). This is fine
```
4. The plugin fetches this URL before connecting — the API key never leaves your server.
### Development workflow (API key in project settings)
- Set the key in **Project Settings → Plugins → ElevenLabs AI Agent**
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
- **Strip this section from `DefaultEngine.ini` before every git commit**
- Each developer sets the key locally — it does not go in version control
---
## 10. Audio Pipeline
@ -415,40 +449,63 @@ By default, the API key is stored in Project Settings (Engine.ini). This is fine
```
Device (any sample rate, any channels)
↓ FAudioCapture (UE built-in)
↓ Callback: float32 interleaved frames
↓ Downmix to mono (average channels)
↓ FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
↓ Callback: const void* → cast to float32 interleaved frames
↓ Downmix to mono (average all channels)
↓ Resample to 16000 Hz (linear interpolation)
↓ Apply VolumeMultiplier
↓ Dispatch to Game Thread
↓ Convert float32 → int16 LE bytes
↓ Dispatch to Game Thread (AsyncTask)
↓ Convert float32 → int16 signed, little-endian bytes
↓ Base64 encode
↓ WebSocket JSON frame: { "user_audio_chunk": "<base64>" }
Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }
```
### Output (agent → player)
```
WebSocket JSON frame: { "type": "audio", "audio_event": { "audio_base_64": "..." } }
↓ Base64 decode → int16 LE PCM bytes
↓ Enqueue in thread-safe AudioQueue
Binary WebSocket frame arrives
↓ Peek first byte:
• '{' → UTF-8 JSON: parse type field, dispatch to handler
• other → raw PCM audio bytes
↓ [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
↓ Enqueue in thread-safe AudioQueue (FCriticalSection)
↓ USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
↓ UAudioComponent plays from the Actor's world position (3D spatialized)
```
**Audio format** (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.
### Silence detection heuristic
`OnAgentStoppedSpeaking` fires when the `AudioQueue` has been empty for **30 consecutive ticks** (~0.5 s at 60 fps). If the agent has natural pauses, increase `SilenceThresholdTicks` in the header:
```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
```
---
## 11. Common Patterns
### Test the connection without a microphone
```
BeginPlay → StartConversation()
OnAgentConnected → SendTextMessage("Hello, introduce yourself")
OnAgentTextResponse → Print string (confirms text pipeline works)
OnAgentStartedSpeaking → (confirms audio pipeline works)
```
### Show subtitles in UI
```
OnAgentTranscript event:
├─ Segment → Speaker == "user" → show in player subtitle widget
├─ Segment → Speaker == "agent" → show in NPC speech bubble
└─ Segment → bIsFinal == false → show as "..." (in-progress)
OnAgentTranscript:
Segment → Text → show in player subtitle widget (speaker always "user")
OnAgentTextResponse:
ResponseText → show in NPC speech bubble
```
### Interrupt the agent when the player starts speaking
@ -456,13 +513,13 @@ OnAgentTranscript event:
In Server VAD mode ElevenLabs handles this automatically. For manual control:
```
OnAgentStartedSpeaking → store "agent is speaking" flag
OnAgentStartedSpeaking → set "agent is speaking" flag
Input Action (any) → if agent is speaking → InterruptAgent()
```
### Multiple NPCs with different agents
Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. Connections are fully independent.
Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. WebSocket connections are fully independent.
### Only start the conversation when the player is nearby
@ -474,7 +531,7 @@ On End Overlap
└─► [ElevenLabs Agent] End Conversation
```
### Adjusting microphone volume
### Adjust microphone volume
Get the `UElevenLabsMicrophoneCaptureComponent` from the owner and set `VolumeMultiplier`:
@ -495,37 +552,68 @@ Ensure the plugin is enabled in `.uproject` and the project was recompiled after
### WebSocket connection fails immediately
- Check the **API Key** is set correctly in Project Settings.
- Check the **Agent ID** exists in your ElevenLabs account.
- Enable **Verbose Logging** in Project Settings and check the Output Log for the exact WebSocket URL and error.
- Make sure your machine has internet access and port 443 (WSS) is not blocked.
- Check the **Agent ID** exists in your ElevenLabs account (find it in the dashboard URL or via `GET /v1/convai/agents`).
- Enable **Verbose Logging** in Project Settings and check Output Log for the exact WS URL and error.
- Ensure port 443 (WSS) is not blocked by your firewall.
### `OnAgentConnected` never fires
- Connection was made but `conversation_initiation_metadata` not received yet — check Verbose Logging.
- If you see `"Binary audio frame"` logs but no `"Conversation initiated"` — the initiation JSON frame may be arriving as a non-`{` binary frame. Check the hex prefix logged at Verbose level.
### No audio from the microphone
- Windows may require microphone permission. Check **Settings → Privacy → Microphone**.
- Try setting `VolumeMultiplier` to `2.0` to rule out a volume issue.
- Check the Output Log for `"Failed to open default audio capture stream"`.
- Try setting `VolumeMultiplier` to `2.0` on the `MicrophoneCaptureComponent`.
- Check Output Log for `"Failed to open default audio capture stream"`.
### Agent audio is choppy or silent
- The `USoundWaveProcedural` queue may be underflowing. This can happen if audio chunks arrive with long gaps. Check network latency.
- Ensure no other component is consuming the same `UAudioComponent`.
- The `USoundWaveProcedural` queue may be underflowing due to network jitter. Check latency.
- Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check `agent_output_audio_format` in the `conversation_initiation_metadata` via Verbose Logging.
- Ensure no other component is using the same `UAudioComponent`.
### `OnAgentStoppedSpeaking` fires too early
The silence detection threshold is 30 ticks (~0.5 s at 60 fps). If the agent has natural pauses in speech, increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:
Increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:
```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps
```
### Build error: "Plugin AudioCapture not found"
Make sure the `AudioCapture` plugin is enabled in your project. It should be auto-enabled via the `.uplugin` dependency, but you can also add it manually to `.uproject`:
Make sure the `AudioCapture` plugin is enabled. It should be auto-enabled via the `.uplugin` dependency, but you can add it manually to `.uproject`:
```json
{ "Name": "AudioCapture", "Enabled": true }
```
### `"Received unexpected binary WebSocket frame"` in the log
This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.
---
*Documentation generated 2026-02-19 — Plugin v1.0.0 — UE 5.5*
## 13. Changelog
### v1.1.0 — 2026-02-19
**Bug fixes:**
- **Binary WebSocket frames**: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
- **Transcript message**: Wrong message type (`"transcript"``"user_transcript"`), wrong event key (`"transcript_event"``"user_transcription_event"`), wrong text field (`"message"``"user_transcript"`).
- **Pong format**: `event_id` was nested inside a `pong_event` object; corrected to top-level field per API spec.
- **Client turn mode**: `user_turn_start`/`user_turn_end` are not valid API messages; replaced with `user_activity` (start) and implicit silence (end).
**New features:**
- `SendTextMessage(Text)` on both `UElevenLabsConversationalAgentComponent` and `UElevenLabsWebSocketProxy` — send text to the agent without a microphone. Useful for testing.
- Verbose logging shows binary frame hex preview and JSON frame content prefix.
- Improved JSON parse error log now shows the first 80 characters of the failing message.
### v1.0.0 — 2026-02-19
Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.
---
*Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5*