PS_AI_Agent/.claude/PS_AI_Agent_ElevenLabs_Documentation.md
j.foucher e464cfe288 Update plugin documentation to v1.1.0
Reflects all bug fixes and new features added since initial release:
- Binary WS frame handling (JSON vs raw PCM discrimination)
- Corrected transcript message type and field names
- Corrected pong format (top-level event_id)
- Corrected client turn mode (user_activity, no explicit end message)
- New SendTextMessage feature documented with Blueprint + C++ examples
- Added Section 13: Changelog (v1.0.0 / v1.1.0)
- Updated audio pipeline diagram for raw binary PCM output path
- Added OnAgentConnected timing note (fires after initiation_metadata)
- Added FTranscriptSegment clarification (speaker always "user")
- Added API key / git workflow note in Security section
- New troubleshooting entries for binary frames and OnAgentConnected
- New "Test without microphone" common pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:01:09 +01:00

23 KiB
Raw Permalink Blame History

PS_AI_Agent_ElevenLabs — Plugin Documentation

Engine: Unreal Engine 5.5 Plugin version: 1.1.0 Status: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline API: ElevenLabs Conversational AI


Table of Contents

  1. Overview
  2. Installation
  3. Project Settings
  4. Quick Start (Blueprint)
  5. Quick Start (C++)
  6. Components Reference
  7. Data Types Reference
  8. Turn Modes
  9. Security — Signed URL Mode
  10. Audio Pipeline
  11. Common Patterns
  12. Troubleshooting
  13. Changelog

1. Overview

This plugin integrates the ElevenLabs Conversational AI Agent API into Unreal Engine 5.5, enabling real-time voice conversations between a player and an NPC (or any Actor).

How it works

Player microphone
      │
      ▼
UElevenLabsMicrophoneCaptureComponent
  • Captures from default audio device
  • Resamples to 16 kHz mono float32
      │
      ▼
UElevenLabsConversationalAgentComponent
  • Converts float32 → int16 PCM bytes
  • Base64-encodes and sends via WebSocket
      │  (wss://api.elevenlabs.io/v1/convai/conversation)
      ▼
ElevenLabs Conversational AI Agent
  • Transcribes speech
  • Runs LLM
  • Synthesizes voice (ElevenLabs TTS)
      │
      ▼
UElevenLabsConversationalAgentComponent
  • Receives raw binary PCM audio frames
  • Feeds USoundWaveProcedural → UAudioComponent
      │
      ▼
Agent voice plays from the Actor's position in the world

Key properties

  • No gRPC, no third-party libraries — uses UE's built-in WebSockets and AudioCapture modules
  • Blueprint-first: all events and controls are exposed to Blueprint
  • Real-time bidirectional: audio streams in both directions simultaneously
  • Server VAD (default) or push-to-talk
  • Text input supported (no microphone needed for testing)

Wire frame protocol notes

ElevenLabs sends all WebSocket frames as binary (not text frames). The plugin handles two binary frame types automatically:

  • JSON control frames (start with {) — conversation init, transcripts, agent responses, ping/pong
  • Raw PCM audio frames (binary) — agent speech audio, played directly via USoundWaveProcedural

2. Installation

The plugin lives inside the project, not the engine, so no separate install is needed.

Verify it is enabled

Open Unreal/PS_AI_Agent/PS_AI_Agent.uproject and confirm:

{
  "Name": "PS_AI_Agent_ElevenLabs",
  "Enabled": true
}

First compile

Open the project in the UE 5.5 Editor. It will detect the new plugin and ask to recompile — click Yes. Alternatively, compile from the command line:

"C:\Program Files\Epic Games\UE_5.5\Engine\Build\BatchFiles\Build.bat"
    PS_AI_AgentEditor Win64 Development
    "<repo>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject"
    -WaitMutex

3. Project Settings

Go to Edit → Project Settings → Plugins → ElevenLabs AI Agent.

Setting Description Required
API Key Your ElevenLabs API key. Find it at elevenlabs.io/app/settings/api-keys Yes (unless using Signed URL Mode or a public agent)
Agent ID Default agent ID. Find it in the URL when editing an agent: elevenlabs.io/app/conversational-ai/agents/<AGENT_ID> Yes (unless set per-component)
Signed URL Mode Fetch the WS URL from your own backend (keeps key off client). See Section 9 No
Signed URL Endpoint Your backend URL returning { "signed_url": "wss://..." } Only if Signed URL Mode = true
Custom WebSocket URL Override the default wss://api.elevenlabs.io/... endpoint (debug only) No
Verbose Logging Log every WebSocket frame type and first bytes to Output Log No

Security note: The API key set in Project Settings is saved to DefaultEngine.ini. Never commit this file with the key in it — strip the [ElevenLabsSettings] section before committing. Use Signed URL Mode for production builds.

Finding your Agent ID: Go to elevenlabs.io/app/conversational-ai, click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.


4. Quick Start (Blueprint)

Step 1 — Add the component to an NPC

  1. Open your NPC Blueprint (or any Actor Blueprint).
  2. In the Components panel, click Add → search for ElevenLabs Conversational Agent.
  3. Select the component. In the Details panel you can optionally set a specific Agent ID (overrides the project default).

Step 2 — Set Turn Mode

In the component's Details panel:

  • Server VAD (default): ElevenLabs automatically detects when the player stops speaking. Microphone streams continuously once connected.
  • Client Controlled: You call Start Listening / Stop Listening manually (push-to-talk).

Step 3 — Wire up events in the Event Graph

Event BeginPlay
    └─► [ElevenLabs Agent] Start Conversation

[ElevenLabs Agent] On Agent Connected
    └─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID

[ElevenLabs Agent] On Agent Text Response
    └─► Set Text (UI widget) ← Response Text

[ElevenLabs Agent] On Agent Transcript
    └─► (optional) display live subtitles ← Segment → Text

[ElevenLabs Agent] On Agent Started Speaking
    └─► Play talking animation on NPC

[ElevenLabs Agent] On Agent Stopped Speaking
    └─► Return to idle animation

[ElevenLabs Agent] On Agent Error
    └─► Print String "Error: " + Error Message

Event EndPlay
    └─► [ElevenLabs Agent] End Conversation

Step 4 — Push-to-talk (Client Controlled mode only)

Input Action "Talk" (Pressed)
    └─► [ElevenLabs Agent] Start Listening

Input Action "Talk" (Released)
    └─► [ElevenLabs Agent] Stop Listening

Step 5 — Testing without a microphone

Once connected, use Send Text Message instead of speaking:

[ElevenLabs Agent] On Agent Connected
    └─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"

The agent will reply with audio and text exactly as if it heard you speak.


5. Quick Start (C++)

1. Add the plugin to your module's Build.cs

PrivateDependencyModuleNames.Add("PS_AI_Agent_ElevenLabs");

2. Include and use

#include "ElevenLabsConversationalAgentComponent.h"
#include "ElevenLabsDefinitions.h"

// In your Actor's header:
UPROPERTY(VisibleAnywhere)
UElevenLabsConversationalAgentComponent* ElevenLabsAgent;

// In the constructor:
ElevenLabsAgent = CreateDefaultSubobject<UElevenLabsConversationalAgentComponent>(
    TEXT("ElevenLabsAgent"));

// Override Agent ID at runtime (optional):
ElevenLabsAgent->AgentID = TEXT("your_agent_id_here");
ElevenLabsAgent->TurnMode = EElevenLabsTurnMode::Server;
ElevenLabsAgent->bAutoStartListening = true;

// Bind events:
ElevenLabsAgent->OnAgentConnected.AddDynamic(
    this, &AMyNPC::HandleAgentConnected);
ElevenLabsAgent->OnAgentTextResponse.AddDynamic(
    this, &AMyNPC::HandleAgentResponse);
ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
    this, &AMyNPC::PlayTalkingAnimation);

// Start the conversation:
ElevenLabsAgent->StartConversation();

// Send a text message (useful for testing without mic):
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));

// Later, to end:
ElevenLabsAgent->EndConversation();

3. Callback signatures

UFUNCTION()
void HandleAgentConnected(const FElevenLabsConversationInfo& Info)
{
    UE_LOG(LogTemp, Log, TEXT("Connected, ConvID=%s"), *Info.ConversationID);
}

UFUNCTION()
void HandleAgentResponse(const FString& ResponseText)
{
    // Display in UI, drive subtitles, etc.
}

UFUNCTION()
void PlayTalkingAnimation()
{
    // Switch to talking anim montage
}

6. Components Reference

UElevenLabsConversationalAgentComponent

The main component — attach this to any Actor that should be able to speak.

Category: ElevenLabs Inherits from: UActorComponent

Properties

Property Type Default Description
AgentID FString "" Agent ID for this actor. Overrides the project-level default when non-empty.
TurnMode EElevenLabsTurnMode Server How speaker turns are detected. See Section 8.
bAutoStartListening bool true If true, starts mic capture automatically once the WebSocket is connected and ready.

Functions

Function Blueprint Description
StartConversation() Callable Opens the WebSocket connection. If bAutoStartListening is true, mic capture starts once OnAgentConnected fires.
EndConversation() Callable Closes the WebSocket, stops mic, stops audio playback.
StartListening() Callable Starts microphone capture and streams to ElevenLabs. In Client mode, also sends user_activity.
StopListening() Callable Stops microphone capture. In Client mode, stops sending user_activity.
SendTextMessage(Text) Callable Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing.
InterruptAgent() Callable Stops the agent's current utterance immediately and clears the audio queue.
IsConnected() Pure Returns true if the WebSocket is open and the conversation is active.
IsListening() Pure Returns true if the microphone is currently capturing.
IsAgentSpeaking() Pure Returns true if agent audio is currently playing.
GetConversationInfo() Pure Returns FElevenLabsConversationInfo (ConversationID, AgentID).
GetWebSocketProxy() Pure Returns the underlying UElevenLabsWebSocketProxy for advanced use.

Events

Event Parameters Fired when
OnAgentConnected FElevenLabsConversationInfo WebSocket handshake + agent initiation metadata received. Safe to call SendTextMessage here.
OnAgentDisconnected int32 StatusCode, FString Reason WebSocket closed (graceful or remote).
OnAgentError FString ErrorMessage Connection or protocol error.
OnAgentTranscript FElevenLabsTranscriptSegment User speech-to-text transcript received (speaker is always "user").
OnAgentTextResponse FString ResponseText Final text response from the agent (mirrors the audio).
OnAgentStartedSpeaking First audio chunk received from the agent (audio playback begins).
OnAgentStoppedSpeaking Audio queue empty for ~0.5 s (heuristic — agent done speaking).
OnAgentInterrupted Agent speech was interrupted (by user or by InterruptAgent()).

UElevenLabsMicrophoneCaptureComponent

A lightweight microphone capture component. Managed automatically by UElevenLabsConversationalAgentComponent — you only need to use this directly for advanced scenarios (e.g. custom audio routing).

Category: ElevenLabs Inherits from: UActorComponent

Properties

Property Type Default Description
VolumeMultiplier float 1.0 Gain applied to captured samples before resampling. Range: 0.0 4.0.

Functions

Function Blueprint Description
StartCapture() Callable Opens the default audio input device and begins streaming.
StopCapture() Callable Stops streaming and closes the device.
IsCapturing() Pure True while actively capturing.

Delegate

OnAudioCaptured — fires on the game thread with TArray<float> PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.


UElevenLabsWebSocketProxy

Low-level WebSocket session manager. Used internally by UElevenLabsConversationalAgentComponent. Use this directly only if you need fine-grained protocol control.

Inherits from: UObject Instantiate via: NewObject<UElevenLabsWebSocketProxy>(Outer)

Key functions

Function Description
Connect(AgentID, APIKey) Open the WS connection. Parameters override project settings when non-empty.
Disconnect() Send close frame and tear down the connection.
SendAudioChunk(PCMData) Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component.
SendTextMessage(Text) Send {"type":"user_message","text":"..."}. Agent replies as if it heard speech.
SendUserTurnStart() Client turn mode: sends {"type":"user_activity"} to signal user is speaking.
SendUserTurnEnd() Client turn mode: stops sending user_activity (no explicit message — server detects silence).
SendInterrupt() Ask the agent to stop speaking: sends {"type":"interrupt"}.
GetConnectionState() Returns EElevenLabsConnectionState.
GetConversationInfo() Returns FElevenLabsConversationInfo.

7. Data Types Reference

EElevenLabsConnectionState

Disconnected  — No active connection
Connecting    — WebSocket handshake in progress / awaiting conversation_initiation_metadata
Connected     — Conversation active and ready (fires OnAgentConnected)
Error         — Connection or protocol failure

Note: State remains Connecting until the server sends conversation_initiation_metadata. OnAgentConnected fires on transition to Connected.

EElevenLabsTurnMode

Server  — ElevenLabs Voice Activity Detection decides when the user stops speaking (recommended)
Client  — Your code calls StartListening/StopListening to define turns (push-to-talk)

FElevenLabsConversationInfo

ConversationID  FString  — Unique session ID assigned by ElevenLabs
AgentID         FString  — The agent ID for this session

FElevenLabsTranscriptSegment

Text      FString  — Transcribed text
Speaker   FString  — "user" (agent text comes via OnAgentTextResponse, not transcript)
bIsFinal  bool     — Always true for user transcripts (ElevenLabs sends final only)

8. Turn Modes

Server VAD (default)

ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.

When to use: Casual conversation, hands-free interaction, natural dialogue.

StartConversation()  →  mic streams continuously (if bAutoStartListening = true)
                        ElevenLabs detects speech / silence automatically
                        Agent replies when it detects end-of-speech

Client Controlled (push-to-talk)

Your code explicitly signals turn boundaries with StartListening() / StopListening(). The plugin sends {"type":"user_activity"} while the user is speaking; stopping it signals end of turn.

When to use: Noisy environments, precise control, walkie-talkie style UI.

Input Pressed   →  StartListening()   →  streams audio + sends user_activity
Input Released  →  StopListening()    →  stops audio (no explicit end message)
                                         Server detects silence and hands turn to agent

9. Security — Signed URL Mode

By default, the API key is stored in Project Settings (DefaultEngine.ini). This is fine for development but should not be shipped in packaged builds as the key could be extracted.

Production setup

  1. Enable Signed URL Mode in Project Settings.
  2. Set Signed URL Endpoint to a URL on your own backend (e.g. https://your-server.com/api/elevenlabs-token).
  3. Your backend authenticates the player and calls the ElevenLabs API to generate a signed WebSocket URL, returning:
    { "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..." }
    
  4. The plugin fetches this URL before connecting — the API key never leaves your server.

Development workflow (API key in project settings)

  • Set the key in Project Settings → Plugins → ElevenLabs AI Agent
  • UE saves it to DefaultEngine.ini under [/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]
  • Strip this section from DefaultEngine.ini before every git commit
  • Each developer sets the key locally — it does not go in version control

10. Audio Pipeline

Input (player → agent)

Device (any sample rate, any channels)
  ↓  FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
  ↓  Callback: const void* → cast to float32 interleaved frames
  ↓  Downmix to mono (average all channels)
  ↓  Resample to 16000 Hz (linear interpolation)
  ↓  Apply VolumeMultiplier
  ↓  Dispatch to Game Thread (AsyncTask)
  ↓  Convert float32 → int16 signed, little-endian bytes
  ↓  Base64 encode
  ↓  Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }

Output (agent → player)

Binary WebSocket frame arrives
  ↓  Peek first byte:
     • '{' → UTF-8 JSON: parse type field, dispatch to handler
     • other → raw PCM audio bytes
  ↓  [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
  ↓  Enqueue in thread-safe AudioQueue (FCriticalSection)
  ↓  USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
  ↓  UAudioComponent plays from the Actor's world position (3D spatialized)

Audio format (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.

Silence detection heuristic

OnAgentStoppedSpeaking fires when the AudioQueue has been empty for 30 consecutive ticks (~0.5 s at 60 fps). If the agent has natural pauses, increase SilenceThresholdTicks in the header:

static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s

11. Common Patterns

Test the connection without a microphone

BeginPlay → StartConversation()

OnAgentConnected → SendTextMessage("Hello, introduce yourself")

OnAgentTextResponse → Print string (confirms text pipeline works)
OnAgentStartedSpeaking → (confirms audio pipeline works)

Show subtitles in UI

OnAgentTranscript:
  Segment → Text  → show in player subtitle widget (speaker always "user")

OnAgentTextResponse:
  ResponseText    → show in NPC speech bubble

Interrupt the agent when the player starts speaking

In Server VAD mode ElevenLabs handles this automatically. For manual control:

OnAgentStartedSpeaking  →  set "agent is speaking" flag
Input Action (any)      →  if agent is speaking → InterruptAgent()

Multiple NPCs with different agents

Each NPC Blueprint has its own UElevenLabsConversationalAgentComponent. Set a different AgentID on each component. WebSocket connections are fully independent.

Only start the conversation when the player is nearby

On Begin Overlap (trigger volume around NPC)
  └─► [ElevenLabs Agent] Start Conversation

On End Overlap
  └─► [ElevenLabs Agent] End Conversation

Adjust microphone volume

Get the UElevenLabsMicrophoneCaptureComponent from the owner and set VolumeMultiplier:

UElevenLabsMicrophoneCaptureComponent* Mic =
    GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
if (Mic) Mic->VolumeMultiplier = 2.0f;

12. Troubleshooting

Plugin doesn't appear in Project Settings

Ensure the plugin is enabled in .uproject and the project was recompiled after adding it.

WebSocket connection fails immediately

  • Check the API Key is set correctly in Project Settings.
  • Check the Agent ID exists in your ElevenLabs account (find it in the dashboard URL or via GET /v1/convai/agents).
  • Enable Verbose Logging in Project Settings and check Output Log for the exact WS URL and error.
  • Ensure port 443 (WSS) is not blocked by your firewall.

OnAgentConnected never fires

  • Connection was made but conversation_initiation_metadata not received yet — check Verbose Logging.
  • If you see "Binary audio frame" logs but no "Conversation initiated" — the initiation JSON frame may be arriving as a non-{ binary frame. Check the hex prefix logged at Verbose level.

No audio from the microphone

  • Windows may require microphone permission. Check Settings → Privacy → Microphone.
  • Try setting VolumeMultiplier to 2.0 on the MicrophoneCaptureComponent.
  • Check Output Log for "Failed to open default audio capture stream".

Agent audio is choppy or silent

  • The USoundWaveProcedural queue may be underflowing due to network jitter. Check latency.
  • Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check agent_output_audio_format in the conversation_initiation_metadata via Verbose Logging.
  • Ensure no other component is using the same UAudioComponent.

OnAgentStoppedSpeaking fires too early

Increase SilenceThresholdTicks in ElevenLabsConversationalAgentComponent.h:

static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps

Build error: "Plugin AudioCapture not found"

Make sure the AudioCapture plugin is enabled. It should be auto-enabled via the .uplugin dependency, but you can add it manually to .uproject:

{ "Name": "AudioCapture", "Enabled": true }

"Received unexpected binary WebSocket frame" in the log

This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.


13. Changelog

v1.1.0 — 2026-02-19

Bug fixes:

  • Binary WebSocket frames: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
  • Transcript message: Wrong message type ("transcript""user_transcript"), wrong event key ("transcript_event""user_transcription_event"), wrong text field ("message""user_transcript").
  • Pong format: event_id was nested inside a pong_event object; corrected to top-level field per API spec.
  • Client turn mode: user_turn_start/user_turn_end are not valid API messages; replaced with user_activity (start) and implicit silence (end).

New features:

  • SendTextMessage(Text) on both UElevenLabsConversationalAgentComponent and UElevenLabsWebSocketProxy — send text to the agent without a microphone. Useful for testing.
  • Verbose logging shows binary frame hex preview and JSON frame content prefix.
  • Improved JSON parse error log now shows the first 80 characters of the failing message.

v1.0.0 — 2026-02-19

Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.


Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5