# PS_AI_Agent_ElevenLabs — Plugin Documentation

**Engine**: Unreal Engine 5.5
**Plugin version**: 1.1.0
**Status**: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/eleven-agents/quickstart)

---

## Table of Contents

1. [Overview](#1-overview)
2. [Installation](#2-installation)
3. [Project Settings](#3-project-settings)
4. [Quick Start (Blueprint)](#4-quick-start-blueprint)
5. [Quick Start (C++)](#5-quick-start-c)
6. [Components Reference](#6-components-reference)
   - [UElevenLabsConversationalAgentComponent](#uelevenlabsconversationalagentcomponent)
   - [UElevenLabsMicrophoneCaptureComponent](#uelevenlabsmicrophonecapturecomponent)
   - [UElevenLabsWebSocketProxy](#uelevenlabswebsocketproxy)
7. [Data Types Reference](#7-data-types-reference)
8. [Turn Modes](#8-turn-modes)
9. [Security — Signed URL Mode](#9-security--signed-url-mode)
10. [Audio Pipeline](#10-audio-pipeline)
11. [Common Patterns](#11-common-patterns)
12. [Troubleshooting](#12-troubleshooting)
13. [Changelog](#13-changelog)

---

## 1. Overview

This plugin integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5, enabling real-time voice conversations between a player and an NPC (or any Actor).

### How it works

```
Player microphone
      │
      ▼
UElevenLabsMicrophoneCaptureComponent
  • Captures from default audio device
  • Resamples to 16 kHz mono float32
      │
      ▼
UElevenLabsConversationalAgentComponent
  • Converts float32 → int16 PCM bytes
  • Base64-encodes and sends via WebSocket
      │  (wss://api.elevenlabs.io/v1/convai/conversation)
      ▼
ElevenLabs Conversational AI Agent
  • Transcribes speech
  • Runs LLM
  • Synthesizes voice (ElevenLabs TTS)
      │
      ▼
UElevenLabsConversationalAgentComponent
  • Receives raw binary PCM audio frames
  • Feeds USoundWaveProcedural → UAudioComponent
      │
      ▼
Agent voice plays from the Actor's position in the world
```

### Key properties
- No gRPC, no third-party libraries — uses UE's built-in `WebSockets` and `AudioCapture` modules
- Blueprint-first: all events and controls are exposed to Blueprint
- Real-time bidirectional: audio streams in both directions simultaneously
- Server VAD (default) or push-to-talk
- Text input supported (no microphone needed for testing)

### Wire frame protocol notes
ElevenLabs sends **all WebSocket frames as binary** (not text frames). The plugin handles two binary frame types automatically:
- **JSON control frames** (start with `{`) — conversation init, transcripts, agent responses, ping/pong
- **Raw PCM audio frames** (binary) — agent speech audio, played directly via `USoundWaveProcedural`

---

## 2. Installation

The plugin lives inside the project, not the engine, so no separate install is needed.

### Verify it is enabled

Open `Unreal/PS_AI_Agent/PS_AI_Agent.uproject` and confirm:

```json
{
  "Name": "PS_AI_Agent_ElevenLabs",
  "Enabled": true
}
```

### First compile

Open the project in the UE 5.5 Editor. It will detect the new plugin and ask to recompile — click **Yes**. Alternatively, compile from the command line:

```
"C:\Program Files\Epic Games\UE_5.5\Engine\Build\BatchFiles\Build.bat"
    PS_AI_AgentEditor Win64 Development
    "<repo>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject"
    -WaitMutex
```

---

## 3. Project Settings

Go to **Edit → Project Settings → Plugins → ElevenLabs AI Agent**.

| Setting | Description | Required |
|---|---|---|
| **API Key** | Your ElevenLabs API key. Find it at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | Yes (unless using Signed URL Mode or a public agent) |
| **Agent ID** | Default agent ID. Find it in the URL when editing an agent: `elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>` | Yes (unless set per-component) |
| **Signed URL Mode** | Fetch the WS URL from your own backend (keeps key off client). See [Section 9](#9-security--signed-url-mode) | No |
| **Signed URL Endpoint** | Your backend URL returning `{ "signed_url": "wss://..." }` | Only if Signed URL Mode = true |
| **Custom WebSocket URL** | Override the default `wss://api.elevenlabs.io/...` endpoint (debug only) | No |
| **Verbose Logging** | Log every WebSocket frame type and first bytes to Output Log | No |

> **Security note**: The API key set in Project Settings is saved to `DefaultEngine.ini`. **Never commit this file with the key in it** — strip the `[ElevenLabsSettings]` section before committing. Use Signed URL Mode for production builds.

> **Finding your Agent ID**: Go to [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai), click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.

---

## 4. Quick Start (Blueprint)

### Step 1 — Add the component to an NPC

1. Open your NPC Blueprint (or any Actor Blueprint).
2. In the **Components** panel, click **Add** → search for **ElevenLabs Conversational Agent**.
3. Select the component. In the **Details** panel you can optionally set a specific **Agent ID** (overrides the project default).

### Step 2 — Set Turn Mode

In the component's **Details** panel:
- **Server VAD** (default): ElevenLabs automatically detects when the player stops speaking. Microphone streams continuously once connected.
- **Client Controlled**: You call `Start Listening` / `Stop Listening` manually (push-to-talk).

### Step 3 — Wire up events in the Event Graph

```
Event BeginPlay
    └─► [ElevenLabs Agent] Start Conversation

[ElevenLabs Agent] On Agent Connected
    └─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID

[ElevenLabs Agent] On Agent Text Response
    └─► Set Text (UI widget) ← Response Text

[ElevenLabs Agent] On Agent Transcript
    └─► (optional) display live subtitles ← Segment → Text

[ElevenLabs Agent] On Agent Started Speaking
    └─► Play talking animation on NPC

[ElevenLabs Agent] On Agent Stopped Speaking
    └─► Return to idle animation

[ElevenLabs Agent] On Agent Error
    └─► Print String "Error: " + Error Message

Event EndPlay
    └─► [ElevenLabs Agent] End Conversation
```

### Step 4 — Push-to-talk (Client Controlled mode only)

```
Input Action "Talk" (Pressed)
    └─► [ElevenLabs Agent] Start Listening

Input Action "Talk" (Released)
    └─► [ElevenLabs Agent] Stop Listening
```

### Step 5 — Testing without a microphone

Once connected, use **Send Text Message** instead of speaking:

```
[ElevenLabs Agent] On Agent Connected
    └─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"
```

The agent will reply with audio and text exactly as if it heard you speak.

---

## 5. Quick Start (C++)

### 1. Add the plugin to your module's Build.cs

```csharp
PrivateDependencyModuleNames.Add("PS_AI_Agent_ElevenLabs");
```

### 2. Include and use

```cpp
#include "ElevenLabsConversationalAgentComponent.h"
#include "ElevenLabsDefinitions.h"

// In your Actor's header:
UPROPERTY(VisibleAnywhere)
UElevenLabsConversationalAgentComponent* ElevenLabsAgent;

// In the constructor:
ElevenLabsAgent = CreateDefaultSubobject<UElevenLabsConversationalAgentComponent>(
    TEXT("ElevenLabsAgent"));

// Override Agent ID at runtime (optional):
ElevenLabsAgent->AgentID = TEXT("your_agent_id_here");
ElevenLabsAgent->TurnMode = EElevenLabsTurnMode::Server;
ElevenLabsAgent->bAutoStartListening = true;

// Bind events:
ElevenLabsAgent->OnAgentConnected.AddDynamic(
    this, &AMyNPC::HandleAgentConnected);
ElevenLabsAgent->OnAgentTextResponse.AddDynamic(
    this, &AMyNPC::HandleAgentResponse);
ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
    this, &AMyNPC::PlayTalkingAnimation);

// Start the conversation:
ElevenLabsAgent->StartConversation();

// Send a text message (useful for testing without mic):
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));

// Later, to end:
ElevenLabsAgent->EndConversation();
```

### 3. Callback signatures

```cpp
UFUNCTION()
void HandleAgentConnected(const FElevenLabsConversationInfo& Info)
{
    UE_LOG(LogTemp, Log, TEXT("Connected, ConvID=%s"), *Info.ConversationID);
}

UFUNCTION()
void HandleAgentResponse(const FString& ResponseText)
{
    // Display in UI, drive subtitles, etc.
}

UFUNCTION()
void PlayTalkingAnimation()
{
    // Switch to talking anim montage
}
```

---

## 6. Components Reference

### UElevenLabsConversationalAgentComponent

The **main component** — attach this to any Actor that should be able to speak.

**Category**: ElevenLabs
**Inherits from**: `UActorComponent`

#### Properties

| Property | Type | Default | Description |
|---|---|---|---|
| `AgentID` | `FString` | `""` | Agent ID for this actor. Overrides the project-level default when non-empty. |
| `TurnMode` | `EElevenLabsTurnMode` | `Server` | How speaker turns are detected. See [Section 8](#8-turn-modes). |
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is connected and ready. |

#### Functions

| Function | Blueprint | Description |
|---|---|---|
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once `OnAgentConnected` fires. |
| `EndConversation()` | Callable | Closes the WebSocket, stops mic, stops audio playback. |
| `StartListening()` | Callable | Starts microphone capture and streams to ElevenLabs. In Client mode, also sends `user_activity`. |
| `StopListening()` | Callable | Stops microphone capture. In Client mode, stops sending `user_activity`. |
| `SendTextMessage(Text)` | Callable | Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing. |
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately and clears the audio queue. |
| `IsConnected()` | Pure | Returns true if the WebSocket is open and the conversation is active. |
| `IsListening()` | Pure | Returns true if the microphone is currently capturing. |
| `IsAgentSpeaking()` | Pure | Returns true if agent audio is currently playing. |
| `GetConversationInfo()` | Pure | Returns `FElevenLabsConversationInfo` (ConversationID, AgentID). |
| `GetWebSocketProxy()` | Pure | Returns the underlying `UElevenLabsWebSocketProxy` for advanced use. |

#### Events

| Event | Parameters | Fired when |
|---|---|---|
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation metadata received. Safe to call `SendTextMessage` here. |
| `OnAgentDisconnected` | `int32 StatusCode`, `FString Reason` | WebSocket closed (graceful or remote). |
| `OnAgentError` | `FString ErrorMessage` | Connection or protocol error. |
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | User speech-to-text transcript received (speaker is always `"user"`). |
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (mirrors the audio). |
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent (audio playback begins). |
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (heuristic — agent done speaking). |
| `OnAgentInterrupted` | — | Agent speech was interrupted (by user or by `InterruptAgent()`). |

---

### UElevenLabsMicrophoneCaptureComponent

A lightweight microphone capture component. Managed automatically by `UElevenLabsConversationalAgentComponent` — you only need to use this directly for advanced scenarios (e.g. custom audio routing).

**Category**: ElevenLabs
**Inherits from**: `UActorComponent`

#### Properties

| Property | Type | Default | Description |
|---|---|---|---|
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples before resampling. Range: 0.0 – 4.0. |

#### Functions

| Function | Blueprint | Description |
|---|---|---|
| `StartCapture()` | Callable | Opens the default audio input device and begins streaming. |
| `StopCapture()` | Callable | Stops streaming and closes the device. |
| `IsCapturing()` | Pure | True while actively capturing. |

#### Delegate

`OnAudioCaptured` — fires on the **game thread** with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.

---

### UElevenLabsWebSocketProxy

Low-level WebSocket session manager. Used internally by `UElevenLabsConversationalAgentComponent`. Use this directly only if you need fine-grained protocol control.

**Inherits from**: `UObject`
**Instantiate via**: `NewObject<UElevenLabsWebSocketProxy>(Outer)`

#### Key functions

| Function | Description |
|---|---|
| `Connect(AgentID, APIKey)` | Open the WS connection. Parameters override project settings when non-empty. |
| `Disconnect()` | Send close frame and tear down the connection. |
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component. |
| `SendTextMessage(Text)` | Send `{"type":"user_message","text":"..."}`. Agent replies as if it heard speech. |
| `SendUserTurnStart()` | Client turn mode: sends `{"type":"user_activity"}` to signal user is speaking. |
| `SendUserTurnEnd()` | Client turn mode: stops sending `user_activity` (no explicit message — server detects silence). |
| `SendInterrupt()` | Ask the agent to stop speaking: sends `{"type":"interrupt"}`. |
| `GetConnectionState()` | Returns `EElevenLabsConnectionState`. |
| `GetConversationInfo()` | Returns `FElevenLabsConversationInfo`. |

---

## 7. Data Types Reference

### EElevenLabsConnectionState

```
Disconnected  — No active connection
Connecting    — WebSocket handshake in progress / awaiting conversation_initiation_metadata
Connected     — Conversation active and ready (fires OnAgentConnected)
Error         — Connection or protocol failure
```

> Note: State remains `Connecting` until the server sends `conversation_initiation_metadata`. `OnAgentConnected` fires on transition to `Connected`.

### EElevenLabsTurnMode

```
Server  — ElevenLabs Voice Activity Detection decides when the user stops speaking (recommended)
Client  — Your code calls StartListening/StopListening to define turns (push-to-talk)
```

### FElevenLabsConversationInfo

```
ConversationID  FString  — Unique session ID assigned by ElevenLabs
AgentID         FString  — The agent ID for this session
```

### FElevenLabsTranscriptSegment

```
Text      FString  — Transcribed text
Speaker   FString  — "user" (agent text comes via OnAgentTextResponse, not transcript)
bIsFinal  bool     — Always true for user transcripts (ElevenLabs sends final only)
```

---

## 8. Turn Modes

### Server VAD (default)

ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.

**When to use**: Casual conversation, hands-free interaction, natural dialogue.

```
StartConversation()  →  mic streams continuously (if bAutoStartListening = true)
                        ElevenLabs detects speech / silence automatically
                        Agent replies when it detects end-of-speech
```

### Client Controlled (push-to-talk)

Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`. The plugin sends `{"type":"user_activity"}` while the user is speaking; stopping it signals end of turn.

**When to use**: Noisy environments, precise control, walkie-talkie style UI.

```
Input Pressed   →  StartListening()   →  streams audio + sends user_activity
Input Released  →  StopListening()    →  stops audio (no explicit end message)
                                         Server detects silence and hands turn to agent
```

---

## 9. Security — Signed URL Mode

By default, the API key is stored in Project Settings (`DefaultEngine.ini`). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.

### Production setup

1. Enable **Signed URL Mode** in Project Settings.
2. Set **Signed URL Endpoint** to a URL on your own backend (e.g. `https://your-server.com/api/elevenlabs-token`).
3. Your backend authenticates the player and calls the ElevenLabs API to generate a signed WebSocket URL, returning:
   ```json
   { "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..." }
   ```
4. The plugin fetches this URL before connecting — the API key never leaves your server.

### Development workflow (API key in project settings)

- Set the key in **Project Settings → Plugins → ElevenLabs AI Agent**
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
- **Strip this section from `DefaultEngine.ini` before every git commit**
- Each developer sets the key locally — it does not go in version control

---

## 10. Audio Pipeline

### Input (player → agent)

```
Device (any sample rate, any channels)
  ↓  FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
  ↓  Callback: const void* → cast to float32 interleaved frames
  ↓  Downmix to mono (average all channels)
  ↓  Resample to 16000 Hz (linear interpolation)
  ↓  Apply VolumeMultiplier
  ↓  Dispatch to Game Thread (AsyncTask)
  ↓  Convert float32 → int16 signed, little-endian bytes
  ↓  Base64 encode
  ↓  Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }
```

### Output (agent → player)

```
Binary WebSocket frame arrives
  ↓  Peek first byte:
     • '{' → UTF-8 JSON: parse type field, dispatch to handler
     • other → raw PCM audio bytes
  ↓  [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
  ↓  Enqueue in thread-safe AudioQueue (FCriticalSection)
  ↓  USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
  ↓  UAudioComponent plays from the Actor's world position (3D spatialized)
```

**Audio format** (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.

### Silence detection heuristic

`OnAgentStoppedSpeaking` fires when the `AudioQueue` has been empty for **30 consecutive ticks** (~0.5 s at 60 fps). If the agent has natural pauses, increase `SilenceThresholdTicks` in the header:

```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
```

---

## 11. Common Patterns

### Test the connection without a microphone

```
BeginPlay → StartConversation()

OnAgentConnected → SendTextMessage("Hello, introduce yourself")

OnAgentTextResponse → Print string (confirms text pipeline works)
OnAgentStartedSpeaking → (confirms audio pipeline works)
```

### Show subtitles in UI

```
OnAgentTranscript:
  Segment → Text  → show in player subtitle widget (speaker always "user")

OnAgentTextResponse:
  ResponseText    → show in NPC speech bubble
```

### Interrupt the agent when the player starts speaking

In Server VAD mode ElevenLabs handles this automatically. For manual control:

```
OnAgentStartedSpeaking  →  set "agent is speaking" flag
Input Action (any)      →  if agent is speaking → InterruptAgent()
```

### Multiple NPCs with different agents

Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. WebSocket connections are fully independent.

### Only start the conversation when the player is nearby

```
On Begin Overlap (trigger volume around NPC)
  └─► [ElevenLabs Agent] Start Conversation

On End Overlap
  └─► [ElevenLabs Agent] End Conversation
```

### Adjust microphone volume

Get the `UElevenLabsMicrophoneCaptureComponent` from the owner and set `VolumeMultiplier`:

```cpp
UElevenLabsMicrophoneCaptureComponent* Mic =
    GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
if (Mic) Mic->VolumeMultiplier = 2.0f;
```

---

## 12. Troubleshooting

### Plugin doesn't appear in Project Settings

Ensure the plugin is enabled in `.uproject` and the project was recompiled after adding it.

### WebSocket connection fails immediately

- Check the **API Key** is set correctly in Project Settings.
- Check the **Agent ID** exists in your ElevenLabs account (find it in the dashboard URL or via `GET /v1/convai/agents`).
- Enable **Verbose Logging** in Project Settings and check Output Log for the exact WS URL and error.
- Ensure port 443 (WSS) is not blocked by your firewall.

### `OnAgentConnected` never fires

- Connection was made but `conversation_initiation_metadata` not received yet — check Verbose Logging.
- If you see `"Binary audio frame"` logs but no `"Conversation initiated"` — the initiation JSON frame may be arriving as a non-`{` binary frame. Check the hex prefix logged at Verbose level.

### No audio from the microphone

- Windows may require microphone permission. Check **Settings → Privacy → Microphone**.
- Try setting `VolumeMultiplier` to `2.0` on the `MicrophoneCaptureComponent`.
- Check Output Log for `"Failed to open default audio capture stream"`.

### Agent audio is choppy or silent

- The `USoundWaveProcedural` queue may be underflowing due to network jitter. Check latency.
- Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check `agent_output_audio_format` in the `conversation_initiation_metadata` via Verbose Logging.
- Ensure no other component is using the same `UAudioComponent`.

### `OnAgentStoppedSpeaking` fires too early

Increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:

```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps
```

### Build error: "Plugin AudioCapture not found"

Make sure the `AudioCapture` plugin is enabled. It should be auto-enabled via the `.uplugin` dependency, but you can add it manually to `.uproject`:

```json
{ "Name": "AudioCapture", "Enabled": true }
```

### `"Received unexpected binary WebSocket frame"` in the log

This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.

---

## 13. Changelog

### v1.1.0 — 2026-02-19

**Bug fixes:**
- **Binary WebSocket frames**: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
- **Transcript message**: Wrong message type (`"transcript"` → `"user_transcript"`), wrong event key (`"transcript_event"` → `"user_transcription_event"`), wrong text field (`"message"` → `"user_transcript"`).
- **Pong format**: `event_id` was nested inside a `pong_event` object; corrected to top-level field per API spec.
- **Client turn mode**: `user_turn_start`/`user_turn_end` are not valid API messages; replaced with `user_activity` (start) and implicit silence (end).

**New features:**
- `SendTextMessage(Text)` on both `UElevenLabsConversationalAgentComponent` and `UElevenLabsWebSocketProxy` — send text to the agent without a microphone. Useful for testing.
- Verbose logging shows binary frame hex preview and JSON frame content prefix.
- Improved JSON parse error log now shows the first 80 characters of the failing message.

### v1.0.0 — 2026-02-19

Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.

---

*Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5*