Compare commits

...

17 Commits

Author SHA1 Message Date
302337b573 memory 2026-02-19 15:02:55 +01:00
99017f4067 Update test_AI_Actor blueprint asset
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:56:06 +01:00
e464cfe288 Update plugin documentation to v1.1.0
Reflects all bug fixes and new features added since initial release:
- Binary WS frame handling (JSON vs raw PCM discrimination)
- Corrected transcript message type and field names
- Corrected pong format (top-level event_id)
- Corrected client turn mode (user_activity, no explicit end message)
- New SendTextMessage feature documented with Blueprint + C++ examples
- Added Section 13: Changelog (v1.0.0 / v1.1.0)
- Updated audio pipeline diagram for raw binary PCM output path
- Added OnAgentConnected timing note (fires after initiation_metadata)
- Added FTranscriptSegment clarification (speaker always "user")
- Added API key / git workflow note in Security section
- New troubleshooting entries for binary frames and OnAgentConnected
- New "Test without microphone" common pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:01:09 +01:00
483456728d Fix: distinguish binary audio frames from binary JSON frames
ElevenLabs sends two kinds of binary WebSocket frames:
  1. JSON control messages (starts with '{') — decode as UTF-8, route to OnWsMessage
  2. Raw PCM audio (binary, does not start with '{') — broadcast directly as audio

Previously all binary frames were decoded as UTF-8 JSON, causing
"Failed to parse WebSocket message as JSON" for every audio frame.

Fix: peek at first byte of assembled frame buffer:
  - '{' → UTF-8 JSON path (null-terminated, routed to existing message handler)
  - anything else → raw PCM path (broadcast directly to OnAudioReceived)

Also: improved "Failed to parse JSON" log to show first 80 chars of message,
and added verbose hex dump of binary audio frame prefix for diagnostics.
Compiles cleanly on UE 5.5 Win64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:54:34 +01:00
669c503d06 Fix: handle binary WebSocket frames from ElevenLabs
ElevenLabs sends all JSON messages as binary WS frames, not text frames.
The OnRawMessage callback receives them; we were logging them as warnings
and discarding the data entirely — causing no events to fire at all.

Fix: accumulate binary frame fragments (BytesRemaining > 0 = more coming),
reassemble into a complete buffer, decode as UTF-8 JSON string, then route
through the existing OnWsMessage text handler unchanged.

Added BinaryFrameBuffer (TArray<uint8>) to proxy header for accumulation.
Compiles cleanly on UE 5.5 Win64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:51:47 +01:00
b489d1174c Add SendTextMessage to agent component and WebSocket proxy
Sends {"type":"user_message","text":"..."} to the ElevenLabs API.
Agent responds with audio + text exactly as if it heard spoken input.
Useful for testing without a microphone and for text-only NPC interactions.

Available in Blueprint on UElevenLabsConversationalAgentComponent.
Compiles cleanly on UE 5.5 Win64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:49:03 +01:00
ae2c9b92e8 Fix 3 WebSocket protocol bugs found by API cross-check
Bug 1 — Transcript handler: wrong type string + wrong JSON fields
  - type was "transcript", API sends "user_transcript"
  - event key was "transcript_event", API uses "user_transcription_event"
  - field was "message", API uses "user_transcript"
  - removed non-existent "speaker"/"is_final" fields; speaker is always "user"

Bug 2 — Pong format: event_id must be top-level, not nested in pong_event
  - Was: {"type":"pong","pong_event":{"event_id":1}}
  - Fixed: {"type":"pong","event_id":1}

Bug 3 — Client turn mode: user_turn_start/end don't exist in the API
  - SendUserTurnStart now sends {"type":"user_activity"} (correct API message)
  - SendUserTurnEnd now a no-op with log (no explicit end message in API)
  - Renamed constants in ElevenLabsDefinitions.h accordingly

Also added AgentResponseCorrection and ConversationClientData constants.
Compiles cleanly on UE 5.5 Win64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:43:08 +01:00
dbd61615a9 Add TestMap, test actor asset, update DefaultEngine.ini and memory
- DefaultEngine.ini: set GameDefaultMap + EditorStartupMap to TestMap
  (API key stripped — set locally via Project Settings, not committed)
- Content/TestMap.umap: initial test level
- Content/test_AI_Actor.uasset: initial test actor
- .claude/MEMORY.md: document API key handling, add memory file index,
  note private git server and TestMap as default map

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:40:08 +01:00
bbeb4294a8 Add ElevenLabs API reference doc for future Claude sessions
Covers: WebSocket protocol (all message types), Agent ID location,
Signed URL auth, REST agents API, audio format, UE5 integration notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:35:35 +01:00
2bb503ae40 Add session log for 2026-02-19
Full record of everything done in today's session: plugin creation,
compile fixes, documentation (Markdown + PowerPoint), git history,
technical decisions made, and next steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 13:15:24 +01:00
1b7202603f Add PowerPoint documentation and update .gitignore
- PS_AI_Agent_ElevenLabs_Documentation.pptx: 20-slide dark-themed presentation
  covering overview, installation, quick start (BP + C++), component reference,
  data types, turn modes, security, audio pipeline, patterns, troubleshooting.
- .gitignore: exclude generate_pptx.py (dev tool, not needed in repo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 13:14:33 +01:00
c833ccd66d Add plugin documentation for PS_AI_Agent_ElevenLabs
Covers: installation, project settings, quick start (Blueprint + C++),
full component/API reference, turn modes, security/signed URL mode,
audio pipeline diagram, common patterns, and troubleshooting guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 13:07:49 +01:00
3b98edcf92 Update memory: document confirmed UE 5.5 API patterns and plugin compile status
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 13:02:33 +01:00
bb1a857e86 Fix compile errors in PS_AI_Agent_ElevenLabs plugin
- Remove WebSockets from .uplugin (it is a module, not a plugin)
- Add AudioCapture plugin dependency to .uplugin
- Fix FOnAudioCaptureFunction: use OpenAudioCaptureStream (not deprecated
  OpenDefaultCaptureStream) and correct callback signature (const void* per UE 5.3+)
- Cast void* to float* inside OnAudioGenerate for float sample processing
- Fix TArray::RemoveAt: use EAllowShrinking::No instead of deprecated bool overload

Plugin now compiles cleanly with no errors or warnings on UE 5.5 / Win64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 13:02:11 +01:00
b114ab063d Broaden .gitignore: use glob for all plugin Binaries/Intermediate
Replaces the plugin-specific path with a wildcard pattern so any
future plugin under Plugins/ is automatically excluded from version control.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 12:58:46 +01:00
4d6ae103db Update .gitignore: exclude plugin build artifacts and local Claude settings
Keeps existing ignores intact. Adds:
- Plugin Binaries/ and Intermediate/ (generated on first compile)
- *.sln / *.suo (generated by UE project regeneration)
- .claude/settings.local.json (machine-specific, not shared)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 12:58:31 +01:00
f0055e85ed Add PS_AI_Agent_ElevenLabs plugin (initial implementation)
Adds a new UE5.5 plugin integrating the ElevenLabs Conversational AI Agent
via WebSocket. No gRPC or third-party libs required.

Plugin components:
- UElevenLabsSettings: API key + Agent ID in Project Settings
- UElevenLabsWebSocketProxy: full WS session lifecycle, JSON message handling,
  ping/pong keepalive, Base64 PCM audio send/receive
- UElevenLabsConversationalAgentComponent: ActorComponent for NPC voice
  conversation, orchestrates mic capture -> WS -> procedural audio playback
- UElevenLabsMicrophoneCaptureComponent: wraps Audio::FAudioCapture,
  resamples to 16kHz mono, dispatches on game thread

Also adds .claude/ memory files (project context, plugin notes, patterns)
so Claude Code can restore full context on any machine after a git pull.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-19 12:57:48 +01:00
23 changed files with 3314 additions and 1 deletions

75
.claude/MEMORY.md Normal file
View File

@ -0,0 +1,75 @@
# Project Memory PS_AI_Agent
> This file is committed to the repository so it is available on any machine.
> Claude Code reads it automatically at session start (via the auto-memory system)
> when the working directory is inside this repo.
> **Keep it under ~180 lines** lines beyond 200 are truncated by the system.
---
## Project Location
- Repo root: `<repo_root>/` (wherever this is cloned)
- UE5 project: `<repo_root>/Unreal/PS_AI_Agent/`
- `.uproject`: `<repo_root>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject`
- Engine: **Unreal Engine 5.5** — Win64 primary target
- Default test map: `/Game/TestMap.TestMap`
## Plugins
| Plugin | Path | Purpose |
|--------|------|---------|
| Convai (reference) | `<repo_root>/ConvAI/Convai/` | gRPC + protobuf streaming to Convai API. Has ElevenLabs voice type enum in `ConvaiDefinitions.h`. Used as architectural reference. |
| **PS_AI_Agent_ElevenLabs** | `<repo_root>/Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/` | Our ElevenLabs Conversational AI integration. See `.claude/elevenlabs_plugin.md` for full details. |
## User Preferences
- Plugin naming: `PS_AI_Agent_<Service>` (e.g. `PS_AI_Agent_ElevenLabs`)
- Save memory frequently during long sessions
- Goal: ElevenLabs Conversational AI integration — simpler than Convai, no gRPC
- Full original ask + intent: see `.claude/project_context.md`
- Git remote is a **private server** — no public exposure risk
## Key UE5 Plugin Patterns
- Settings object: `UCLASS(config=Engine, defaultconfig)` inheriting `UObject`, registered via `ISettingsModule`
- Module startup: `NewObject<USettings>(..., RF_Standalone)` + `AddToRoot()`
- WebSocket: `FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), Headers)`
- `WebSockets` is a **module** (Build.cs only) — NOT a plugin, don't put it in `.uplugin`
- Audio capture: `Audio::FAudioCapture::OpenAudioCaptureStream()` (UE 5.3+, replaces deprecated `OpenCaptureStream`)
- `AudioCapture` IS a plugin — declare it in `.uplugin` Plugins array
- Callback type: `FOnAudioCaptureFunction` = `TFunction<void(const void*, int32, int32, int32, double, bool)>`
- Cast `const void*` to `const float*` inside — device sends float32 interleaved
- Procedural audio playback: `USoundWaveProcedural` + `OnSoundWaveProceduralUnderflow` delegate
- Audio capture callbacks arrive on a **background thread** — always marshal to game thread with `AsyncTask(ENamedThreads::GameThread, ...)`
- Resample mic audio to **16000 Hz mono** before sending to ElevenLabs
- `TArray::RemoveAt(idx, count, EAllowShrinking::No)` — bool overload deprecated in UE 5.5
## Plugin Status
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
- Connection confirmed working end-to-end; audio receive path functional
## ElevenLabs WebSocket Protocol Notes
- **ALL frames are binary**`OnRawMessage` handles everything; `OnMessage` (text) never fires
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
- Pong: `{"type":"pong","event_id":N}``event_id` is **top-level**, NOT nested
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
## API Keys / Secrets
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
- **The key is stripped from `DefaultEngine.ini` before every commit** — do not commit it
- Each developer sets the key locally; it does not go in git
## Claude Memory Files in This Repo
| File | Contents |
|------|----------|
| `.claude/MEMORY.md` | This file — project structure, patterns, status |
| `.claude/elevenlabs_plugin.md` | Plugin file map, ElevenLabs WS protocol, design decisions |
| `.claude/elevenlabs_api_reference.md` | Full ElevenLabs API reference (WS messages, REST, signed URL, Agent ID location) |
| `.claude/project_context.md` | Original ask, intent, short/long-term goals |
| `.claude/session_log_2026-02-19.md` | Full session record: steps, commits, technical decisions, next steps |
| `.claude/PS_AI_Agent_ElevenLabs_Documentation.md` | User-facing Markdown reference doc |

View File

@ -0,0 +1,619 @@
# PS_AI_Agent_ElevenLabs — Plugin Documentation
**Engine**: Unreal Engine 5.5
**Plugin version**: 1.1.0
**Status**: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/eleven-agents/quickstart)
---
## Table of Contents
1. [Overview](#1-overview)
2. [Installation](#2-installation)
3. [Project Settings](#3-project-settings)
4. [Quick Start (Blueprint)](#4-quick-start-blueprint)
5. [Quick Start (C++)](#5-quick-start-c)
6. [Components Reference](#6-components-reference)
- [UElevenLabsConversationalAgentComponent](#uelevenlabsconversationalagentcomponent)
- [UElevenLabsMicrophoneCaptureComponent](#uelevenlabsmicrophonecapturecomponent)
- [UElevenLabsWebSocketProxy](#uelevenlabswebsocketproxy)
7. [Data Types Reference](#7-data-types-reference)
8. [Turn Modes](#8-turn-modes)
9. [Security — Signed URL Mode](#9-security--signed-url-mode)
10. [Audio Pipeline](#10-audio-pipeline)
11. [Common Patterns](#11-common-patterns)
12. [Troubleshooting](#12-troubleshooting)
13. [Changelog](#13-changelog)
---
## 1. Overview
This plugin integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5, enabling real-time voice conversations between a player and an NPC (or any Actor).
### How it works
```
Player microphone
UElevenLabsMicrophoneCaptureComponent
• Captures from default audio device
• Resamples to 16 kHz mono float32
UElevenLabsConversationalAgentComponent
• Converts float32 → int16 PCM bytes
• Base64-encodes and sends via WebSocket
│ (wss://api.elevenlabs.io/v1/convai/conversation)
ElevenLabs Conversational AI Agent
• Transcribes speech
• Runs LLM
• Synthesizes voice (ElevenLabs TTS)
UElevenLabsConversationalAgentComponent
• Receives raw binary PCM audio frames
• Feeds USoundWaveProcedural → UAudioComponent
Agent voice plays from the Actor's position in the world
```
### Key properties
- No gRPC, no third-party libraries — uses UE's built-in `WebSockets` and `AudioCapture` modules
- Blueprint-first: all events and controls are exposed to Blueprint
- Real-time bidirectional: audio streams in both directions simultaneously
- Server VAD (default) or push-to-talk
- Text input supported (no microphone needed for testing)
### Wire frame protocol notes
ElevenLabs sends **all WebSocket frames as binary** (not text frames). The plugin handles two binary frame types automatically:
- **JSON control frames** (start with `{`) — conversation init, transcripts, agent responses, ping/pong
- **Raw PCM audio frames** (binary) — agent speech audio, played directly via `USoundWaveProcedural`
---
## 2. Installation
The plugin lives inside the project, not the engine, so no separate install is needed.
### Verify it is enabled
Open `Unreal/PS_AI_Agent/PS_AI_Agent.uproject` and confirm:
```json
{
"Name": "PS_AI_Agent_ElevenLabs",
"Enabled": true
}
```
### First compile
Open the project in the UE 5.5 Editor. It will detect the new plugin and ask to recompile — click **Yes**. Alternatively, compile from the command line:
```
"C:\Program Files\Epic Games\UE_5.5\Engine\Build\BatchFiles\Build.bat"
PS_AI_AgentEditor Win64 Development
"<repo>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject"
-WaitMutex
```
---
## 3. Project Settings
Go to **Edit → Project Settings → Plugins → ElevenLabs AI Agent**.
| Setting | Description | Required |
|---|---|---|
| **API Key** | Your ElevenLabs API key. Find it at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | Yes (unless using Signed URL Mode or a public agent) |
| **Agent ID** | Default agent ID. Find it in the URL when editing an agent: `elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>` | Yes (unless set per-component) |
| **Signed URL Mode** | Fetch the WS URL from your own backend (keeps key off client). See [Section 9](#9-security--signed-url-mode) | No |
| **Signed URL Endpoint** | Your backend URL returning `{ "signed_url": "wss://..." }` | Only if Signed URL Mode = true |
| **Custom WebSocket URL** | Override the default `wss://api.elevenlabs.io/...` endpoint (debug only) | No |
| **Verbose Logging** | Log every WebSocket frame type and first bytes to Output Log | No |
> **Security note**: The API key set in Project Settings is saved to `DefaultEngine.ini`. **Never commit this file with the key in it** — strip the `[ElevenLabsSettings]` section before committing. Use Signed URL Mode for production builds.
> **Finding your Agent ID**: Go to [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai), click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.
---
## 4. Quick Start (Blueprint)
### Step 1 — Add the component to an NPC
1. Open your NPC Blueprint (or any Actor Blueprint).
2. In the **Components** panel, click **Add** → search for **ElevenLabs Conversational Agent**.
3. Select the component. In the **Details** panel you can optionally set a specific **Agent ID** (overrides the project default).
### Step 2 — Set Turn Mode
In the component's **Details** panel:
- **Server VAD** (default): ElevenLabs automatically detects when the player stops speaking. Microphone streams continuously once connected.
- **Client Controlled**: You call `Start Listening` / `Stop Listening` manually (push-to-talk).
### Step 3 — Wire up events in the Event Graph
```
Event BeginPlay
└─► [ElevenLabs Agent] Start Conversation
[ElevenLabs Agent] On Agent Connected
└─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID
[ElevenLabs Agent] On Agent Text Response
└─► Set Text (UI widget) ← Response Text
[ElevenLabs Agent] On Agent Transcript
└─► (optional) display live subtitles ← Segment → Text
[ElevenLabs Agent] On Agent Started Speaking
└─► Play talking animation on NPC
[ElevenLabs Agent] On Agent Stopped Speaking
└─► Return to idle animation
[ElevenLabs Agent] On Agent Error
└─► Print String "Error: " + Error Message
Event EndPlay
└─► [ElevenLabs Agent] End Conversation
```
### Step 4 — Push-to-talk (Client Controlled mode only)
```
Input Action "Talk" (Pressed)
└─► [ElevenLabs Agent] Start Listening
Input Action "Talk" (Released)
└─► [ElevenLabs Agent] Stop Listening
```
### Step 5 — Testing without a microphone
Once connected, use **Send Text Message** instead of speaking:
```
[ElevenLabs Agent] On Agent Connected
└─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"
```
The agent will reply with audio and text exactly as if it heard you speak.
---
## 5. Quick Start (C++)
### 1. Add the plugin to your module's Build.cs
```csharp
PrivateDependencyModuleNames.Add("PS_AI_Agent_ElevenLabs");
```
### 2. Include and use
```cpp
#include "ElevenLabsConversationalAgentComponent.h"
#include "ElevenLabsDefinitions.h"
// In your Actor's header:
UPROPERTY(VisibleAnywhere)
UElevenLabsConversationalAgentComponent* ElevenLabsAgent;
// In the constructor:
ElevenLabsAgent = CreateDefaultSubobject<UElevenLabsConversationalAgentComponent>(
TEXT("ElevenLabsAgent"));
// Override Agent ID at runtime (optional):
ElevenLabsAgent->AgentID = TEXT("your_agent_id_here");
ElevenLabsAgent->TurnMode = EElevenLabsTurnMode::Server;
ElevenLabsAgent->bAutoStartListening = true;
// Bind events:
ElevenLabsAgent->OnAgentConnected.AddDynamic(
this, &AMyNPC::HandleAgentConnected);
ElevenLabsAgent->OnAgentTextResponse.AddDynamic(
this, &AMyNPC::HandleAgentResponse);
ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
this, &AMyNPC::PlayTalkingAnimation);
// Start the conversation:
ElevenLabsAgent->StartConversation();
// Send a text message (useful for testing without mic):
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));
// Later, to end:
ElevenLabsAgent->EndConversation();
```
### 3. Callback signatures
```cpp
UFUNCTION()
void HandleAgentConnected(const FElevenLabsConversationInfo& Info)
{
UE_LOG(LogTemp, Log, TEXT("Connected, ConvID=%s"), *Info.ConversationID);
}
UFUNCTION()
void HandleAgentResponse(const FString& ResponseText)
{
// Display in UI, drive subtitles, etc.
}
UFUNCTION()
void PlayTalkingAnimation()
{
// Switch to talking anim montage
}
```
---
## 6. Components Reference
### UElevenLabsConversationalAgentComponent
The **main component** — attach this to any Actor that should be able to speak.
**Category**: ElevenLabs
**Inherits from**: `UActorComponent`
#### Properties
| Property | Type | Default | Description |
|---|---|---|---|
| `AgentID` | `FString` | `""` | Agent ID for this actor. Overrides the project-level default when non-empty. |
| `TurnMode` | `EElevenLabsTurnMode` | `Server` | How speaker turns are detected. See [Section 8](#8-turn-modes). |
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is connected and ready. |
#### Functions
| Function | Blueprint | Description |
|---|---|---|
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once `OnAgentConnected` fires. |
| `EndConversation()` | Callable | Closes the WebSocket, stops mic, stops audio playback. |
| `StartListening()` | Callable | Starts microphone capture and streams to ElevenLabs. In Client mode, also sends `user_activity`. |
| `StopListening()` | Callable | Stops microphone capture. In Client mode, stops sending `user_activity`. |
| `SendTextMessage(Text)` | Callable | Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing. |
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately and clears the audio queue. |
| `IsConnected()` | Pure | Returns true if the WebSocket is open and the conversation is active. |
| `IsListening()` | Pure | Returns true if the microphone is currently capturing. |
| `IsAgentSpeaking()` | Pure | Returns true if agent audio is currently playing. |
| `GetConversationInfo()` | Pure | Returns `FElevenLabsConversationInfo` (ConversationID, AgentID). |
| `GetWebSocketProxy()` | Pure | Returns the underlying `UElevenLabsWebSocketProxy` for advanced use. |
#### Events
| Event | Parameters | Fired when |
|---|---|---|
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation metadata received. Safe to call `SendTextMessage` here. |
| `OnAgentDisconnected` | `int32 StatusCode`, `FString Reason` | WebSocket closed (graceful or remote). |
| `OnAgentError` | `FString ErrorMessage` | Connection or protocol error. |
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | User speech-to-text transcript received (speaker is always `"user"`). |
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (mirrors the audio). |
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent (audio playback begins). |
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (heuristic — agent done speaking). |
| `OnAgentInterrupted` | — | Agent speech was interrupted (by user or by `InterruptAgent()`). |
---
### UElevenLabsMicrophoneCaptureComponent
A lightweight microphone capture component. Managed automatically by `UElevenLabsConversationalAgentComponent` — you only need to use this directly for advanced scenarios (e.g. custom audio routing).
**Category**: ElevenLabs
**Inherits from**: `UActorComponent`
#### Properties
| Property | Type | Default | Description |
|---|---|---|---|
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples before resampling. Range: 0.0 4.0. |
#### Functions
| Function | Blueprint | Description |
|---|---|---|
| `StartCapture()` | Callable | Opens the default audio input device and begins streaming. |
| `StopCapture()` | Callable | Stops streaming and closes the device. |
| `IsCapturing()` | Pure | True while actively capturing. |
#### Delegate
`OnAudioCaptured` — fires on the **game thread** with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.
---
### UElevenLabsWebSocketProxy
Low-level WebSocket session manager. Used internally by `UElevenLabsConversationalAgentComponent`. Use this directly only if you need fine-grained protocol control.
**Inherits from**: `UObject`
**Instantiate via**: `NewObject<UElevenLabsWebSocketProxy>(Outer)`
#### Key functions
| Function | Description |
|---|---|
| `Connect(AgentID, APIKey)` | Open the WS connection. Parameters override project settings when non-empty. |
| `Disconnect()` | Send close frame and tear down the connection. |
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component. |
| `SendTextMessage(Text)` | Send `{"type":"user_message","text":"..."}`. Agent replies as if it heard speech. |
| `SendUserTurnStart()` | Client turn mode: sends `{"type":"user_activity"}` to signal user is speaking. |
| `SendUserTurnEnd()` | Client turn mode: stops sending `user_activity` (no explicit message — server detects silence). |
| `SendInterrupt()` | Ask the agent to stop speaking: sends `{"type":"interrupt"}`. |
| `GetConnectionState()` | Returns `EElevenLabsConnectionState`. |
| `GetConversationInfo()` | Returns `FElevenLabsConversationInfo`. |
---
## 7. Data Types Reference
### EElevenLabsConnectionState
```
Disconnected — No active connection
Connecting — WebSocket handshake in progress / awaiting conversation_initiation_metadata
Connected — Conversation active and ready (fires OnAgentConnected)
Error — Connection or protocol failure
```
> Note: State remains `Connecting` until the server sends `conversation_initiation_metadata`. `OnAgentConnected` fires on transition to `Connected`.
### EElevenLabsTurnMode
```
Server — ElevenLabs Voice Activity Detection decides when the user stops speaking (recommended)
Client — Your code calls StartListening/StopListening to define turns (push-to-talk)
```
### FElevenLabsConversationInfo
```
ConversationID FString — Unique session ID assigned by ElevenLabs
AgentID FString — The agent ID for this session
```
### FElevenLabsTranscriptSegment
```
Text FString — Transcribed text
Speaker FString — "user" (agent text comes via OnAgentTextResponse, not transcript)
bIsFinal bool — Always true for user transcripts (ElevenLabs sends final only)
```
---
## 8. Turn Modes
### Server VAD (default)
ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.
**When to use**: Casual conversation, hands-free interaction, natural dialogue.
```
StartConversation() → mic streams continuously (if bAutoStartListening = true)
ElevenLabs detects speech / silence automatically
Agent replies when it detects end-of-speech
```
### Client Controlled (push-to-talk)
Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`. The plugin sends `{"type":"user_activity"}` while the user is speaking; stopping it signals end of turn.
**When to use**: Noisy environments, precise control, walkie-talkie style UI.
```
Input Pressed → StartListening() → streams audio + sends user_activity
Input Released → StopListening() → stops audio (no explicit end message)
Server detects silence and hands turn to agent
```
---
## 9. Security — Signed URL Mode
By default, the API key is stored in Project Settings (`DefaultEngine.ini`). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.
### Production setup
1. Enable **Signed URL Mode** in Project Settings.
2. Set **Signed URL Endpoint** to a URL on your own backend (e.g. `https://your-server.com/api/elevenlabs-token`).
3. Your backend authenticates the player and calls the ElevenLabs API to generate a signed WebSocket URL, returning:
```json
{ "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..." }
```
4. The plugin fetches this URL before connecting — the API key never leaves your server.
### Development workflow (API key in project settings)
- Set the key in **Project Settings → Plugins → ElevenLabs AI Agent**
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
- **Strip this section from `DefaultEngine.ini` before every git commit**
- Each developer sets the key locally — it does not go in version control
---
## 10. Audio Pipeline
### Input (player → agent)
```
Device (any sample rate, any channels)
↓ FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
↓ Callback: const void* → cast to float32 interleaved frames
↓ Downmix to mono (average all channels)
↓ Resample to 16000 Hz (linear interpolation)
↓ Apply VolumeMultiplier
↓ Dispatch to Game Thread (AsyncTask)
↓ Convert float32 → int16 signed, little-endian bytes
↓ Base64 encode
↓ Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }
```
### Output (agent → player)
```
Binary WebSocket frame arrives
↓ Peek first byte:
• '{' → UTF-8 JSON: parse type field, dispatch to handler
• other → raw PCM audio bytes
↓ [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
↓ Enqueue in thread-safe AudioQueue (FCriticalSection)
↓ USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
↓ UAudioComponent plays from the Actor's world position (3D spatialized)
```
**Audio format** (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.
### Silence detection heuristic
`OnAgentStoppedSpeaking` fires when the `AudioQueue` has been empty for **30 consecutive ticks** (~0.5 s at 60 fps). If the agent has natural pauses, increase `SilenceThresholdTicks` in the header:
```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
```
---
## 11. Common Patterns
### Test the connection without a microphone
```
BeginPlay → StartConversation()
OnAgentConnected → SendTextMessage("Hello, introduce yourself")
OnAgentTextResponse → Print string (confirms text pipeline works)
OnAgentStartedSpeaking → (confirms audio pipeline works)
```
### Show subtitles in UI
```
OnAgentTranscript:
Segment → Text → show in player subtitle widget (speaker always "user")
OnAgentTextResponse:
ResponseText → show in NPC speech bubble
```
### Interrupt the agent when the player starts speaking
In Server VAD mode ElevenLabs handles this automatically. For manual control:
```
OnAgentStartedSpeaking → set "agent is speaking" flag
Input Action (any) → if agent is speaking → InterruptAgent()
```
### Multiple NPCs with different agents
Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. WebSocket connections are fully independent.
### Only start the conversation when the player is nearby
```
On Begin Overlap (trigger volume around NPC)
└─► [ElevenLabs Agent] Start Conversation
On End Overlap
└─► [ElevenLabs Agent] End Conversation
```
### Adjust microphone volume
Get the `UElevenLabsMicrophoneCaptureComponent` from the owner and set `VolumeMultiplier`:
```cpp
UElevenLabsMicrophoneCaptureComponent* Mic =
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
if (Mic) Mic->VolumeMultiplier = 2.0f;
```
---
## 12. Troubleshooting
### Plugin doesn't appear in Project Settings
Ensure the plugin is enabled in `.uproject` and the project was recompiled after adding it.
### WebSocket connection fails immediately
- Check the **API Key** is set correctly in Project Settings.
- Check the **Agent ID** exists in your ElevenLabs account (find it in the dashboard URL or via `GET /v1/convai/agents`).
- Enable **Verbose Logging** in Project Settings and check Output Log for the exact WS URL and error.
- Ensure port 443 (WSS) is not blocked by your firewall.
### `OnAgentConnected` never fires
- Connection was made but `conversation_initiation_metadata` not received yet — check Verbose Logging.
- If you see `"Binary audio frame"` logs but no `"Conversation initiated"` — the initiation JSON frame may be arriving as a non-`{` binary frame. Check the hex prefix logged at Verbose level.
### No audio from the microphone
- Windows may require microphone permission. Check **Settings → Privacy → Microphone**.
- Try setting `VolumeMultiplier` to `2.0` on the `MicrophoneCaptureComponent`.
- Check Output Log for `"Failed to open default audio capture stream"`.
### Agent audio is choppy or silent
- The `USoundWaveProcedural` queue may be underflowing due to network jitter. Check latency.
- Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check `agent_output_audio_format` in the `conversation_initiation_metadata` via Verbose Logging.
- Ensure no other component is using the same `UAudioComponent`.
### `OnAgentStoppedSpeaking` fires too early
Increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:
```cpp
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps
```
### Build error: "Plugin AudioCapture not found"
Make sure the `AudioCapture` plugin is enabled. It should be auto-enabled via the `.uplugin` dependency, but you can add it manually to `.uproject`:
```json
{ "Name": "AudioCapture", "Enabled": true }
```
### `"Received unexpected binary WebSocket frame"` in the log
This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.
---
## 13. Changelog
### v1.1.0 — 2026-02-19
**Bug fixes:**
- **Binary WebSocket frames**: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
- **Transcript message**: Wrong message type (`"transcript"``"user_transcript"`), wrong event key (`"transcript_event"``"user_transcription_event"`), wrong text field (`"message"``"user_transcript"`).
- **Pong format**: `event_id` was nested inside a `pong_event` object; corrected to top-level field per API spec.
- **Client turn mode**: `user_turn_start`/`user_turn_end` are not valid API messages; replaced with `user_activity` (start) and implicit silence (end).
**New features:**
- `SendTextMessage(Text)` on both `UElevenLabsConversationalAgentComponent` and `UElevenLabsWebSocketProxy` — send text to the agent without a microphone. Useful for testing.
- Verbose logging shows binary frame hex preview and JSON frame content prefix.
- Improved JSON parse error log now shows the first 80 characters of the failing message.
### v1.0.0 — 2026-02-19
Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.
---
*Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5*

View File

@ -0,0 +1,463 @@
# ElevenLabs Conversational AI API Reference
> Saved for Claude Code sessions. Auto-loaded via `.claude/` directory.
> Last updated: 2026-02-19
---
## 1. Agent ID — Where to Find It
### In the Dashboard (UI)
1. Go to **https://elevenlabs.io/app/conversational-ai**
2. Click on your agent to open it
3. The **Agent ID** is shown in the agent settings page — typically in the URL bar and/or in the agent's "General" settings tab
- URL pattern: `https://elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>`
- Also visible in the "API" or "Overview" tab of the agent editor (copy button available)
### Via API
```http
GET https://api.elevenlabs.io/v1/convai/agents
xi-api-key: YOUR_API_KEY
```
Returns a list of all agents with their `agent_id` strings.
### Via API (single agent)
```http
GET https://api.elevenlabs.io/v1/convai/agents/{agent_id}
xi-api-key: YOUR_API_KEY
```
### Agent ID Format
- Type: `string`
- Returned on agent creation via `POST /v1/convai/agents/create`
- Used as URL path param and WebSocket query param throughout the API
---
## 2. WebSocket Conversational AI
### Connection URL
```
wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<AGENT_ID>
```
Regional alternatives:
| Region | URL |
|--------|-----|
| Default (Global) | `wss://api.elevenlabs.io/` |
| US | `wss://api.us.elevenlabs.io/` |
| EU | `wss://api.eu.residency.elevenlabs.io/` |
| India | `wss://api.in.residency.elevenlabs.io/` |
### Authentication
- **Public agents**: No key required, just `agent_id` query param
- **Private agents**: Use a **Signed URL** (see Section 4) instead of direct `agent_id`
- **Server-side** (backend): Pass `xi-api-key` as an HTTP upgrade header
```
Headers:
xi-api-key: YOUR_API_KEY
```
> ⚠️ Never expose your API key client-side. For browser/mobile apps, use Signed URLs.
---
## 3. WebSocket Protocol — Message Reference
### Audio Format
- **Input (mic → server)**: PCM 16-bit signed, **16000 Hz**, mono, little-endian, Base64-encoded
- **Output (server → client)**: Base64-encoded audio (format specified in `conversation_initiation_metadata`)
---
### Messages FROM Server (Subscribe / Receive)
#### `conversation_initiation_metadata`
Sent immediately after connection. Contains conversation ID and audio format specs.
```json
{
"type": "conversation_initiation_metadata",
"conversation_initiation_metadata_event": {
"conversation_id": "string",
"agent_output_audio_format": "pcm_16000 | mp3_44100 | ...",
"user_input_audio_format": "pcm_16000"
}
}
```
#### `audio`
Agent speech audio chunk.
```json
{
"type": "audio",
"audio_event": {
"audio_base_64": "BASE64_PCM_BYTES",
"event_id": 42
}
}
```
#### `user_transcript`
Transcribed text of what the user said.
```json
{
"type": "user_transcript",
"user_transcription_event": {
"user_transcript": "Hello, how are you?"
}
}
```
#### `agent_response`
The text the agent is saying (arrives in parallel with audio).
```json
{
"type": "agent_response",
"agent_response_event": {
"agent_response": "I'm doing great, thanks!"
}
}
```
#### `agent_response_correction`
Sent after an interruption — shows what was truncated.
```json
{
"type": "agent_response_correction",
"agent_response_correction_event": {
"original_agent_response": "string",
"corrected_agent_response": "string"
}
}
```
#### `interruption`
Signals that a specific audio event was interrupted.
```json
{
"type": "interruption",
"interruption_event": {
"event_id": 42
}
}
```
#### `ping`
Keepalive ping from server. Client must reply with `pong`.
```json
{
"type": "ping",
"ping_event": {
"event_id": 1,
"ping_ms": 150
}
}
```
#### `client_tool_call`
Requests the client execute a tool (custom tools integration).
```json
{
"type": "client_tool_call",
"client_tool_call": {
"tool_name": "string",
"tool_call_id": "string",
"parameters": {}
}
}
```
#### `contextual_update`
Text context added to conversation state (non-interrupting).
```json
{
"type": "contextual_update",
"contextual_update_event": {
"text": "string"
}
}
```
#### `vad_score`
Voice Activity Detection confidence score (0.01.0).
```json
{
"type": "vad_score",
"vad_score_event": {
"vad_score": 0.85
}
}
```
#### `internal_tentative_agent_response`
Preliminary agent text during LLM generation (not final).
```json
{
"type": "internal_tentative_agent_response",
"tentative_agent_response_internal_event": {
"tentative_agent_response": "string"
}
}
```
---
### Messages TO Server (Publish / Send)
#### `user_audio_chunk`
Microphone audio data. Send continuously during user speech.
```json
{
"user_audio_chunk": "BASE64_PCM_16BIT_16KHZ_MONO"
}
```
Audio must be: **PCM 16-bit signed, 16000 Hz, mono, little-endian**, then Base64-encoded.
#### `pong`
Reply to server `ping` to keep connection alive.
```json
{
"type": "pong",
"event_id": 1
}
```
#### `conversation_initiation_client_data`
Override agent configuration at connection time. Send before or just after connecting.
```json
{
"type": "conversation_initiation_client_data",
"conversation_config_override": {
"agent": {
"prompt": { "prompt": "Custom system prompt override" },
"first_message": "Hello! How can I help?",
"language": "en"
},
"tts": {
"voice_id": "string",
"speed": 1.0,
"stability": 0.5,
"similarity_boost": 0.75
}
},
"dynamic_variables": {
"user_name": "Alice",
"session_id": 12345
}
}
```
Config override ranges:
- `tts.speed`: 0.7 1.2
- `tts.stability`: 0.0 1.0
- `tts.similarity_boost`: 0.0 1.0
#### `client_tool_result`
Response to a `client_tool_call` from the server.
```json
{
"type": "client_tool_result",
"tool_call_id": "string",
"result": "tool output string",
"is_error": false
}
```
#### `contextual_update`
Inject context without interrupting the conversation.
```json
{
"type": "contextual_update",
"text": "User just entered room 4B"
}
```
#### `user_message`
Send a text message (no mic audio needed).
```json
{
"type": "user_message",
"text": "What is the weather like?"
}
```
#### `user_activity`
Signal that user is active (for turn detection in client mode).
```json
{
"type": "user_activity"
}
```
---
## 4. Signed URL (Private Agents)
Used for browser/mobile clients to authenticate without exposing the API key.
### Flow
1. **Backend** calls ElevenLabs API to get a temporary signed URL
2. Backend returns signed URL to client
3. **Client** opens WebSocket to the signed URL (no API key needed)
### Get Signed URL
```http
GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=<AGENT_ID>
xi-api-key: YOUR_API_KEY
```
Optional query params:
- `include_conversation_id=true` — generates unique conversation ID, prevents URL reuse
- `branch_id` — specific agent branch
Response:
```json
{
"signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..."
}
```
Client connects to `signed_url` directly — no headers needed.
---
## 5. Agents REST API
Base URL: `https://api.elevenlabs.io`
Auth header: `xi-api-key: YOUR_API_KEY`
### Create Agent
```http
POST /v1/convai/agents/create
Content-Type: application/json
{
"name": "My NPC Agent",
"conversation_config": {
"agent": {
"first_message": "Hello adventurer!",
"prompt": { "prompt": "You are a wise tavern keeper in a fantasy world." },
"language": "en"
}
}
}
```
Response includes `agent_id`.
### List Agents
```http
GET /v1/convai/agents?page_size=30&search=&sort_by=created_at&sort_direction=desc
```
Response:
```json
{
"agents": [
{
"agent_id": "abc123xyz",
"name": "My NPC Agent",
"created_at_unix_secs": 1708300000,
"last_call_time_unix_secs": null,
"archived": false,
"tags": []
}
],
"has_more": false,
"next_cursor": null
}
```
### Get Agent
```http
GET /v1/convai/agents/{agent_id}
```
### Update Agent
```http
PATCH /v1/convai/agents/{agent_id}
Content-Type: application/json
{ "name": "Updated Name", "conversation_config": { ... } }
```
### Delete Agent
```http
DELETE /v1/convai/agents/{agent_id}
```
---
## 6. Turn Modes
### Server VAD (Default / Recommended)
- ElevenLabs server detects when user stops speaking
- Client streams audio continuously
- Server handles all turn-taking automatically
### Client Turn Mode
- Client explicitly signals turn boundaries
- Send `user_activity` to indicate user is speaking
- Use when you have your own VAD or push-to-talk UI
---
## 7. Audio Pipeline (UE5 Implementation Notes)
```
Microphone (FAudioCapture)
→ float32 samples at device rate (e.g. 44100 Hz stereo)
→ Resample to 16000 Hz mono
→ Convert float32 → int16 little-endian
→ Base64-encode
→ Send as {"user_audio_chunk": "BASE64"}
Server → {"type":"audio","audio_event":{"audio_base_64":"BASE64"}}
→ Base64-decode
→ Raw PCM bytes
→ Push to USoundWaveProcedural
→ UAudioComponent plays back
```
### Float32 → Int16 Conversion (C++)
```cpp
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatSamples)
{
TArray<uint8> Bytes;
Bytes.SetNumUninitialized(FloatSamples.Num() * 2);
for (int32 i = 0; i < FloatSamples.Num(); i++)
{
float Clamped = FMath::Clamp(FloatSamples[i], -1.f, 1.f);
int16 Sample = (int16)(Clamped * 32767.f);
Bytes[i * 2] = (uint8)(Sample & 0xFF); // Low byte
Bytes[i * 2 + 1] = (uint8)((Sample >> 8) & 0xFF); // High byte
}
return Bytes;
}
```
---
## 8. Quick Integration Checklist (UE5 Plugin)
- [ ] Set `AgentID` in `UElevenLabsSettings` (Project Settings → ElevenLabs AI Agent)
- Or override per-component via `UElevenLabsConversationalAgentComponent::AgentID`
- [ ] Set `API_Key` in settings (or leave empty for public agents)
- [ ] Add `UElevenLabsConversationalAgentComponent` to your NPC actor
- [ ] Set `TurnMode` (default: `Server` — recommended)
- [ ] Bind to events: `OnAgentConnected`, `OnAgentTranscript`, `OnAgentTextResponse`, `OnAgentStartedSpeaking`, `OnAgentStoppedSpeaking`
- [ ] Call `StartConversation()` to begin
- [ ] Call `EndConversation()` when done
---
## 9. Key API URLs Reference
| Purpose | URL |
|---------|-----|
| Dashboard | https://elevenlabs.io/app/conversational-ai |
| API Keys | https://elevenlabs.io/app/settings/api-keys |
| WebSocket endpoint | wss://api.elevenlabs.io/v1/convai/conversation |
| Agents list | GET https://api.elevenlabs.io/v1/convai/agents |
| Agent by ID | GET https://api.elevenlabs.io/v1/convai/agents/{agent_id} |
| Create agent | POST https://api.elevenlabs.io/v1/convai/agents/create |
| Signed URL | GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url |
| WS protocol docs | https://elevenlabs.io/docs/eleven-agents/api-reference/eleven-agents/websocket |
| Quickstart | https://elevenlabs.io/docs/eleven-agents/quickstart |

View File

@ -0,0 +1,61 @@
# PS_AI_Agent_ElevenLabs Plugin
## Location
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
## File Map
```
PS_AI_Agent_ElevenLabs.uplugin
Source/PS_AI_Agent_ElevenLabs/
PS_AI_Agent_ElevenLabs.Build.cs
Public/
PS_AI_Agent_ElevenLabs.h FPS_AI_Agent_ElevenLabsModule + UElevenLabsSettings
ElevenLabsDefinitions.h Enums, structs, ElevenLabsMessageType/Audio constants
ElevenLabsWebSocketProxy.h/.cpp UObject managing one WS session
ElevenLabsConversationalAgentComponent.h/.cpp Main ActorComponent (attach to NPC)
ElevenLabsMicrophoneCaptureComponent.h/.cpp Mic capture, resample, dispatch to game thread
Private/
(implementations of the above)
```
## ElevenLabs Conversational AI Protocol
- **WebSocket URL**: `wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>`
- **Auth**: HTTP upgrade header `xi-api-key: <key>` (set in Project Settings)
- **All frames**: JSON text (no binary frames used by the API)
- **Audio format**: PCM 16-bit signed, 16000 Hz, mono, little-endian — Base64-encoded in JSON
### Client → Server messages
| Type field value | Payload |
|---|---|
| *(none key is the type)* `user_audio_chunk` | `{ "user_audio_chunk": "<base64 PCM>" }` |
| `user_turn_start` | `{ "type": "user_turn_start" }` |
| `user_turn_end` | `{ "type": "user_turn_end" }` |
| `interrupt` | `{ "type": "interrupt" }` |
| `pong` | `{ "type": "pong", "pong_event": { "event_id": N } }` |
### Server → Client messages (field: `type`)
| type value | Key nested object | Notes |
|---|---|---|
| `conversation_initiation_metadata` | `conversation_initiation_metadata_event.conversation_id` | Marks WS ready |
| `audio` | `audio_event.audio_base_64` | Base64 PCM from agent |
| `transcript` | `transcript_event.{speaker, message, is_final}` | User or agent speech |
| `agent_response` | `agent_response_event.agent_response` | Final agent text |
| `interruption` | — | Agent stopped mid-sentence |
| `ping` | `ping_event.event_id` | Must reply with pong |
## Key Design Decisions
- **No gRPC / no ThirdParty libs** — pure UE WebSockets + HTTP, builds out of the box
- Audio resampled in-plugin: device rate → 16000 Hz mono (linear interpolation)
- `USoundWaveProcedural` for real-time agent audio playback (queue-driven)
- Silence heuristic: 30 game-thread ticks (~0.5 s at 60 fps) with no new audio → agent done speaking
- `bSignedURLMode` setting: fetch a signed WS URL from your own backend (keeps API key off client)
- Two turn modes: `Server VAD` (ElevenLabs detects speech end) and `Client Controlled` (push-to-talk)
## Build Dependencies (Build.cs)
Core, CoreUObject, Engine, InputCore, Json, JsonUtilities, WebSockets, HTTP,
AudioMixer, AudioCaptureCore, AudioCapture, Voice, SignalProcessing
## Status
- **Session 1** (2026-02-19): All source files written, registered in .uproject. Not yet compiled.
- **TODO**: Open in UE 5.5 Editor → compile → test basic WS connection with a test agent ID.
- **Watch out**: Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature vs UE 5.5 API.

View File

@ -0,0 +1,79 @@
# Project Context & Original Ask
## What the user wants to build
A **UE5 plugin** that integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5,
allowing an in-game NPC (or any Actor) to hold a real-time voice conversation with a player.
### The original request (paraphrased)
> "I want to create a plugin to use ElevenLabs Conversational Agent in Unreal Engine 5.5.
> I previously used the Convai plugin which does what I want, but I prefer ElevenLabs quality.
> The goal is to create a plugin in the existing Unreal Project to make a first step for integration.
> Convai AI plugin may be too big in terms of functionality for the new project, but it is the final goal.
> You can use the Convai source code to find the right way to make the ElevenLabs version —
> it should be very similar."
### Plugin name
`PS_AI_Agent_ElevenLabs`
---
## User's mental model / intent
1. **Short-term**: A working first-step plugin — minimal but functional — that can:
- Connect to ElevenLabs Conversational AI via WebSocket
- Capture microphone audio from the player
- Stream it to ElevenLabs in real time
- Play back the agent's voice response
- Surface key events (transcript, agent text, speaking state) to Blueprint
2. **Long-term**: Match the full feature set of Convai — character IDs, session memory,
actions/environment context, lip-sync, etc. — but powered by ElevenLabs instead.
3. **Key preference**: Simpler than Convai. No gRPC, no protobuf, no ThirdParty precompiled
libraries. ElevenLabs' Conversational AI API uses plain WebSocket + JSON, which maps
naturally to UE's built-in `WebSockets` module.
---
## How we used Convai as a reference
We studied the Convai plugin source (`ConvAI/Convai/`) to understand:
- **Module structure**: `UConvaiSettings` + `IModuleInterface` + `ISettingsModule` registration
- **Audio capture pattern**: `Audio::FAudioCapture`, ring buffers, thread-safe dispatch to game thread
- **Audio playback pattern**: `USoundWaveProcedural` fed from a queue
- **Component architecture**: `UConvaiChatbotComponent` (NPC side) + `UConvaiPlayerComponent` (player side)
- **HTTP proxy pattern**: `UConvaiAPIBaseProxy` base class for async REST calls
- **Voice type enum**: Convai already had `EVoiceType::ElevenLabsVoices` — confirming ElevenLabs
is a natural fit
We then replaced gRPC/protobuf with **WebSocket + JSON** to match the ElevenLabs API, and
simplified the architecture to the minimum needed for a first working version.
---
## What was built (Session 1 — 2026-02-19)
All source files created and registered. See `.claude/elevenlabs_plugin.md` for full file map and protocol details.
### Components created
| Class | Role |
|---|---|
| `UElevenLabsSettings` | Project Settings UI — API key, Agent ID, security options |
| `UElevenLabsWebSocketProxy` | Manages one WS session: connect, send audio, handle all server message types |
| `UElevenLabsConversationalAgentComponent` | ActorComponent to attach to any NPC — orchestrates mic + WS + playback |
| `UElevenLabsMicrophoneCaptureComponent` | Wraps `Audio::FAudioCapture`, resamples to 16 kHz mono |
### Not yet done (next sessions)
- Compile & test in UE 5.5 Editor
- Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature for UE 5.5
- Add lip-sync support (future)
- Add session memory / conversation history (future)
- Add environment/action context support (future, matching Convai's full feature set)
---
## Notes on the ElevenLabs API
- Docs: https://elevenlabs.io/docs/conversational-ai
- Create agents at: https://elevenlabs.io/app/conversational-ai
- API keys at: https://elevenlabs.io (dashboard)

View File

@ -0,0 +1,200 @@
# Session Log — 2026-02-19
**Project**: PS_AI_Agent (Unreal Engine 5.5)
**Machine**: Desktop PC (j_foucher)
**Working directory**: `E:\ASTERION\GIT\PS_AI_Agent`
---
## Conversation Summary
### 1. Initial Request
User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5.
Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs.
Plugin name requested: `PS_AI_Agent_ElevenLabs`.
### 2. Codebase Exploration
Explored the Convai plugin source at `ConvAI/Convai/` to understand:
- Module/settings structure
- AudioCapture patterns
- HTTP proxy pattern
- gRPC streaming architecture (to know what to replace with WebSocket)
- Convai already had `EVoiceType::ElevenLabsVoices` — confirming the direction
### 3. Plugin Created
All source files written from scratch under:
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
Files created:
- `PS_AI_Agent_ElevenLabs.uplugin`
- `PS_AI_Agent_ElevenLabs.Build.cs`
- `Public/PS_AI_Agent_ElevenLabs.h` — Module + `UElevenLabsSettings`
- `Public/ElevenLabsDefinitions.h` — Enums, structs, protocol constants
- `Public/ElevenLabsWebSocketProxy.h` + `.cpp` — WS session manager
- `Public/ElevenLabsConversationalAgentComponent.h` + `.cpp` — Main NPC component
- `Public/ElevenLabsMicrophoneCaptureComponent.h` + `.cpp` — Mic capture
- `PS_AI_Agent.uproject` — Plugin registered
Commit: `f0055e8`
### 4. Memory Files Created
To allow context recovery on any machine (including laptop):
- `.claude/MEMORY.md` — project structure + patterns (auto-loaded by Claude Code)
- `.claude/elevenlabs_plugin.md` — plugin file map + API protocol details
- `.claude/project_context.md` — original ask, intent, short/long-term goals
- Local copy also at `C:\Users\j_foucher\.claude\projects\...\memory\`
Commit: `f0055e8` (with plugin), updated in `4d6ae10`
### 5. .gitignore Updated
Added to existing ignores:
- `Unreal/PS_AI_Agent/Plugins/*/Binaries/`
- `Unreal/PS_AI_Agent/Plugins/*/Intermediate/`
- `Unreal/PS_AI_Agent/*.sln` / `*.suo`
- `.claude/settings.local.json`
- `generate_pptx.py`
Commit: `4d6ae10`, `b114ab0`
### 6. Compile — First Attempt (Errors Found)
Ran `Build.bat PS_AI_AgentEditor Win64 Development`. Errors:
- `WebSockets` listed in `.uplugin` — it's a module not a plugin → removed
- `OpenDefaultCaptureStream` doesn't exist in UE 5.5 → use `OpenAudioCaptureStream`
- `FOnAudioCaptureFunction` callback uses `const void*` not `const float*` → fixed cast
- `TArray::RemoveAt(0, N, false)` deprecated → use `EAllowShrinking::No`
- `AudioCapture` is a plugin and must be in `.uplugin` Plugins array → added
Commit: `bb1a857`
### 7. Compile — Success
Clean build, no warnings, no errors.
Output: `Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll`
Memory updated with confirmed UE 5.5 API patterns. Commit: `3b98edc`
### 8. Documentation — Markdown
Full reference doc written to `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
- Installation, Project Settings, Quick Start (BP + C++), Components Reference,
Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.
Commit: `c833ccd`
### 9. Documentation — PowerPoint
20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):
- File: `PS_AI_Agent_ElevenLabs_Documentation.pptx` in repo root
- Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
- Generator script `generate_pptx.py` excluded from git via .gitignore
Commit: `1b72026`
---
## Session 2 — 2026-02-19 (continued context)
### 10. API vs Implementation Cross-Check (3 bugs found and fixed)
Cross-referenced `elevenlabs_api_reference.md` against plugin source. Found 3 protocol bugs:
**Bug 1 — Transcript fields wrong:**
- Type: `"transcript"``"user_transcript"`
- Event key: `"transcript_event"``"user_transcription_event"`
- Field: `"message"``"user_transcript"`
**Bug 2 — Pong format wrong:**
- `event_id` was nested in `pong_event{}` → must be top-level
**Bug 3 — Client turn mode messages don't exist:**
- `"user_turn_start"` / `"user_turn_end"` are not valid API types
- Replaced: start → `"user_activity"`, end → no-op (server detects silence)
Commit: `ae2c9b9`
### 11. SendTextMessage Added
User asked for text input to agent for testing (without mic).
Added `SendTextMessage(FString)` to `UElevenLabsWebSocketProxy` and `UElevenLabsConversationalAgentComponent`.
Sends `{"type":"user_message","text":"..."}` — agent replies with audio + text.
Commit: `b489d11`
### 12. Binary WebSocket Frame Fix
User reported: `"Received unexpected binary WebSocket frame"` warnings.
Root cause: ElevenLabs sends **ALL WebSocket frames as binary**, never text.
`OnMessage` (text handler) never fires. `OnRawMessage` must handle everything.
Fix: Implemented `OnWsBinaryMessage` with fragment reassembly (`BinaryFrameBuffer`).
Commit: `669c503`
### 13. JSON vs PCM Discrimination Fix
After binary fix: `"Failed to parse WebSocket message as JSON"` errors.
Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.
Fix: Peek at byte[0] of assembled buffer:
- `'{'` (0x7B) → UTF-8 JSON → route to `OnWsMessage()`
- anything else → raw PCM audio → broadcast to `OnAudioReceived`
Commit: `4834567`
### 14. Documentation Updated to v1.1.0
Full rewrite of `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
- Added Changelog section (v1.0.0 / v1.1.0)
- Updated audio pipeline (binary PCM path, not Base64 JSON)
- Added `SendTextMessage` to all function tables and examples
- Corrected turn mode docs, transcript docs, `OnAgentConnected` timing
- New troubleshooting entries
Commit: `e464cfe`
### 15. Test Blueprint Asset Updated
`test_AI_Actor.uasset` updated in UE Editor.
Commit: `99017f4`
---
## Git History (this session)
| Hash | Message |
|------|---------|
| `f0055e8` | Add PS_AI_Agent_ElevenLabs plugin (initial implementation) |
| `4d6ae10` | Update .gitignore: exclude plugin build artifacts and local Claude settings |
| `b114ab0` | Broaden .gitignore: use glob for all plugin Binaries/Intermediate |
| `bb1a857` | Fix compile errors in PS_AI_Agent_ElevenLabs plugin |
| `3b98edc` | Update memory: document confirmed UE 5.5 API patterns and plugin compile status |
| `c833ccd` | Add plugin documentation for PS_AI_Agent_ElevenLabs |
| `1b72026` | Add PowerPoint documentation and update .gitignore |
| `bbeb429` | ElevenLabs API reference doc |
| `dbd6161` | TestMap, test actor, DefaultEngine.ini, memory update |
| `ae2c9b9` | Fix 3 WebSocket protocol bugs |
| `b489d11` | Add SendTextMessage |
| `669c503` | Fix binary WebSocket frames |
| `4834567` | Fix JSON vs binary frame discrimination |
| `e464cfe` | Update documentation to v1.1.0 |
| `99017f4` | Update test_AI_Actor blueprint asset |
---
## Key Technical Decisions Made This Session
| Decision | Reason |
|----------|--------|
| WebSocket instead of gRPC | ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed |
| `AudioCapture` in `.uplugin` Plugins array | It's an engine plugin, not a module — UBT requires it declared |
| `WebSockets` in Build.cs only | It's a module (no `.uplugin` file), declaring it in `.uplugin` causes build error |
| `FOnAudioCaptureFunction` uses `const void*` | UE 5.3+ API change — must cast to `float*` inside callback |
| `EAllowShrinking::No` | Bool overload of `RemoveAt` deprecated in UE 5.5 |
| `USoundWaveProcedural` for playback | Allows pushing raw PCM bytes at runtime without file I/O |
| Silence threshold = 30 ticks | ~0.5s at 60fps heuristic to detect agent finished speaking |
| Binary frame handling | ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM |
| `user_activity` for client turn | `user_turn_start`/`user_turn_end` don't exist in ElevenLabs API |
---
## Next Steps (not done yet)
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
- [ ] Test `SendTextMessage` end-to-end in Blueprint
- [ ] Add lip-sync support (future)
- [ ] Add session memory / conversation history (future, matching Convai)
- [ ] Add environment/action context support (future)
- [ ] Consider Signed URL Mode backend implementation

14
.gitignore vendored
View File

@ -4,3 +4,17 @@ Unreal/PS_AI_Agent/Binaries/
Unreal/PS_AI_Agent/Intermediate/
Unreal/PS_AI_Agent/Saved/
ConvAI/Convai/Binaries/
# All plugin build artifacts (Binaries + Intermediate for any plugin)
Unreal/PS_AI_Agent/Plugins/*/Binaries/
Unreal/PS_AI_Agent/Plugins/*/Intermediate/
# UE5 generated solution files
Unreal/PS_AI_Agent/*.sln
Unreal/PS_AI_Agent/*.suo
# Claude Code local session settings (machine-specific, memory files in .claude/ are kept)
.claude/settings.local.json
# Documentation generator script (dev tool, output .pptx is committed instead)
generate_pptx.py

Binary file not shown.

View File

@ -1,7 +1,8 @@
[/Script/EngineSettings.GameMapsSettings]
GameDefaultMap=/Engine/Maps/Templates/OpenWorld
GameDefaultMap=/Game/TestMap.TestMap
EditorStartupMap=/Game/TestMap.TestMap
[/Script/Engine.RendererSettings]
r.AllowStaticLighting=False
@ -90,3 +91,4 @@ ConnectionType=USBOnly
bUseManualIPAddress=False
ManualIPAddress=

Binary file not shown.

View File

@ -17,6 +17,10 @@
"TargetAllowList": [
"Editor"
]
},
{
"Name": "PS_AI_Agent_ElevenLabs",
"Enabled": true
}
]
}

View File

@ -0,0 +1,35 @@
{
"FileVersion": 3,
"Version": 1,
"VersionName": "1.0.0",
"FriendlyName": "PS AI Agent - ElevenLabs",
"Description": "Integrates ElevenLabs Conversational AI Agent into Unreal Engine 5.5. Supports real-time voice conversation via WebSocket, microphone capture, and audio playback.",
"Category": "AI",
"CreatedBy": "ASTERION",
"CreatedByURL": "",
"DocsURL": "https://elevenlabs.io/docs/conversational-ai",
"MarketplaceURL": "",
"SupportURL": "",
"CanContainContent": false,
"IsBetaVersion": true,
"IsExperimentalVersion": false,
"Installed": false,
"Modules": [
{
"Name": "PS_AI_Agent_ElevenLabs",
"Type": "Runtime",
"LoadingPhase": "PreDefault",
"PlatformAllowList": [
"Win64",
"Mac",
"Linux"
]
}
],
"Plugins": [
{
"Name": "AudioCapture",
"Enabled": true
}
]
}

View File

@ -0,0 +1,40 @@
// Copyright ASTERION. All Rights Reserved.
using UnrealBuildTool;
public class PS_AI_Agent_ElevenLabs : ModuleRules
{
public PS_AI_Agent_ElevenLabs(ReadOnlyTargetRules Target) : base(Target)
{
DefaultBuildSettings = BuildSettingsVersion.Latest;
PCHUsage = PCHUsageMode.UseExplicitOrSharedPCHs;
PublicDependencyModuleNames.AddRange(new string[]
{
"Core",
"CoreUObject",
"Engine",
"InputCore",
// JSON serialization for WebSocket message payloads
"Json",
"JsonUtilities",
// WebSocket for ElevenLabs Conversational AI real-time API
"WebSockets",
// HTTP for REST calls (agent metadata, auth, etc.)
"HTTP",
// Audio capture (microphone input)
"AudioMixer",
"AudioCaptureCore",
"AudioCapture",
"Voice",
"SignalProcessing",
});
PrivateDependencyModuleNames.AddRange(new string[]
{
"Projects",
// For ISettingsModule (Project Settings integration)
"Settings",
});
}
}

View File

@ -0,0 +1,345 @@
// Copyright ASTERION. All Rights Reserved.
#include "ElevenLabsConversationalAgentComponent.h"
#include "ElevenLabsMicrophoneCaptureComponent.h"
#include "PS_AI_Agent_ElevenLabs.h"
#include "Components/AudioComponent.h"
#include "Sound/SoundWaveProcedural.h"
#include "GameFramework/Actor.h"
#include "Engine/World.h"
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsAgent, Log, All);
// ─────────────────────────────────────────────────────────────────────────────
// Constructor
// ─────────────────────────────────────────────────────────────────────────────
UElevenLabsConversationalAgentComponent::UElevenLabsConversationalAgentComponent()
{
PrimaryComponentTick.bCanEverTick = true;
// Tick is used only to detect silence (agent stopped speaking).
// Disable if not needed for perf.
PrimaryComponentTick.TickInterval = 1.0f / 60.0f;
}
// ─────────────────────────────────────────────────────────────────────────────
// Lifecycle
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsConversationalAgentComponent::BeginPlay()
{
Super::BeginPlay();
InitAudioPlayback();
}
void UElevenLabsConversationalAgentComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
{
EndConversation();
Super::EndPlay(EndPlayReason);
}
void UElevenLabsConversationalAgentComponent::TickComponent(float DeltaTime, ELevelTick TickType,
FActorComponentTickFunction* ThisTickFunction)
{
Super::TickComponent(DeltaTime, TickType, ThisTickFunction);
if (bAgentSpeaking)
{
FScopeLock Lock(&AudioQueueLock);
if (AudioQueue.Num() == 0)
{
SilentTickCount++;
if (SilentTickCount >= SilenceThresholdTicks)
{
bAgentSpeaking = false;
SilentTickCount = 0;
OnAgentStoppedSpeaking.Broadcast();
}
}
else
{
SilentTickCount = 0;
}
}
}
// ─────────────────────────────────────────────────────────────────────────────
// Control
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsConversationalAgentComponent::StartConversation()
{
if (!WebSocketProxy)
{
WebSocketProxy = NewObject<UElevenLabsWebSocketProxy>(this);
WebSocketProxy->OnConnected.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleConnected);
WebSocketProxy->OnDisconnected.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleDisconnected);
WebSocketProxy->OnError.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleError);
WebSocketProxy->OnAudioReceived.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleAudioReceived);
WebSocketProxy->OnTranscript.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleTranscript);
WebSocketProxy->OnAgentResponse.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleAgentResponse);
WebSocketProxy->OnInterrupted.AddDynamic(this,
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
}
WebSocketProxy->Connect(AgentID);
}
void UElevenLabsConversationalAgentComponent::EndConversation()
{
StopListening();
StopAgentAudio();
if (WebSocketProxy)
{
WebSocketProxy->Disconnect();
WebSocketProxy = nullptr;
}
}
void UElevenLabsConversationalAgentComponent::StartListening()
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsAgent, Warning, TEXT("StartListening: not connected."));
return;
}
if (bIsListening) return;
bIsListening = true;
if (TurnMode == EElevenLabsTurnMode::Client)
{
WebSocketProxy->SendUserTurnStart();
}
// Find the microphone component on our owner actor, or create one.
UElevenLabsMicrophoneCaptureComponent* Mic =
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
if (!Mic)
{
Mic = NewObject<UElevenLabsMicrophoneCaptureComponent>(GetOwner(),
TEXT("ElevenLabsMicrophone"));
Mic->RegisterComponent();
}
Mic->OnAudioCaptured.AddUObject(this,
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
Mic->StartCapture();
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture started."));
}
void UElevenLabsConversationalAgentComponent::StopListening()
{
if (!bIsListening) return;
bIsListening = false;
if (UElevenLabsMicrophoneCaptureComponent* Mic =
GetOwner() ? GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>() : nullptr)
{
Mic->StopCapture();
Mic->OnAudioCaptured.RemoveAll(this);
}
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
{
WebSocketProxy->SendUserTurnEnd();
}
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture stopped."));
}
void UElevenLabsConversationalAgentComponent::SendTextMessage(const FString& Text)
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsAgent, Warning, TEXT("SendTextMessage: not connected. Call StartConversation() first."));
return;
}
WebSocketProxy->SendTextMessage(Text);
}
void UElevenLabsConversationalAgentComponent::InterruptAgent()
{
if (WebSocketProxy) WebSocketProxy->SendInterrupt();
StopAgentAudio();
}
// ─────────────────────────────────────────────────────────────────────────────
// State queries
// ─────────────────────────────────────────────────────────────────────────────
bool UElevenLabsConversationalAgentComponent::IsConnected() const
{
return WebSocketProxy && WebSocketProxy->IsConnected();
}
const FElevenLabsConversationInfo& UElevenLabsConversationalAgentComponent::GetConversationInfo() const
{
static FElevenLabsConversationInfo Empty;
return WebSocketProxy ? WebSocketProxy->GetConversationInfo() : Empty;
}
// ─────────────────────────────────────────────────────────────────────────────
// WebSocket event handlers
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsConversationInfo& Info)
{
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
OnAgentConnected.Broadcast(Info);
if (bAutoStartListening)
{
StartListening();
}
}
void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCode, const FString& Reason)
{
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
bIsListening = false;
bAgentSpeaking = false;
OnAgentDisconnected.Broadcast(StatusCode, Reason);
}
void UElevenLabsConversationalAgentComponent::HandleError(const FString& ErrorMessage)
{
UE_LOG(LogElevenLabsAgent, Error, TEXT("Agent error: %s"), *ErrorMessage);
OnAgentError.Broadcast(ErrorMessage);
}
void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<uint8>& PCMData)
{
EnqueueAgentAudio(PCMData);
}
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
{
OnAgentTranscript.Broadcast(Segment);
}
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
{
OnAgentTextResponse.Broadcast(ResponseText);
}
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
{
StopAgentAudio();
OnAgentInterrupted.Broadcast();
}
// ─────────────────────────────────────────────────────────────────────────────
// Audio playback
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsConversationalAgentComponent::InitAudioPlayback()
{
AActor* Owner = GetOwner();
if (!Owner) return;
// USoundWaveProcedural lets us push raw PCM data at runtime.
ProceduralSoundWave = NewObject<USoundWaveProcedural>(this);
ProceduralSoundWave->SetSampleRate(ElevenLabsAudio::SampleRate);
ProceduralSoundWave->NumChannels = ElevenLabsAudio::Channels;
ProceduralSoundWave->Duration = INDEFINITELY_LOOPING_DURATION;
ProceduralSoundWave->SoundGroup = SOUNDGROUP_Voice;
ProceduralSoundWave->bLooping = false;
// Create the audio component attached to the owner.
AudioPlaybackComponent = NewObject<UAudioComponent>(Owner, TEXT("ElevenLabsAudioPlayback"));
AudioPlaybackComponent->RegisterComponent();
AudioPlaybackComponent->bAutoActivate = false;
AudioPlaybackComponent->SetSound(ProceduralSoundWave);
// When the procedural sound wave needs more audio data, pull from our queue.
ProceduralSoundWave->OnSoundWaveProceduralUnderflow =
FOnSoundWaveProceduralUnderflow::CreateUObject(
this, &UElevenLabsConversationalAgentComponent::OnProceduralUnderflow);
}
void UElevenLabsConversationalAgentComponent::OnProceduralUnderflow(
USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired)
{
FScopeLock Lock(&AudioQueueLock);
if (AudioQueue.Num() == 0) return;
const int32 BytesRequired = SamplesRequired * sizeof(int16);
const int32 BytesToPush = FMath::Min(AudioQueue.Num(), BytesRequired);
InProceduralWave->QueueAudio(AudioQueue.GetData(), BytesToPush);
AudioQueue.RemoveAt(0, BytesToPush, EAllowShrinking::No);
}
void UElevenLabsConversationalAgentComponent::EnqueueAgentAudio(const TArray<uint8>& PCMData)
{
{
FScopeLock Lock(&AudioQueueLock);
AudioQueue.Append(PCMData);
}
// Start playback if not already playing.
if (!bAgentSpeaking)
{
bAgentSpeaking = true;
SilentTickCount = 0;
OnAgentStartedSpeaking.Broadcast();
if (AudioPlaybackComponent && !AudioPlaybackComponent->IsPlaying())
{
AudioPlaybackComponent->Play();
}
}
}
void UElevenLabsConversationalAgentComponent::StopAgentAudio()
{
if (AudioPlaybackComponent && AudioPlaybackComponent->IsPlaying())
{
AudioPlaybackComponent->Stop();
}
FScopeLock Lock(&AudioQueueLock);
AudioQueue.Empty();
if (bAgentSpeaking)
{
bAgentSpeaking = false;
SilentTickCount = 0;
OnAgentStoppedSpeaking.Broadcast();
}
}
// ─────────────────────────────────────────────────────────────────────────────
// Microphone → WebSocket
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TArray<float>& FloatPCM)
{
if (!IsConnected() || !bIsListening) return;
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
WebSocketProxy->SendAudioChunk(PCMBytes);
}
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)
{
TArray<uint8> Out;
Out.Reserve(FloatPCM.Num() * 2);
for (float Sample : FloatPCM)
{
// Clamp to [-1,1] then scale to int16 range
const float Clamped = FMath::Clamp(Sample, -1.0f, 1.0f);
const int16 Int16Sample = static_cast<int16>(Clamped * 32767.0f);
// Little-endian
Out.Add(static_cast<uint8>(Int16Sample & 0xFF));
Out.Add(static_cast<uint8>((Int16Sample >> 8) & 0xFF));
}
return Out;
}

View File

@ -0,0 +1,171 @@
// Copyright ASTERION. All Rights Reserved.
#include "ElevenLabsMicrophoneCaptureComponent.h"
#include "ElevenLabsDefinitions.h"
#include "AudioCaptureCore.h"
#include "Async/Async.h"
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsMic, Log, All);
// ─────────────────────────────────────────────────────────────────────────────
// Constructor
// ─────────────────────────────────────────────────────────────────────────────
UElevenLabsMicrophoneCaptureComponent::UElevenLabsMicrophoneCaptureComponent()
{
PrimaryComponentTick.bCanEverTick = false;
}
// ─────────────────────────────────────────────────────────────────────────────
// Lifecycle
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsMicrophoneCaptureComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
{
StopCapture();
Super::EndPlay(EndPlayReason);
}
// ─────────────────────────────────────────────────────────────────────────────
// Capture control
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsMicrophoneCaptureComponent::StartCapture()
{
if (bCapturing)
{
UE_LOG(LogElevenLabsMic, Warning, TEXT("StartCapture called while already capturing."));
return;
}
// Open the default audio capture stream.
// FOnAudioCaptureFunction uses const void* per UE 5.3+ API (cast to float* inside).
Audio::FOnAudioCaptureFunction CaptureCallback =
[this](const void* InAudio, int32 NumFrames, int32 InNumChannels,
int32 InSampleRate, double StreamTime, bool bOverflow)
{
OnAudioGenerate(InAudio, NumFrames, InNumChannels, InSampleRate, StreamTime, bOverflow);
};
if (!AudioCapture.OpenAudioCaptureStream(DeviceParams, MoveTemp(CaptureCallback), 1024))
{
UE_LOG(LogElevenLabsMic, Error, TEXT("Failed to open default audio capture stream."));
return;
}
// Retrieve the actual device parameters after opening the stream.
Audio::FCaptureDeviceInfo DeviceInfo;
if (AudioCapture.GetCaptureDeviceInfo(DeviceInfo))
{
DeviceSampleRate = DeviceInfo.PreferredSampleRate;
DeviceChannels = DeviceInfo.InputChannels;
UE_LOG(LogElevenLabsMic, Log, TEXT("Capture device: %s | Rate=%d | Channels=%d"),
*DeviceInfo.DeviceName, DeviceSampleRate, DeviceChannels);
}
AudioCapture.StartStream();
bCapturing = true;
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture started."));
}
void UElevenLabsMicrophoneCaptureComponent::StopCapture()
{
if (!bCapturing) return;
AudioCapture.StopStream();
AudioCapture.CloseStream();
bCapturing = false;
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture stopped."));
}
// ─────────────────────────────────────────────────────────────────────────────
// Audio callback (background thread)
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
const void* InAudio, int32 NumFrames,
int32 InNumChannels, int32 InSampleRate,
double StreamTime, bool bOverflow)
{
if (bOverflow)
{
UE_LOG(LogElevenLabsMic, Verbose, TEXT("Audio capture buffer overflow."));
}
// Device sends float32 interleaved samples; cast from the void* API.
const float* FloatAudio = static_cast<const float*>(InAudio);
// Resample + downmix to 16000 Hz mono.
TArray<float> Resampled = ResampleTo16000(FloatAudio, NumFrames, InNumChannels, InSampleRate);
// Apply volume multiplier.
if (!FMath::IsNearlyEqual(VolumeMultiplier, 1.0f))
{
for (float& S : Resampled)
{
S *= VolumeMultiplier;
}
}
// Fire the delegate on the game thread so subscribers don't need to be
// thread-safe (WebSocket Send is not thread-safe in UE's implementation).
AsyncTask(ENamedThreads::GameThread, [this, Data = MoveTemp(Resampled)]()
{
if (bCapturing)
{
OnAudioCaptured.Broadcast(Data);
}
});
}
// ─────────────────────────────────────────────────────────────────────────────
// Resampling
// ─────────────────────────────────────────────────────────────────────────────
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
const float* InAudio, int32 NumSamples,
int32 InChannels, int32 InSampleRate)
{
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
// --- Step 1: Downmix to mono ---
TArray<float> Mono;
if (InChannels == 1)
{
Mono = TArray<float>(InAudio, NumSamples);
}
else
{
const int32 NumFrames = NumSamples / InChannels;
Mono.Reserve(NumFrames);
for (int32 i = 0; i < NumFrames; i++)
{
float Sum = 0.0f;
for (int32 c = 0; c < InChannels; c++)
{
Sum += InAudio[i * InChannels + c];
}
Mono.Add(Sum / static_cast<float>(InChannels));
}
}
// --- Step 2: Resample via linear interpolation ---
if (InSampleRate == TargetRate)
{
return Mono;
}
const float Ratio = static_cast<float>(InSampleRate) / static_cast<float>(TargetRate);
const int32 OutSamples = FMath::FloorToInt(static_cast<float>(Mono.Num()) / Ratio);
TArray<float> Out;
Out.Reserve(OutSamples);
for (int32 i = 0; i < OutSamples; i++)
{
const float SrcIndex = static_cast<float>(i) * Ratio;
const int32 SrcLow = FMath::FloorToInt(SrcIndex);
const int32 SrcHigh = FMath::Min(SrcLow + 1, Mono.Num() - 1);
const float Alpha = SrcIndex - static_cast<float>(SrcLow);
Out.Add(FMath::Lerp(Mono[SrcLow], Mono[SrcHigh], Alpha));
}
return Out;
}

View File

@ -0,0 +1,455 @@
// Copyright ASTERION. All Rights Reserved.
#include "ElevenLabsWebSocketProxy.h"
#include "PS_AI_Agent_ElevenLabs.h"
#include "WebSocketsModule.h"
#include "IWebSocket.h"
#include "Json.h"
#include "JsonUtilities.h"
#include "Misc/Base64.h"
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsWS, Log, All);
// ─────────────────────────────────────────────────────────────────────────────
// Helpers
// ─────────────────────────────────────────────────────────────────────────────
static void EL_LOG(bool bVerbose, const TCHAR* Format, ...)
{
if (!bVerbose) return;
va_list Args;
va_start(Args, Format);
// Forward to UE_LOG at Verbose level
TCHAR Buffer[2048];
FCString::GetVarArgs(Buffer, UE_ARRAY_COUNT(Buffer), Format, Args);
va_end(Args);
UE_LOG(LogElevenLabsWS, Verbose, TEXT("%s"), Buffer);
}
// ─────────────────────────────────────────────────────────────────────────────
// Connect / Disconnect
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FString& APIKeyOverride)
{
if (ConnectionState == EElevenLabsConnectionState::Connected ||
ConnectionState == EElevenLabsConnectionState::Connecting)
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("Connect called but already connecting/connected. Ignoring."));
return;
}
if (!FModuleManager::Get().IsModuleLoaded("WebSockets"))
{
FModuleManager::LoadModuleChecked<FWebSocketsModule>("WebSockets");
}
const FString URL = BuildWebSocketURL(AgentIDOverride, APIKeyOverride);
if (URL.IsEmpty())
{
const FString Msg = TEXT("Cannot connect: no Agent ID configured. Set it in Project Settings or pass it to Connect().");
UE_LOG(LogElevenLabsWS, Error, TEXT("%s"), *Msg);
OnError.Broadcast(Msg);
ConnectionState = EElevenLabsConnectionState::Error;
return;
}
UE_LOG(LogElevenLabsWS, Log, TEXT("Connecting to ElevenLabs: %s"), *URL);
ConnectionState = EElevenLabsConnectionState::Connecting;
// Headers: the ElevenLabs Conversational AI WS endpoint accepts the
// xi-api-key header on the initial HTTP upgrade request.
TMap<FString, FString> UpgradeHeaders;
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
const FString ResolvedKey = APIKeyOverride.IsEmpty() ? Settings->API_Key : APIKeyOverride;
if (!ResolvedKey.IsEmpty())
{
UpgradeHeaders.Add(TEXT("xi-api-key"), ResolvedKey);
}
WebSocket = FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), UpgradeHeaders);
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
WebSocket->Connect();
}
void UElevenLabsWebSocketProxy::Disconnect()
{
if (WebSocket.IsValid() && WebSocket->IsConnected())
{
WebSocket->Close(1000, TEXT("Client disconnected"));
}
ConnectionState = EElevenLabsConnectionState::Disconnected;
}
// ─────────────────────────────────────────────────────────────────────────────
// Audio & turn control
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
return;
}
if (PCMData.Num() == 0) return;
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
SendJsonMessage(Msg);
}
void UElevenLabsWebSocketProxy::SendUserTurnStart()
{
// In client turn mode, signal that the user is active/speaking.
// API message: { "type": "user_activity" }
if (!IsConnected()) return;
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
SendJsonMessage(Msg);
}
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
{
// In client turn mode, stopping user_activity signals end of user turn.
// The API uses user_activity for ongoing speech; simply stop sending it.
// No explicit end message is required — silence is detected server-side.
// We still log for debug visibility.
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
}
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
{
if (!IsConnected())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendTextMessage: not connected."));
return;
}
if (Text.IsEmpty()) return;
// API: { "type": "user_message", "text": "Hello agent" }
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserMessage);
Msg->SetStringField(TEXT("text"), Text);
SendJsonMessage(Msg);
}
void UElevenLabsWebSocketProxy::SendInterrupt()
{
if (!IsConnected()) return;
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::Interrupt);
SendJsonMessage(Msg);
}
// ─────────────────────────────────────────────────────────────────────────────
// WebSocket callbacks
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::OnWsConnected()
{
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
// State stays Connecting until we receive the initiation metadata from the server.
}
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
{
UE_LOG(LogElevenLabsWS, Error, TEXT("WebSocket connection error: %s"), *Error);
ConnectionState = EElevenLabsConnectionState::Error;
OnError.Broadcast(Error);
}
void UElevenLabsWebSocketProxy::OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean)
{
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket closed. Code=%d Reason=%s Clean=%d"), StatusCode, *Reason, bWasClean);
ConnectionState = EElevenLabsConnectionState::Disconnected;
WebSocket.Reset();
OnDisconnected.Broadcast(StatusCode, Reason);
}
void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
{
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
if (Settings->bVerboseLogging)
{
UE_LOG(LogElevenLabsWS, Verbose, TEXT(">> %s"), *Message);
}
TSharedPtr<FJsonObject> Root;
TSharedRef<TJsonReader<>> Reader = TJsonReaderFactory<>::Create(Message);
if (!FJsonSerializer::Deserialize(Reader, Root) || !Root.IsValid())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to parse WebSocket message as JSON (first 80 chars): %.80s"), *Message);
return;
}
FString MsgType;
// ElevenLabs wraps the type in a "type" field
if (!Root->TryGetStringField(TEXT("type"), MsgType))
{
// Fallback: some messages use the top-level key as the type
// e.g. { "user_audio_chunk": "..." } from ourselves (shouldn't arrive)
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Message has no 'type' field, ignoring."));
return;
}
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
{
HandleConversationInitiation(Root);
}
else if (MsgType == ElevenLabsMessageType::AudioResponse)
{
HandleAudioResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::UserTranscript)
{
HandleTranscript(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponse)
{
HandleAgentResponse(Root);
}
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)
{
// Silently ignore for now — corrected text after interruption.
UE_LOG(LogElevenLabsWS, Verbose, TEXT("agent_response_correction received (ignored)."));
}
else if (MsgType == ElevenLabsMessageType::InterruptionEvent)
{
HandleInterruption(Root);
}
else if (MsgType == ElevenLabsMessageType::PingEvent)
{
HandlePing(Root);
}
else
{
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Unhandled message type: %s"), *MsgType);
}
}
void UElevenLabsWebSocketProxy::OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining)
{
// Accumulate fragments until BytesRemaining == 0.
const uint8* Bytes = static_cast<const uint8*>(Data);
BinaryFrameBuffer.Append(Bytes, Size);
if (BytesRemaining > 0)
{
// More fragments coming — wait for the rest
return;
}
const int32 TotalSize = BinaryFrameBuffer.Num();
// Peek at first byte to distinguish JSON (starts with '{') from raw binary audio.
const bool bLooksLikeJson = (TotalSize > 0 && BinaryFrameBuffer[0] == '{');
if (bLooksLikeJson)
{
// Null-terminate safely then decode as UTF-8 JSON
BinaryFrameBuffer.Add(0);
const FString JsonString = FString(UTF8_TO_TCHAR(
reinterpret_cast<const char*>(BinaryFrameBuffer.GetData())));
BinaryFrameBuffer.Reset();
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
if (Settings->bVerboseLogging)
{
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary JSON frame (%d bytes): %.120s"), TotalSize, *JsonString);
}
OnWsMessage(JsonString);
}
else
{
// Raw binary audio frame — PCM bytes sent directly without Base64/JSON wrapper.
// Log first few bytes as hex to help diagnose the format.
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
if (Settings->bVerboseLogging)
{
FString HexPreview;
const int32 PreviewBytes = FMath::Min(TotalSize, 8);
for (int32 i = 0; i < PreviewBytes; i++)
{
HexPreview += FString::Printf(TEXT("%02X "), BinaryFrameBuffer[i]);
}
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary audio frame: %d bytes | first bytes: %s"), TotalSize, *HexPreview);
}
// Broadcast raw PCM bytes directly to the audio queue.
TArray<uint8> PCMData = MoveTemp(BinaryFrameBuffer);
BinaryFrameBuffer.Reset();
OnAudioReceived.Broadcast(PCMData);
}
}
// ─────────────────────────────────────────────────────────────────────────────
// Message handlers
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::HandleConversationInitiation(const TSharedPtr<FJsonObject>& Root)
{
// Expected structure:
// { "type": "conversation_initiation_metadata",
// "conversation_initiation_metadata_event": {
// "conversation_id": "...",
// "agent_output_audio_format": "pcm_16000"
// }
// }
const TSharedPtr<FJsonObject>* MetaObj = nullptr;
if (Root->TryGetObjectField(TEXT("conversation_initiation_metadata_event"), MetaObj) && MetaObj)
{
(*MetaObj)->TryGetStringField(TEXT("conversation_id"), ConversationInfo.ConversationID);
}
UE_LOG(LogElevenLabsWS, Log, TEXT("Conversation initiated. ID=%s"), *ConversationInfo.ConversationID);
ConnectionState = EElevenLabsConnectionState::Connected;
OnConnected.Broadcast(ConversationInfo);
}
void UElevenLabsWebSocketProxy::HandleAudioResponse(const TSharedPtr<FJsonObject>& Root)
{
// Expected structure:
// { "type": "audio",
// "audio_event": { "audio_base_64": "<base64 PCM>", "event_id": 1 }
// }
const TSharedPtr<FJsonObject>* AudioEvent = nullptr;
if (!Root->TryGetObjectField(TEXT("audio_event"), AudioEvent) || !AudioEvent)
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio message missing 'audio_event' field."));
return;
}
FString Base64Audio;
if (!(*AudioEvent)->TryGetStringField(TEXT("audio_base_64"), Base64Audio))
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio_event missing 'audio_base_64' field."));
return;
}
TArray<uint8> PCMData;
if (!FBase64::Decode(Base64Audio, PCMData))
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to Base64-decode audio data."));
return;
}
OnAudioReceived.Broadcast(PCMData);
}
void UElevenLabsWebSocketProxy::HandleTranscript(const TSharedPtr<FJsonObject>& Root)
{
// API structure:
// { "type": "user_transcript",
// "user_transcription_event": { "user_transcript": "Hello" }
// }
// This message only carries the user's speech-to-text — speaker is always "user".
const TSharedPtr<FJsonObject>* TranscriptEvent = nullptr;
if (!Root->TryGetObjectField(TEXT("user_transcription_event"), TranscriptEvent) || !TranscriptEvent)
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("user_transcript message missing 'user_transcription_event' field."));
return;
}
FElevenLabsTranscriptSegment Segment;
Segment.Speaker = TEXT("user");
(*TranscriptEvent)->TryGetStringField(TEXT("user_transcript"), Segment.Text);
// user_transcript messages are always final (interim results are not sent for user speech)
Segment.bIsFinal = true;
OnTranscript.Broadcast(Segment);
}
void UElevenLabsWebSocketProxy::HandleAgentResponse(const TSharedPtr<FJsonObject>& Root)
{
// { "type": "agent_response",
// "agent_response_event": { "agent_response": "..." }
// }
const TSharedPtr<FJsonObject>* ResponseEvent = nullptr;
if (!Root->TryGetObjectField(TEXT("agent_response_event"), ResponseEvent) || !ResponseEvent)
{
return;
}
FString ResponseText;
(*ResponseEvent)->TryGetStringField(TEXT("agent_response"), ResponseText);
OnAgentResponse.Broadcast(ResponseText);
}
void UElevenLabsWebSocketProxy::HandleInterruption(const TSharedPtr<FJsonObject>& Root)
{
UE_LOG(LogElevenLabsWS, Log, TEXT("Agent interrupted."));
OnInterrupted.Broadcast();
}
void UElevenLabsWebSocketProxy::HandlePing(const TSharedPtr<FJsonObject>& Root)
{
// Reply with a pong to keep the connection alive.
// Incoming: { "type": "ping", "ping_event": { "event_id": 1, "ping_ms": 150 } }
// Reply: { "type": "pong", "event_id": 1 } ← event_id is top-level, no wrapper object
int32 EventID = 0;
const TSharedPtr<FJsonObject>* PingEvent = nullptr;
if (Root->TryGetObjectField(TEXT("ping_event"), PingEvent) && PingEvent)
{
(*PingEvent)->TryGetNumberField(TEXT("event_id"), EventID);
}
TSharedPtr<FJsonObject> Pong = MakeShareable(new FJsonObject());
Pong->SetStringField(TEXT("type"), TEXT("pong"));
Pong->SetNumberField(TEXT("event_id"), EventID); // top-level, not nested
SendJsonMessage(Pong);
}
// ─────────────────────────────────────────────────────────────────────────────
// Helpers
// ─────────────────────────────────────────────────────────────────────────────
void UElevenLabsWebSocketProxy::SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj)
{
if (!WebSocket.IsValid() || !WebSocket->IsConnected())
{
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendJsonMessage: WebSocket not connected."));
return;
}
FString Out;
TSharedRef<TJsonWriter<>> Writer = TJsonWriterFactory<>::Create(&Out);
FJsonSerializer::Serialize(JsonObj.ToSharedRef(), Writer);
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
if (Settings->bVerboseLogging)
{
UE_LOG(LogElevenLabsWS, Verbose, TEXT("<< %s"), *Out);
}
WebSocket->Send(Out);
}
FString UElevenLabsWebSocketProxy::BuildWebSocketURL(const FString& AgentIDOverride, const FString& APIKeyOverride) const
{
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
// Custom URL override takes full precedence
if (!Settings->CustomWebSocketURL.IsEmpty())
{
return Settings->CustomWebSocketURL;
}
const FString ResolvedAgentID = AgentIDOverride.IsEmpty() ? Settings->AgentID : AgentIDOverride;
if (ResolvedAgentID.IsEmpty())
{
return FString();
}
// Official ElevenLabs Conversational AI WebSocket endpoint
// wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>
return FString::Printf(
TEXT("wss://api.elevenlabs.io/v1/convai/conversation?agent_id=%s"),
*ResolvedAgentID);
}

View File

@ -0,0 +1,50 @@
// Copyright ASTERION. All Rights Reserved.
#include "PS_AI_Agent_ElevenLabs.h"
#include "Developer/Settings/Public/ISettingsModule.h"
#include "UObject/UObjectGlobals.h"
#include "UObject/Package.h"
IMPLEMENT_MODULE(FPS_AI_Agent_ElevenLabsModule, PS_AI_Agent_ElevenLabs)
#define LOCTEXT_NAMESPACE "PS_AI_Agent_ElevenLabs"
void FPS_AI_Agent_ElevenLabsModule::StartupModule()
{
Settings = NewObject<UElevenLabsSettings>(GetTransientPackage(), "ElevenLabsSettings", RF_Standalone);
Settings->AddToRoot();
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
{
SettingsModule->RegisterSettings(
"Project", "Plugins", "ElevenLabsAIAgent",
LOCTEXT("SettingsName", "ElevenLabs AI Agent"),
LOCTEXT("SettingsDescription", "Configure the ElevenLabs Conversational AI Agent plugin"),
Settings);
}
}
void FPS_AI_Agent_ElevenLabsModule::ShutdownModule()
{
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
{
SettingsModule->UnregisterSettings("Project", "Plugins", "ElevenLabsAIAgent");
}
if (!GExitPurge)
{
Settings->RemoveFromRoot();
}
else
{
Settings = nullptr;
}
}
UElevenLabsSettings* FPS_AI_Agent_ElevenLabsModule::GetSettings() const
{
check(Settings);
return Settings;
}
#undef LOCTEXT_NAMESPACE

View File

@ -0,0 +1,233 @@
// Copyright ASTERION. All Rights Reserved.
#pragma once
#include "CoreMinimal.h"
#include "Components/ActorComponent.h"
#include "ElevenLabsDefinitions.h"
#include "ElevenLabsWebSocketProxy.h"
#include "Sound/SoundWaveProcedural.h"
#include "ElevenLabsConversationalAgentComponent.generated.h"
class UAudioComponent;
class UElevenLabsMicrophoneCaptureComponent;
// ─────────────────────────────────────────────────────────────────────────────
// Delegates exposed to Blueprint
// ─────────────────────────────────────────────────────────────────────────────
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentConnected,
const FElevenLabsConversationInfo&, ConversationInfo);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnAgentDisconnected,
int32, StatusCode, const FString&, Reason);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentError,
const FString&, ErrorMessage);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTranscript,
const FElevenLabsTranscriptSegment&, Segment);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTextResponse,
const FString&, ResponseText);
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStartedSpeaking);
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStoppedSpeaking);
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentInterrupted);
// ─────────────────────────────────────────────────────────────────────────────
// UElevenLabsConversationalAgentComponent
//
// Attach this to any Actor (e.g. a character NPC) to give it a voice powered by
// the ElevenLabs Conversational AI API.
//
// Workflow:
// 1. Set AgentID (or rely on project default).
// 2. Call StartConversation() to open the WebSocket.
// 3. Call StartListening() / StopListening() to control microphone capture.
// 4. React to events (OnAgentTranscript, OnAgentTextResponse, etc.) in Blueprint.
// 5. Call EndConversation() when done.
// ─────────────────────────────────────────────────────────────────────────────
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
DisplayName = "ElevenLabs Conversational Agent")
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsConversationalAgentComponent : public UActorComponent
{
GENERATED_BODY()
public:
UElevenLabsConversationalAgentComponent();
// ── Configuration ─────────────────────────────────────────────────────────
/**
* ElevenLabs Agent ID. Overrides the project-level default in Project Settings.
* Leave empty to use the project default.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
FString AgentID;
/**
* Turn mode:
* - Server VAD: ElevenLabs detects end-of-speech automatically (recommended).
* - Client Controlled: you call StartListening/StopListening manually (push-to-talk).
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
/**
* Automatically start listening (microphone capture) once the WebSocket is
* connected and the conversation is initiated.
*/
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
bool bAutoStartListening = true;
// ── Events ────────────────────────────────────────────────────────────────
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentConnected OnAgentConnected;
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentDisconnected OnAgentDisconnected;
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentError OnAgentError;
/** Fired for every transcript segment (user speech or agent speech, tentative and final). */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentTranscript OnAgentTranscript;
/** Final text response produced by the agent (mirrors the audio). */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentTextResponse OnAgentTextResponse;
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentStartedSpeaking OnAgentStartedSpeaking;
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentStoppedSpeaking OnAgentStoppedSpeaking;
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnAgentInterrupted OnAgentInterrupted;
// ── Control ───────────────────────────────────────────────────────────────
/**
* Open the WebSocket connection and start the conversation.
* If bAutoStartListening is true, microphone capture also starts once connected.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void StartConversation();
/** Close the WebSocket and stop all audio. */
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void EndConversation();
/**
* Start capturing microphone audio and streaming it to ElevenLabs.
* In Client turn mode, also sends a UserTurnStart signal.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void StartListening();
/**
* Stop capturing microphone audio.
* In Client turn mode, also sends a UserTurnEnd signal.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void StopListening();
/**
* Send a plain text message to the agent without using the microphone.
* The agent will respond with audio and text just as if it heard you speak.
* Useful for testing in the Editor or for text-based interaction.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendTextMessage(const FString& Text);
/** Interrupt the agent's current utterance. */
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void InterruptAgent();
// ── State queries ─────────────────────────────────────────────────────────
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
bool IsConnected() const;
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
bool IsListening() const { return bIsListening; }
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
bool IsAgentSpeaking() const { return bAgentSpeaking; }
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
const FElevenLabsConversationInfo& GetConversationInfo() const;
/** Access the underlying WebSocket proxy (advanced use). */
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
UElevenLabsWebSocketProxy* GetWebSocketProxy() const { return WebSocketProxy; }
// ─────────────────────────────────────────────────────────────────────────
// UActorComponent overrides
// ─────────────────────────────────────────────────────────────────────────
virtual void BeginPlay() override;
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
virtual void TickComponent(float DeltaTime, ELevelTick TickType,
FActorComponentTickFunction* ThisTickFunction) override;
private:
// ── Internal event handlers ───────────────────────────────────────────────
UFUNCTION()
void HandleConnected(const FElevenLabsConversationInfo& Info);
UFUNCTION()
void HandleDisconnected(int32 StatusCode, const FString& Reason);
UFUNCTION()
void HandleError(const FString& ErrorMessage);
UFUNCTION()
void HandleAudioReceived(const TArray<uint8>& PCMData);
UFUNCTION()
void HandleTranscript(const FElevenLabsTranscriptSegment& Segment);
UFUNCTION()
void HandleAgentResponse(const FString& ResponseText);
UFUNCTION()
void HandleInterrupted();
// ── Audio playback ────────────────────────────────────────────────────────
void InitAudioPlayback();
void EnqueueAgentAudio(const TArray<uint8>& PCMData);
void StopAgentAudio();
/** Called by USoundWaveProcedural when it needs more PCM data. */
void OnProceduralUnderflow(USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired);
// ── Microphone streaming ──────────────────────────────────────────────────
void OnMicrophoneDataCaptured(const TArray<float>& FloatPCM);
/** Convert float PCM to int16 little-endian bytes for ElevenLabs. */
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatPCM);
// ── Sub-objects ───────────────────────────────────────────────────────────
UPROPERTY()
UElevenLabsWebSocketProxy* WebSocketProxy = nullptr;
UPROPERTY()
UAudioComponent* AudioPlaybackComponent = nullptr;
UPROPERTY()
USoundWaveProcedural* ProceduralSoundWave = nullptr;
// ── State ─────────────────────────────────────────────────────────────────
bool bIsListening = false;
bool bAgentSpeaking = false;
// Accumulates incoming PCM bytes until the audio component needs data.
TArray<uint8> AudioQueue;
FCriticalSection AudioQueueLock;
// Simple heuristic: if we haven't received audio data for this many ticks,
// consider the agent done speaking.
int32 SilentTickCount = 0;
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
};

View File

@ -0,0 +1,109 @@
// Copyright ASTERION. All Rights Reserved.
#pragma once
#include "CoreMinimal.h"
#include "ElevenLabsDefinitions.generated.h"
// ─────────────────────────────────────────────────────────────────────────────
// Connection state
// ─────────────────────────────────────────────────────────────────────────────
UENUM(BlueprintType)
enum class EElevenLabsConnectionState : uint8
{
Disconnected UMETA(DisplayName = "Disconnected"),
Connecting UMETA(DisplayName = "Connecting"),
Connected UMETA(DisplayName = "Connected"),
Error UMETA(DisplayName = "Error"),
};
// ─────────────────────────────────────────────────────────────────────────────
// Agent turn mode
// ─────────────────────────────────────────────────────────────────────────────
UENUM(BlueprintType)
enum class EElevenLabsTurnMode : uint8
{
/** ElevenLabs server decides when the user has finished speaking (default). */
Server UMETA(DisplayName = "Server VAD"),
/** Client explicitly signals turn start/end (manual push-to-talk). */
Client UMETA(DisplayName = "Client Controlled"),
};
// ─────────────────────────────────────────────────────────────────────────────
// WebSocket message type helpers (internal, not exposed to Blueprint)
// ─────────────────────────────────────────────────────────────────────────────
namespace ElevenLabsMessageType
{
// Client → Server
static const FString AudioChunk = TEXT("user_audio_chunk");
// Client turn mode: signal user is currently active/speaking
static const FString UserActivity = TEXT("user_activity");
// Client turn mode: send a text message without audio
static const FString UserMessage = TEXT("user_message");
static const FString Interrupt = TEXT("interrupt");
static const FString ClientToolResult = TEXT("client_tool_result");
static const FString ConversationClientData = TEXT("conversation_initiation_client_data");
// Server → Client
static const FString ConversationInitiation = TEXT("conversation_initiation_metadata");
static const FString AudioResponse = TEXT("audio");
// User speech-to-text transcript (speaker is always the user)
static const FString UserTranscript = TEXT("user_transcript");
static const FString AgentResponse = TEXT("agent_response");
static const FString AgentResponseCorrection= TEXT("agent_response_correction");
static const FString InterruptionEvent = TEXT("interruption");
static const FString PingEvent = TEXT("ping");
static const FString ClientToolCall = TEXT("client_tool_call");
static const FString InternalTentativeAgent = TEXT("internal_tentative_agent_response");
}
// ─────────────────────────────────────────────────────────────────────────────
// Audio format exchanged with ElevenLabs
// PCM 16-bit signed, 16000 Hz, mono, little-endian.
// ─────────────────────────────────────────────────────────────────────────────
namespace ElevenLabsAudio
{
static constexpr int32 SampleRate = 16000;
static constexpr int32 Channels = 1;
static constexpr int32 BitsPerSample = 16;
// Chunk size sent per WebSocket frame: 100 ms of audio
static constexpr int32 ChunkSamples = SampleRate / 10; // 1600 samples = 3200 bytes
}
// ─────────────────────────────────────────────────────────────────────────────
// Conversation metadata received on successful connection
// ─────────────────────────────────────────────────────────────────────────────
USTRUCT(BlueprintType)
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsConversationInfo
{
GENERATED_BODY()
/** Unique ID of this conversation session assigned by ElevenLabs. */
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
FString ConversationID;
/** Agent ID that is responding. */
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
FString AgentID;
};
// ─────────────────────────────────────────────────────────────────────────────
// Transcript segment
// ─────────────────────────────────────────────────────────────────────────────
USTRUCT(BlueprintType)
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsTranscriptSegment
{
GENERATED_BODY()
/** Transcribed text. */
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
FString Text;
/** "user" or "agent". */
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
FString Speaker;
/** Whether this is a final transcript or a tentative (in-progress) one. */
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
bool bIsFinal = false;
};

View File

@ -0,0 +1,73 @@
// Copyright ASTERION. All Rights Reserved.
#pragma once
#include "CoreMinimal.h"
#include "Components/ActorComponent.h"
#include "AudioCapture.h"
#include "ElevenLabsMicrophoneCaptureComponent.generated.h"
// Delivers captured float PCM samples (16000 Hz mono, resampled from device rate).
DECLARE_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioCaptured, const TArray<float>& /*FloatPCM*/);
/**
* Lightweight microphone capture component.
* Captures from the default audio input device, resamples to 16000 Hz mono,
* and delivers chunks via FOnElevenLabsAudioCaptured.
*
* Modelled after Convai's ConvaiAudioCaptureComponent but stripped to the
* minimal functionality needed for the ElevenLabs Conversational AI API.
*/
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
DisplayName = "ElevenLabs Microphone Capture")
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsMicrophoneCaptureComponent : public UActorComponent
{
GENERATED_BODY()
public:
UElevenLabsMicrophoneCaptureComponent();
/** Volume multiplier applied to captured samples before forwarding. */
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Microphone",
meta = (ClampMin = "0.0", ClampMax = "4.0"))
float VolumeMultiplier = 1.0f;
/**
* Delegate fired on the game thread each time a new chunk of PCM audio
* is captured. Samples are float32, resampled to 16000 Hz mono.
*/
FOnElevenLabsAudioCaptured OnAudioCaptured;
/** Open the default capture device and begin streaming audio. */
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void StartCapture();
/** Stop streaming and close the capture device. */
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void StopCapture();
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
bool IsCapturing() const { return bCapturing; }
// ─────────────────────────────────────────────────────────────────────────
// UActorComponent overrides
// ─────────────────────────────────────────────────────────────────────────
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
private:
/** Called by the audio capture callback on a background thread. Raw void* per UE 5.3+ API. */
void OnAudioGenerate(const void* InAudio, int32 NumFrames,
int32 InNumChannels, int32 InSampleRate, double StreamTime, bool bOverflow);
/** Simple linear resample from InSampleRate to 16000 Hz. Input is float32 frames. */
static TArray<float> ResampleTo16000(const float* InAudio, int32 NumFrames,
int32 InChannels, int32 InSampleRate);
Audio::FAudioCapture AudioCapture;
Audio::FAudioCaptureDeviceParams DeviceParams;
bool bCapturing = false;
// Device sample rate discovered on StartCapture
int32 DeviceSampleRate = 44100;
int32 DeviceChannels = 1;
};

View File

@ -0,0 +1,186 @@
// Copyright ASTERION. All Rights Reserved.
#pragma once
#include "CoreMinimal.h"
#include "UObject/NoExportTypes.h"
#include "ElevenLabsDefinitions.h"
#include "IWebSocket.h"
#include "ElevenLabsWebSocketProxy.generated.h"
// ─────────────────────────────────────────────────────────────────────────────
// Delegates (all Blueprint-assignable)
// ─────────────────────────────────────────────────────────────────────────────
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsConnected,
const FElevenLabsConversationInfo&, ConversationInfo);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnElevenLabsDisconnected,
int32, StatusCode, const FString&, Reason);
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsError,
const FString&, ErrorMessage);
/** Fired when a PCM audio chunk arrives from the agent. Raw bytes, 16-bit signed 16kHz mono. */
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioReceived,
const TArray<uint8>&, PCMData);
/** Fired for user or agent transcript segments. */
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsTranscript,
const FElevenLabsTranscriptSegment&, Segment);
/** Fired with the final text response from the agent. */
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAgentResponse,
const FString&, ResponseText);
/** Fired when the agent interrupts the user. */
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnElevenLabsInterrupted);
// ─────────────────────────────────────────────────────────────────────────────
// WebSocket Proxy
// Manages the lifecycle of a single ElevenLabs Conversational AI WebSocket session.
// Instantiate via UElevenLabsConversationalAgentComponent (the component manages
// one proxy at a time), or create manually through Blueprints.
// ─────────────────────────────────────────────────────────────────────────────
UCLASS(BlueprintType, Blueprintable)
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsWebSocketProxy : public UObject
{
GENERATED_BODY()
public:
// ── Events ────────────────────────────────────────────────────────────────
/** Called once the WebSocket handshake succeeds and the agent sends its initiation metadata. */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsConnected OnConnected;
/** Called when the WebSocket closes (graceful or remote). */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsDisconnected OnDisconnected;
/** Called on any connection or protocol error. */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsError OnError;
/** Raw PCM audio coming from the agent — feed this into your audio component. */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsAudioReceived OnAudioReceived;
/** User or agent transcript (may be tentative while the conversation is ongoing). */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsTranscript OnTranscript;
/** Final text response from the agent (complements audio). */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsAgentResponse OnAgentResponse;
/** The agent was interrupted by new user speech. */
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
FOnElevenLabsInterrupted OnInterrupted;
// ── Lifecycle ─────────────────────────────────────────────────────────────
/**
* Open a WebSocket connection to ElevenLabs.
* Uses settings from Project Settings unless overridden by the parameters.
*
* @param AgentID ElevenLabs agent ID. Overrides the project-level default when non-empty.
* @param APIKey API key. Overrides the project-level default when non-empty.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void Connect(const FString& AgentID = TEXT(""), const FString& APIKey = TEXT(""));
/**
* Gracefully close the WebSocket connection.
* OnDisconnected will fire after the server acknowledges.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void Disconnect();
/** Current connection state. */
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
EElevenLabsConnectionState GetConnectionState() const { return ConnectionState; }
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
bool IsConnected() const { return ConnectionState == EElevenLabsConnectionState::Connected; }
// ── Audio sending ─────────────────────────────────────────────────────────
/**
* Send a chunk of raw PCM audio to ElevenLabs.
* Audio must be 16-bit signed, 16000 Hz, mono, little-endian.
* The data is Base64-encoded and sent as a JSON message.
* Call this repeatedly while the microphone is capturing.
*
* @param PCMData Raw PCM bytes (16-bit LE, 16kHz, mono).
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendAudioChunk(const TArray<uint8>& PCMData);
// ── Turn control (only relevant in Client turn mode) ──────────────────────
/**
* Signal that the user is actively speaking (Client turn mode).
* Sends a { "type": "user_activity" } message to the server.
* Call this periodically while the user is speaking (e.g. every audio chunk).
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendUserTurnStart();
/**
* Signal that the user has finished speaking (Client turn mode).
* No explicit API message simply stop sending user_activity.
* The server detects silence and hands the turn to the agent.
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendUserTurnEnd();
/**
* Send a text message to the agent (no microphone needed).
* Useful for testing or text-only interaction.
* Sends: { "type": "user_message", "text": "..." }
*/
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendTextMessage(const FString& Text);
/** Ask the agent to stop the current utterance. */
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
void SendInterrupt();
// ── Info ──────────────────────────────────────────────────────────────────
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
const FElevenLabsConversationInfo& GetConversationInfo() const { return ConversationInfo; }
// ─────────────────────────────────────────────────────────────────────────
// Internal
// ─────────────────────────────────────────────────────────────────────────
private:
void OnWsConnected();
void OnWsConnectionError(const FString& Error);
void OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean);
void OnWsMessage(const FString& Message);
void OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining);
void HandleConversationInitiation(const TSharedPtr<FJsonObject>& Payload);
void HandleAudioResponse(const TSharedPtr<FJsonObject>& Payload);
void HandleTranscript(const TSharedPtr<FJsonObject>& Payload);
void HandleAgentResponse(const TSharedPtr<FJsonObject>& Payload);
void HandleInterruption(const TSharedPtr<FJsonObject>& Payload);
void HandlePing(const TSharedPtr<FJsonObject>& Payload);
/** Build and send a JSON text frame to the server. */
void SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj);
/** Resolve the WebSocket URL from settings / parameters. */
FString BuildWebSocketURL(const FString& AgentID, const FString& APIKey) const;
TSharedPtr<IWebSocket> WebSocket;
EElevenLabsConnectionState ConnectionState = EElevenLabsConnectionState::Disconnected;
FElevenLabsConversationInfo ConversationInfo;
// Accumulation buffer for multi-fragment binary WebSocket frames.
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
TArray<uint8> BinaryFrameBuffer;
};

View File

@ -0,0 +1,99 @@
// Copyright ASTERION. All Rights Reserved.
#pragma once
#include "CoreMinimal.h"
#include "Modules/ModuleManager.h"
#include "PS_AI_Agent_ElevenLabs.generated.h"
// ─────────────────────────────────────────────────────────────────────────────
// Settings object exposed in Project Settings → Plugins → ElevenLabs AI Agent
// ─────────────────────────────────────────────────────────────────────────────
UCLASS(config = Engine, defaultconfig)
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsSettings : public UObject
{
GENERATED_BODY()
public:
UElevenLabsSettings(const FObjectInitializer& ObjectInitializer)
: Super(ObjectInitializer)
{
API_Key = TEXT("");
AgentID = TEXT("");
bSignedURLMode = false;
}
/**
* ElevenLabs API key.
* Obtain from https://elevenlabs.io used to authenticate WebSocket connections.
* Keep this secret; do not ship with the key hard-coded in a shipping build.
*/
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
FString API_Key;
/**
* The default ElevenLabs Conversational Agent ID to use when none is specified
* on the component. Create agents at https://elevenlabs.io/app/conversational-ai
*/
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
FString AgentID;
/**
* When true, the plugin fetches a signed WebSocket URL from your own backend
* before connecting, so the API key is never exposed in the client.
* Set SignedURLEndpoint to point to your server that returns the signed URL.
*/
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security")
bool bSignedURLMode;
/**
* Your backend endpoint that returns a signed WebSocket URL for ElevenLabs.
* Only used when bSignedURLMode = true.
* Expected response body: { "signed_url": "wss://..." }
*/
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security",
meta = (EditCondition = "bSignedURLMode"))
FString SignedURLEndpoint;
/**
* Override the ElevenLabs WebSocket base URL. Leave empty to use the default:
* wss://api.elevenlabs.io/v1/convai/conversation
*/
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
FString CustomWebSocketURL;
/** Log verbose WebSocket messages to the Output Log (useful during development). */
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
bool bVerboseLogging = false;
};
// ─────────────────────────────────────────────────────────────────────────────
// Module
// ─────────────────────────────────────────────────────────────────────────────
class PS_AI_AGENT_ELEVENLABS_API FPS_AI_Agent_ElevenLabsModule : public IModuleInterface
{
public:
/** IModuleInterface implementation */
virtual void StartupModule() override;
virtual void ShutdownModule() override;
virtual bool IsGameModule() const override { return true; }
/** Singleton access */
static inline FPS_AI_Agent_ElevenLabsModule& Get()
{
return FModuleManager::LoadModuleChecked<FPS_AI_Agent_ElevenLabsModule>("PS_AI_Agent_ElevenLabs");
}
static inline bool IsAvailable()
{
return FModuleManager::Get().IsModuleLoaded("PS_AI_Agent_ElevenLabs");
}
/** Access the settings object at runtime */
UElevenLabsSettings* GetSettings() const;
private:
UElevenLabsSettings* Settings = nullptr;
};