Compare commits
17 Commits
61710c9fde
...
302337b573
| Author | SHA1 | Date | |
|---|---|---|---|
| 302337b573 | |||
| 99017f4067 | |||
| e464cfe288 | |||
| 483456728d | |||
| 669c503d06 | |||
| b489d1174c | |||
| ae2c9b92e8 | |||
| dbd61615a9 | |||
| bbeb4294a8 | |||
| 2bb503ae40 | |||
| 1b7202603f | |||
| c833ccd66d | |||
| 3b98edcf92 | |||
| bb1a857e86 | |||
| b114ab063d | |||
| 4d6ae103db | |||
| f0055e85ed |
75
.claude/MEMORY.md
Normal file
75
.claude/MEMORY.md
Normal file
@ -0,0 +1,75 @@
|
||||
# Project Memory – PS_AI_Agent
|
||||
|
||||
> This file is committed to the repository so it is available on any machine.
|
||||
> Claude Code reads it automatically at session start (via the auto-memory system)
|
||||
> when the working directory is inside this repo.
|
||||
> **Keep it under ~180 lines** – lines beyond 200 are truncated by the system.
|
||||
|
||||
---
|
||||
|
||||
## Project Location
|
||||
- Repo root: `<repo_root>/` (wherever this is cloned)
|
||||
- UE5 project: `<repo_root>/Unreal/PS_AI_Agent/`
|
||||
- `.uproject`: `<repo_root>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject`
|
||||
- Engine: **Unreal Engine 5.5** — Win64 primary target
|
||||
- Default test map: `/Game/TestMap.TestMap`
|
||||
|
||||
## Plugins
|
||||
| Plugin | Path | Purpose |
|
||||
|--------|------|---------|
|
||||
| Convai (reference) | `<repo_root>/ConvAI/Convai/` | gRPC + protobuf streaming to Convai API. Has ElevenLabs voice type enum in `ConvaiDefinitions.h`. Used as architectural reference. |
|
||||
| **PS_AI_Agent_ElevenLabs** | `<repo_root>/Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/` | Our ElevenLabs Conversational AI integration. See `.claude/elevenlabs_plugin.md` for full details. |
|
||||
|
||||
## User Preferences
|
||||
- Plugin naming: `PS_AI_Agent_<Service>` (e.g. `PS_AI_Agent_ElevenLabs`)
|
||||
- Save memory frequently during long sessions
|
||||
- Goal: ElevenLabs Conversational AI integration — simpler than Convai, no gRPC
|
||||
- Full original ask + intent: see `.claude/project_context.md`
|
||||
- Git remote is a **private server** — no public exposure risk
|
||||
|
||||
## Key UE5 Plugin Patterns
|
||||
- Settings object: `UCLASS(config=Engine, defaultconfig)` inheriting `UObject`, registered via `ISettingsModule`
|
||||
- Module startup: `NewObject<USettings>(..., RF_Standalone)` + `AddToRoot()`
|
||||
- WebSocket: `FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), Headers)`
|
||||
- `WebSockets` is a **module** (Build.cs only) — NOT a plugin, don't put it in `.uplugin`
|
||||
- Audio capture: `Audio::FAudioCapture::OpenAudioCaptureStream()` (UE 5.3+, replaces deprecated `OpenCaptureStream`)
|
||||
- `AudioCapture` IS a plugin — declare it in `.uplugin` Plugins array
|
||||
- Callback type: `FOnAudioCaptureFunction` = `TFunction<void(const void*, int32, int32, int32, double, bool)>`
|
||||
- Cast `const void*` to `const float*` inside — device sends float32 interleaved
|
||||
- Procedural audio playback: `USoundWaveProcedural` + `OnSoundWaveProceduralUnderflow` delegate
|
||||
- Audio capture callbacks arrive on a **background thread** — always marshal to game thread with `AsyncTask(ENamedThreads::GameThread, ...)`
|
||||
- Resample mic audio to **16000 Hz mono** before sending to ElevenLabs
|
||||
- `TArray::RemoveAt(idx, count, EAllowShrinking::No)` — bool overload deprecated in UE 5.5
|
||||
|
||||
## Plugin Status
|
||||
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
||||
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
|
||||
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
||||
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
||||
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
||||
- Connection confirmed working end-to-end; audio receive path functional
|
||||
|
||||
## ElevenLabs WebSocket Protocol Notes
|
||||
- **ALL frames are binary** — `OnRawMessage` handles everything; `OnMessage` (text) never fires
|
||||
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
||||
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
||||
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
||||
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
||||
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
|
||||
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
||||
|
||||
## API Keys / Secrets
|
||||
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
||||
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
|
||||
- **The key is stripped from `DefaultEngine.ini` before every commit** — do not commit it
|
||||
- Each developer sets the key locally; it does not go in git
|
||||
|
||||
## Claude Memory Files in This Repo
|
||||
| File | Contents |
|
||||
|------|----------|
|
||||
| `.claude/MEMORY.md` | This file — project structure, patterns, status |
|
||||
| `.claude/elevenlabs_plugin.md` | Plugin file map, ElevenLabs WS protocol, design decisions |
|
||||
| `.claude/elevenlabs_api_reference.md` | Full ElevenLabs API reference (WS messages, REST, signed URL, Agent ID location) |
|
||||
| `.claude/project_context.md` | Original ask, intent, short/long-term goals |
|
||||
| `.claude/session_log_2026-02-19.md` | Full session record: steps, commits, technical decisions, next steps |
|
||||
| `.claude/PS_AI_Agent_ElevenLabs_Documentation.md` | User-facing Markdown reference doc |
|
||||
619
.claude/PS_AI_Agent_ElevenLabs_Documentation.md
Normal file
619
.claude/PS_AI_Agent_ElevenLabs_Documentation.md
Normal file
@ -0,0 +1,619 @@
|
||||
# PS_AI_Agent_ElevenLabs — Plugin Documentation
|
||||
|
||||
**Engine**: Unreal Engine 5.5
|
||||
**Plugin version**: 1.1.0
|
||||
**Status**: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline
|
||||
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/eleven-agents/quickstart)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#1-overview)
|
||||
2. [Installation](#2-installation)
|
||||
3. [Project Settings](#3-project-settings)
|
||||
4. [Quick Start (Blueprint)](#4-quick-start-blueprint)
|
||||
5. [Quick Start (C++)](#5-quick-start-c)
|
||||
6. [Components Reference](#6-components-reference)
|
||||
- [UElevenLabsConversationalAgentComponent](#uelevenlabsconversationalagentcomponent)
|
||||
- [UElevenLabsMicrophoneCaptureComponent](#uelevenlabsmicrophonecapturecomponent)
|
||||
- [UElevenLabsWebSocketProxy](#uelevenlabswebsocketproxy)
|
||||
7. [Data Types Reference](#7-data-types-reference)
|
||||
8. [Turn Modes](#8-turn-modes)
|
||||
9. [Security — Signed URL Mode](#9-security--signed-url-mode)
|
||||
10. [Audio Pipeline](#10-audio-pipeline)
|
||||
11. [Common Patterns](#11-common-patterns)
|
||||
12. [Troubleshooting](#12-troubleshooting)
|
||||
13. [Changelog](#13-changelog)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
This plugin integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5, enabling real-time voice conversations between a player and an NPC (or any Actor).
|
||||
|
||||
### How it works
|
||||
|
||||
```
|
||||
Player microphone
|
||||
│
|
||||
▼
|
||||
UElevenLabsMicrophoneCaptureComponent
|
||||
• Captures from default audio device
|
||||
• Resamples to 16 kHz mono float32
|
||||
│
|
||||
▼
|
||||
UElevenLabsConversationalAgentComponent
|
||||
• Converts float32 → int16 PCM bytes
|
||||
• Base64-encodes and sends via WebSocket
|
||||
│ (wss://api.elevenlabs.io/v1/convai/conversation)
|
||||
▼
|
||||
ElevenLabs Conversational AI Agent
|
||||
• Transcribes speech
|
||||
• Runs LLM
|
||||
• Synthesizes voice (ElevenLabs TTS)
|
||||
│
|
||||
▼
|
||||
UElevenLabsConversationalAgentComponent
|
||||
• Receives raw binary PCM audio frames
|
||||
• Feeds USoundWaveProcedural → UAudioComponent
|
||||
│
|
||||
▼
|
||||
Agent voice plays from the Actor's position in the world
|
||||
```
|
||||
|
||||
### Key properties
|
||||
- No gRPC, no third-party libraries — uses UE's built-in `WebSockets` and `AudioCapture` modules
|
||||
- Blueprint-first: all events and controls are exposed to Blueprint
|
||||
- Real-time bidirectional: audio streams in both directions simultaneously
|
||||
- Server VAD (default) or push-to-talk
|
||||
- Text input supported (no microphone needed for testing)
|
||||
|
||||
### Wire frame protocol notes
|
||||
ElevenLabs sends **all WebSocket frames as binary** (not text frames). The plugin handles two binary frame types automatically:
|
||||
- **JSON control frames** (start with `{`) — conversation init, transcripts, agent responses, ping/pong
|
||||
- **Raw PCM audio frames** (binary) — agent speech audio, played directly via `USoundWaveProcedural`
|
||||
|
||||
---
|
||||
|
||||
## 2. Installation
|
||||
|
||||
The plugin lives inside the project, not the engine, so no separate install is needed.
|
||||
|
||||
### Verify it is enabled
|
||||
|
||||
Open `Unreal/PS_AI_Agent/PS_AI_Agent.uproject` and confirm:
|
||||
|
||||
```json
|
||||
{
|
||||
"Name": "PS_AI_Agent_ElevenLabs",
|
||||
"Enabled": true
|
||||
}
|
||||
```
|
||||
|
||||
### First compile
|
||||
|
||||
Open the project in the UE 5.5 Editor. It will detect the new plugin and ask to recompile — click **Yes**. Alternatively, compile from the command line:
|
||||
|
||||
```
|
||||
"C:\Program Files\Epic Games\UE_5.5\Engine\Build\BatchFiles\Build.bat"
|
||||
PS_AI_AgentEditor Win64 Development
|
||||
"<repo>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject"
|
||||
-WaitMutex
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Project Settings
|
||||
|
||||
Go to **Edit → Project Settings → Plugins → ElevenLabs AI Agent**.
|
||||
|
||||
| Setting | Description | Required |
|
||||
|---|---|---|
|
||||
| **API Key** | Your ElevenLabs API key. Find it at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | Yes (unless using Signed URL Mode or a public agent) |
|
||||
| **Agent ID** | Default agent ID. Find it in the URL when editing an agent: `elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>` | Yes (unless set per-component) |
|
||||
| **Signed URL Mode** | Fetch the WS URL from your own backend (keeps key off client). See [Section 9](#9-security--signed-url-mode) | No |
|
||||
| **Signed URL Endpoint** | Your backend URL returning `{ "signed_url": "wss://..." }` | Only if Signed URL Mode = true |
|
||||
| **Custom WebSocket URL** | Override the default `wss://api.elevenlabs.io/...` endpoint (debug only) | No |
|
||||
| **Verbose Logging** | Log every WebSocket frame type and first bytes to Output Log | No |
|
||||
|
||||
> **Security note**: The API key set in Project Settings is saved to `DefaultEngine.ini`. **Never commit this file with the key in it** — strip the `[ElevenLabsSettings]` section before committing. Use Signed URL Mode for production builds.
|
||||
|
||||
> **Finding your Agent ID**: Go to [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai), click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.
|
||||
|
||||
---
|
||||
|
||||
## 4. Quick Start (Blueprint)
|
||||
|
||||
### Step 1 — Add the component to an NPC
|
||||
|
||||
1. Open your NPC Blueprint (or any Actor Blueprint).
|
||||
2. In the **Components** panel, click **Add** → search for **ElevenLabs Conversational Agent**.
|
||||
3. Select the component. In the **Details** panel you can optionally set a specific **Agent ID** (overrides the project default).
|
||||
|
||||
### Step 2 — Set Turn Mode
|
||||
|
||||
In the component's **Details** panel:
|
||||
- **Server VAD** (default): ElevenLabs automatically detects when the player stops speaking. Microphone streams continuously once connected.
|
||||
- **Client Controlled**: You call `Start Listening` / `Stop Listening` manually (push-to-talk).
|
||||
|
||||
### Step 3 — Wire up events in the Event Graph
|
||||
|
||||
```
|
||||
Event BeginPlay
|
||||
└─► [ElevenLabs Agent] Start Conversation
|
||||
|
||||
[ElevenLabs Agent] On Agent Connected
|
||||
└─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID
|
||||
|
||||
[ElevenLabs Agent] On Agent Text Response
|
||||
└─► Set Text (UI widget) ← Response Text
|
||||
|
||||
[ElevenLabs Agent] On Agent Transcript
|
||||
└─► (optional) display live subtitles ← Segment → Text
|
||||
|
||||
[ElevenLabs Agent] On Agent Started Speaking
|
||||
└─► Play talking animation on NPC
|
||||
|
||||
[ElevenLabs Agent] On Agent Stopped Speaking
|
||||
└─► Return to idle animation
|
||||
|
||||
[ElevenLabs Agent] On Agent Error
|
||||
└─► Print String "Error: " + Error Message
|
||||
|
||||
Event EndPlay
|
||||
└─► [ElevenLabs Agent] End Conversation
|
||||
```
|
||||
|
||||
### Step 4 — Push-to-talk (Client Controlled mode only)
|
||||
|
||||
```
|
||||
Input Action "Talk" (Pressed)
|
||||
└─► [ElevenLabs Agent] Start Listening
|
||||
|
||||
Input Action "Talk" (Released)
|
||||
└─► [ElevenLabs Agent] Stop Listening
|
||||
```
|
||||
|
||||
### Step 5 — Testing without a microphone
|
||||
|
||||
Once connected, use **Send Text Message** instead of speaking:
|
||||
|
||||
```
|
||||
[ElevenLabs Agent] On Agent Connected
|
||||
└─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"
|
||||
```
|
||||
|
||||
The agent will reply with audio and text exactly as if it heard you speak.
|
||||
|
||||
---
|
||||
|
||||
## 5. Quick Start (C++)
|
||||
|
||||
### 1. Add the plugin to your module's Build.cs
|
||||
|
||||
```csharp
|
||||
PrivateDependencyModuleNames.Add("PS_AI_Agent_ElevenLabs");
|
||||
```
|
||||
|
||||
### 2. Include and use
|
||||
|
||||
```cpp
|
||||
#include "ElevenLabsConversationalAgentComponent.h"
|
||||
#include "ElevenLabsDefinitions.h"
|
||||
|
||||
// In your Actor's header:
|
||||
UPROPERTY(VisibleAnywhere)
|
||||
UElevenLabsConversationalAgentComponent* ElevenLabsAgent;
|
||||
|
||||
// In the constructor:
|
||||
ElevenLabsAgent = CreateDefaultSubobject<UElevenLabsConversationalAgentComponent>(
|
||||
TEXT("ElevenLabsAgent"));
|
||||
|
||||
// Override Agent ID at runtime (optional):
|
||||
ElevenLabsAgent->AgentID = TEXT("your_agent_id_here");
|
||||
ElevenLabsAgent->TurnMode = EElevenLabsTurnMode::Server;
|
||||
ElevenLabsAgent->bAutoStartListening = true;
|
||||
|
||||
// Bind events:
|
||||
ElevenLabsAgent->OnAgentConnected.AddDynamic(
|
||||
this, &AMyNPC::HandleAgentConnected);
|
||||
ElevenLabsAgent->OnAgentTextResponse.AddDynamic(
|
||||
this, &AMyNPC::HandleAgentResponse);
|
||||
ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
|
||||
this, &AMyNPC::PlayTalkingAnimation);
|
||||
|
||||
// Start the conversation:
|
||||
ElevenLabsAgent->StartConversation();
|
||||
|
||||
// Send a text message (useful for testing without mic):
|
||||
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));
|
||||
|
||||
// Later, to end:
|
||||
ElevenLabsAgent->EndConversation();
|
||||
```
|
||||
|
||||
### 3. Callback signatures
|
||||
|
||||
```cpp
|
||||
UFUNCTION()
|
||||
void HandleAgentConnected(const FElevenLabsConversationInfo& Info)
|
||||
{
|
||||
UE_LOG(LogTemp, Log, TEXT("Connected, ConvID=%s"), *Info.ConversationID);
|
||||
}
|
||||
|
||||
UFUNCTION()
|
||||
void HandleAgentResponse(const FString& ResponseText)
|
||||
{
|
||||
// Display in UI, drive subtitles, etc.
|
||||
}
|
||||
|
||||
UFUNCTION()
|
||||
void PlayTalkingAnimation()
|
||||
{
|
||||
// Switch to talking anim montage
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Components Reference
|
||||
|
||||
### UElevenLabsConversationalAgentComponent
|
||||
|
||||
The **main component** — attach this to any Actor that should be able to speak.
|
||||
|
||||
**Category**: ElevenLabs
|
||||
**Inherits from**: `UActorComponent`
|
||||
|
||||
#### Properties
|
||||
|
||||
| Property | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `AgentID` | `FString` | `""` | Agent ID for this actor. Overrides the project-level default when non-empty. |
|
||||
| `TurnMode` | `EElevenLabsTurnMode` | `Server` | How speaker turns are detected. See [Section 8](#8-turn-modes). |
|
||||
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is connected and ready. |
|
||||
|
||||
#### Functions
|
||||
|
||||
| Function | Blueprint | Description |
|
||||
|---|---|---|
|
||||
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once `OnAgentConnected` fires. |
|
||||
| `EndConversation()` | Callable | Closes the WebSocket, stops mic, stops audio playback. |
|
||||
| `StartListening()` | Callable | Starts microphone capture and streams to ElevenLabs. In Client mode, also sends `user_activity`. |
|
||||
| `StopListening()` | Callable | Stops microphone capture. In Client mode, stops sending `user_activity`. |
|
||||
| `SendTextMessage(Text)` | Callable | Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing. |
|
||||
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately and clears the audio queue. |
|
||||
| `IsConnected()` | Pure | Returns true if the WebSocket is open and the conversation is active. |
|
||||
| `IsListening()` | Pure | Returns true if the microphone is currently capturing. |
|
||||
| `IsAgentSpeaking()` | Pure | Returns true if agent audio is currently playing. |
|
||||
| `GetConversationInfo()` | Pure | Returns `FElevenLabsConversationInfo` (ConversationID, AgentID). |
|
||||
| `GetWebSocketProxy()` | Pure | Returns the underlying `UElevenLabsWebSocketProxy` for advanced use. |
|
||||
|
||||
#### Events
|
||||
|
||||
| Event | Parameters | Fired when |
|
||||
|---|---|---|
|
||||
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation metadata received. Safe to call `SendTextMessage` here. |
|
||||
| `OnAgentDisconnected` | `int32 StatusCode`, `FString Reason` | WebSocket closed (graceful or remote). |
|
||||
| `OnAgentError` | `FString ErrorMessage` | Connection or protocol error. |
|
||||
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | User speech-to-text transcript received (speaker is always `"user"`). |
|
||||
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (mirrors the audio). |
|
||||
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent (audio playback begins). |
|
||||
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (heuristic — agent done speaking). |
|
||||
| `OnAgentInterrupted` | — | Agent speech was interrupted (by user or by `InterruptAgent()`). |
|
||||
|
||||
---
|
||||
|
||||
### UElevenLabsMicrophoneCaptureComponent
|
||||
|
||||
A lightweight microphone capture component. Managed automatically by `UElevenLabsConversationalAgentComponent` — you only need to use this directly for advanced scenarios (e.g. custom audio routing).
|
||||
|
||||
**Category**: ElevenLabs
|
||||
**Inherits from**: `UActorComponent`
|
||||
|
||||
#### Properties
|
||||
|
||||
| Property | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples before resampling. Range: 0.0 – 4.0. |
|
||||
|
||||
#### Functions
|
||||
|
||||
| Function | Blueprint | Description |
|
||||
|---|---|---|
|
||||
| `StartCapture()` | Callable | Opens the default audio input device and begins streaming. |
|
||||
| `StopCapture()` | Callable | Stops streaming and closes the device. |
|
||||
| `IsCapturing()` | Pure | True while actively capturing. |
|
||||
|
||||
#### Delegate
|
||||
|
||||
`OnAudioCaptured` — fires on the **game thread** with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.
|
||||
|
||||
---
|
||||
|
||||
### UElevenLabsWebSocketProxy
|
||||
|
||||
Low-level WebSocket session manager. Used internally by `UElevenLabsConversationalAgentComponent`. Use this directly only if you need fine-grained protocol control.
|
||||
|
||||
**Inherits from**: `UObject`
|
||||
**Instantiate via**: `NewObject<UElevenLabsWebSocketProxy>(Outer)`
|
||||
|
||||
#### Key functions
|
||||
|
||||
| Function | Description |
|
||||
|---|---|
|
||||
| `Connect(AgentID, APIKey)` | Open the WS connection. Parameters override project settings when non-empty. |
|
||||
| `Disconnect()` | Send close frame and tear down the connection. |
|
||||
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component. |
|
||||
| `SendTextMessage(Text)` | Send `{"type":"user_message","text":"..."}`. Agent replies as if it heard speech. |
|
||||
| `SendUserTurnStart()` | Client turn mode: sends `{"type":"user_activity"}` to signal user is speaking. |
|
||||
| `SendUserTurnEnd()` | Client turn mode: stops sending `user_activity` (no explicit message — server detects silence). |
|
||||
| `SendInterrupt()` | Ask the agent to stop speaking: sends `{"type":"interrupt"}`. |
|
||||
| `GetConnectionState()` | Returns `EElevenLabsConnectionState`. |
|
||||
| `GetConversationInfo()` | Returns `FElevenLabsConversationInfo`. |
|
||||
|
||||
---
|
||||
|
||||
## 7. Data Types Reference
|
||||
|
||||
### EElevenLabsConnectionState
|
||||
|
||||
```
|
||||
Disconnected — No active connection
|
||||
Connecting — WebSocket handshake in progress / awaiting conversation_initiation_metadata
|
||||
Connected — Conversation active and ready (fires OnAgentConnected)
|
||||
Error — Connection or protocol failure
|
||||
```
|
||||
|
||||
> Note: State remains `Connecting` until the server sends `conversation_initiation_metadata`. `OnAgentConnected` fires on transition to `Connected`.
|
||||
|
||||
### EElevenLabsTurnMode
|
||||
|
||||
```
|
||||
Server — ElevenLabs Voice Activity Detection decides when the user stops speaking (recommended)
|
||||
Client — Your code calls StartListening/StopListening to define turns (push-to-talk)
|
||||
```
|
||||
|
||||
### FElevenLabsConversationInfo
|
||||
|
||||
```
|
||||
ConversationID FString — Unique session ID assigned by ElevenLabs
|
||||
AgentID FString — The agent ID for this session
|
||||
```
|
||||
|
||||
### FElevenLabsTranscriptSegment
|
||||
|
||||
```
|
||||
Text FString — Transcribed text
|
||||
Speaker FString — "user" (agent text comes via OnAgentTextResponse, not transcript)
|
||||
bIsFinal bool — Always true for user transcripts (ElevenLabs sends final only)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Turn Modes
|
||||
|
||||
### Server VAD (default)
|
||||
|
||||
ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.
|
||||
|
||||
**When to use**: Casual conversation, hands-free interaction, natural dialogue.
|
||||
|
||||
```
|
||||
StartConversation() → mic streams continuously (if bAutoStartListening = true)
|
||||
ElevenLabs detects speech / silence automatically
|
||||
Agent replies when it detects end-of-speech
|
||||
```
|
||||
|
||||
### Client Controlled (push-to-talk)
|
||||
|
||||
Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`. The plugin sends `{"type":"user_activity"}` while the user is speaking; stopping it signals end of turn.
|
||||
|
||||
**When to use**: Noisy environments, precise control, walkie-talkie style UI.
|
||||
|
||||
```
|
||||
Input Pressed → StartListening() → streams audio + sends user_activity
|
||||
Input Released → StopListening() → stops audio (no explicit end message)
|
||||
Server detects silence and hands turn to agent
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Security — Signed URL Mode
|
||||
|
||||
By default, the API key is stored in Project Settings (`DefaultEngine.ini`). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.
|
||||
|
||||
### Production setup
|
||||
|
||||
1. Enable **Signed URL Mode** in Project Settings.
|
||||
2. Set **Signed URL Endpoint** to a URL on your own backend (e.g. `https://your-server.com/api/elevenlabs-token`).
|
||||
3. Your backend authenticates the player and calls the ElevenLabs API to generate a signed WebSocket URL, returning:
|
||||
```json
|
||||
{ "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..." }
|
||||
```
|
||||
4. The plugin fetches this URL before connecting — the API key never leaves your server.
|
||||
|
||||
### Development workflow (API key in project settings)
|
||||
|
||||
- Set the key in **Project Settings → Plugins → ElevenLabs AI Agent**
|
||||
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
|
||||
- **Strip this section from `DefaultEngine.ini` before every git commit**
|
||||
- Each developer sets the key locally — it does not go in version control
|
||||
|
||||
---
|
||||
|
||||
## 10. Audio Pipeline
|
||||
|
||||
### Input (player → agent)
|
||||
|
||||
```
|
||||
Device (any sample rate, any channels)
|
||||
↓ FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
|
||||
↓ Callback: const void* → cast to float32 interleaved frames
|
||||
↓ Downmix to mono (average all channels)
|
||||
↓ Resample to 16000 Hz (linear interpolation)
|
||||
↓ Apply VolumeMultiplier
|
||||
↓ Dispatch to Game Thread (AsyncTask)
|
||||
↓ Convert float32 → int16 signed, little-endian bytes
|
||||
↓ Base64 encode
|
||||
↓ Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }
|
||||
```
|
||||
|
||||
### Output (agent → player)
|
||||
|
||||
```
|
||||
Binary WebSocket frame arrives
|
||||
↓ Peek first byte:
|
||||
• '{' → UTF-8 JSON: parse type field, dispatch to handler
|
||||
• other → raw PCM audio bytes
|
||||
↓ [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
|
||||
↓ Enqueue in thread-safe AudioQueue (FCriticalSection)
|
||||
↓ USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
|
||||
↓ UAudioComponent plays from the Actor's world position (3D spatialized)
|
||||
```
|
||||
|
||||
**Audio format** (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.
|
||||
|
||||
### Silence detection heuristic
|
||||
|
||||
`OnAgentStoppedSpeaking` fires when the `AudioQueue` has been empty for **30 consecutive ticks** (~0.5 s at 60 fps). If the agent has natural pauses, increase `SilenceThresholdTicks` in the header:
|
||||
|
||||
```cpp
|
||||
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Common Patterns
|
||||
|
||||
### Test the connection without a microphone
|
||||
|
||||
```
|
||||
BeginPlay → StartConversation()
|
||||
|
||||
OnAgentConnected → SendTextMessage("Hello, introduce yourself")
|
||||
|
||||
OnAgentTextResponse → Print string (confirms text pipeline works)
|
||||
OnAgentStartedSpeaking → (confirms audio pipeline works)
|
||||
```
|
||||
|
||||
### Show subtitles in UI
|
||||
|
||||
```
|
||||
OnAgentTranscript:
|
||||
Segment → Text → show in player subtitle widget (speaker always "user")
|
||||
|
||||
OnAgentTextResponse:
|
||||
ResponseText → show in NPC speech bubble
|
||||
```
|
||||
|
||||
### Interrupt the agent when the player starts speaking
|
||||
|
||||
In Server VAD mode ElevenLabs handles this automatically. For manual control:
|
||||
|
||||
```
|
||||
OnAgentStartedSpeaking → set "agent is speaking" flag
|
||||
Input Action (any) → if agent is speaking → InterruptAgent()
|
||||
```
|
||||
|
||||
### Multiple NPCs with different agents
|
||||
|
||||
Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. WebSocket connections are fully independent.
|
||||
|
||||
### Only start the conversation when the player is nearby
|
||||
|
||||
```
|
||||
On Begin Overlap (trigger volume around NPC)
|
||||
└─► [ElevenLabs Agent] Start Conversation
|
||||
|
||||
On End Overlap
|
||||
└─► [ElevenLabs Agent] End Conversation
|
||||
```
|
||||
|
||||
### Adjust microphone volume
|
||||
|
||||
Get the `UElevenLabsMicrophoneCaptureComponent` from the owner and set `VolumeMultiplier`:
|
||||
|
||||
```cpp
|
||||
UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
|
||||
if (Mic) Mic->VolumeMultiplier = 2.0f;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Troubleshooting
|
||||
|
||||
### Plugin doesn't appear in Project Settings
|
||||
|
||||
Ensure the plugin is enabled in `.uproject` and the project was recompiled after adding it.
|
||||
|
||||
### WebSocket connection fails immediately
|
||||
|
||||
- Check the **API Key** is set correctly in Project Settings.
|
||||
- Check the **Agent ID** exists in your ElevenLabs account (find it in the dashboard URL or via `GET /v1/convai/agents`).
|
||||
- Enable **Verbose Logging** in Project Settings and check Output Log for the exact WS URL and error.
|
||||
- Ensure port 443 (WSS) is not blocked by your firewall.
|
||||
|
||||
### `OnAgentConnected` never fires
|
||||
|
||||
- Connection was made but `conversation_initiation_metadata` not received yet — check Verbose Logging.
|
||||
- If you see `"Binary audio frame"` logs but no `"Conversation initiated"` — the initiation JSON frame may be arriving as a non-`{` binary frame. Check the hex prefix logged at Verbose level.
|
||||
|
||||
### No audio from the microphone
|
||||
|
||||
- Windows may require microphone permission. Check **Settings → Privacy → Microphone**.
|
||||
- Try setting `VolumeMultiplier` to `2.0` on the `MicrophoneCaptureComponent`.
|
||||
- Check Output Log for `"Failed to open default audio capture stream"`.
|
||||
|
||||
### Agent audio is choppy or silent
|
||||
|
||||
- The `USoundWaveProcedural` queue may be underflowing due to network jitter. Check latency.
|
||||
- Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check `agent_output_audio_format` in the `conversation_initiation_metadata` via Verbose Logging.
|
||||
- Ensure no other component is using the same `UAudioComponent`.
|
||||
|
||||
### `OnAgentStoppedSpeaking` fires too early
|
||||
|
||||
Increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:
|
||||
|
||||
```cpp
|
||||
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps
|
||||
```
|
||||
|
||||
### Build error: "Plugin AudioCapture not found"
|
||||
|
||||
Make sure the `AudioCapture` plugin is enabled. It should be auto-enabled via the `.uplugin` dependency, but you can add it manually to `.uproject`:
|
||||
|
||||
```json
|
||||
{ "Name": "AudioCapture", "Enabled": true }
|
||||
```
|
||||
|
||||
### `"Received unexpected binary WebSocket frame"` in the log
|
||||
|
||||
This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.
|
||||
|
||||
---
|
||||
|
||||
## 13. Changelog
|
||||
|
||||
### v1.1.0 — 2026-02-19
|
||||
|
||||
**Bug fixes:**
|
||||
- **Binary WebSocket frames**: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
|
||||
- **Transcript message**: Wrong message type (`"transcript"` → `"user_transcript"`), wrong event key (`"transcript_event"` → `"user_transcription_event"`), wrong text field (`"message"` → `"user_transcript"`).
|
||||
- **Pong format**: `event_id` was nested inside a `pong_event` object; corrected to top-level field per API spec.
|
||||
- **Client turn mode**: `user_turn_start`/`user_turn_end` are not valid API messages; replaced with `user_activity` (start) and implicit silence (end).
|
||||
|
||||
**New features:**
|
||||
- `SendTextMessage(Text)` on both `UElevenLabsConversationalAgentComponent` and `UElevenLabsWebSocketProxy` — send text to the agent without a microphone. Useful for testing.
|
||||
- Verbose logging shows binary frame hex preview and JSON frame content prefix.
|
||||
- Improved JSON parse error log now shows the first 80 characters of the failing message.
|
||||
|
||||
### v1.0.0 — 2026-02-19
|
||||
|
||||
Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.
|
||||
|
||||
---
|
||||
|
||||
*Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5*
|
||||
463
.claude/elevenlabs_api_reference.md
Normal file
463
.claude/elevenlabs_api_reference.md
Normal file
@ -0,0 +1,463 @@
|
||||
# ElevenLabs Conversational AI – API Reference
|
||||
> Saved for Claude Code sessions. Auto-loaded via `.claude/` directory.
|
||||
> Last updated: 2026-02-19
|
||||
|
||||
---
|
||||
|
||||
## 1. Agent ID — Where to Find It
|
||||
|
||||
### In the Dashboard (UI)
|
||||
1. Go to **https://elevenlabs.io/app/conversational-ai**
|
||||
2. Click on your agent to open it
|
||||
3. The **Agent ID** is shown in the agent settings page — typically in the URL bar and/or in the agent's "General" settings tab
|
||||
- URL pattern: `https://elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>`
|
||||
- Also visible in the "API" or "Overview" tab of the agent editor (copy button available)
|
||||
|
||||
### Via API
|
||||
```http
|
||||
GET https://api.elevenlabs.io/v1/convai/agents
|
||||
xi-api-key: YOUR_API_KEY
|
||||
```
|
||||
Returns a list of all agents with their `agent_id` strings.
|
||||
|
||||
### Via API (single agent)
|
||||
```http
|
||||
GET https://api.elevenlabs.io/v1/convai/agents/{agent_id}
|
||||
xi-api-key: YOUR_API_KEY
|
||||
```
|
||||
|
||||
### Agent ID Format
|
||||
- Type: `string`
|
||||
- Returned on agent creation via `POST /v1/convai/agents/create`
|
||||
- Used as URL path param and WebSocket query param throughout the API
|
||||
|
||||
---
|
||||
|
||||
## 2. WebSocket Conversational AI
|
||||
|
||||
### Connection URL
|
||||
```
|
||||
wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<AGENT_ID>
|
||||
```
|
||||
|
||||
Regional alternatives:
|
||||
| Region | URL |
|
||||
|--------|-----|
|
||||
| Default (Global) | `wss://api.elevenlabs.io/` |
|
||||
| US | `wss://api.us.elevenlabs.io/` |
|
||||
| EU | `wss://api.eu.residency.elevenlabs.io/` |
|
||||
| India | `wss://api.in.residency.elevenlabs.io/` |
|
||||
|
||||
### Authentication
|
||||
- **Public agents**: No key required, just `agent_id` query param
|
||||
- **Private agents**: Use a **Signed URL** (see Section 4) instead of direct `agent_id`
|
||||
- **Server-side** (backend): Pass `xi-api-key` as an HTTP upgrade header
|
||||
|
||||
```
|
||||
Headers:
|
||||
xi-api-key: YOUR_API_KEY
|
||||
```
|
||||
|
||||
> ⚠️ Never expose your API key client-side. For browser/mobile apps, use Signed URLs.
|
||||
|
||||
---
|
||||
|
||||
## 3. WebSocket Protocol — Message Reference
|
||||
|
||||
### Audio Format
|
||||
- **Input (mic → server)**: PCM 16-bit signed, **16000 Hz**, mono, little-endian, Base64-encoded
|
||||
- **Output (server → client)**: Base64-encoded audio (format specified in `conversation_initiation_metadata`)
|
||||
|
||||
---
|
||||
|
||||
### Messages FROM Server (Subscribe / Receive)
|
||||
|
||||
#### `conversation_initiation_metadata`
|
||||
Sent immediately after connection. Contains conversation ID and audio format specs.
|
||||
```json
|
||||
{
|
||||
"type": "conversation_initiation_metadata",
|
||||
"conversation_initiation_metadata_event": {
|
||||
"conversation_id": "string",
|
||||
"agent_output_audio_format": "pcm_16000 | mp3_44100 | ...",
|
||||
"user_input_audio_format": "pcm_16000"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `audio`
|
||||
Agent speech audio chunk.
|
||||
```json
|
||||
{
|
||||
"type": "audio",
|
||||
"audio_event": {
|
||||
"audio_base_64": "BASE64_PCM_BYTES",
|
||||
"event_id": 42
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `user_transcript`
|
||||
Transcribed text of what the user said.
|
||||
```json
|
||||
{
|
||||
"type": "user_transcript",
|
||||
"user_transcription_event": {
|
||||
"user_transcript": "Hello, how are you?"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `agent_response`
|
||||
The text the agent is saying (arrives in parallel with audio).
|
||||
```json
|
||||
{
|
||||
"type": "agent_response",
|
||||
"agent_response_event": {
|
||||
"agent_response": "I'm doing great, thanks!"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `agent_response_correction`
|
||||
Sent after an interruption — shows what was truncated.
|
||||
```json
|
||||
{
|
||||
"type": "agent_response_correction",
|
||||
"agent_response_correction_event": {
|
||||
"original_agent_response": "string",
|
||||
"corrected_agent_response": "string"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `interruption`
|
||||
Signals that a specific audio event was interrupted.
|
||||
```json
|
||||
{
|
||||
"type": "interruption",
|
||||
"interruption_event": {
|
||||
"event_id": 42
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `ping`
|
||||
Keepalive ping from server. Client must reply with `pong`.
|
||||
```json
|
||||
{
|
||||
"type": "ping",
|
||||
"ping_event": {
|
||||
"event_id": 1,
|
||||
"ping_ms": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `client_tool_call`
|
||||
Requests the client execute a tool (custom tools integration).
|
||||
```json
|
||||
{
|
||||
"type": "client_tool_call",
|
||||
"client_tool_call": {
|
||||
"tool_name": "string",
|
||||
"tool_call_id": "string",
|
||||
"parameters": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `contextual_update`
|
||||
Text context added to conversation state (non-interrupting).
|
||||
```json
|
||||
{
|
||||
"type": "contextual_update",
|
||||
"contextual_update_event": {
|
||||
"text": "string"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `vad_score`
|
||||
Voice Activity Detection confidence score (0.0–1.0).
|
||||
```json
|
||||
{
|
||||
"type": "vad_score",
|
||||
"vad_score_event": {
|
||||
"vad_score": 0.85
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `internal_tentative_agent_response`
|
||||
Preliminary agent text during LLM generation (not final).
|
||||
```json
|
||||
{
|
||||
"type": "internal_tentative_agent_response",
|
||||
"tentative_agent_response_internal_event": {
|
||||
"tentative_agent_response": "string"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Messages TO Server (Publish / Send)
|
||||
|
||||
#### `user_audio_chunk`
|
||||
Microphone audio data. Send continuously during user speech.
|
||||
```json
|
||||
{
|
||||
"user_audio_chunk": "BASE64_PCM_16BIT_16KHZ_MONO"
|
||||
}
|
||||
```
|
||||
Audio must be: **PCM 16-bit signed, 16000 Hz, mono, little-endian**, then Base64-encoded.
|
||||
|
||||
#### `pong`
|
||||
Reply to server `ping` to keep connection alive.
|
||||
```json
|
||||
{
|
||||
"type": "pong",
|
||||
"event_id": 1
|
||||
}
|
||||
```
|
||||
|
||||
#### `conversation_initiation_client_data`
|
||||
Override agent configuration at connection time. Send before or just after connecting.
|
||||
```json
|
||||
{
|
||||
"type": "conversation_initiation_client_data",
|
||||
"conversation_config_override": {
|
||||
"agent": {
|
||||
"prompt": { "prompt": "Custom system prompt override" },
|
||||
"first_message": "Hello! How can I help?",
|
||||
"language": "en"
|
||||
},
|
||||
"tts": {
|
||||
"voice_id": "string",
|
||||
"speed": 1.0,
|
||||
"stability": 0.5,
|
||||
"similarity_boost": 0.75
|
||||
}
|
||||
},
|
||||
"dynamic_variables": {
|
||||
"user_name": "Alice",
|
||||
"session_id": 12345
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Config override ranges:
|
||||
- `tts.speed`: 0.7 – 1.2
|
||||
- `tts.stability`: 0.0 – 1.0
|
||||
- `tts.similarity_boost`: 0.0 – 1.0
|
||||
|
||||
#### `client_tool_result`
|
||||
Response to a `client_tool_call` from the server.
|
||||
```json
|
||||
{
|
||||
"type": "client_tool_result",
|
||||
"tool_call_id": "string",
|
||||
"result": "tool output string",
|
||||
"is_error": false
|
||||
}
|
||||
```
|
||||
|
||||
#### `contextual_update`
|
||||
Inject context without interrupting the conversation.
|
||||
```json
|
||||
{
|
||||
"type": "contextual_update",
|
||||
"text": "User just entered room 4B"
|
||||
}
|
||||
```
|
||||
|
||||
#### `user_message`
|
||||
Send a text message (no mic audio needed).
|
||||
```json
|
||||
{
|
||||
"type": "user_message",
|
||||
"text": "What is the weather like?"
|
||||
}
|
||||
```
|
||||
|
||||
#### `user_activity`
|
||||
Signal that user is active (for turn detection in client mode).
|
||||
```json
|
||||
{
|
||||
"type": "user_activity"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Signed URL (Private Agents)
|
||||
|
||||
Used for browser/mobile clients to authenticate without exposing the API key.
|
||||
|
||||
### Flow
|
||||
1. **Backend** calls ElevenLabs API to get a temporary signed URL
|
||||
2. Backend returns signed URL to client
|
||||
3. **Client** opens WebSocket to the signed URL (no API key needed)
|
||||
|
||||
### Get Signed URL
|
||||
```http
|
||||
GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=<AGENT_ID>
|
||||
xi-api-key: YOUR_API_KEY
|
||||
```
|
||||
|
||||
Optional query params:
|
||||
- `include_conversation_id=true` — generates unique conversation ID, prevents URL reuse
|
||||
- `branch_id` — specific agent branch
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..."
|
||||
}
|
||||
```
|
||||
|
||||
Client connects to `signed_url` directly — no headers needed.
|
||||
|
||||
---
|
||||
|
||||
## 5. Agents REST API
|
||||
|
||||
Base URL: `https://api.elevenlabs.io`
|
||||
Auth header: `xi-api-key: YOUR_API_KEY`
|
||||
|
||||
### Create Agent
|
||||
```http
|
||||
POST /v1/convai/agents/create
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"name": "My NPC Agent",
|
||||
"conversation_config": {
|
||||
"agent": {
|
||||
"first_message": "Hello adventurer!",
|
||||
"prompt": { "prompt": "You are a wise tavern keeper in a fantasy world." },
|
||||
"language": "en"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
Response includes `agent_id`.
|
||||
|
||||
### List Agents
|
||||
```http
|
||||
GET /v1/convai/agents?page_size=30&search=&sort_by=created_at&sort_direction=desc
|
||||
```
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"agents": [
|
||||
{
|
||||
"agent_id": "abc123xyz",
|
||||
"name": "My NPC Agent",
|
||||
"created_at_unix_secs": 1708300000,
|
||||
"last_call_time_unix_secs": null,
|
||||
"archived": false,
|
||||
"tags": []
|
||||
}
|
||||
],
|
||||
"has_more": false,
|
||||
"next_cursor": null
|
||||
}
|
||||
```
|
||||
|
||||
### Get Agent
|
||||
```http
|
||||
GET /v1/convai/agents/{agent_id}
|
||||
```
|
||||
|
||||
### Update Agent
|
||||
```http
|
||||
PATCH /v1/convai/agents/{agent_id}
|
||||
Content-Type: application/json
|
||||
{ "name": "Updated Name", "conversation_config": { ... } }
|
||||
```
|
||||
|
||||
### Delete Agent
|
||||
```http
|
||||
DELETE /v1/convai/agents/{agent_id}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Turn Modes
|
||||
|
||||
### Server VAD (Default / Recommended)
|
||||
- ElevenLabs server detects when user stops speaking
|
||||
- Client streams audio continuously
|
||||
- Server handles all turn-taking automatically
|
||||
|
||||
### Client Turn Mode
|
||||
- Client explicitly signals turn boundaries
|
||||
- Send `user_activity` to indicate user is speaking
|
||||
- Use when you have your own VAD or push-to-talk UI
|
||||
|
||||
---
|
||||
|
||||
## 7. Audio Pipeline (UE5 Implementation Notes)
|
||||
|
||||
```
|
||||
Microphone (FAudioCapture)
|
||||
→ float32 samples at device rate (e.g. 44100 Hz stereo)
|
||||
→ Resample to 16000 Hz mono
|
||||
→ Convert float32 → int16 little-endian
|
||||
→ Base64-encode
|
||||
→ Send as {"user_audio_chunk": "BASE64"}
|
||||
|
||||
Server → {"type":"audio","audio_event":{"audio_base_64":"BASE64"}}
|
||||
→ Base64-decode
|
||||
→ Raw PCM bytes
|
||||
→ Push to USoundWaveProcedural
|
||||
→ UAudioComponent plays back
|
||||
```
|
||||
|
||||
### Float32 → Int16 Conversion (C++)
|
||||
```cpp
|
||||
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatSamples)
|
||||
{
|
||||
TArray<uint8> Bytes;
|
||||
Bytes.SetNumUninitialized(FloatSamples.Num() * 2);
|
||||
for (int32 i = 0; i < FloatSamples.Num(); i++)
|
||||
{
|
||||
float Clamped = FMath::Clamp(FloatSamples[i], -1.f, 1.f);
|
||||
int16 Sample = (int16)(Clamped * 32767.f);
|
||||
Bytes[i * 2] = (uint8)(Sample & 0xFF); // Low byte
|
||||
Bytes[i * 2 + 1] = (uint8)((Sample >> 8) & 0xFF); // High byte
|
||||
}
|
||||
return Bytes;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Quick Integration Checklist (UE5 Plugin)
|
||||
|
||||
- [ ] Set `AgentID` in `UElevenLabsSettings` (Project Settings → ElevenLabs AI Agent)
|
||||
- Or override per-component via `UElevenLabsConversationalAgentComponent::AgentID`
|
||||
- [ ] Set `API_Key` in settings (or leave empty for public agents)
|
||||
- [ ] Add `UElevenLabsConversationalAgentComponent` to your NPC actor
|
||||
- [ ] Set `TurnMode` (default: `Server` — recommended)
|
||||
- [ ] Bind to events: `OnAgentConnected`, `OnAgentTranscript`, `OnAgentTextResponse`, `OnAgentStartedSpeaking`, `OnAgentStoppedSpeaking`
|
||||
- [ ] Call `StartConversation()` to begin
|
||||
- [ ] Call `EndConversation()` when done
|
||||
|
||||
---
|
||||
|
||||
## 9. Key API URLs Reference
|
||||
|
||||
| Purpose | URL |
|
||||
|---------|-----|
|
||||
| Dashboard | https://elevenlabs.io/app/conversational-ai |
|
||||
| API Keys | https://elevenlabs.io/app/settings/api-keys |
|
||||
| WebSocket endpoint | wss://api.elevenlabs.io/v1/convai/conversation |
|
||||
| Agents list | GET https://api.elevenlabs.io/v1/convai/agents |
|
||||
| Agent by ID | GET https://api.elevenlabs.io/v1/convai/agents/{agent_id} |
|
||||
| Create agent | POST https://api.elevenlabs.io/v1/convai/agents/create |
|
||||
| Signed URL | GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url |
|
||||
| WS protocol docs | https://elevenlabs.io/docs/eleven-agents/api-reference/eleven-agents/websocket |
|
||||
| Quickstart | https://elevenlabs.io/docs/eleven-agents/quickstart |
|
||||
61
.claude/elevenlabs_plugin.md
Normal file
61
.claude/elevenlabs_plugin.md
Normal file
@ -0,0 +1,61 @@
|
||||
# PS_AI_Agent_ElevenLabs Plugin
|
||||
|
||||
## Location
|
||||
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
|
||||
|
||||
## File Map
|
||||
```
|
||||
PS_AI_Agent_ElevenLabs.uplugin
|
||||
Source/PS_AI_Agent_ElevenLabs/
|
||||
PS_AI_Agent_ElevenLabs.Build.cs
|
||||
Public/
|
||||
PS_AI_Agent_ElevenLabs.h – FPS_AI_Agent_ElevenLabsModule + UElevenLabsSettings
|
||||
ElevenLabsDefinitions.h – Enums, structs, ElevenLabsMessageType/Audio constants
|
||||
ElevenLabsWebSocketProxy.h/.cpp – UObject managing one WS session
|
||||
ElevenLabsConversationalAgentComponent.h/.cpp – Main ActorComponent (attach to NPC)
|
||||
ElevenLabsMicrophoneCaptureComponent.h/.cpp – Mic capture, resample, dispatch to game thread
|
||||
Private/
|
||||
(implementations of the above)
|
||||
```
|
||||
|
||||
## ElevenLabs Conversational AI Protocol
|
||||
- **WebSocket URL**: `wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>`
|
||||
- **Auth**: HTTP upgrade header `xi-api-key: <key>` (set in Project Settings)
|
||||
- **All frames**: JSON text (no binary frames used by the API)
|
||||
- **Audio format**: PCM 16-bit signed, 16000 Hz, mono, little-endian — Base64-encoded in JSON
|
||||
|
||||
### Client → Server messages
|
||||
| Type field value | Payload |
|
||||
|---|---|
|
||||
| *(none – key is the type)* `user_audio_chunk` | `{ "user_audio_chunk": "<base64 PCM>" }` |
|
||||
| `user_turn_start` | `{ "type": "user_turn_start" }` |
|
||||
| `user_turn_end` | `{ "type": "user_turn_end" }` |
|
||||
| `interrupt` | `{ "type": "interrupt" }` |
|
||||
| `pong` | `{ "type": "pong", "pong_event": { "event_id": N } }` |
|
||||
|
||||
### Server → Client messages (field: `type`)
|
||||
| type value | Key nested object | Notes |
|
||||
|---|---|---|
|
||||
| `conversation_initiation_metadata` | `conversation_initiation_metadata_event.conversation_id` | Marks WS ready |
|
||||
| `audio` | `audio_event.audio_base_64` | Base64 PCM from agent |
|
||||
| `transcript` | `transcript_event.{speaker, message, is_final}` | User or agent speech |
|
||||
| `agent_response` | `agent_response_event.agent_response` | Final agent text |
|
||||
| `interruption` | — | Agent stopped mid-sentence |
|
||||
| `ping` | `ping_event.event_id` | Must reply with pong |
|
||||
|
||||
## Key Design Decisions
|
||||
- **No gRPC / no ThirdParty libs** — pure UE WebSockets + HTTP, builds out of the box
|
||||
- Audio resampled in-plugin: device rate → 16000 Hz mono (linear interpolation)
|
||||
- `USoundWaveProcedural` for real-time agent audio playback (queue-driven)
|
||||
- Silence heuristic: 30 game-thread ticks (~0.5 s at 60 fps) with no new audio → agent done speaking
|
||||
- `bSignedURLMode` setting: fetch a signed WS URL from your own backend (keeps API key off client)
|
||||
- Two turn modes: `Server VAD` (ElevenLabs detects speech end) and `Client Controlled` (push-to-talk)
|
||||
|
||||
## Build Dependencies (Build.cs)
|
||||
Core, CoreUObject, Engine, InputCore, Json, JsonUtilities, WebSockets, HTTP,
|
||||
AudioMixer, AudioCaptureCore, AudioCapture, Voice, SignalProcessing
|
||||
|
||||
## Status
|
||||
- **Session 1** (2026-02-19): All source files written, registered in .uproject. Not yet compiled.
|
||||
- **TODO**: Open in UE 5.5 Editor → compile → test basic WS connection with a test agent ID.
|
||||
- **Watch out**: Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature vs UE 5.5 API.
|
||||
79
.claude/project_context.md
Normal file
79
.claude/project_context.md
Normal file
@ -0,0 +1,79 @@
|
||||
# Project Context & Original Ask
|
||||
|
||||
## What the user wants to build
|
||||
|
||||
A **UE5 plugin** that integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5,
|
||||
allowing an in-game NPC (or any Actor) to hold a real-time voice conversation with a player.
|
||||
|
||||
### The original request (paraphrased)
|
||||
> "I want to create a plugin to use ElevenLabs Conversational Agent in Unreal Engine 5.5.
|
||||
> I previously used the Convai plugin which does what I want, but I prefer ElevenLabs quality.
|
||||
> The goal is to create a plugin in the existing Unreal Project to make a first step for integration.
|
||||
> Convai AI plugin may be too big in terms of functionality for the new project, but it is the final goal.
|
||||
> You can use the Convai source code to find the right way to make the ElevenLabs version —
|
||||
> it should be very similar."
|
||||
|
||||
### Plugin name
|
||||
`PS_AI_Agent_ElevenLabs`
|
||||
|
||||
---
|
||||
|
||||
## User's mental model / intent
|
||||
|
||||
1. **Short-term**: A working first-step plugin — minimal but functional — that can:
|
||||
- Connect to ElevenLabs Conversational AI via WebSocket
|
||||
- Capture microphone audio from the player
|
||||
- Stream it to ElevenLabs in real time
|
||||
- Play back the agent's voice response
|
||||
- Surface key events (transcript, agent text, speaking state) to Blueprint
|
||||
|
||||
2. **Long-term**: Match the full feature set of Convai — character IDs, session memory,
|
||||
actions/environment context, lip-sync, etc. — but powered by ElevenLabs instead.
|
||||
|
||||
3. **Key preference**: Simpler than Convai. No gRPC, no protobuf, no ThirdParty precompiled
|
||||
libraries. ElevenLabs' Conversational AI API uses plain WebSocket + JSON, which maps
|
||||
naturally to UE's built-in `WebSockets` module.
|
||||
|
||||
---
|
||||
|
||||
## How we used Convai as a reference
|
||||
|
||||
We studied the Convai plugin source (`ConvAI/Convai/`) to understand:
|
||||
- **Module structure**: `UConvaiSettings` + `IModuleInterface` + `ISettingsModule` registration
|
||||
- **Audio capture pattern**: `Audio::FAudioCapture`, ring buffers, thread-safe dispatch to game thread
|
||||
- **Audio playback pattern**: `USoundWaveProcedural` fed from a queue
|
||||
- **Component architecture**: `UConvaiChatbotComponent` (NPC side) + `UConvaiPlayerComponent` (player side)
|
||||
- **HTTP proxy pattern**: `UConvaiAPIBaseProxy` base class for async REST calls
|
||||
- **Voice type enum**: Convai already had `EVoiceType::ElevenLabsVoices` — confirming ElevenLabs
|
||||
is a natural fit
|
||||
|
||||
We then replaced gRPC/protobuf with **WebSocket + JSON** to match the ElevenLabs API, and
|
||||
simplified the architecture to the minimum needed for a first working version.
|
||||
|
||||
---
|
||||
|
||||
## What was built (Session 1 — 2026-02-19)
|
||||
|
||||
All source files created and registered. See `.claude/elevenlabs_plugin.md` for full file map and protocol details.
|
||||
|
||||
### Components created
|
||||
| Class | Role |
|
||||
|---|---|
|
||||
| `UElevenLabsSettings` | Project Settings UI — API key, Agent ID, security options |
|
||||
| `UElevenLabsWebSocketProxy` | Manages one WS session: connect, send audio, handle all server message types |
|
||||
| `UElevenLabsConversationalAgentComponent` | ActorComponent to attach to any NPC — orchestrates mic + WS + playback |
|
||||
| `UElevenLabsMicrophoneCaptureComponent` | Wraps `Audio::FAudioCapture`, resamples to 16 kHz mono |
|
||||
|
||||
### Not yet done (next sessions)
|
||||
- Compile & test in UE 5.5 Editor
|
||||
- Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature for UE 5.5
|
||||
- Add lip-sync support (future)
|
||||
- Add session memory / conversation history (future)
|
||||
- Add environment/action context support (future, matching Convai's full feature set)
|
||||
|
||||
---
|
||||
|
||||
## Notes on the ElevenLabs API
|
||||
- Docs: https://elevenlabs.io/docs/conversational-ai
|
||||
- Create agents at: https://elevenlabs.io/app/conversational-ai
|
||||
- API keys at: https://elevenlabs.io (dashboard)
|
||||
200
.claude/session_log_2026-02-19.md
Normal file
200
.claude/session_log_2026-02-19.md
Normal file
@ -0,0 +1,200 @@
|
||||
# Session Log — 2026-02-19
|
||||
|
||||
**Project**: PS_AI_Agent (Unreal Engine 5.5)
|
||||
**Machine**: Desktop PC (j_foucher)
|
||||
**Working directory**: `E:\ASTERION\GIT\PS_AI_Agent`
|
||||
|
||||
---
|
||||
|
||||
## Conversation Summary
|
||||
|
||||
### 1. Initial Request
|
||||
User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5.
|
||||
Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs.
|
||||
Plugin name requested: `PS_AI_Agent_ElevenLabs`.
|
||||
|
||||
### 2. Codebase Exploration
|
||||
Explored the Convai plugin source at `ConvAI/Convai/` to understand:
|
||||
- Module/settings structure
|
||||
- AudioCapture patterns
|
||||
- HTTP proxy pattern
|
||||
- gRPC streaming architecture (to know what to replace with WebSocket)
|
||||
- Convai already had `EVoiceType::ElevenLabsVoices` — confirming the direction
|
||||
|
||||
### 3. Plugin Created
|
||||
All source files written from scratch under:
|
||||
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
|
||||
|
||||
Files created:
|
||||
- `PS_AI_Agent_ElevenLabs.uplugin`
|
||||
- `PS_AI_Agent_ElevenLabs.Build.cs`
|
||||
- `Public/PS_AI_Agent_ElevenLabs.h` — Module + `UElevenLabsSettings`
|
||||
- `Public/ElevenLabsDefinitions.h` — Enums, structs, protocol constants
|
||||
- `Public/ElevenLabsWebSocketProxy.h` + `.cpp` — WS session manager
|
||||
- `Public/ElevenLabsConversationalAgentComponent.h` + `.cpp` — Main NPC component
|
||||
- `Public/ElevenLabsMicrophoneCaptureComponent.h` + `.cpp` — Mic capture
|
||||
- `PS_AI_Agent.uproject` — Plugin registered
|
||||
|
||||
Commit: `f0055e8`
|
||||
|
||||
### 4. Memory Files Created
|
||||
To allow context recovery on any machine (including laptop):
|
||||
- `.claude/MEMORY.md` — project structure + patterns (auto-loaded by Claude Code)
|
||||
- `.claude/elevenlabs_plugin.md` — plugin file map + API protocol details
|
||||
- `.claude/project_context.md` — original ask, intent, short/long-term goals
|
||||
- Local copy also at `C:\Users\j_foucher\.claude\projects\...\memory\`
|
||||
|
||||
Commit: `f0055e8` (with plugin), updated in `4d6ae10`
|
||||
|
||||
### 5. .gitignore Updated
|
||||
Added to existing ignores:
|
||||
- `Unreal/PS_AI_Agent/Plugins/*/Binaries/`
|
||||
- `Unreal/PS_AI_Agent/Plugins/*/Intermediate/`
|
||||
- `Unreal/PS_AI_Agent/*.sln` / `*.suo`
|
||||
- `.claude/settings.local.json`
|
||||
- `generate_pptx.py`
|
||||
|
||||
Commit: `4d6ae10`, `b114ab0`
|
||||
|
||||
### 6. Compile — First Attempt (Errors Found)
|
||||
Ran `Build.bat PS_AI_AgentEditor Win64 Development`. Errors:
|
||||
- `WebSockets` listed in `.uplugin` — it's a module not a plugin → removed
|
||||
- `OpenDefaultCaptureStream` doesn't exist in UE 5.5 → use `OpenAudioCaptureStream`
|
||||
- `FOnAudioCaptureFunction` callback uses `const void*` not `const float*` → fixed cast
|
||||
- `TArray::RemoveAt(0, N, false)` deprecated → use `EAllowShrinking::No`
|
||||
- `AudioCapture` is a plugin and must be in `.uplugin` Plugins array → added
|
||||
|
||||
Commit: `bb1a857`
|
||||
|
||||
### 7. Compile — Success
|
||||
Clean build, no warnings, no errors.
|
||||
Output: `Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll`
|
||||
|
||||
Memory updated with confirmed UE 5.5 API patterns. Commit: `3b98edc`
|
||||
|
||||
### 8. Documentation — Markdown
|
||||
Full reference doc written to `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
||||
- Installation, Project Settings, Quick Start (BP + C++), Components Reference,
|
||||
Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.
|
||||
|
||||
Commit: `c833ccd`
|
||||
|
||||
### 9. Documentation — PowerPoint
|
||||
20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):
|
||||
- File: `PS_AI_Agent_ElevenLabs_Documentation.pptx` in repo root
|
||||
- Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
|
||||
- Generator script `generate_pptx.py` excluded from git via .gitignore
|
||||
|
||||
Commit: `1b72026`
|
||||
|
||||
---
|
||||
|
||||
## Session 2 — 2026-02-19 (continued context)
|
||||
|
||||
### 10. API vs Implementation Cross-Check (3 bugs found and fixed)
|
||||
Cross-referenced `elevenlabs_api_reference.md` against plugin source. Found 3 protocol bugs:
|
||||
|
||||
**Bug 1 — Transcript fields wrong:**
|
||||
- Type: `"transcript"` → `"user_transcript"`
|
||||
- Event key: `"transcript_event"` → `"user_transcription_event"`
|
||||
- Field: `"message"` → `"user_transcript"`
|
||||
|
||||
**Bug 2 — Pong format wrong:**
|
||||
- `event_id` was nested in `pong_event{}` → must be top-level
|
||||
|
||||
**Bug 3 — Client turn mode messages don't exist:**
|
||||
- `"user_turn_start"` / `"user_turn_end"` are not valid API types
|
||||
- Replaced: start → `"user_activity"`, end → no-op (server detects silence)
|
||||
|
||||
Commit: `ae2c9b9`
|
||||
|
||||
### 11. SendTextMessage Added
|
||||
User asked for text input to agent for testing (without mic).
|
||||
Added `SendTextMessage(FString)` to `UElevenLabsWebSocketProxy` and `UElevenLabsConversationalAgentComponent`.
|
||||
Sends `{"type":"user_message","text":"..."}` — agent replies with audio + text.
|
||||
|
||||
Commit: `b489d11`
|
||||
|
||||
### 12. Binary WebSocket Frame Fix
|
||||
User reported: `"Received unexpected binary WebSocket frame"` warnings.
|
||||
Root cause: ElevenLabs sends **ALL WebSocket frames as binary**, never text.
|
||||
`OnMessage` (text handler) never fires. `OnRawMessage` must handle everything.
|
||||
|
||||
Fix: Implemented `OnWsBinaryMessage` with fragment reassembly (`BinaryFrameBuffer`).
|
||||
|
||||
Commit: `669c503`
|
||||
|
||||
### 13. JSON vs PCM Discrimination Fix
|
||||
After binary fix: `"Failed to parse WebSocket message as JSON"` errors.
|
||||
Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.
|
||||
|
||||
Fix: Peek at byte[0] of assembled buffer:
|
||||
- `'{'` (0x7B) → UTF-8 JSON → route to `OnWsMessage()`
|
||||
- anything else → raw PCM audio → broadcast to `OnAudioReceived`
|
||||
|
||||
Commit: `4834567`
|
||||
|
||||
### 14. Documentation Updated to v1.1.0
|
||||
Full rewrite of `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
||||
- Added Changelog section (v1.0.0 / v1.1.0)
|
||||
- Updated audio pipeline (binary PCM path, not Base64 JSON)
|
||||
- Added `SendTextMessage` to all function tables and examples
|
||||
- Corrected turn mode docs, transcript docs, `OnAgentConnected` timing
|
||||
- New troubleshooting entries
|
||||
|
||||
Commit: `e464cfe`
|
||||
|
||||
### 15. Test Blueprint Asset Updated
|
||||
`test_AI_Actor.uasset` updated in UE Editor.
|
||||
|
||||
Commit: `99017f4`
|
||||
|
||||
---
|
||||
|
||||
## Git History (this session)
|
||||
|
||||
| Hash | Message |
|
||||
|------|---------|
|
||||
| `f0055e8` | Add PS_AI_Agent_ElevenLabs plugin (initial implementation) |
|
||||
| `4d6ae10` | Update .gitignore: exclude plugin build artifacts and local Claude settings |
|
||||
| `b114ab0` | Broaden .gitignore: use glob for all plugin Binaries/Intermediate |
|
||||
| `bb1a857` | Fix compile errors in PS_AI_Agent_ElevenLabs plugin |
|
||||
| `3b98edc` | Update memory: document confirmed UE 5.5 API patterns and plugin compile status |
|
||||
| `c833ccd` | Add plugin documentation for PS_AI_Agent_ElevenLabs |
|
||||
| `1b72026` | Add PowerPoint documentation and update .gitignore |
|
||||
| `bbeb429` | ElevenLabs API reference doc |
|
||||
| `dbd6161` | TestMap, test actor, DefaultEngine.ini, memory update |
|
||||
| `ae2c9b9` | Fix 3 WebSocket protocol bugs |
|
||||
| `b489d11` | Add SendTextMessage |
|
||||
| `669c503` | Fix binary WebSocket frames |
|
||||
| `4834567` | Fix JSON vs binary frame discrimination |
|
||||
| `e464cfe` | Update documentation to v1.1.0 |
|
||||
| `99017f4` | Update test_AI_Actor blueprint asset |
|
||||
|
||||
---
|
||||
|
||||
## Key Technical Decisions Made This Session
|
||||
|
||||
| Decision | Reason |
|
||||
|----------|--------|
|
||||
| WebSocket instead of gRPC | ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed |
|
||||
| `AudioCapture` in `.uplugin` Plugins array | It's an engine plugin, not a module — UBT requires it declared |
|
||||
| `WebSockets` in Build.cs only | It's a module (no `.uplugin` file), declaring it in `.uplugin` causes build error |
|
||||
| `FOnAudioCaptureFunction` uses `const void*` | UE 5.3+ API change — must cast to `float*` inside callback |
|
||||
| `EAllowShrinking::No` | Bool overload of `RemoveAt` deprecated in UE 5.5 |
|
||||
| `USoundWaveProcedural` for playback | Allows pushing raw PCM bytes at runtime without file I/O |
|
||||
| Silence threshold = 30 ticks | ~0.5s at 60fps heuristic to detect agent finished speaking |
|
||||
| Binary frame handling | ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM |
|
||||
| `user_activity` for client turn | `user_turn_start`/`user_turn_end` don't exist in ElevenLabs API |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (not done yet)
|
||||
|
||||
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
|
||||
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
||||
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
||||
- [ ] Add lip-sync support (future)
|
||||
- [ ] Add session memory / conversation history (future, matching Convai)
|
||||
- [ ] Add environment/action context support (future)
|
||||
- [ ] Consider Signed URL Mode backend implementation
|
||||
14
.gitignore
vendored
14
.gitignore
vendored
@ -4,3 +4,17 @@ Unreal/PS_AI_Agent/Binaries/
|
||||
Unreal/PS_AI_Agent/Intermediate/
|
||||
Unreal/PS_AI_Agent/Saved/
|
||||
ConvAI/Convai/Binaries/
|
||||
|
||||
# All plugin build artifacts (Binaries + Intermediate for any plugin)
|
||||
Unreal/PS_AI_Agent/Plugins/*/Binaries/
|
||||
Unreal/PS_AI_Agent/Plugins/*/Intermediate/
|
||||
|
||||
# UE5 generated solution files
|
||||
Unreal/PS_AI_Agent/*.sln
|
||||
Unreal/PS_AI_Agent/*.suo
|
||||
|
||||
# Claude Code local session settings (machine-specific, memory files in .claude/ are kept)
|
||||
.claude/settings.local.json
|
||||
|
||||
# Documentation generator script (dev tool, output .pptx is committed instead)
|
||||
generate_pptx.py
|
||||
|
||||
BIN
PS_AI_Agent_ElevenLabs_Documentation.pptx
Normal file
BIN
PS_AI_Agent_ElevenLabs_Documentation.pptx
Normal file
Binary file not shown.
@ -1,7 +1,8 @@
|
||||
|
||||
|
||||
[/Script/EngineSettings.GameMapsSettings]
|
||||
GameDefaultMap=/Engine/Maps/Templates/OpenWorld
|
||||
GameDefaultMap=/Game/TestMap.TestMap
|
||||
EditorStartupMap=/Game/TestMap.TestMap
|
||||
|
||||
[/Script/Engine.RendererSettings]
|
||||
r.AllowStaticLighting=False
|
||||
@ -90,3 +91,4 @@ ConnectionType=USBOnly
|
||||
bUseManualIPAddress=False
|
||||
ManualIPAddress=
|
||||
|
||||
|
||||
|
||||
Binary file not shown.
BIN
Unreal/PS_AI_Agent/Content/test_AI_Actor.uasset
Normal file
BIN
Unreal/PS_AI_Agent/Content/test_AI_Actor.uasset
Normal file
Binary file not shown.
@ -17,6 +17,10 @@
|
||||
"TargetAllowList": [
|
||||
"Editor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Name": "PS_AI_Agent_ElevenLabs",
|
||||
"Enabled": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@ -0,0 +1,35 @@
|
||||
{
|
||||
"FileVersion": 3,
|
||||
"Version": 1,
|
||||
"VersionName": "1.0.0",
|
||||
"FriendlyName": "PS AI Agent - ElevenLabs",
|
||||
"Description": "Integrates ElevenLabs Conversational AI Agent into Unreal Engine 5.5. Supports real-time voice conversation via WebSocket, microphone capture, and audio playback.",
|
||||
"Category": "AI",
|
||||
"CreatedBy": "ASTERION",
|
||||
"CreatedByURL": "",
|
||||
"DocsURL": "https://elevenlabs.io/docs/conversational-ai",
|
||||
"MarketplaceURL": "",
|
||||
"SupportURL": "",
|
||||
"CanContainContent": false,
|
||||
"IsBetaVersion": true,
|
||||
"IsExperimentalVersion": false,
|
||||
"Installed": false,
|
||||
"Modules": [
|
||||
{
|
||||
"Name": "PS_AI_Agent_ElevenLabs",
|
||||
"Type": "Runtime",
|
||||
"LoadingPhase": "PreDefault",
|
||||
"PlatformAllowList": [
|
||||
"Win64",
|
||||
"Mac",
|
||||
"Linux"
|
||||
]
|
||||
}
|
||||
],
|
||||
"Plugins": [
|
||||
{
|
||||
"Name": "AudioCapture",
|
||||
"Enabled": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@ -0,0 +1,40 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
using UnrealBuildTool;
|
||||
|
||||
public class PS_AI_Agent_ElevenLabs : ModuleRules
|
||||
{
|
||||
public PS_AI_Agent_ElevenLabs(ReadOnlyTargetRules Target) : base(Target)
|
||||
{
|
||||
DefaultBuildSettings = BuildSettingsVersion.Latest;
|
||||
PCHUsage = PCHUsageMode.UseExplicitOrSharedPCHs;
|
||||
|
||||
PublicDependencyModuleNames.AddRange(new string[]
|
||||
{
|
||||
"Core",
|
||||
"CoreUObject",
|
||||
"Engine",
|
||||
"InputCore",
|
||||
// JSON serialization for WebSocket message payloads
|
||||
"Json",
|
||||
"JsonUtilities",
|
||||
// WebSocket for ElevenLabs Conversational AI real-time API
|
||||
"WebSockets",
|
||||
// HTTP for REST calls (agent metadata, auth, etc.)
|
||||
"HTTP",
|
||||
// Audio capture (microphone input)
|
||||
"AudioMixer",
|
||||
"AudioCaptureCore",
|
||||
"AudioCapture",
|
||||
"Voice",
|
||||
"SignalProcessing",
|
||||
});
|
||||
|
||||
PrivateDependencyModuleNames.AddRange(new string[]
|
||||
{
|
||||
"Projects",
|
||||
// For ISettingsModule (Project Settings integration)
|
||||
"Settings",
|
||||
});
|
||||
}
|
||||
}
|
||||
@ -0,0 +1,345 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#include "ElevenLabsConversationalAgentComponent.h"
|
||||
#include "ElevenLabsMicrophoneCaptureComponent.h"
|
||||
#include "PS_AI_Agent_ElevenLabs.h"
|
||||
|
||||
#include "Components/AudioComponent.h"
|
||||
#include "Sound/SoundWaveProcedural.h"
|
||||
#include "GameFramework/Actor.h"
|
||||
#include "Engine/World.h"
|
||||
|
||||
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsAgent, Log, All);
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Constructor
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UElevenLabsConversationalAgentComponent::UElevenLabsConversationalAgentComponent()
|
||||
{
|
||||
PrimaryComponentTick.bCanEverTick = true;
|
||||
// Tick is used only to detect silence (agent stopped speaking).
|
||||
// Disable if not needed for perf.
|
||||
PrimaryComponentTick.TickInterval = 1.0f / 60.0f;
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Lifecycle
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsConversationalAgentComponent::BeginPlay()
|
||||
{
|
||||
Super::BeginPlay();
|
||||
InitAudioPlayback();
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
|
||||
{
|
||||
EndConversation();
|
||||
Super::EndPlay(EndPlayReason);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::TickComponent(float DeltaTime, ELevelTick TickType,
|
||||
FActorComponentTickFunction* ThisTickFunction)
|
||||
{
|
||||
Super::TickComponent(DeltaTime, TickType, ThisTickFunction);
|
||||
|
||||
if (bAgentSpeaking)
|
||||
{
|
||||
FScopeLock Lock(&AudioQueueLock);
|
||||
if (AudioQueue.Num() == 0)
|
||||
{
|
||||
SilentTickCount++;
|
||||
if (SilentTickCount >= SilenceThresholdTicks)
|
||||
{
|
||||
bAgentSpeaking = false;
|
||||
SilentTickCount = 0;
|
||||
OnAgentStoppedSpeaking.Broadcast();
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
SilentTickCount = 0;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Control
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsConversationalAgentComponent::StartConversation()
|
||||
{
|
||||
if (!WebSocketProxy)
|
||||
{
|
||||
WebSocketProxy = NewObject<UElevenLabsWebSocketProxy>(this);
|
||||
WebSocketProxy->OnConnected.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleConnected);
|
||||
WebSocketProxy->OnDisconnected.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleDisconnected);
|
||||
WebSocketProxy->OnError.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleError);
|
||||
WebSocketProxy->OnAudioReceived.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleAudioReceived);
|
||||
WebSocketProxy->OnTranscript.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleTranscript);
|
||||
WebSocketProxy->OnAgentResponse.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleAgentResponse);
|
||||
WebSocketProxy->OnInterrupted.AddDynamic(this,
|
||||
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
|
||||
}
|
||||
|
||||
WebSocketProxy->Connect(AgentID);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::EndConversation()
|
||||
{
|
||||
StopListening();
|
||||
StopAgentAudio();
|
||||
|
||||
if (WebSocketProxy)
|
||||
{
|
||||
WebSocketProxy->Disconnect();
|
||||
WebSocketProxy = nullptr;
|
||||
}
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::StartListening()
|
||||
{
|
||||
if (!IsConnected())
|
||||
{
|
||||
UE_LOG(LogElevenLabsAgent, Warning, TEXT("StartListening: not connected."));
|
||||
return;
|
||||
}
|
||||
|
||||
if (bIsListening) return;
|
||||
bIsListening = true;
|
||||
|
||||
if (TurnMode == EElevenLabsTurnMode::Client)
|
||||
{
|
||||
WebSocketProxy->SendUserTurnStart();
|
||||
}
|
||||
|
||||
// Find the microphone component on our owner actor, or create one.
|
||||
UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
|
||||
|
||||
if (!Mic)
|
||||
{
|
||||
Mic = NewObject<UElevenLabsMicrophoneCaptureComponent>(GetOwner(),
|
||||
TEXT("ElevenLabsMicrophone"));
|
||||
Mic->RegisterComponent();
|
||||
}
|
||||
|
||||
Mic->OnAudioCaptured.AddUObject(this,
|
||||
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
|
||||
Mic->StartCapture();
|
||||
|
||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture started."));
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::StopListening()
|
||||
{
|
||||
if (!bIsListening) return;
|
||||
bIsListening = false;
|
||||
|
||||
if (UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||
GetOwner() ? GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>() : nullptr)
|
||||
{
|
||||
Mic->StopCapture();
|
||||
Mic->OnAudioCaptured.RemoveAll(this);
|
||||
}
|
||||
|
||||
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
|
||||
{
|
||||
WebSocketProxy->SendUserTurnEnd();
|
||||
}
|
||||
|
||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture stopped."));
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::SendTextMessage(const FString& Text)
|
||||
{
|
||||
if (!IsConnected())
|
||||
{
|
||||
UE_LOG(LogElevenLabsAgent, Warning, TEXT("SendTextMessage: not connected. Call StartConversation() first."));
|
||||
return;
|
||||
}
|
||||
WebSocketProxy->SendTextMessage(Text);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::InterruptAgent()
|
||||
{
|
||||
if (WebSocketProxy) WebSocketProxy->SendInterrupt();
|
||||
StopAgentAudio();
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// State queries
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
bool UElevenLabsConversationalAgentComponent::IsConnected() const
|
||||
{
|
||||
return WebSocketProxy && WebSocketProxy->IsConnected();
|
||||
}
|
||||
|
||||
const FElevenLabsConversationInfo& UElevenLabsConversationalAgentComponent::GetConversationInfo() const
|
||||
{
|
||||
static FElevenLabsConversationInfo Empty;
|
||||
return WebSocketProxy ? WebSocketProxy->GetConversationInfo() : Empty;
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// WebSocket event handlers
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsConversationInfo& Info)
|
||||
{
|
||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
|
||||
OnAgentConnected.Broadcast(Info);
|
||||
|
||||
if (bAutoStartListening)
|
||||
{
|
||||
StartListening();
|
||||
}
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCode, const FString& Reason)
|
||||
{
|
||||
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
|
||||
bIsListening = false;
|
||||
bAgentSpeaking = false;
|
||||
OnAgentDisconnected.Broadcast(StatusCode, Reason);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleError(const FString& ErrorMessage)
|
||||
{
|
||||
UE_LOG(LogElevenLabsAgent, Error, TEXT("Agent error: %s"), *ErrorMessage);
|
||||
OnAgentError.Broadcast(ErrorMessage);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<uint8>& PCMData)
|
||||
{
|
||||
EnqueueAgentAudio(PCMData);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
|
||||
{
|
||||
OnAgentTranscript.Broadcast(Segment);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
|
||||
{
|
||||
OnAgentTextResponse.Broadcast(ResponseText);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
|
||||
{
|
||||
StopAgentAudio();
|
||||
OnAgentInterrupted.Broadcast();
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Audio playback
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsConversationalAgentComponent::InitAudioPlayback()
|
||||
{
|
||||
AActor* Owner = GetOwner();
|
||||
if (!Owner) return;
|
||||
|
||||
// USoundWaveProcedural lets us push raw PCM data at runtime.
|
||||
ProceduralSoundWave = NewObject<USoundWaveProcedural>(this);
|
||||
ProceduralSoundWave->SetSampleRate(ElevenLabsAudio::SampleRate);
|
||||
ProceduralSoundWave->NumChannels = ElevenLabsAudio::Channels;
|
||||
ProceduralSoundWave->Duration = INDEFINITELY_LOOPING_DURATION;
|
||||
ProceduralSoundWave->SoundGroup = SOUNDGROUP_Voice;
|
||||
ProceduralSoundWave->bLooping = false;
|
||||
|
||||
// Create the audio component attached to the owner.
|
||||
AudioPlaybackComponent = NewObject<UAudioComponent>(Owner, TEXT("ElevenLabsAudioPlayback"));
|
||||
AudioPlaybackComponent->RegisterComponent();
|
||||
AudioPlaybackComponent->bAutoActivate = false;
|
||||
AudioPlaybackComponent->SetSound(ProceduralSoundWave);
|
||||
|
||||
// When the procedural sound wave needs more audio data, pull from our queue.
|
||||
ProceduralSoundWave->OnSoundWaveProceduralUnderflow =
|
||||
FOnSoundWaveProceduralUnderflow::CreateUObject(
|
||||
this, &UElevenLabsConversationalAgentComponent::OnProceduralUnderflow);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::OnProceduralUnderflow(
|
||||
USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired)
|
||||
{
|
||||
FScopeLock Lock(&AudioQueueLock);
|
||||
if (AudioQueue.Num() == 0) return;
|
||||
|
||||
const int32 BytesRequired = SamplesRequired * sizeof(int16);
|
||||
const int32 BytesToPush = FMath::Min(AudioQueue.Num(), BytesRequired);
|
||||
|
||||
InProceduralWave->QueueAudio(AudioQueue.GetData(), BytesToPush);
|
||||
AudioQueue.RemoveAt(0, BytesToPush, EAllowShrinking::No);
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::EnqueueAgentAudio(const TArray<uint8>& PCMData)
|
||||
{
|
||||
{
|
||||
FScopeLock Lock(&AudioQueueLock);
|
||||
AudioQueue.Append(PCMData);
|
||||
}
|
||||
|
||||
// Start playback if not already playing.
|
||||
if (!bAgentSpeaking)
|
||||
{
|
||||
bAgentSpeaking = true;
|
||||
SilentTickCount = 0;
|
||||
OnAgentStartedSpeaking.Broadcast();
|
||||
|
||||
if (AudioPlaybackComponent && !AudioPlaybackComponent->IsPlaying())
|
||||
{
|
||||
AudioPlaybackComponent->Play();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void UElevenLabsConversationalAgentComponent::StopAgentAudio()
|
||||
{
|
||||
if (AudioPlaybackComponent && AudioPlaybackComponent->IsPlaying())
|
||||
{
|
||||
AudioPlaybackComponent->Stop();
|
||||
}
|
||||
|
||||
FScopeLock Lock(&AudioQueueLock);
|
||||
AudioQueue.Empty();
|
||||
|
||||
if (bAgentSpeaking)
|
||||
{
|
||||
bAgentSpeaking = false;
|
||||
SilentTickCount = 0;
|
||||
OnAgentStoppedSpeaking.Broadcast();
|
||||
}
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Microphone → WebSocket
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TArray<float>& FloatPCM)
|
||||
{
|
||||
if (!IsConnected() || !bIsListening) return;
|
||||
|
||||
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
|
||||
WebSocketProxy->SendAudioChunk(PCMBytes);
|
||||
}
|
||||
|
||||
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)
|
||||
{
|
||||
TArray<uint8> Out;
|
||||
Out.Reserve(FloatPCM.Num() * 2);
|
||||
|
||||
for (float Sample : FloatPCM)
|
||||
{
|
||||
// Clamp to [-1,1] then scale to int16 range
|
||||
const float Clamped = FMath::Clamp(Sample, -1.0f, 1.0f);
|
||||
const int16 Int16Sample = static_cast<int16>(Clamped * 32767.0f);
|
||||
|
||||
// Little-endian
|
||||
Out.Add(static_cast<uint8>(Int16Sample & 0xFF));
|
||||
Out.Add(static_cast<uint8>((Int16Sample >> 8) & 0xFF));
|
||||
}
|
||||
|
||||
return Out;
|
||||
}
|
||||
@ -0,0 +1,171 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#include "ElevenLabsMicrophoneCaptureComponent.h"
|
||||
#include "ElevenLabsDefinitions.h"
|
||||
|
||||
#include "AudioCaptureCore.h"
|
||||
#include "Async/Async.h"
|
||||
|
||||
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsMic, Log, All);
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Constructor
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UElevenLabsMicrophoneCaptureComponent::UElevenLabsMicrophoneCaptureComponent()
|
||||
{
|
||||
PrimaryComponentTick.bCanEverTick = false;
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Lifecycle
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsMicrophoneCaptureComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
|
||||
{
|
||||
StopCapture();
|
||||
Super::EndPlay(EndPlayReason);
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Capture control
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsMicrophoneCaptureComponent::StartCapture()
|
||||
{
|
||||
if (bCapturing)
|
||||
{
|
||||
UE_LOG(LogElevenLabsMic, Warning, TEXT("StartCapture called while already capturing."));
|
||||
return;
|
||||
}
|
||||
|
||||
// Open the default audio capture stream.
|
||||
// FOnAudioCaptureFunction uses const void* per UE 5.3+ API (cast to float* inside).
|
||||
Audio::FOnAudioCaptureFunction CaptureCallback =
|
||||
[this](const void* InAudio, int32 NumFrames, int32 InNumChannels,
|
||||
int32 InSampleRate, double StreamTime, bool bOverflow)
|
||||
{
|
||||
OnAudioGenerate(InAudio, NumFrames, InNumChannels, InSampleRate, StreamTime, bOverflow);
|
||||
};
|
||||
|
||||
if (!AudioCapture.OpenAudioCaptureStream(DeviceParams, MoveTemp(CaptureCallback), 1024))
|
||||
{
|
||||
UE_LOG(LogElevenLabsMic, Error, TEXT("Failed to open default audio capture stream."));
|
||||
return;
|
||||
}
|
||||
|
||||
// Retrieve the actual device parameters after opening the stream.
|
||||
Audio::FCaptureDeviceInfo DeviceInfo;
|
||||
if (AudioCapture.GetCaptureDeviceInfo(DeviceInfo))
|
||||
{
|
||||
DeviceSampleRate = DeviceInfo.PreferredSampleRate;
|
||||
DeviceChannels = DeviceInfo.InputChannels;
|
||||
UE_LOG(LogElevenLabsMic, Log, TEXT("Capture device: %s | Rate=%d | Channels=%d"),
|
||||
*DeviceInfo.DeviceName, DeviceSampleRate, DeviceChannels);
|
||||
}
|
||||
|
||||
AudioCapture.StartStream();
|
||||
bCapturing = true;
|
||||
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture started."));
|
||||
}
|
||||
|
||||
void UElevenLabsMicrophoneCaptureComponent::StopCapture()
|
||||
{
|
||||
if (!bCapturing) return;
|
||||
|
||||
AudioCapture.StopStream();
|
||||
AudioCapture.CloseStream();
|
||||
bCapturing = false;
|
||||
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture stopped."));
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Audio callback (background thread)
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
|
||||
const void* InAudio, int32 NumFrames,
|
||||
int32 InNumChannels, int32 InSampleRate,
|
||||
double StreamTime, bool bOverflow)
|
||||
{
|
||||
if (bOverflow)
|
||||
{
|
||||
UE_LOG(LogElevenLabsMic, Verbose, TEXT("Audio capture buffer overflow."));
|
||||
}
|
||||
|
||||
// Device sends float32 interleaved samples; cast from the void* API.
|
||||
const float* FloatAudio = static_cast<const float*>(InAudio);
|
||||
|
||||
// Resample + downmix to 16000 Hz mono.
|
||||
TArray<float> Resampled = ResampleTo16000(FloatAudio, NumFrames, InNumChannels, InSampleRate);
|
||||
|
||||
// Apply volume multiplier.
|
||||
if (!FMath::IsNearlyEqual(VolumeMultiplier, 1.0f))
|
||||
{
|
||||
for (float& S : Resampled)
|
||||
{
|
||||
S *= VolumeMultiplier;
|
||||
}
|
||||
}
|
||||
|
||||
// Fire the delegate on the game thread so subscribers don't need to be
|
||||
// thread-safe (WebSocket Send is not thread-safe in UE's implementation).
|
||||
AsyncTask(ENamedThreads::GameThread, [this, Data = MoveTemp(Resampled)]()
|
||||
{
|
||||
if (bCapturing)
|
||||
{
|
||||
OnAudioCaptured.Broadcast(Data);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Resampling
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
|
||||
const float* InAudio, int32 NumSamples,
|
||||
int32 InChannels, int32 InSampleRate)
|
||||
{
|
||||
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
|
||||
|
||||
// --- Step 1: Downmix to mono ---
|
||||
TArray<float> Mono;
|
||||
if (InChannels == 1)
|
||||
{
|
||||
Mono = TArray<float>(InAudio, NumSamples);
|
||||
}
|
||||
else
|
||||
{
|
||||
const int32 NumFrames = NumSamples / InChannels;
|
||||
Mono.Reserve(NumFrames);
|
||||
for (int32 i = 0; i < NumFrames; i++)
|
||||
{
|
||||
float Sum = 0.0f;
|
||||
for (int32 c = 0; c < InChannels; c++)
|
||||
{
|
||||
Sum += InAudio[i * InChannels + c];
|
||||
}
|
||||
Mono.Add(Sum / static_cast<float>(InChannels));
|
||||
}
|
||||
}
|
||||
|
||||
// --- Step 2: Resample via linear interpolation ---
|
||||
if (InSampleRate == TargetRate)
|
||||
{
|
||||
return Mono;
|
||||
}
|
||||
|
||||
const float Ratio = static_cast<float>(InSampleRate) / static_cast<float>(TargetRate);
|
||||
const int32 OutSamples = FMath::FloorToInt(static_cast<float>(Mono.Num()) / Ratio);
|
||||
|
||||
TArray<float> Out;
|
||||
Out.Reserve(OutSamples);
|
||||
|
||||
for (int32 i = 0; i < OutSamples; i++)
|
||||
{
|
||||
const float SrcIndex = static_cast<float>(i) * Ratio;
|
||||
const int32 SrcLow = FMath::FloorToInt(SrcIndex);
|
||||
const int32 SrcHigh = FMath::Min(SrcLow + 1, Mono.Num() - 1);
|
||||
const float Alpha = SrcIndex - static_cast<float>(SrcLow);
|
||||
|
||||
Out.Add(FMath::Lerp(Mono[SrcLow], Mono[SrcHigh], Alpha));
|
||||
}
|
||||
|
||||
return Out;
|
||||
}
|
||||
@ -0,0 +1,455 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#include "ElevenLabsWebSocketProxy.h"
|
||||
#include "PS_AI_Agent_ElevenLabs.h"
|
||||
|
||||
#include "WebSocketsModule.h"
|
||||
#include "IWebSocket.h"
|
||||
|
||||
#include "Json.h"
|
||||
#include "JsonUtilities.h"
|
||||
#include "Misc/Base64.h"
|
||||
|
||||
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsWS, Log, All);
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Helpers
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
static void EL_LOG(bool bVerbose, const TCHAR* Format, ...)
|
||||
{
|
||||
if (!bVerbose) return;
|
||||
va_list Args;
|
||||
va_start(Args, Format);
|
||||
// Forward to UE_LOG at Verbose level
|
||||
TCHAR Buffer[2048];
|
||||
FCString::GetVarArgs(Buffer, UE_ARRAY_COUNT(Buffer), Format, Args);
|
||||
va_end(Args);
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("%s"), Buffer);
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Connect / Disconnect
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FString& APIKeyOverride)
|
||||
{
|
||||
if (ConnectionState == EElevenLabsConnectionState::Connected ||
|
||||
ConnectionState == EElevenLabsConnectionState::Connecting)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("Connect called but already connecting/connected. Ignoring."));
|
||||
return;
|
||||
}
|
||||
|
||||
if (!FModuleManager::Get().IsModuleLoaded("WebSockets"))
|
||||
{
|
||||
FModuleManager::LoadModuleChecked<FWebSocketsModule>("WebSockets");
|
||||
}
|
||||
|
||||
const FString URL = BuildWebSocketURL(AgentIDOverride, APIKeyOverride);
|
||||
if (URL.IsEmpty())
|
||||
{
|
||||
const FString Msg = TEXT("Cannot connect: no Agent ID configured. Set it in Project Settings or pass it to Connect().");
|
||||
UE_LOG(LogElevenLabsWS, Error, TEXT("%s"), *Msg);
|
||||
OnError.Broadcast(Msg);
|
||||
ConnectionState = EElevenLabsConnectionState::Error;
|
||||
return;
|
||||
}
|
||||
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("Connecting to ElevenLabs: %s"), *URL);
|
||||
ConnectionState = EElevenLabsConnectionState::Connecting;
|
||||
|
||||
// Headers: the ElevenLabs Conversational AI WS endpoint accepts the
|
||||
// xi-api-key header on the initial HTTP upgrade request.
|
||||
TMap<FString, FString> UpgradeHeaders;
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
const FString ResolvedKey = APIKeyOverride.IsEmpty() ? Settings->API_Key : APIKeyOverride;
|
||||
if (!ResolvedKey.IsEmpty())
|
||||
{
|
||||
UpgradeHeaders.Add(TEXT("xi-api-key"), ResolvedKey);
|
||||
}
|
||||
|
||||
WebSocket = FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), UpgradeHeaders);
|
||||
|
||||
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
|
||||
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
|
||||
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
|
||||
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
|
||||
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
|
||||
|
||||
WebSocket->Connect();
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::Disconnect()
|
||||
{
|
||||
if (WebSocket.IsValid() && WebSocket->IsConnected())
|
||||
{
|
||||
WebSocket->Close(1000, TEXT("Client disconnected"));
|
||||
}
|
||||
ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Audio & turn control
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
|
||||
{
|
||||
if (!IsConnected())
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
|
||||
return;
|
||||
}
|
||||
if (PCMData.Num() == 0) return;
|
||||
|
||||
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
|
||||
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
|
||||
|
||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
|
||||
SendJsonMessage(Msg);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::SendUserTurnStart()
|
||||
{
|
||||
// In client turn mode, signal that the user is active/speaking.
|
||||
// API message: { "type": "user_activity" }
|
||||
if (!IsConnected()) return;
|
||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
|
||||
SendJsonMessage(Msg);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
|
||||
{
|
||||
// In client turn mode, stopping user_activity signals end of user turn.
|
||||
// The API uses user_activity for ongoing speech; simply stop sending it.
|
||||
// No explicit end message is required — silence is detected server-side.
|
||||
// We still log for debug visibility.
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
|
||||
{
|
||||
if (!IsConnected())
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendTextMessage: not connected."));
|
||||
return;
|
||||
}
|
||||
if (Text.IsEmpty()) return;
|
||||
|
||||
// API: { "type": "user_message", "text": "Hello agent" }
|
||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserMessage);
|
||||
Msg->SetStringField(TEXT("text"), Text);
|
||||
SendJsonMessage(Msg);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::SendInterrupt()
|
||||
{
|
||||
if (!IsConnected()) return;
|
||||
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::Interrupt);
|
||||
SendJsonMessage(Msg);
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// WebSocket callbacks
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsWebSocketProxy::OnWsConnected()
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
|
||||
// State stays Connecting until we receive the initiation metadata from the server.
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Error, TEXT("WebSocket connection error: %s"), *Error);
|
||||
ConnectionState = EElevenLabsConnectionState::Error;
|
||||
OnError.Broadcast(Error);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket closed. Code=%d Reason=%s Clean=%d"), StatusCode, *Reason, bWasClean);
|
||||
ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||
WebSocket.Reset();
|
||||
OnDisconnected.Broadcast(StatusCode, Reason);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
|
||||
{
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
if (Settings->bVerboseLogging)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT(">> %s"), *Message);
|
||||
}
|
||||
|
||||
TSharedPtr<FJsonObject> Root;
|
||||
TSharedRef<TJsonReader<>> Reader = TJsonReaderFactory<>::Create(Message);
|
||||
if (!FJsonSerializer::Deserialize(Reader, Root) || !Root.IsValid())
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to parse WebSocket message as JSON (first 80 chars): %.80s"), *Message);
|
||||
return;
|
||||
}
|
||||
|
||||
FString MsgType;
|
||||
// ElevenLabs wraps the type in a "type" field
|
||||
if (!Root->TryGetStringField(TEXT("type"), MsgType))
|
||||
{
|
||||
// Fallback: some messages use the top-level key as the type
|
||||
// e.g. { "user_audio_chunk": "..." } from ourselves (shouldn't arrive)
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Message has no 'type' field, ignoring."));
|
||||
return;
|
||||
}
|
||||
|
||||
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
|
||||
{
|
||||
HandleConversationInitiation(Root);
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::AudioResponse)
|
||||
{
|
||||
HandleAudioResponse(Root);
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::UserTranscript)
|
||||
{
|
||||
HandleTranscript(Root);
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::AgentResponse)
|
||||
{
|
||||
HandleAgentResponse(Root);
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)
|
||||
{
|
||||
// Silently ignore for now — corrected text after interruption.
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("agent_response_correction received (ignored)."));
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::InterruptionEvent)
|
||||
{
|
||||
HandleInterruption(Root);
|
||||
}
|
||||
else if (MsgType == ElevenLabsMessageType::PingEvent)
|
||||
{
|
||||
HandlePing(Root);
|
||||
}
|
||||
else
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Unhandled message type: %s"), *MsgType);
|
||||
}
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining)
|
||||
{
|
||||
// Accumulate fragments until BytesRemaining == 0.
|
||||
const uint8* Bytes = static_cast<const uint8*>(Data);
|
||||
BinaryFrameBuffer.Append(Bytes, Size);
|
||||
|
||||
if (BytesRemaining > 0)
|
||||
{
|
||||
// More fragments coming — wait for the rest
|
||||
return;
|
||||
}
|
||||
|
||||
const int32 TotalSize = BinaryFrameBuffer.Num();
|
||||
|
||||
// Peek at first byte to distinguish JSON (starts with '{') from raw binary audio.
|
||||
const bool bLooksLikeJson = (TotalSize > 0 && BinaryFrameBuffer[0] == '{');
|
||||
|
||||
if (bLooksLikeJson)
|
||||
{
|
||||
// Null-terminate safely then decode as UTF-8 JSON
|
||||
BinaryFrameBuffer.Add(0);
|
||||
const FString JsonString = FString(UTF8_TO_TCHAR(
|
||||
reinterpret_cast<const char*>(BinaryFrameBuffer.GetData())));
|
||||
BinaryFrameBuffer.Reset();
|
||||
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
if (Settings->bVerboseLogging)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary JSON frame (%d bytes): %.120s"), TotalSize, *JsonString);
|
||||
}
|
||||
|
||||
OnWsMessage(JsonString);
|
||||
}
|
||||
else
|
||||
{
|
||||
// Raw binary audio frame — PCM bytes sent directly without Base64/JSON wrapper.
|
||||
// Log first few bytes as hex to help diagnose the format.
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
if (Settings->bVerboseLogging)
|
||||
{
|
||||
FString HexPreview;
|
||||
const int32 PreviewBytes = FMath::Min(TotalSize, 8);
|
||||
for (int32 i = 0; i < PreviewBytes; i++)
|
||||
{
|
||||
HexPreview += FString::Printf(TEXT("%02X "), BinaryFrameBuffer[i]);
|
||||
}
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary audio frame: %d bytes | first bytes: %s"), TotalSize, *HexPreview);
|
||||
}
|
||||
|
||||
// Broadcast raw PCM bytes directly to the audio queue.
|
||||
TArray<uint8> PCMData = MoveTemp(BinaryFrameBuffer);
|
||||
BinaryFrameBuffer.Reset();
|
||||
OnAudioReceived.Broadcast(PCMData);
|
||||
}
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Message handlers
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsWebSocketProxy::HandleConversationInitiation(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
// Expected structure:
|
||||
// { "type": "conversation_initiation_metadata",
|
||||
// "conversation_initiation_metadata_event": {
|
||||
// "conversation_id": "...",
|
||||
// "agent_output_audio_format": "pcm_16000"
|
||||
// }
|
||||
// }
|
||||
const TSharedPtr<FJsonObject>* MetaObj = nullptr;
|
||||
if (Root->TryGetObjectField(TEXT("conversation_initiation_metadata_event"), MetaObj) && MetaObj)
|
||||
{
|
||||
(*MetaObj)->TryGetStringField(TEXT("conversation_id"), ConversationInfo.ConversationID);
|
||||
}
|
||||
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("Conversation initiated. ID=%s"), *ConversationInfo.ConversationID);
|
||||
ConnectionState = EElevenLabsConnectionState::Connected;
|
||||
OnConnected.Broadcast(ConversationInfo);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::HandleAudioResponse(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
// Expected structure:
|
||||
// { "type": "audio",
|
||||
// "audio_event": { "audio_base_64": "<base64 PCM>", "event_id": 1 }
|
||||
// }
|
||||
const TSharedPtr<FJsonObject>* AudioEvent = nullptr;
|
||||
if (!Root->TryGetObjectField(TEXT("audio_event"), AudioEvent) || !AudioEvent)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio message missing 'audio_event' field."));
|
||||
return;
|
||||
}
|
||||
|
||||
FString Base64Audio;
|
||||
if (!(*AudioEvent)->TryGetStringField(TEXT("audio_base_64"), Base64Audio))
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio_event missing 'audio_base_64' field."));
|
||||
return;
|
||||
}
|
||||
|
||||
TArray<uint8> PCMData;
|
||||
if (!FBase64::Decode(Base64Audio, PCMData))
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to Base64-decode audio data."));
|
||||
return;
|
||||
}
|
||||
|
||||
OnAudioReceived.Broadcast(PCMData);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::HandleTranscript(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
// API structure:
|
||||
// { "type": "user_transcript",
|
||||
// "user_transcription_event": { "user_transcript": "Hello" }
|
||||
// }
|
||||
// This message only carries the user's speech-to-text — speaker is always "user".
|
||||
const TSharedPtr<FJsonObject>* TranscriptEvent = nullptr;
|
||||
if (!Root->TryGetObjectField(TEXT("user_transcription_event"), TranscriptEvent) || !TranscriptEvent)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("user_transcript message missing 'user_transcription_event' field."));
|
||||
return;
|
||||
}
|
||||
|
||||
FElevenLabsTranscriptSegment Segment;
|
||||
Segment.Speaker = TEXT("user");
|
||||
(*TranscriptEvent)->TryGetStringField(TEXT("user_transcript"), Segment.Text);
|
||||
// user_transcript messages are always final (interim results are not sent for user speech)
|
||||
Segment.bIsFinal = true;
|
||||
|
||||
OnTranscript.Broadcast(Segment);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::HandleAgentResponse(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
// { "type": "agent_response",
|
||||
// "agent_response_event": { "agent_response": "..." }
|
||||
// }
|
||||
const TSharedPtr<FJsonObject>* ResponseEvent = nullptr;
|
||||
if (!Root->TryGetObjectField(TEXT("agent_response_event"), ResponseEvent) || !ResponseEvent)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
FString ResponseText;
|
||||
(*ResponseEvent)->TryGetStringField(TEXT("agent_response"), ResponseText);
|
||||
OnAgentResponse.Broadcast(ResponseText);
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::HandleInterruption(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Log, TEXT("Agent interrupted."));
|
||||
OnInterrupted.Broadcast();
|
||||
}
|
||||
|
||||
void UElevenLabsWebSocketProxy::HandlePing(const TSharedPtr<FJsonObject>& Root)
|
||||
{
|
||||
// Reply with a pong to keep the connection alive.
|
||||
// Incoming: { "type": "ping", "ping_event": { "event_id": 1, "ping_ms": 150 } }
|
||||
// Reply: { "type": "pong", "event_id": 1 } ← event_id is top-level, no wrapper object
|
||||
int32 EventID = 0;
|
||||
const TSharedPtr<FJsonObject>* PingEvent = nullptr;
|
||||
if (Root->TryGetObjectField(TEXT("ping_event"), PingEvent) && PingEvent)
|
||||
{
|
||||
(*PingEvent)->TryGetNumberField(TEXT("event_id"), EventID);
|
||||
}
|
||||
|
||||
TSharedPtr<FJsonObject> Pong = MakeShareable(new FJsonObject());
|
||||
Pong->SetStringField(TEXT("type"), TEXT("pong"));
|
||||
Pong->SetNumberField(TEXT("event_id"), EventID); // top-level, not nested
|
||||
SendJsonMessage(Pong);
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Helpers
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
void UElevenLabsWebSocketProxy::SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj)
|
||||
{
|
||||
if (!WebSocket.IsValid() || !WebSocket->IsConnected())
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendJsonMessage: WebSocket not connected."));
|
||||
return;
|
||||
}
|
||||
|
||||
FString Out;
|
||||
TSharedRef<TJsonWriter<>> Writer = TJsonWriterFactory<>::Create(&Out);
|
||||
FJsonSerializer::Serialize(JsonObj.ToSharedRef(), Writer);
|
||||
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
if (Settings->bVerboseLogging)
|
||||
{
|
||||
UE_LOG(LogElevenLabsWS, Verbose, TEXT("<< %s"), *Out);
|
||||
}
|
||||
|
||||
WebSocket->Send(Out);
|
||||
}
|
||||
|
||||
FString UElevenLabsWebSocketProxy::BuildWebSocketURL(const FString& AgentIDOverride, const FString& APIKeyOverride) const
|
||||
{
|
||||
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||
|
||||
// Custom URL override takes full precedence
|
||||
if (!Settings->CustomWebSocketURL.IsEmpty())
|
||||
{
|
||||
return Settings->CustomWebSocketURL;
|
||||
}
|
||||
|
||||
const FString ResolvedAgentID = AgentIDOverride.IsEmpty() ? Settings->AgentID : AgentIDOverride;
|
||||
if (ResolvedAgentID.IsEmpty())
|
||||
{
|
||||
return FString();
|
||||
}
|
||||
|
||||
// Official ElevenLabs Conversational AI WebSocket endpoint
|
||||
// wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>
|
||||
return FString::Printf(
|
||||
TEXT("wss://api.elevenlabs.io/v1/convai/conversation?agent_id=%s"),
|
||||
*ResolvedAgentID);
|
||||
}
|
||||
@ -0,0 +1,50 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#include "PS_AI_Agent_ElevenLabs.h"
|
||||
#include "Developer/Settings/Public/ISettingsModule.h"
|
||||
#include "UObject/UObjectGlobals.h"
|
||||
#include "UObject/Package.h"
|
||||
|
||||
IMPLEMENT_MODULE(FPS_AI_Agent_ElevenLabsModule, PS_AI_Agent_ElevenLabs)
|
||||
|
||||
#define LOCTEXT_NAMESPACE "PS_AI_Agent_ElevenLabs"
|
||||
|
||||
void FPS_AI_Agent_ElevenLabsModule::StartupModule()
|
||||
{
|
||||
Settings = NewObject<UElevenLabsSettings>(GetTransientPackage(), "ElevenLabsSettings", RF_Standalone);
|
||||
Settings->AddToRoot();
|
||||
|
||||
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
|
||||
{
|
||||
SettingsModule->RegisterSettings(
|
||||
"Project", "Plugins", "ElevenLabsAIAgent",
|
||||
LOCTEXT("SettingsName", "ElevenLabs AI Agent"),
|
||||
LOCTEXT("SettingsDescription", "Configure the ElevenLabs Conversational AI Agent plugin"),
|
||||
Settings);
|
||||
}
|
||||
}
|
||||
|
||||
void FPS_AI_Agent_ElevenLabsModule::ShutdownModule()
|
||||
{
|
||||
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
|
||||
{
|
||||
SettingsModule->UnregisterSettings("Project", "Plugins", "ElevenLabsAIAgent");
|
||||
}
|
||||
|
||||
if (!GExitPurge)
|
||||
{
|
||||
Settings->RemoveFromRoot();
|
||||
}
|
||||
else
|
||||
{
|
||||
Settings = nullptr;
|
||||
}
|
||||
}
|
||||
|
||||
UElevenLabsSettings* FPS_AI_Agent_ElevenLabsModule::GetSettings() const
|
||||
{
|
||||
check(Settings);
|
||||
return Settings;
|
||||
}
|
||||
|
||||
#undef LOCTEXT_NAMESPACE
|
||||
@ -0,0 +1,233 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "CoreMinimal.h"
|
||||
#include "Components/ActorComponent.h"
|
||||
#include "ElevenLabsDefinitions.h"
|
||||
#include "ElevenLabsWebSocketProxy.h"
|
||||
#include "Sound/SoundWaveProcedural.h"
|
||||
#include "ElevenLabsConversationalAgentComponent.generated.h"
|
||||
|
||||
class UAudioComponent;
|
||||
class UElevenLabsMicrophoneCaptureComponent;
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Delegates exposed to Blueprint
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentConnected,
|
||||
const FElevenLabsConversationInfo&, ConversationInfo);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnAgentDisconnected,
|
||||
int32, StatusCode, const FString&, Reason);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentError,
|
||||
const FString&, ErrorMessage);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTranscript,
|
||||
const FElevenLabsTranscriptSegment&, Segment);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTextResponse,
|
||||
const FString&, ResponseText);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStartedSpeaking);
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStoppedSpeaking);
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentInterrupted);
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// UElevenLabsConversationalAgentComponent
|
||||
//
|
||||
// Attach this to any Actor (e.g. a character NPC) to give it a voice powered by
|
||||
// the ElevenLabs Conversational AI API.
|
||||
//
|
||||
// Workflow:
|
||||
// 1. Set AgentID (or rely on project default).
|
||||
// 2. Call StartConversation() to open the WebSocket.
|
||||
// 3. Call StartListening() / StopListening() to control microphone capture.
|
||||
// 4. React to events (OnAgentTranscript, OnAgentTextResponse, etc.) in Blueprint.
|
||||
// 5. Call EndConversation() when done.
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
|
||||
DisplayName = "ElevenLabs Conversational Agent")
|
||||
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsConversationalAgentComponent : public UActorComponent
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
public:
|
||||
UElevenLabsConversationalAgentComponent();
|
||||
|
||||
// ── Configuration ─────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* ElevenLabs Agent ID. Overrides the project-level default in Project Settings.
|
||||
* Leave empty to use the project default.
|
||||
*/
|
||||
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||
FString AgentID;
|
||||
|
||||
/**
|
||||
* Turn mode:
|
||||
* - Server VAD: ElevenLabs detects end-of-speech automatically (recommended).
|
||||
* - Client Controlled: you call StartListening/StopListening manually (push-to-talk).
|
||||
*/
|
||||
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
|
||||
|
||||
/**
|
||||
* Automatically start listening (microphone capture) once the WebSocket is
|
||||
* connected and the conversation is initiated.
|
||||
*/
|
||||
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||
bool bAutoStartListening = true;
|
||||
|
||||
// ── Events ────────────────────────────────────────────────────────────────
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentConnected OnAgentConnected;
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentDisconnected OnAgentDisconnected;
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentError OnAgentError;
|
||||
|
||||
/** Fired for every transcript segment (user speech or agent speech, tentative and final). */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentTranscript OnAgentTranscript;
|
||||
|
||||
/** Final text response produced by the agent (mirrors the audio). */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentTextResponse OnAgentTextResponse;
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentStartedSpeaking OnAgentStartedSpeaking;
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentStoppedSpeaking OnAgentStoppedSpeaking;
|
||||
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnAgentInterrupted OnAgentInterrupted;
|
||||
|
||||
// ── Control ───────────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Open the WebSocket connection and start the conversation.
|
||||
* If bAutoStartListening is true, microphone capture also starts once connected.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void StartConversation();
|
||||
|
||||
/** Close the WebSocket and stop all audio. */
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void EndConversation();
|
||||
|
||||
/**
|
||||
* Start capturing microphone audio and streaming it to ElevenLabs.
|
||||
* In Client turn mode, also sends a UserTurnStart signal.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void StartListening();
|
||||
|
||||
/**
|
||||
* Stop capturing microphone audio.
|
||||
* In Client turn mode, also sends a UserTurnEnd signal.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void StopListening();
|
||||
|
||||
/**
|
||||
* Send a plain text message to the agent without using the microphone.
|
||||
* The agent will respond with audio and text just as if it heard you speak.
|
||||
* Useful for testing in the Editor or for text-based interaction.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendTextMessage(const FString& Text);
|
||||
|
||||
/** Interrupt the agent's current utterance. */
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void InterruptAgent();
|
||||
|
||||
// ── State queries ─────────────────────────────────────────────────────────
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
bool IsConnected() const;
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
bool IsListening() const { return bIsListening; }
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
bool IsAgentSpeaking() const { return bAgentSpeaking; }
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
const FElevenLabsConversationInfo& GetConversationInfo() const;
|
||||
|
||||
/** Access the underlying WebSocket proxy (advanced use). */
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
UElevenLabsWebSocketProxy* GetWebSocketProxy() const { return WebSocketProxy; }
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
// UActorComponent overrides
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
virtual void BeginPlay() override;
|
||||
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
|
||||
virtual void TickComponent(float DeltaTime, ELevelTick TickType,
|
||||
FActorComponentTickFunction* ThisTickFunction) override;
|
||||
|
||||
private:
|
||||
// ── Internal event handlers ───────────────────────────────────────────────
|
||||
UFUNCTION()
|
||||
void HandleConnected(const FElevenLabsConversationInfo& Info);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleDisconnected(int32 StatusCode, const FString& Reason);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleError(const FString& ErrorMessage);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleAudioReceived(const TArray<uint8>& PCMData);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleTranscript(const FElevenLabsTranscriptSegment& Segment);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleAgentResponse(const FString& ResponseText);
|
||||
|
||||
UFUNCTION()
|
||||
void HandleInterrupted();
|
||||
|
||||
// ── Audio playback ────────────────────────────────────────────────────────
|
||||
void InitAudioPlayback();
|
||||
void EnqueueAgentAudio(const TArray<uint8>& PCMData);
|
||||
void StopAgentAudio();
|
||||
/** Called by USoundWaveProcedural when it needs more PCM data. */
|
||||
void OnProceduralUnderflow(USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired);
|
||||
|
||||
// ── Microphone streaming ──────────────────────────────────────────────────
|
||||
void OnMicrophoneDataCaptured(const TArray<float>& FloatPCM);
|
||||
/** Convert float PCM to int16 little-endian bytes for ElevenLabs. */
|
||||
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatPCM);
|
||||
|
||||
// ── Sub-objects ───────────────────────────────────────────────────────────
|
||||
UPROPERTY()
|
||||
UElevenLabsWebSocketProxy* WebSocketProxy = nullptr;
|
||||
|
||||
UPROPERTY()
|
||||
UAudioComponent* AudioPlaybackComponent = nullptr;
|
||||
|
||||
UPROPERTY()
|
||||
USoundWaveProcedural* ProceduralSoundWave = nullptr;
|
||||
|
||||
// ── State ─────────────────────────────────────────────────────────────────
|
||||
bool bIsListening = false;
|
||||
bool bAgentSpeaking = false;
|
||||
|
||||
// Accumulates incoming PCM bytes until the audio component needs data.
|
||||
TArray<uint8> AudioQueue;
|
||||
FCriticalSection AudioQueueLock;
|
||||
|
||||
// Simple heuristic: if we haven't received audio data for this many ticks,
|
||||
// consider the agent done speaking.
|
||||
int32 SilentTickCount = 0;
|
||||
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
|
||||
};
|
||||
@ -0,0 +1,109 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "CoreMinimal.h"
|
||||
#include "ElevenLabsDefinitions.generated.h"
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Connection state
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UENUM(BlueprintType)
|
||||
enum class EElevenLabsConnectionState : uint8
|
||||
{
|
||||
Disconnected UMETA(DisplayName = "Disconnected"),
|
||||
Connecting UMETA(DisplayName = "Connecting"),
|
||||
Connected UMETA(DisplayName = "Connected"),
|
||||
Error UMETA(DisplayName = "Error"),
|
||||
};
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Agent turn mode
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UENUM(BlueprintType)
|
||||
enum class EElevenLabsTurnMode : uint8
|
||||
{
|
||||
/** ElevenLabs server decides when the user has finished speaking (default). */
|
||||
Server UMETA(DisplayName = "Server VAD"),
|
||||
/** Client explicitly signals turn start/end (manual push-to-talk). */
|
||||
Client UMETA(DisplayName = "Client Controlled"),
|
||||
};
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// WebSocket message type helpers (internal, not exposed to Blueprint)
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
namespace ElevenLabsMessageType
|
||||
{
|
||||
// Client → Server
|
||||
static const FString AudioChunk = TEXT("user_audio_chunk");
|
||||
// Client turn mode: signal user is currently active/speaking
|
||||
static const FString UserActivity = TEXT("user_activity");
|
||||
// Client turn mode: send a text message without audio
|
||||
static const FString UserMessage = TEXT("user_message");
|
||||
static const FString Interrupt = TEXT("interrupt");
|
||||
static const FString ClientToolResult = TEXT("client_tool_result");
|
||||
static const FString ConversationClientData = TEXT("conversation_initiation_client_data");
|
||||
|
||||
// Server → Client
|
||||
static const FString ConversationInitiation = TEXT("conversation_initiation_metadata");
|
||||
static const FString AudioResponse = TEXT("audio");
|
||||
// User speech-to-text transcript (speaker is always the user)
|
||||
static const FString UserTranscript = TEXT("user_transcript");
|
||||
static const FString AgentResponse = TEXT("agent_response");
|
||||
static const FString AgentResponseCorrection= TEXT("agent_response_correction");
|
||||
static const FString InterruptionEvent = TEXT("interruption");
|
||||
static const FString PingEvent = TEXT("ping");
|
||||
static const FString ClientToolCall = TEXT("client_tool_call");
|
||||
static const FString InternalTentativeAgent = TEXT("internal_tentative_agent_response");
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Audio format exchanged with ElevenLabs
|
||||
// PCM 16-bit signed, 16000 Hz, mono, little-endian.
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
namespace ElevenLabsAudio
|
||||
{
|
||||
static constexpr int32 SampleRate = 16000;
|
||||
static constexpr int32 Channels = 1;
|
||||
static constexpr int32 BitsPerSample = 16;
|
||||
// Chunk size sent per WebSocket frame: 100 ms of audio
|
||||
static constexpr int32 ChunkSamples = SampleRate / 10; // 1600 samples = 3200 bytes
|
||||
}
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Conversation metadata received on successful connection
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
USTRUCT(BlueprintType)
|
||||
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsConversationInfo
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
/** Unique ID of this conversation session assigned by ElevenLabs. */
|
||||
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||
FString ConversationID;
|
||||
|
||||
/** Agent ID that is responding. */
|
||||
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||
FString AgentID;
|
||||
};
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Transcript segment
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
USTRUCT(BlueprintType)
|
||||
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsTranscriptSegment
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
/** Transcribed text. */
|
||||
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||
FString Text;
|
||||
|
||||
/** "user" or "agent". */
|
||||
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||
FString Speaker;
|
||||
|
||||
/** Whether this is a final transcript or a tentative (in-progress) one. */
|
||||
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||
bool bIsFinal = false;
|
||||
};
|
||||
@ -0,0 +1,73 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "CoreMinimal.h"
|
||||
#include "Components/ActorComponent.h"
|
||||
#include "AudioCapture.h"
|
||||
#include "ElevenLabsMicrophoneCaptureComponent.generated.h"
|
||||
|
||||
// Delivers captured float PCM samples (16000 Hz mono, resampled from device rate).
|
||||
DECLARE_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioCaptured, const TArray<float>& /*FloatPCM*/);
|
||||
|
||||
/**
|
||||
* Lightweight microphone capture component.
|
||||
* Captures from the default audio input device, resamples to 16000 Hz mono,
|
||||
* and delivers chunks via FOnElevenLabsAudioCaptured.
|
||||
*
|
||||
* Modelled after Convai's ConvaiAudioCaptureComponent but stripped to the
|
||||
* minimal functionality needed for the ElevenLabs Conversational AI API.
|
||||
*/
|
||||
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
|
||||
DisplayName = "ElevenLabs Microphone Capture")
|
||||
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsMicrophoneCaptureComponent : public UActorComponent
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
public:
|
||||
UElevenLabsMicrophoneCaptureComponent();
|
||||
|
||||
/** Volume multiplier applied to captured samples before forwarding. */
|
||||
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Microphone",
|
||||
meta = (ClampMin = "0.0", ClampMax = "4.0"))
|
||||
float VolumeMultiplier = 1.0f;
|
||||
|
||||
/**
|
||||
* Delegate fired on the game thread each time a new chunk of PCM audio
|
||||
* is captured. Samples are float32, resampled to 16000 Hz mono.
|
||||
*/
|
||||
FOnElevenLabsAudioCaptured OnAudioCaptured;
|
||||
|
||||
/** Open the default capture device and begin streaming audio. */
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void StartCapture();
|
||||
|
||||
/** Stop streaming and close the capture device. */
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void StopCapture();
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
bool IsCapturing() const { return bCapturing; }
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
// UActorComponent overrides
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
|
||||
|
||||
private:
|
||||
/** Called by the audio capture callback on a background thread. Raw void* per UE 5.3+ API. */
|
||||
void OnAudioGenerate(const void* InAudio, int32 NumFrames,
|
||||
int32 InNumChannels, int32 InSampleRate, double StreamTime, bool bOverflow);
|
||||
|
||||
/** Simple linear resample from InSampleRate to 16000 Hz. Input is float32 frames. */
|
||||
static TArray<float> ResampleTo16000(const float* InAudio, int32 NumFrames,
|
||||
int32 InChannels, int32 InSampleRate);
|
||||
|
||||
Audio::FAudioCapture AudioCapture;
|
||||
Audio::FAudioCaptureDeviceParams DeviceParams;
|
||||
bool bCapturing = false;
|
||||
|
||||
// Device sample rate discovered on StartCapture
|
||||
int32 DeviceSampleRate = 44100;
|
||||
int32 DeviceChannels = 1;
|
||||
};
|
||||
@ -0,0 +1,186 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "CoreMinimal.h"
|
||||
#include "UObject/NoExportTypes.h"
|
||||
#include "ElevenLabsDefinitions.h"
|
||||
#include "IWebSocket.h"
|
||||
#include "ElevenLabsWebSocketProxy.generated.h"
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Delegates (all Blueprint-assignable)
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsConnected,
|
||||
const FElevenLabsConversationInfo&, ConversationInfo);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnElevenLabsDisconnected,
|
||||
int32, StatusCode, const FString&, Reason);
|
||||
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsError,
|
||||
const FString&, ErrorMessage);
|
||||
|
||||
/** Fired when a PCM audio chunk arrives from the agent. Raw bytes, 16-bit signed 16kHz mono. */
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioReceived,
|
||||
const TArray<uint8>&, PCMData);
|
||||
|
||||
/** Fired for user or agent transcript segments. */
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsTranscript,
|
||||
const FElevenLabsTranscriptSegment&, Segment);
|
||||
|
||||
/** Fired with the final text response from the agent. */
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAgentResponse,
|
||||
const FString&, ResponseText);
|
||||
|
||||
/** Fired when the agent interrupts the user. */
|
||||
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnElevenLabsInterrupted);
|
||||
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// WebSocket Proxy
|
||||
// Manages the lifecycle of a single ElevenLabs Conversational AI WebSocket session.
|
||||
// Instantiate via UElevenLabsConversationalAgentComponent (the component manages
|
||||
// one proxy at a time), or create manually through Blueprints.
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UCLASS(BlueprintType, Blueprintable)
|
||||
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsWebSocketProxy : public UObject
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
public:
|
||||
// ── Events ────────────────────────────────────────────────────────────────
|
||||
|
||||
/** Called once the WebSocket handshake succeeds and the agent sends its initiation metadata. */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsConnected OnConnected;
|
||||
|
||||
/** Called when the WebSocket closes (graceful or remote). */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsDisconnected OnDisconnected;
|
||||
|
||||
/** Called on any connection or protocol error. */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsError OnError;
|
||||
|
||||
/** Raw PCM audio coming from the agent — feed this into your audio component. */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsAudioReceived OnAudioReceived;
|
||||
|
||||
/** User or agent transcript (may be tentative while the conversation is ongoing). */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsTranscript OnTranscript;
|
||||
|
||||
/** Final text response from the agent (complements audio). */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsAgentResponse OnAgentResponse;
|
||||
|
||||
/** The agent was interrupted by new user speech. */
|
||||
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||
FOnElevenLabsInterrupted OnInterrupted;
|
||||
|
||||
// ── Lifecycle ─────────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Open a WebSocket connection to ElevenLabs.
|
||||
* Uses settings from Project Settings unless overridden by the parameters.
|
||||
*
|
||||
* @param AgentID ElevenLabs agent ID. Overrides the project-level default when non-empty.
|
||||
* @param APIKey API key. Overrides the project-level default when non-empty.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void Connect(const FString& AgentID = TEXT(""), const FString& APIKey = TEXT(""));
|
||||
|
||||
/**
|
||||
* Gracefully close the WebSocket connection.
|
||||
* OnDisconnected will fire after the server acknowledges.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void Disconnect();
|
||||
|
||||
/** Current connection state. */
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
EElevenLabsConnectionState GetConnectionState() const { return ConnectionState; }
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
bool IsConnected() const { return ConnectionState == EElevenLabsConnectionState::Connected; }
|
||||
|
||||
// ── Audio sending ─────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Send a chunk of raw PCM audio to ElevenLabs.
|
||||
* Audio must be 16-bit signed, 16000 Hz, mono, little-endian.
|
||||
* The data is Base64-encoded and sent as a JSON message.
|
||||
* Call this repeatedly while the microphone is capturing.
|
||||
*
|
||||
* @param PCMData Raw PCM bytes (16-bit LE, 16kHz, mono).
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendAudioChunk(const TArray<uint8>& PCMData);
|
||||
|
||||
// ── Turn control (only relevant in Client turn mode) ──────────────────────
|
||||
|
||||
/**
|
||||
* Signal that the user is actively speaking (Client turn mode).
|
||||
* Sends a { "type": "user_activity" } message to the server.
|
||||
* Call this periodically while the user is speaking (e.g. every audio chunk).
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendUserTurnStart();
|
||||
|
||||
/**
|
||||
* Signal that the user has finished speaking (Client turn mode).
|
||||
* No explicit API message — simply stop sending user_activity.
|
||||
* The server detects silence and hands the turn to the agent.
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendUserTurnEnd();
|
||||
|
||||
/**
|
||||
* Send a text message to the agent (no microphone needed).
|
||||
* Useful for testing or text-only interaction.
|
||||
* Sends: { "type": "user_message", "text": "..." }
|
||||
*/
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendTextMessage(const FString& Text);
|
||||
|
||||
/** Ask the agent to stop the current utterance. */
|
||||
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||
void SendInterrupt();
|
||||
|
||||
// ── Info ──────────────────────────────────────────────────────────────────
|
||||
|
||||
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||
const FElevenLabsConversationInfo& GetConversationInfo() const { return ConversationInfo; }
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
// Internal
|
||||
// ─────────────────────────────────────────────────────────────────────────
|
||||
private:
|
||||
void OnWsConnected();
|
||||
void OnWsConnectionError(const FString& Error);
|
||||
void OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean);
|
||||
void OnWsMessage(const FString& Message);
|
||||
void OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining);
|
||||
|
||||
void HandleConversationInitiation(const TSharedPtr<FJsonObject>& Payload);
|
||||
void HandleAudioResponse(const TSharedPtr<FJsonObject>& Payload);
|
||||
void HandleTranscript(const TSharedPtr<FJsonObject>& Payload);
|
||||
void HandleAgentResponse(const TSharedPtr<FJsonObject>& Payload);
|
||||
void HandleInterruption(const TSharedPtr<FJsonObject>& Payload);
|
||||
void HandlePing(const TSharedPtr<FJsonObject>& Payload);
|
||||
|
||||
/** Build and send a JSON text frame to the server. */
|
||||
void SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj);
|
||||
|
||||
/** Resolve the WebSocket URL from settings / parameters. */
|
||||
FString BuildWebSocketURL(const FString& AgentID, const FString& APIKey) const;
|
||||
|
||||
TSharedPtr<IWebSocket> WebSocket;
|
||||
EElevenLabsConnectionState ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||
FElevenLabsConversationInfo ConversationInfo;
|
||||
|
||||
// Accumulation buffer for multi-fragment binary WebSocket frames.
|
||||
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
|
||||
TArray<uint8> BinaryFrameBuffer;
|
||||
};
|
||||
@ -0,0 +1,99 @@
|
||||
// Copyright ASTERION. All Rights Reserved.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "CoreMinimal.h"
|
||||
#include "Modules/ModuleManager.h"
|
||||
#include "PS_AI_Agent_ElevenLabs.generated.h"
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Settings object – exposed in Project Settings → Plugins → ElevenLabs AI Agent
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
UCLASS(config = Engine, defaultconfig)
|
||||
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsSettings : public UObject
|
||||
{
|
||||
GENERATED_BODY()
|
||||
|
||||
public:
|
||||
UElevenLabsSettings(const FObjectInitializer& ObjectInitializer)
|
||||
: Super(ObjectInitializer)
|
||||
{
|
||||
API_Key = TEXT("");
|
||||
AgentID = TEXT("");
|
||||
bSignedURLMode = false;
|
||||
}
|
||||
|
||||
/**
|
||||
* ElevenLabs API key.
|
||||
* Obtain from https://elevenlabs.io – used to authenticate WebSocket connections.
|
||||
* Keep this secret; do not ship with the key hard-coded in a shipping build.
|
||||
*/
|
||||
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
|
||||
FString API_Key;
|
||||
|
||||
/**
|
||||
* The default ElevenLabs Conversational Agent ID to use when none is specified
|
||||
* on the component. Create agents at https://elevenlabs.io/app/conversational-ai
|
||||
*/
|
||||
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
|
||||
FString AgentID;
|
||||
|
||||
/**
|
||||
* When true, the plugin fetches a signed WebSocket URL from your own backend
|
||||
* before connecting, so the API key is never exposed in the client.
|
||||
* Set SignedURLEndpoint to point to your server that returns the signed URL.
|
||||
*/
|
||||
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security")
|
||||
bool bSignedURLMode;
|
||||
|
||||
/**
|
||||
* Your backend endpoint that returns a signed WebSocket URL for ElevenLabs.
|
||||
* Only used when bSignedURLMode = true.
|
||||
* Expected response body: { "signed_url": "wss://..." }
|
||||
*/
|
||||
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security",
|
||||
meta = (EditCondition = "bSignedURLMode"))
|
||||
FString SignedURLEndpoint;
|
||||
|
||||
/**
|
||||
* Override the ElevenLabs WebSocket base URL. Leave empty to use the default:
|
||||
* wss://api.elevenlabs.io/v1/convai/conversation
|
||||
*/
|
||||
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
|
||||
FString CustomWebSocketURL;
|
||||
|
||||
/** Log verbose WebSocket messages to the Output Log (useful during development). */
|
||||
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
|
||||
bool bVerboseLogging = false;
|
||||
};
|
||||
|
||||
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
// Module
|
||||
// ─────────────────────────────────────────────────────────────────────────────
|
||||
class PS_AI_AGENT_ELEVENLABS_API FPS_AI_Agent_ElevenLabsModule : public IModuleInterface
|
||||
{
|
||||
public:
|
||||
/** IModuleInterface implementation */
|
||||
virtual void StartupModule() override;
|
||||
virtual void ShutdownModule() override;
|
||||
|
||||
virtual bool IsGameModule() const override { return true; }
|
||||
|
||||
/** Singleton access */
|
||||
static inline FPS_AI_Agent_ElevenLabsModule& Get()
|
||||
{
|
||||
return FModuleManager::LoadModuleChecked<FPS_AI_Agent_ElevenLabsModule>("PS_AI_Agent_ElevenLabs");
|
||||
}
|
||||
|
||||
static inline bool IsAvailable()
|
||||
{
|
||||
return FModuleManager::Get().IsModuleLoaded("PS_AI_Agent_ElevenLabs");
|
||||
}
|
||||
|
||||
/** Access the settings object at runtime */
|
||||
UElevenLabsSettings* GetSettings() const;
|
||||
|
||||
private:
|
||||
UElevenLabsSettings* Settings = nullptr;
|
||||
};
|
||||
Loading…
x
Reference in New Issue
Block a user