Compare commits
17 Commits
61710c9fde
...
302337b573
| Author | SHA1 | Date | |
|---|---|---|---|
| 302337b573 | |||
| 99017f4067 | |||
| e464cfe288 | |||
| 483456728d | |||
| 669c503d06 | |||
| b489d1174c | |||
| ae2c9b92e8 | |||
| dbd61615a9 | |||
| bbeb4294a8 | |||
| 2bb503ae40 | |||
| 1b7202603f | |||
| c833ccd66d | |||
| 3b98edcf92 | |||
| bb1a857e86 | |||
| b114ab063d | |||
| 4d6ae103db | |||
| f0055e85ed |
75
.claude/MEMORY.md
Normal file
75
.claude/MEMORY.md
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
# Project Memory – PS_AI_Agent
|
||||||
|
|
||||||
|
> This file is committed to the repository so it is available on any machine.
|
||||||
|
> Claude Code reads it automatically at session start (via the auto-memory system)
|
||||||
|
> when the working directory is inside this repo.
|
||||||
|
> **Keep it under ~180 lines** – lines beyond 200 are truncated by the system.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project Location
|
||||||
|
- Repo root: `<repo_root>/` (wherever this is cloned)
|
||||||
|
- UE5 project: `<repo_root>/Unreal/PS_AI_Agent/`
|
||||||
|
- `.uproject`: `<repo_root>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject`
|
||||||
|
- Engine: **Unreal Engine 5.5** — Win64 primary target
|
||||||
|
- Default test map: `/Game/TestMap.TestMap`
|
||||||
|
|
||||||
|
## Plugins
|
||||||
|
| Plugin | Path | Purpose |
|
||||||
|
|--------|------|---------|
|
||||||
|
| Convai (reference) | `<repo_root>/ConvAI/Convai/` | gRPC + protobuf streaming to Convai API. Has ElevenLabs voice type enum in `ConvaiDefinitions.h`. Used as architectural reference. |
|
||||||
|
| **PS_AI_Agent_ElevenLabs** | `<repo_root>/Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/` | Our ElevenLabs Conversational AI integration. See `.claude/elevenlabs_plugin.md` for full details. |
|
||||||
|
|
||||||
|
## User Preferences
|
||||||
|
- Plugin naming: `PS_AI_Agent_<Service>` (e.g. `PS_AI_Agent_ElevenLabs`)
|
||||||
|
- Save memory frequently during long sessions
|
||||||
|
- Goal: ElevenLabs Conversational AI integration — simpler than Convai, no gRPC
|
||||||
|
- Full original ask + intent: see `.claude/project_context.md`
|
||||||
|
- Git remote is a **private server** — no public exposure risk
|
||||||
|
|
||||||
|
## Key UE5 Plugin Patterns
|
||||||
|
- Settings object: `UCLASS(config=Engine, defaultconfig)` inheriting `UObject`, registered via `ISettingsModule`
|
||||||
|
- Module startup: `NewObject<USettings>(..., RF_Standalone)` + `AddToRoot()`
|
||||||
|
- WebSocket: `FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), Headers)`
|
||||||
|
- `WebSockets` is a **module** (Build.cs only) — NOT a plugin, don't put it in `.uplugin`
|
||||||
|
- Audio capture: `Audio::FAudioCapture::OpenAudioCaptureStream()` (UE 5.3+, replaces deprecated `OpenCaptureStream`)
|
||||||
|
- `AudioCapture` IS a plugin — declare it in `.uplugin` Plugins array
|
||||||
|
- Callback type: `FOnAudioCaptureFunction` = `TFunction<void(const void*, int32, int32, int32, double, bool)>`
|
||||||
|
- Cast `const void*` to `const float*` inside — device sends float32 interleaved
|
||||||
|
- Procedural audio playback: `USoundWaveProcedural` + `OnSoundWaveProceduralUnderflow` delegate
|
||||||
|
- Audio capture callbacks arrive on a **background thread** — always marshal to game thread with `AsyncTask(ENamedThreads::GameThread, ...)`
|
||||||
|
- Resample mic audio to **16000 Hz mono** before sending to ElevenLabs
|
||||||
|
- `TArray::RemoveAt(idx, count, EAllowShrinking::No)` — bool overload deprecated in UE 5.5
|
||||||
|
|
||||||
|
## Plugin Status
|
||||||
|
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
||||||
|
- v1.1.0 — all 3 protocol bugs fixed (transcript fields, pong format, client turn mode)
|
||||||
|
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
||||||
|
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
||||||
|
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
||||||
|
- Connection confirmed working end-to-end; audio receive path functional
|
||||||
|
|
||||||
|
## ElevenLabs WebSocket Protocol Notes
|
||||||
|
- **ALL frames are binary** — `OnRawMessage` handles everything; `OnMessage` (text) never fires
|
||||||
|
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
||||||
|
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
||||||
|
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
||||||
|
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
||||||
|
- Client turn mode: `{"type":"user_activity"}` to signal speaking; no explicit end message
|
||||||
|
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
||||||
|
|
||||||
|
## API Keys / Secrets
|
||||||
|
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
||||||
|
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
|
||||||
|
- **The key is stripped from `DefaultEngine.ini` before every commit** — do not commit it
|
||||||
|
- Each developer sets the key locally; it does not go in git
|
||||||
|
|
||||||
|
## Claude Memory Files in This Repo
|
||||||
|
| File | Contents |
|
||||||
|
|------|----------|
|
||||||
|
| `.claude/MEMORY.md` | This file — project structure, patterns, status |
|
||||||
|
| `.claude/elevenlabs_plugin.md` | Plugin file map, ElevenLabs WS protocol, design decisions |
|
||||||
|
| `.claude/elevenlabs_api_reference.md` | Full ElevenLabs API reference (WS messages, REST, signed URL, Agent ID location) |
|
||||||
|
| `.claude/project_context.md` | Original ask, intent, short/long-term goals |
|
||||||
|
| `.claude/session_log_2026-02-19.md` | Full session record: steps, commits, technical decisions, next steps |
|
||||||
|
| `.claude/PS_AI_Agent_ElevenLabs_Documentation.md` | User-facing Markdown reference doc |
|
||||||
619
.claude/PS_AI_Agent_ElevenLabs_Documentation.md
Normal file
619
.claude/PS_AI_Agent_ElevenLabs_Documentation.md
Normal file
@ -0,0 +1,619 @@
|
|||||||
|
# PS_AI_Agent_ElevenLabs — Plugin Documentation
|
||||||
|
|
||||||
|
**Engine**: Unreal Engine 5.5
|
||||||
|
**Plugin version**: 1.1.0
|
||||||
|
**Status**: Beta — tested on UE 5.5 Win64, verified connection and audio pipeline
|
||||||
|
**API**: [ElevenLabs Conversational AI](https://elevenlabs.io/docs/eleven-agents/quickstart)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Overview](#1-overview)
|
||||||
|
2. [Installation](#2-installation)
|
||||||
|
3. [Project Settings](#3-project-settings)
|
||||||
|
4. [Quick Start (Blueprint)](#4-quick-start-blueprint)
|
||||||
|
5. [Quick Start (C++)](#5-quick-start-c)
|
||||||
|
6. [Components Reference](#6-components-reference)
|
||||||
|
- [UElevenLabsConversationalAgentComponent](#uelevenlabsconversationalagentcomponent)
|
||||||
|
- [UElevenLabsMicrophoneCaptureComponent](#uelevenlabsmicrophonecapturecomponent)
|
||||||
|
- [UElevenLabsWebSocketProxy](#uelevenlabswebsocketproxy)
|
||||||
|
7. [Data Types Reference](#7-data-types-reference)
|
||||||
|
8. [Turn Modes](#8-turn-modes)
|
||||||
|
9. [Security — Signed URL Mode](#9-security--signed-url-mode)
|
||||||
|
10. [Audio Pipeline](#10-audio-pipeline)
|
||||||
|
11. [Common Patterns](#11-common-patterns)
|
||||||
|
12. [Troubleshooting](#12-troubleshooting)
|
||||||
|
13. [Changelog](#13-changelog)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Overview
|
||||||
|
|
||||||
|
This plugin integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5, enabling real-time voice conversations between a player and an NPC (or any Actor).
|
||||||
|
|
||||||
|
### How it works
|
||||||
|
|
||||||
|
```
|
||||||
|
Player microphone
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
UElevenLabsMicrophoneCaptureComponent
|
||||||
|
• Captures from default audio device
|
||||||
|
• Resamples to 16 kHz mono float32
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
UElevenLabsConversationalAgentComponent
|
||||||
|
• Converts float32 → int16 PCM bytes
|
||||||
|
• Base64-encodes and sends via WebSocket
|
||||||
|
│ (wss://api.elevenlabs.io/v1/convai/conversation)
|
||||||
|
▼
|
||||||
|
ElevenLabs Conversational AI Agent
|
||||||
|
• Transcribes speech
|
||||||
|
• Runs LLM
|
||||||
|
• Synthesizes voice (ElevenLabs TTS)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
UElevenLabsConversationalAgentComponent
|
||||||
|
• Receives raw binary PCM audio frames
|
||||||
|
• Feeds USoundWaveProcedural → UAudioComponent
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Agent voice plays from the Actor's position in the world
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key properties
|
||||||
|
- No gRPC, no third-party libraries — uses UE's built-in `WebSockets` and `AudioCapture` modules
|
||||||
|
- Blueprint-first: all events and controls are exposed to Blueprint
|
||||||
|
- Real-time bidirectional: audio streams in both directions simultaneously
|
||||||
|
- Server VAD (default) or push-to-talk
|
||||||
|
- Text input supported (no microphone needed for testing)
|
||||||
|
|
||||||
|
### Wire frame protocol notes
|
||||||
|
ElevenLabs sends **all WebSocket frames as binary** (not text frames). The plugin handles two binary frame types automatically:
|
||||||
|
- **JSON control frames** (start with `{`) — conversation init, transcripts, agent responses, ping/pong
|
||||||
|
- **Raw PCM audio frames** (binary) — agent speech audio, played directly via `USoundWaveProcedural`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Installation
|
||||||
|
|
||||||
|
The plugin lives inside the project, not the engine, so no separate install is needed.
|
||||||
|
|
||||||
|
### Verify it is enabled
|
||||||
|
|
||||||
|
Open `Unreal/PS_AI_Agent/PS_AI_Agent.uproject` and confirm:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Name": "PS_AI_Agent_ElevenLabs",
|
||||||
|
"Enabled": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### First compile
|
||||||
|
|
||||||
|
Open the project in the UE 5.5 Editor. It will detect the new plugin and ask to recompile — click **Yes**. Alternatively, compile from the command line:
|
||||||
|
|
||||||
|
```
|
||||||
|
"C:\Program Files\Epic Games\UE_5.5\Engine\Build\BatchFiles\Build.bat"
|
||||||
|
PS_AI_AgentEditor Win64 Development
|
||||||
|
"<repo>/Unreal/PS_AI_Agent/PS_AI_Agent.uproject"
|
||||||
|
-WaitMutex
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Project Settings
|
||||||
|
|
||||||
|
Go to **Edit → Project Settings → Plugins → ElevenLabs AI Agent**.
|
||||||
|
|
||||||
|
| Setting | Description | Required |
|
||||||
|
|---|---|---|
|
||||||
|
| **API Key** | Your ElevenLabs API key. Find it at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | Yes (unless using Signed URL Mode or a public agent) |
|
||||||
|
| **Agent ID** | Default agent ID. Find it in the URL when editing an agent: `elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>` | Yes (unless set per-component) |
|
||||||
|
| **Signed URL Mode** | Fetch the WS URL from your own backend (keeps key off client). See [Section 9](#9-security--signed-url-mode) | No |
|
||||||
|
| **Signed URL Endpoint** | Your backend URL returning `{ "signed_url": "wss://..." }` | Only if Signed URL Mode = true |
|
||||||
|
| **Custom WebSocket URL** | Override the default `wss://api.elevenlabs.io/...` endpoint (debug only) | No |
|
||||||
|
| **Verbose Logging** | Log every WebSocket frame type and first bytes to Output Log | No |
|
||||||
|
|
||||||
|
> **Security note**: The API key set in Project Settings is saved to `DefaultEngine.ini`. **Never commit this file with the key in it** — strip the `[ElevenLabsSettings]` section before committing. Use Signed URL Mode for production builds.
|
||||||
|
|
||||||
|
> **Finding your Agent ID**: Go to [elevenlabs.io/app/conversational-ai](https://elevenlabs.io/app/conversational-ai), click your agent, and copy the ID from the URL bar or the agent's Overview/API tab.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Quick Start (Blueprint)
|
||||||
|
|
||||||
|
### Step 1 — Add the component to an NPC
|
||||||
|
|
||||||
|
1. Open your NPC Blueprint (or any Actor Blueprint).
|
||||||
|
2. In the **Components** panel, click **Add** → search for **ElevenLabs Conversational Agent**.
|
||||||
|
3. Select the component. In the **Details** panel you can optionally set a specific **Agent ID** (overrides the project default).
|
||||||
|
|
||||||
|
### Step 2 — Set Turn Mode
|
||||||
|
|
||||||
|
In the component's **Details** panel:
|
||||||
|
- **Server VAD** (default): ElevenLabs automatically detects when the player stops speaking. Microphone streams continuously once connected.
|
||||||
|
- **Client Controlled**: You call `Start Listening` / `Stop Listening` manually (push-to-talk).
|
||||||
|
|
||||||
|
### Step 3 — Wire up events in the Event Graph
|
||||||
|
|
||||||
|
```
|
||||||
|
Event BeginPlay
|
||||||
|
└─► [ElevenLabs Agent] Start Conversation
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Connected
|
||||||
|
└─► Print String "Connected! ConvID: " + Conversation Info → Conversation ID
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Text Response
|
||||||
|
└─► Set Text (UI widget) ← Response Text
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Transcript
|
||||||
|
└─► (optional) display live subtitles ← Segment → Text
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Started Speaking
|
||||||
|
└─► Play talking animation on NPC
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Stopped Speaking
|
||||||
|
└─► Return to idle animation
|
||||||
|
|
||||||
|
[ElevenLabs Agent] On Agent Error
|
||||||
|
└─► Print String "Error: " + Error Message
|
||||||
|
|
||||||
|
Event EndPlay
|
||||||
|
└─► [ElevenLabs Agent] End Conversation
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4 — Push-to-talk (Client Controlled mode only)
|
||||||
|
|
||||||
|
```
|
||||||
|
Input Action "Talk" (Pressed)
|
||||||
|
└─► [ElevenLabs Agent] Start Listening
|
||||||
|
|
||||||
|
Input Action "Talk" (Released)
|
||||||
|
└─► [ElevenLabs Agent] Stop Listening
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5 — Testing without a microphone
|
||||||
|
|
||||||
|
Once connected, use **Send Text Message** instead of speaking:
|
||||||
|
|
||||||
|
```
|
||||||
|
[ElevenLabs Agent] On Agent Connected
|
||||||
|
└─► [ElevenLabs Agent] Send Text Message ← "Hello, who are you?"
|
||||||
|
```
|
||||||
|
|
||||||
|
The agent will reply with audio and text exactly as if it heard you speak.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Quick Start (C++)
|
||||||
|
|
||||||
|
### 1. Add the plugin to your module's Build.cs
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
PrivateDependencyModuleNames.Add("PS_AI_Agent_ElevenLabs");
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Include and use
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
#include "ElevenLabsConversationalAgentComponent.h"
|
||||||
|
#include "ElevenLabsDefinitions.h"
|
||||||
|
|
||||||
|
// In your Actor's header:
|
||||||
|
UPROPERTY(VisibleAnywhere)
|
||||||
|
UElevenLabsConversationalAgentComponent* ElevenLabsAgent;
|
||||||
|
|
||||||
|
// In the constructor:
|
||||||
|
ElevenLabsAgent = CreateDefaultSubobject<UElevenLabsConversationalAgentComponent>(
|
||||||
|
TEXT("ElevenLabsAgent"));
|
||||||
|
|
||||||
|
// Override Agent ID at runtime (optional):
|
||||||
|
ElevenLabsAgent->AgentID = TEXT("your_agent_id_here");
|
||||||
|
ElevenLabsAgent->TurnMode = EElevenLabsTurnMode::Server;
|
||||||
|
ElevenLabsAgent->bAutoStartListening = true;
|
||||||
|
|
||||||
|
// Bind events:
|
||||||
|
ElevenLabsAgent->OnAgentConnected.AddDynamic(
|
||||||
|
this, &AMyNPC::HandleAgentConnected);
|
||||||
|
ElevenLabsAgent->OnAgentTextResponse.AddDynamic(
|
||||||
|
this, &AMyNPC::HandleAgentResponse);
|
||||||
|
ElevenLabsAgent->OnAgentStartedSpeaking.AddDynamic(
|
||||||
|
this, &AMyNPC::PlayTalkingAnimation);
|
||||||
|
|
||||||
|
// Start the conversation:
|
||||||
|
ElevenLabsAgent->StartConversation();
|
||||||
|
|
||||||
|
// Send a text message (useful for testing without mic):
|
||||||
|
ElevenLabsAgent->SendTextMessage(TEXT("Hello, who are you?"));
|
||||||
|
|
||||||
|
// Later, to end:
|
||||||
|
ElevenLabsAgent->EndConversation();
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Callback signatures
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleAgentConnected(const FElevenLabsConversationInfo& Info)
|
||||||
|
{
|
||||||
|
UE_LOG(LogTemp, Log, TEXT("Connected, ConvID=%s"), *Info.ConversationID);
|
||||||
|
}
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleAgentResponse(const FString& ResponseText)
|
||||||
|
{
|
||||||
|
// Display in UI, drive subtitles, etc.
|
||||||
|
}
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void PlayTalkingAnimation()
|
||||||
|
{
|
||||||
|
// Switch to talking anim montage
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Components Reference
|
||||||
|
|
||||||
|
### UElevenLabsConversationalAgentComponent
|
||||||
|
|
||||||
|
The **main component** — attach this to any Actor that should be able to speak.
|
||||||
|
|
||||||
|
**Category**: ElevenLabs
|
||||||
|
**Inherits from**: `UActorComponent`
|
||||||
|
|
||||||
|
#### Properties
|
||||||
|
|
||||||
|
| Property | Type | Default | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `AgentID` | `FString` | `""` | Agent ID for this actor. Overrides the project-level default when non-empty. |
|
||||||
|
| `TurnMode` | `EElevenLabsTurnMode` | `Server` | How speaker turns are detected. See [Section 8](#8-turn-modes). |
|
||||||
|
| `bAutoStartListening` | `bool` | `true` | If true, starts mic capture automatically once the WebSocket is connected and ready. |
|
||||||
|
|
||||||
|
#### Functions
|
||||||
|
|
||||||
|
| Function | Blueprint | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `StartConversation()` | Callable | Opens the WebSocket connection. If `bAutoStartListening` is true, mic capture starts once `OnAgentConnected` fires. |
|
||||||
|
| `EndConversation()` | Callable | Closes the WebSocket, stops mic, stops audio playback. |
|
||||||
|
| `StartListening()` | Callable | Starts microphone capture and streams to ElevenLabs. In Client mode, also sends `user_activity`. |
|
||||||
|
| `StopListening()` | Callable | Stops microphone capture. In Client mode, stops sending `user_activity`. |
|
||||||
|
| `SendTextMessage(Text)` | Callable | Sends a text message to the agent without using the microphone. Agent replies with full audio + text. Useful for testing. |
|
||||||
|
| `InterruptAgent()` | Callable | Stops the agent's current utterance immediately and clears the audio queue. |
|
||||||
|
| `IsConnected()` | Pure | Returns true if the WebSocket is open and the conversation is active. |
|
||||||
|
| `IsListening()` | Pure | Returns true if the microphone is currently capturing. |
|
||||||
|
| `IsAgentSpeaking()` | Pure | Returns true if agent audio is currently playing. |
|
||||||
|
| `GetConversationInfo()` | Pure | Returns `FElevenLabsConversationInfo` (ConversationID, AgentID). |
|
||||||
|
| `GetWebSocketProxy()` | Pure | Returns the underlying `UElevenLabsWebSocketProxy` for advanced use. |
|
||||||
|
|
||||||
|
#### Events
|
||||||
|
|
||||||
|
| Event | Parameters | Fired when |
|
||||||
|
|---|---|---|
|
||||||
|
| `OnAgentConnected` | `FElevenLabsConversationInfo` | WebSocket handshake + agent initiation metadata received. Safe to call `SendTextMessage` here. |
|
||||||
|
| `OnAgentDisconnected` | `int32 StatusCode`, `FString Reason` | WebSocket closed (graceful or remote). |
|
||||||
|
| `OnAgentError` | `FString ErrorMessage` | Connection or protocol error. |
|
||||||
|
| `OnAgentTranscript` | `FElevenLabsTranscriptSegment` | User speech-to-text transcript received (speaker is always `"user"`). |
|
||||||
|
| `OnAgentTextResponse` | `FString ResponseText` | Final text response from the agent (mirrors the audio). |
|
||||||
|
| `OnAgentStartedSpeaking` | — | First audio chunk received from the agent (audio playback begins). |
|
||||||
|
| `OnAgentStoppedSpeaking` | — | Audio queue empty for ~0.5 s (heuristic — agent done speaking). |
|
||||||
|
| `OnAgentInterrupted` | — | Agent speech was interrupted (by user or by `InterruptAgent()`). |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### UElevenLabsMicrophoneCaptureComponent
|
||||||
|
|
||||||
|
A lightweight microphone capture component. Managed automatically by `UElevenLabsConversationalAgentComponent` — you only need to use this directly for advanced scenarios (e.g. custom audio routing).
|
||||||
|
|
||||||
|
**Category**: ElevenLabs
|
||||||
|
**Inherits from**: `UActorComponent`
|
||||||
|
|
||||||
|
#### Properties
|
||||||
|
|
||||||
|
| Property | Type | Default | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `VolumeMultiplier` | `float` | `1.0` | Gain applied to captured samples before resampling. Range: 0.0 – 4.0. |
|
||||||
|
|
||||||
|
#### Functions
|
||||||
|
|
||||||
|
| Function | Blueprint | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `StartCapture()` | Callable | Opens the default audio input device and begins streaming. |
|
||||||
|
| `StopCapture()` | Callable | Stops streaming and closes the device. |
|
||||||
|
| `IsCapturing()` | Pure | True while actively capturing. |
|
||||||
|
|
||||||
|
#### Delegate
|
||||||
|
|
||||||
|
`OnAudioCaptured` — fires on the **game thread** with `TArray<float>` PCM samples at 16 kHz mono. Bind to this if you want to process or forward audio manually.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### UElevenLabsWebSocketProxy
|
||||||
|
|
||||||
|
Low-level WebSocket session manager. Used internally by `UElevenLabsConversationalAgentComponent`. Use this directly only if you need fine-grained protocol control.
|
||||||
|
|
||||||
|
**Inherits from**: `UObject`
|
||||||
|
**Instantiate via**: `NewObject<UElevenLabsWebSocketProxy>(Outer)`
|
||||||
|
|
||||||
|
#### Key functions
|
||||||
|
|
||||||
|
| Function | Description |
|
||||||
|
|---|---|
|
||||||
|
| `Connect(AgentID, APIKey)` | Open the WS connection. Parameters override project settings when non-empty. |
|
||||||
|
| `Disconnect()` | Send close frame and tear down the connection. |
|
||||||
|
| `SendAudioChunk(PCMData)` | Send raw int16 LE PCM bytes as a Base64 JSON frame. Called automatically by the agent component. |
|
||||||
|
| `SendTextMessage(Text)` | Send `{"type":"user_message","text":"..."}`. Agent replies as if it heard speech. |
|
||||||
|
| `SendUserTurnStart()` | Client turn mode: sends `{"type":"user_activity"}` to signal user is speaking. |
|
||||||
|
| `SendUserTurnEnd()` | Client turn mode: stops sending `user_activity` (no explicit message — server detects silence). |
|
||||||
|
| `SendInterrupt()` | Ask the agent to stop speaking: sends `{"type":"interrupt"}`. |
|
||||||
|
| `GetConnectionState()` | Returns `EElevenLabsConnectionState`. |
|
||||||
|
| `GetConversationInfo()` | Returns `FElevenLabsConversationInfo`. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Data Types Reference
|
||||||
|
|
||||||
|
### EElevenLabsConnectionState
|
||||||
|
|
||||||
|
```
|
||||||
|
Disconnected — No active connection
|
||||||
|
Connecting — WebSocket handshake in progress / awaiting conversation_initiation_metadata
|
||||||
|
Connected — Conversation active and ready (fires OnAgentConnected)
|
||||||
|
Error — Connection or protocol failure
|
||||||
|
```
|
||||||
|
|
||||||
|
> Note: State remains `Connecting` until the server sends `conversation_initiation_metadata`. `OnAgentConnected` fires on transition to `Connected`.
|
||||||
|
|
||||||
|
### EElevenLabsTurnMode
|
||||||
|
|
||||||
|
```
|
||||||
|
Server — ElevenLabs Voice Activity Detection decides when the user stops speaking (recommended)
|
||||||
|
Client — Your code calls StartListening/StopListening to define turns (push-to-talk)
|
||||||
|
```
|
||||||
|
|
||||||
|
### FElevenLabsConversationInfo
|
||||||
|
|
||||||
|
```
|
||||||
|
ConversationID FString — Unique session ID assigned by ElevenLabs
|
||||||
|
AgentID FString — The agent ID for this session
|
||||||
|
```
|
||||||
|
|
||||||
|
### FElevenLabsTranscriptSegment
|
||||||
|
|
||||||
|
```
|
||||||
|
Text FString — Transcribed text
|
||||||
|
Speaker FString — "user" (agent text comes via OnAgentTextResponse, not transcript)
|
||||||
|
bIsFinal bool — Always true for user transcripts (ElevenLabs sends final only)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Turn Modes
|
||||||
|
|
||||||
|
### Server VAD (default)
|
||||||
|
|
||||||
|
ElevenLabs runs Voice Activity Detection on the server. The plugin streams microphone audio continuously and ElevenLabs decides when the user has finished speaking.
|
||||||
|
|
||||||
|
**When to use**: Casual conversation, hands-free interaction, natural dialogue.
|
||||||
|
|
||||||
|
```
|
||||||
|
StartConversation() → mic streams continuously (if bAutoStartListening = true)
|
||||||
|
ElevenLabs detects speech / silence automatically
|
||||||
|
Agent replies when it detects end-of-speech
|
||||||
|
```
|
||||||
|
|
||||||
|
### Client Controlled (push-to-talk)
|
||||||
|
|
||||||
|
Your code explicitly signals turn boundaries with `StartListening()` / `StopListening()`. The plugin sends `{"type":"user_activity"}` while the user is speaking; stopping it signals end of turn.
|
||||||
|
|
||||||
|
**When to use**: Noisy environments, precise control, walkie-talkie style UI.
|
||||||
|
|
||||||
|
```
|
||||||
|
Input Pressed → StartListening() → streams audio + sends user_activity
|
||||||
|
Input Released → StopListening() → stops audio (no explicit end message)
|
||||||
|
Server detects silence and hands turn to agent
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Security — Signed URL Mode
|
||||||
|
|
||||||
|
By default, the API key is stored in Project Settings (`DefaultEngine.ini`). This is fine for development but **should not be shipped in packaged builds** as the key could be extracted.
|
||||||
|
|
||||||
|
### Production setup
|
||||||
|
|
||||||
|
1. Enable **Signed URL Mode** in Project Settings.
|
||||||
|
2. Set **Signed URL Endpoint** to a URL on your own backend (e.g. `https://your-server.com/api/elevenlabs-token`).
|
||||||
|
3. Your backend authenticates the player and calls the ElevenLabs API to generate a signed WebSocket URL, returning:
|
||||||
|
```json
|
||||||
|
{ "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..." }
|
||||||
|
```
|
||||||
|
4. The plugin fetches this URL before connecting — the API key never leaves your server.
|
||||||
|
|
||||||
|
### Development workflow (API key in project settings)
|
||||||
|
|
||||||
|
- Set the key in **Project Settings → Plugins → ElevenLabs AI Agent**
|
||||||
|
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
|
||||||
|
- **Strip this section from `DefaultEngine.ini` before every git commit**
|
||||||
|
- Each developer sets the key locally — it does not go in version control
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Audio Pipeline
|
||||||
|
|
||||||
|
### Input (player → agent)
|
||||||
|
|
||||||
|
```
|
||||||
|
Device (any sample rate, any channels)
|
||||||
|
↓ FAudioCapture — UE built-in (UE 5.3+ API: OpenAudioCaptureStream)
|
||||||
|
↓ Callback: const void* → cast to float32 interleaved frames
|
||||||
|
↓ Downmix to mono (average all channels)
|
||||||
|
↓ Resample to 16000 Hz (linear interpolation)
|
||||||
|
↓ Apply VolumeMultiplier
|
||||||
|
↓ Dispatch to Game Thread (AsyncTask)
|
||||||
|
↓ Convert float32 → int16 signed, little-endian bytes
|
||||||
|
↓ Base64 encode
|
||||||
|
↓ Send as binary WebSocket frame: { "user_audio_chunk": "<base64>" }
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output (agent → player)
|
||||||
|
|
||||||
|
```
|
||||||
|
Binary WebSocket frame arrives
|
||||||
|
↓ Peek first byte:
|
||||||
|
• '{' → UTF-8 JSON: parse type field, dispatch to handler
|
||||||
|
• other → raw PCM audio bytes
|
||||||
|
↓ [Audio path] Raw int16 LE PCM bytes at 16000 Hz mono
|
||||||
|
↓ Enqueue in thread-safe AudioQueue (FCriticalSection)
|
||||||
|
↓ USoundWaveProcedural::OnSoundWaveProceduralUnderflow pulls from queue
|
||||||
|
↓ UAudioComponent plays from the Actor's world position (3D spatialized)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Audio format** (both directions): PCM 16-bit signed, 16000 Hz, mono, little-endian.
|
||||||
|
|
||||||
|
### Silence detection heuristic
|
||||||
|
|
||||||
|
`OnAgentStoppedSpeaking` fires when the `AudioQueue` has been empty for **30 consecutive ticks** (~0.5 s at 60 fps). If the agent has natural pauses, increase `SilenceThresholdTicks` in the header:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Common Patterns
|
||||||
|
|
||||||
|
### Test the connection without a microphone
|
||||||
|
|
||||||
|
```
|
||||||
|
BeginPlay → StartConversation()
|
||||||
|
|
||||||
|
OnAgentConnected → SendTextMessage("Hello, introduce yourself")
|
||||||
|
|
||||||
|
OnAgentTextResponse → Print string (confirms text pipeline works)
|
||||||
|
OnAgentStartedSpeaking → (confirms audio pipeline works)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Show subtitles in UI
|
||||||
|
|
||||||
|
```
|
||||||
|
OnAgentTranscript:
|
||||||
|
Segment → Text → show in player subtitle widget (speaker always "user")
|
||||||
|
|
||||||
|
OnAgentTextResponse:
|
||||||
|
ResponseText → show in NPC speech bubble
|
||||||
|
```
|
||||||
|
|
||||||
|
### Interrupt the agent when the player starts speaking
|
||||||
|
|
||||||
|
In Server VAD mode ElevenLabs handles this automatically. For manual control:
|
||||||
|
|
||||||
|
```
|
||||||
|
OnAgentStartedSpeaking → set "agent is speaking" flag
|
||||||
|
Input Action (any) → if agent is speaking → InterruptAgent()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multiple NPCs with different agents
|
||||||
|
|
||||||
|
Each NPC Blueprint has its own `UElevenLabsConversationalAgentComponent`. Set a different `AgentID` on each component. WebSocket connections are fully independent.
|
||||||
|
|
||||||
|
### Only start the conversation when the player is nearby
|
||||||
|
|
||||||
|
```
|
||||||
|
On Begin Overlap (trigger volume around NPC)
|
||||||
|
└─► [ElevenLabs Agent] Start Conversation
|
||||||
|
|
||||||
|
On End Overlap
|
||||||
|
└─► [ElevenLabs Agent] End Conversation
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adjust microphone volume
|
||||||
|
|
||||||
|
Get the `UElevenLabsMicrophoneCaptureComponent` from the owner and set `VolumeMultiplier`:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||||
|
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
|
||||||
|
if (Mic) Mic->VolumeMultiplier = 2.0f;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Troubleshooting
|
||||||
|
|
||||||
|
### Plugin doesn't appear in Project Settings
|
||||||
|
|
||||||
|
Ensure the plugin is enabled in `.uproject` and the project was recompiled after adding it.
|
||||||
|
|
||||||
|
### WebSocket connection fails immediately
|
||||||
|
|
||||||
|
- Check the **API Key** is set correctly in Project Settings.
|
||||||
|
- Check the **Agent ID** exists in your ElevenLabs account (find it in the dashboard URL or via `GET /v1/convai/agents`).
|
||||||
|
- Enable **Verbose Logging** in Project Settings and check Output Log for the exact WS URL and error.
|
||||||
|
- Ensure port 443 (WSS) is not blocked by your firewall.
|
||||||
|
|
||||||
|
### `OnAgentConnected` never fires
|
||||||
|
|
||||||
|
- Connection was made but `conversation_initiation_metadata` not received yet — check Verbose Logging.
|
||||||
|
- If you see `"Binary audio frame"` logs but no `"Conversation initiated"` — the initiation JSON frame may be arriving as a non-`{` binary frame. Check the hex prefix logged at Verbose level.
|
||||||
|
|
||||||
|
### No audio from the microphone
|
||||||
|
|
||||||
|
- Windows may require microphone permission. Check **Settings → Privacy → Microphone**.
|
||||||
|
- Try setting `VolumeMultiplier` to `2.0` on the `MicrophoneCaptureComponent`.
|
||||||
|
- Check Output Log for `"Failed to open default audio capture stream"`.
|
||||||
|
|
||||||
|
### Agent audio is choppy or silent
|
||||||
|
|
||||||
|
- The `USoundWaveProcedural` queue may be underflowing due to network jitter. Check latency.
|
||||||
|
- Verify the audio format matches: plugin expects raw PCM 16-bit 16 kHz mono from the server. If ElevenLabs sends a different format (e.g. mp3_44100), audio will sound garbled — check `agent_output_audio_format` in the `conversation_initiation_metadata` via Verbose Logging.
|
||||||
|
- Ensure no other component is using the same `UAudioComponent`.
|
||||||
|
|
||||||
|
### `OnAgentStoppedSpeaking` fires too early
|
||||||
|
|
||||||
|
Increase `SilenceThresholdTicks` in `ElevenLabsConversationalAgentComponent.h`:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
static constexpr int32 SilenceThresholdTicks = 60; // ~1.0s at 60fps
|
||||||
|
```
|
||||||
|
|
||||||
|
### Build error: "Plugin AudioCapture not found"
|
||||||
|
|
||||||
|
Make sure the `AudioCapture` plugin is enabled. It should be auto-enabled via the `.uplugin` dependency, but you can add it manually to `.uproject`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "Name": "AudioCapture", "Enabled": true }
|
||||||
|
```
|
||||||
|
|
||||||
|
### `"Received unexpected binary WebSocket frame"` in the log
|
||||||
|
|
||||||
|
This warning no longer appears in v1.1.0+. If you see it, you are running an older build — recompile the plugin.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Changelog
|
||||||
|
|
||||||
|
### v1.1.0 — 2026-02-19
|
||||||
|
|
||||||
|
**Bug fixes:**
|
||||||
|
- **Binary WebSocket frames**: ElevenLabs sends all frames as binary (not text). All frames were previously discarded. Now correctly handled — JSON control frames decoded as UTF-8, raw PCM audio frames routed directly to the audio queue.
|
||||||
|
- **Transcript message**: Wrong message type (`"transcript"` → `"user_transcript"`), wrong event key (`"transcript_event"` → `"user_transcription_event"`), wrong text field (`"message"` → `"user_transcript"`).
|
||||||
|
- **Pong format**: `event_id` was nested inside a `pong_event` object; corrected to top-level field per API spec.
|
||||||
|
- **Client turn mode**: `user_turn_start`/`user_turn_end` are not valid API messages; replaced with `user_activity` (start) and implicit silence (end).
|
||||||
|
|
||||||
|
**New features:**
|
||||||
|
- `SendTextMessage(Text)` on both `UElevenLabsConversationalAgentComponent` and `UElevenLabsWebSocketProxy` — send text to the agent without a microphone. Useful for testing.
|
||||||
|
- Verbose logging shows binary frame hex preview and JSON frame content prefix.
|
||||||
|
- Improved JSON parse error log now shows the first 80 characters of the failing message.
|
||||||
|
|
||||||
|
### v1.0.0 — 2026-02-19
|
||||||
|
|
||||||
|
Initial implementation. Plugin compiles cleanly on UE 5.5 Win64.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Documentation updated 2026-02-19 — Plugin v1.1.0 — UE 5.5*
|
||||||
463
.claude/elevenlabs_api_reference.md
Normal file
463
.claude/elevenlabs_api_reference.md
Normal file
@ -0,0 +1,463 @@
|
|||||||
|
# ElevenLabs Conversational AI – API Reference
|
||||||
|
> Saved for Claude Code sessions. Auto-loaded via `.claude/` directory.
|
||||||
|
> Last updated: 2026-02-19
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Agent ID — Where to Find It
|
||||||
|
|
||||||
|
### In the Dashboard (UI)
|
||||||
|
1. Go to **https://elevenlabs.io/app/conversational-ai**
|
||||||
|
2. Click on your agent to open it
|
||||||
|
3. The **Agent ID** is shown in the agent settings page — typically in the URL bar and/or in the agent's "General" settings tab
|
||||||
|
- URL pattern: `https://elevenlabs.io/app/conversational-ai/agents/<AGENT_ID>`
|
||||||
|
- Also visible in the "API" or "Overview" tab of the agent editor (copy button available)
|
||||||
|
|
||||||
|
### Via API
|
||||||
|
```http
|
||||||
|
GET https://api.elevenlabs.io/v1/convai/agents
|
||||||
|
xi-api-key: YOUR_API_KEY
|
||||||
|
```
|
||||||
|
Returns a list of all agents with their `agent_id` strings.
|
||||||
|
|
||||||
|
### Via API (single agent)
|
||||||
|
```http
|
||||||
|
GET https://api.elevenlabs.io/v1/convai/agents/{agent_id}
|
||||||
|
xi-api-key: YOUR_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent ID Format
|
||||||
|
- Type: `string`
|
||||||
|
- Returned on agent creation via `POST /v1/convai/agents/create`
|
||||||
|
- Used as URL path param and WebSocket query param throughout the API
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. WebSocket Conversational AI
|
||||||
|
|
||||||
|
### Connection URL
|
||||||
|
```
|
||||||
|
wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<AGENT_ID>
|
||||||
|
```
|
||||||
|
|
||||||
|
Regional alternatives:
|
||||||
|
| Region | URL |
|
||||||
|
|--------|-----|
|
||||||
|
| Default (Global) | `wss://api.elevenlabs.io/` |
|
||||||
|
| US | `wss://api.us.elevenlabs.io/` |
|
||||||
|
| EU | `wss://api.eu.residency.elevenlabs.io/` |
|
||||||
|
| India | `wss://api.in.residency.elevenlabs.io/` |
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
- **Public agents**: No key required, just `agent_id` query param
|
||||||
|
- **Private agents**: Use a **Signed URL** (see Section 4) instead of direct `agent_id`
|
||||||
|
- **Server-side** (backend): Pass `xi-api-key` as an HTTP upgrade header
|
||||||
|
|
||||||
|
```
|
||||||
|
Headers:
|
||||||
|
xi-api-key: YOUR_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
> ⚠️ Never expose your API key client-side. For browser/mobile apps, use Signed URLs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. WebSocket Protocol — Message Reference
|
||||||
|
|
||||||
|
### Audio Format
|
||||||
|
- **Input (mic → server)**: PCM 16-bit signed, **16000 Hz**, mono, little-endian, Base64-encoded
|
||||||
|
- **Output (server → client)**: Base64-encoded audio (format specified in `conversation_initiation_metadata`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Messages FROM Server (Subscribe / Receive)
|
||||||
|
|
||||||
|
#### `conversation_initiation_metadata`
|
||||||
|
Sent immediately after connection. Contains conversation ID and audio format specs.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "conversation_initiation_metadata",
|
||||||
|
"conversation_initiation_metadata_event": {
|
||||||
|
"conversation_id": "string",
|
||||||
|
"agent_output_audio_format": "pcm_16000 | mp3_44100 | ...",
|
||||||
|
"user_input_audio_format": "pcm_16000"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `audio`
|
||||||
|
Agent speech audio chunk.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "audio",
|
||||||
|
"audio_event": {
|
||||||
|
"audio_base_64": "BASE64_PCM_BYTES",
|
||||||
|
"event_id": 42
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `user_transcript`
|
||||||
|
Transcribed text of what the user said.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "user_transcript",
|
||||||
|
"user_transcription_event": {
|
||||||
|
"user_transcript": "Hello, how are you?"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `agent_response`
|
||||||
|
The text the agent is saying (arrives in parallel with audio).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "agent_response",
|
||||||
|
"agent_response_event": {
|
||||||
|
"agent_response": "I'm doing great, thanks!"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `agent_response_correction`
|
||||||
|
Sent after an interruption — shows what was truncated.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "agent_response_correction",
|
||||||
|
"agent_response_correction_event": {
|
||||||
|
"original_agent_response": "string",
|
||||||
|
"corrected_agent_response": "string"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `interruption`
|
||||||
|
Signals that a specific audio event was interrupted.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "interruption",
|
||||||
|
"interruption_event": {
|
||||||
|
"event_id": 42
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `ping`
|
||||||
|
Keepalive ping from server. Client must reply with `pong`.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "ping",
|
||||||
|
"ping_event": {
|
||||||
|
"event_id": 1,
|
||||||
|
"ping_ms": 150
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `client_tool_call`
|
||||||
|
Requests the client execute a tool (custom tools integration).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "client_tool_call",
|
||||||
|
"client_tool_call": {
|
||||||
|
"tool_name": "string",
|
||||||
|
"tool_call_id": "string",
|
||||||
|
"parameters": {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `contextual_update`
|
||||||
|
Text context added to conversation state (non-interrupting).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "contextual_update",
|
||||||
|
"contextual_update_event": {
|
||||||
|
"text": "string"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `vad_score`
|
||||||
|
Voice Activity Detection confidence score (0.0–1.0).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "vad_score",
|
||||||
|
"vad_score_event": {
|
||||||
|
"vad_score": 0.85
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `internal_tentative_agent_response`
|
||||||
|
Preliminary agent text during LLM generation (not final).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "internal_tentative_agent_response",
|
||||||
|
"tentative_agent_response_internal_event": {
|
||||||
|
"tentative_agent_response": "string"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Messages TO Server (Publish / Send)
|
||||||
|
|
||||||
|
#### `user_audio_chunk`
|
||||||
|
Microphone audio data. Send continuously during user speech.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"user_audio_chunk": "BASE64_PCM_16BIT_16KHZ_MONO"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Audio must be: **PCM 16-bit signed, 16000 Hz, mono, little-endian**, then Base64-encoded.
|
||||||
|
|
||||||
|
#### `pong`
|
||||||
|
Reply to server `ping` to keep connection alive.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "pong",
|
||||||
|
"event_id": 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `conversation_initiation_client_data`
|
||||||
|
Override agent configuration at connection time. Send before or just after connecting.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "conversation_initiation_client_data",
|
||||||
|
"conversation_config_override": {
|
||||||
|
"agent": {
|
||||||
|
"prompt": { "prompt": "Custom system prompt override" },
|
||||||
|
"first_message": "Hello! How can I help?",
|
||||||
|
"language": "en"
|
||||||
|
},
|
||||||
|
"tts": {
|
||||||
|
"voice_id": "string",
|
||||||
|
"speed": 1.0,
|
||||||
|
"stability": 0.5,
|
||||||
|
"similarity_boost": 0.75
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"dynamic_variables": {
|
||||||
|
"user_name": "Alice",
|
||||||
|
"session_id": 12345
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Config override ranges:
|
||||||
|
- `tts.speed`: 0.7 – 1.2
|
||||||
|
- `tts.stability`: 0.0 – 1.0
|
||||||
|
- `tts.similarity_boost`: 0.0 – 1.0
|
||||||
|
|
||||||
|
#### `client_tool_result`
|
||||||
|
Response to a `client_tool_call` from the server.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "client_tool_result",
|
||||||
|
"tool_call_id": "string",
|
||||||
|
"result": "tool output string",
|
||||||
|
"is_error": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `contextual_update`
|
||||||
|
Inject context without interrupting the conversation.
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "contextual_update",
|
||||||
|
"text": "User just entered room 4B"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `user_message`
|
||||||
|
Send a text message (no mic audio needed).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "user_message",
|
||||||
|
"text": "What is the weather like?"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `user_activity`
|
||||||
|
Signal that user is active (for turn detection in client mode).
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "user_activity"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Signed URL (Private Agents)
|
||||||
|
|
||||||
|
Used for browser/mobile clients to authenticate without exposing the API key.
|
||||||
|
|
||||||
|
### Flow
|
||||||
|
1. **Backend** calls ElevenLabs API to get a temporary signed URL
|
||||||
|
2. Backend returns signed URL to client
|
||||||
|
3. **Client** opens WebSocket to the signed URL (no API key needed)
|
||||||
|
|
||||||
|
### Get Signed URL
|
||||||
|
```http
|
||||||
|
GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=<AGENT_ID>
|
||||||
|
xi-api-key: YOUR_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
Optional query params:
|
||||||
|
- `include_conversation_id=true` — generates unique conversation ID, prevents URL reuse
|
||||||
|
- `branch_id` — specific agent branch
|
||||||
|
|
||||||
|
Response:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=...&token=..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Client connects to `signed_url` directly — no headers needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Agents REST API
|
||||||
|
|
||||||
|
Base URL: `https://api.elevenlabs.io`
|
||||||
|
Auth header: `xi-api-key: YOUR_API_KEY`
|
||||||
|
|
||||||
|
### Create Agent
|
||||||
|
```http
|
||||||
|
POST /v1/convai/agents/create
|
||||||
|
Content-Type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"name": "My NPC Agent",
|
||||||
|
"conversation_config": {
|
||||||
|
"agent": {
|
||||||
|
"first_message": "Hello adventurer!",
|
||||||
|
"prompt": { "prompt": "You are a wise tavern keeper in a fantasy world." },
|
||||||
|
"language": "en"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Response includes `agent_id`.
|
||||||
|
|
||||||
|
### List Agents
|
||||||
|
```http
|
||||||
|
GET /v1/convai/agents?page_size=30&search=&sort_by=created_at&sort_direction=desc
|
||||||
|
```
|
||||||
|
Response:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"agents": [
|
||||||
|
{
|
||||||
|
"agent_id": "abc123xyz",
|
||||||
|
"name": "My NPC Agent",
|
||||||
|
"created_at_unix_secs": 1708300000,
|
||||||
|
"last_call_time_unix_secs": null,
|
||||||
|
"archived": false,
|
||||||
|
"tags": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"has_more": false,
|
||||||
|
"next_cursor": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Agent
|
||||||
|
```http
|
||||||
|
GET /v1/convai/agents/{agent_id}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Update Agent
|
||||||
|
```http
|
||||||
|
PATCH /v1/convai/agents/{agent_id}
|
||||||
|
Content-Type: application/json
|
||||||
|
{ "name": "Updated Name", "conversation_config": { ... } }
|
||||||
|
```
|
||||||
|
|
||||||
|
### Delete Agent
|
||||||
|
```http
|
||||||
|
DELETE /v1/convai/agents/{agent_id}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Turn Modes
|
||||||
|
|
||||||
|
### Server VAD (Default / Recommended)
|
||||||
|
- ElevenLabs server detects when user stops speaking
|
||||||
|
- Client streams audio continuously
|
||||||
|
- Server handles all turn-taking automatically
|
||||||
|
|
||||||
|
### Client Turn Mode
|
||||||
|
- Client explicitly signals turn boundaries
|
||||||
|
- Send `user_activity` to indicate user is speaking
|
||||||
|
- Use when you have your own VAD or push-to-talk UI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Audio Pipeline (UE5 Implementation Notes)
|
||||||
|
|
||||||
|
```
|
||||||
|
Microphone (FAudioCapture)
|
||||||
|
→ float32 samples at device rate (e.g. 44100 Hz stereo)
|
||||||
|
→ Resample to 16000 Hz mono
|
||||||
|
→ Convert float32 → int16 little-endian
|
||||||
|
→ Base64-encode
|
||||||
|
→ Send as {"user_audio_chunk": "BASE64"}
|
||||||
|
|
||||||
|
Server → {"type":"audio","audio_event":{"audio_base_64":"BASE64"}}
|
||||||
|
→ Base64-decode
|
||||||
|
→ Raw PCM bytes
|
||||||
|
→ Push to USoundWaveProcedural
|
||||||
|
→ UAudioComponent plays back
|
||||||
|
```
|
||||||
|
|
||||||
|
### Float32 → Int16 Conversion (C++)
|
||||||
|
```cpp
|
||||||
|
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatSamples)
|
||||||
|
{
|
||||||
|
TArray<uint8> Bytes;
|
||||||
|
Bytes.SetNumUninitialized(FloatSamples.Num() * 2);
|
||||||
|
for (int32 i = 0; i < FloatSamples.Num(); i++)
|
||||||
|
{
|
||||||
|
float Clamped = FMath::Clamp(FloatSamples[i], -1.f, 1.f);
|
||||||
|
int16 Sample = (int16)(Clamped * 32767.f);
|
||||||
|
Bytes[i * 2] = (uint8)(Sample & 0xFF); // Low byte
|
||||||
|
Bytes[i * 2 + 1] = (uint8)((Sample >> 8) & 0xFF); // High byte
|
||||||
|
}
|
||||||
|
return Bytes;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Quick Integration Checklist (UE5 Plugin)
|
||||||
|
|
||||||
|
- [ ] Set `AgentID` in `UElevenLabsSettings` (Project Settings → ElevenLabs AI Agent)
|
||||||
|
- Or override per-component via `UElevenLabsConversationalAgentComponent::AgentID`
|
||||||
|
- [ ] Set `API_Key` in settings (or leave empty for public agents)
|
||||||
|
- [ ] Add `UElevenLabsConversationalAgentComponent` to your NPC actor
|
||||||
|
- [ ] Set `TurnMode` (default: `Server` — recommended)
|
||||||
|
- [ ] Bind to events: `OnAgentConnected`, `OnAgentTranscript`, `OnAgentTextResponse`, `OnAgentStartedSpeaking`, `OnAgentStoppedSpeaking`
|
||||||
|
- [ ] Call `StartConversation()` to begin
|
||||||
|
- [ ] Call `EndConversation()` when done
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Key API URLs Reference
|
||||||
|
|
||||||
|
| Purpose | URL |
|
||||||
|
|---------|-----|
|
||||||
|
| Dashboard | https://elevenlabs.io/app/conversational-ai |
|
||||||
|
| API Keys | https://elevenlabs.io/app/settings/api-keys |
|
||||||
|
| WebSocket endpoint | wss://api.elevenlabs.io/v1/convai/conversation |
|
||||||
|
| Agents list | GET https://api.elevenlabs.io/v1/convai/agents |
|
||||||
|
| Agent by ID | GET https://api.elevenlabs.io/v1/convai/agents/{agent_id} |
|
||||||
|
| Create agent | POST https://api.elevenlabs.io/v1/convai/agents/create |
|
||||||
|
| Signed URL | GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url |
|
||||||
|
| WS protocol docs | https://elevenlabs.io/docs/eleven-agents/api-reference/eleven-agents/websocket |
|
||||||
|
| Quickstart | https://elevenlabs.io/docs/eleven-agents/quickstart |
|
||||||
61
.claude/elevenlabs_plugin.md
Normal file
61
.claude/elevenlabs_plugin.md
Normal file
@ -0,0 +1,61 @@
|
|||||||
|
# PS_AI_Agent_ElevenLabs Plugin
|
||||||
|
|
||||||
|
## Location
|
||||||
|
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
|
||||||
|
|
||||||
|
## File Map
|
||||||
|
```
|
||||||
|
PS_AI_Agent_ElevenLabs.uplugin
|
||||||
|
Source/PS_AI_Agent_ElevenLabs/
|
||||||
|
PS_AI_Agent_ElevenLabs.Build.cs
|
||||||
|
Public/
|
||||||
|
PS_AI_Agent_ElevenLabs.h – FPS_AI_Agent_ElevenLabsModule + UElevenLabsSettings
|
||||||
|
ElevenLabsDefinitions.h – Enums, structs, ElevenLabsMessageType/Audio constants
|
||||||
|
ElevenLabsWebSocketProxy.h/.cpp – UObject managing one WS session
|
||||||
|
ElevenLabsConversationalAgentComponent.h/.cpp – Main ActorComponent (attach to NPC)
|
||||||
|
ElevenLabsMicrophoneCaptureComponent.h/.cpp – Mic capture, resample, dispatch to game thread
|
||||||
|
Private/
|
||||||
|
(implementations of the above)
|
||||||
|
```
|
||||||
|
|
||||||
|
## ElevenLabs Conversational AI Protocol
|
||||||
|
- **WebSocket URL**: `wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>`
|
||||||
|
- **Auth**: HTTP upgrade header `xi-api-key: <key>` (set in Project Settings)
|
||||||
|
- **All frames**: JSON text (no binary frames used by the API)
|
||||||
|
- **Audio format**: PCM 16-bit signed, 16000 Hz, mono, little-endian — Base64-encoded in JSON
|
||||||
|
|
||||||
|
### Client → Server messages
|
||||||
|
| Type field value | Payload |
|
||||||
|
|---|---|
|
||||||
|
| *(none – key is the type)* `user_audio_chunk` | `{ "user_audio_chunk": "<base64 PCM>" }` |
|
||||||
|
| `user_turn_start` | `{ "type": "user_turn_start" }` |
|
||||||
|
| `user_turn_end` | `{ "type": "user_turn_end" }` |
|
||||||
|
| `interrupt` | `{ "type": "interrupt" }` |
|
||||||
|
| `pong` | `{ "type": "pong", "pong_event": { "event_id": N } }` |
|
||||||
|
|
||||||
|
### Server → Client messages (field: `type`)
|
||||||
|
| type value | Key nested object | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| `conversation_initiation_metadata` | `conversation_initiation_metadata_event.conversation_id` | Marks WS ready |
|
||||||
|
| `audio` | `audio_event.audio_base_64` | Base64 PCM from agent |
|
||||||
|
| `transcript` | `transcript_event.{speaker, message, is_final}` | User or agent speech |
|
||||||
|
| `agent_response` | `agent_response_event.agent_response` | Final agent text |
|
||||||
|
| `interruption` | — | Agent stopped mid-sentence |
|
||||||
|
| `ping` | `ping_event.event_id` | Must reply with pong |
|
||||||
|
|
||||||
|
## Key Design Decisions
|
||||||
|
- **No gRPC / no ThirdParty libs** — pure UE WebSockets + HTTP, builds out of the box
|
||||||
|
- Audio resampled in-plugin: device rate → 16000 Hz mono (linear interpolation)
|
||||||
|
- `USoundWaveProcedural` for real-time agent audio playback (queue-driven)
|
||||||
|
- Silence heuristic: 30 game-thread ticks (~0.5 s at 60 fps) with no new audio → agent done speaking
|
||||||
|
- `bSignedURLMode` setting: fetch a signed WS URL from your own backend (keeps API key off client)
|
||||||
|
- Two turn modes: `Server VAD` (ElevenLabs detects speech end) and `Client Controlled` (push-to-talk)
|
||||||
|
|
||||||
|
## Build Dependencies (Build.cs)
|
||||||
|
Core, CoreUObject, Engine, InputCore, Json, JsonUtilities, WebSockets, HTTP,
|
||||||
|
AudioMixer, AudioCaptureCore, AudioCapture, Voice, SignalProcessing
|
||||||
|
|
||||||
|
## Status
|
||||||
|
- **Session 1** (2026-02-19): All source files written, registered in .uproject. Not yet compiled.
|
||||||
|
- **TODO**: Open in UE 5.5 Editor → compile → test basic WS connection with a test agent ID.
|
||||||
|
- **Watch out**: Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature vs UE 5.5 API.
|
||||||
79
.claude/project_context.md
Normal file
79
.claude/project_context.md
Normal file
@ -0,0 +1,79 @@
|
|||||||
|
# Project Context & Original Ask
|
||||||
|
|
||||||
|
## What the user wants to build
|
||||||
|
|
||||||
|
A **UE5 plugin** that integrates the **ElevenLabs Conversational AI Agent** API into Unreal Engine 5.5,
|
||||||
|
allowing an in-game NPC (or any Actor) to hold a real-time voice conversation with a player.
|
||||||
|
|
||||||
|
### The original request (paraphrased)
|
||||||
|
> "I want to create a plugin to use ElevenLabs Conversational Agent in Unreal Engine 5.5.
|
||||||
|
> I previously used the Convai plugin which does what I want, but I prefer ElevenLabs quality.
|
||||||
|
> The goal is to create a plugin in the existing Unreal Project to make a first step for integration.
|
||||||
|
> Convai AI plugin may be too big in terms of functionality for the new project, but it is the final goal.
|
||||||
|
> You can use the Convai source code to find the right way to make the ElevenLabs version —
|
||||||
|
> it should be very similar."
|
||||||
|
|
||||||
|
### Plugin name
|
||||||
|
`PS_AI_Agent_ElevenLabs`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## User's mental model / intent
|
||||||
|
|
||||||
|
1. **Short-term**: A working first-step plugin — minimal but functional — that can:
|
||||||
|
- Connect to ElevenLabs Conversational AI via WebSocket
|
||||||
|
- Capture microphone audio from the player
|
||||||
|
- Stream it to ElevenLabs in real time
|
||||||
|
- Play back the agent's voice response
|
||||||
|
- Surface key events (transcript, agent text, speaking state) to Blueprint
|
||||||
|
|
||||||
|
2. **Long-term**: Match the full feature set of Convai — character IDs, session memory,
|
||||||
|
actions/environment context, lip-sync, etc. — but powered by ElevenLabs instead.
|
||||||
|
|
||||||
|
3. **Key preference**: Simpler than Convai. No gRPC, no protobuf, no ThirdParty precompiled
|
||||||
|
libraries. ElevenLabs' Conversational AI API uses plain WebSocket + JSON, which maps
|
||||||
|
naturally to UE's built-in `WebSockets` module.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How we used Convai as a reference
|
||||||
|
|
||||||
|
We studied the Convai plugin source (`ConvAI/Convai/`) to understand:
|
||||||
|
- **Module structure**: `UConvaiSettings` + `IModuleInterface` + `ISettingsModule` registration
|
||||||
|
- **Audio capture pattern**: `Audio::FAudioCapture`, ring buffers, thread-safe dispatch to game thread
|
||||||
|
- **Audio playback pattern**: `USoundWaveProcedural` fed from a queue
|
||||||
|
- **Component architecture**: `UConvaiChatbotComponent` (NPC side) + `UConvaiPlayerComponent` (player side)
|
||||||
|
- **HTTP proxy pattern**: `UConvaiAPIBaseProxy` base class for async REST calls
|
||||||
|
- **Voice type enum**: Convai already had `EVoiceType::ElevenLabsVoices` — confirming ElevenLabs
|
||||||
|
is a natural fit
|
||||||
|
|
||||||
|
We then replaced gRPC/protobuf with **WebSocket + JSON** to match the ElevenLabs API, and
|
||||||
|
simplified the architecture to the minimum needed for a first working version.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was built (Session 1 — 2026-02-19)
|
||||||
|
|
||||||
|
All source files created and registered. See `.claude/elevenlabs_plugin.md` for full file map and protocol details.
|
||||||
|
|
||||||
|
### Components created
|
||||||
|
| Class | Role |
|
||||||
|
|---|---|
|
||||||
|
| `UElevenLabsSettings` | Project Settings UI — API key, Agent ID, security options |
|
||||||
|
| `UElevenLabsWebSocketProxy` | Manages one WS session: connect, send audio, handle all server message types |
|
||||||
|
| `UElevenLabsConversationalAgentComponent` | ActorComponent to attach to any NPC — orchestrates mic + WS + playback |
|
||||||
|
| `UElevenLabsMicrophoneCaptureComponent` | Wraps `Audio::FAudioCapture`, resamples to 16 kHz mono |
|
||||||
|
|
||||||
|
### Not yet done (next sessions)
|
||||||
|
- Compile & test in UE 5.5 Editor
|
||||||
|
- Verify `USoundWaveProcedural::OnSoundWaveProceduralUnderflow` delegate signature for UE 5.5
|
||||||
|
- Add lip-sync support (future)
|
||||||
|
- Add session memory / conversation history (future)
|
||||||
|
- Add environment/action context support (future, matching Convai's full feature set)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes on the ElevenLabs API
|
||||||
|
- Docs: https://elevenlabs.io/docs/conversational-ai
|
||||||
|
- Create agents at: https://elevenlabs.io/app/conversational-ai
|
||||||
|
- API keys at: https://elevenlabs.io (dashboard)
|
||||||
200
.claude/session_log_2026-02-19.md
Normal file
200
.claude/session_log_2026-02-19.md
Normal file
@ -0,0 +1,200 @@
|
|||||||
|
# Session Log — 2026-02-19
|
||||||
|
|
||||||
|
**Project**: PS_AI_Agent (Unreal Engine 5.5)
|
||||||
|
**Machine**: Desktop PC (j_foucher)
|
||||||
|
**Working directory**: `E:\ASTERION\GIT\PS_AI_Agent`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conversation Summary
|
||||||
|
|
||||||
|
### 1. Initial Request
|
||||||
|
User asked to create a plugin to use the ElevenLabs Conversational AI Agent in UE5.5.
|
||||||
|
Reference: existing Convai plugin (gRPC-based, more complex). Goal: simpler version using ElevenLabs.
|
||||||
|
Plugin name requested: `PS_AI_Agent_ElevenLabs`.
|
||||||
|
|
||||||
|
### 2. Codebase Exploration
|
||||||
|
Explored the Convai plugin source at `ConvAI/Convai/` to understand:
|
||||||
|
- Module/settings structure
|
||||||
|
- AudioCapture patterns
|
||||||
|
- HTTP proxy pattern
|
||||||
|
- gRPC streaming architecture (to know what to replace with WebSocket)
|
||||||
|
- Convai already had `EVoiceType::ElevenLabsVoices` — confirming the direction
|
||||||
|
|
||||||
|
### 3. Plugin Created
|
||||||
|
All source files written from scratch under:
|
||||||
|
`Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/`
|
||||||
|
|
||||||
|
Files created:
|
||||||
|
- `PS_AI_Agent_ElevenLabs.uplugin`
|
||||||
|
- `PS_AI_Agent_ElevenLabs.Build.cs`
|
||||||
|
- `Public/PS_AI_Agent_ElevenLabs.h` — Module + `UElevenLabsSettings`
|
||||||
|
- `Public/ElevenLabsDefinitions.h` — Enums, structs, protocol constants
|
||||||
|
- `Public/ElevenLabsWebSocketProxy.h` + `.cpp` — WS session manager
|
||||||
|
- `Public/ElevenLabsConversationalAgentComponent.h` + `.cpp` — Main NPC component
|
||||||
|
- `Public/ElevenLabsMicrophoneCaptureComponent.h` + `.cpp` — Mic capture
|
||||||
|
- `PS_AI_Agent.uproject` — Plugin registered
|
||||||
|
|
||||||
|
Commit: `f0055e8`
|
||||||
|
|
||||||
|
### 4. Memory Files Created
|
||||||
|
To allow context recovery on any machine (including laptop):
|
||||||
|
- `.claude/MEMORY.md` — project structure + patterns (auto-loaded by Claude Code)
|
||||||
|
- `.claude/elevenlabs_plugin.md` — plugin file map + API protocol details
|
||||||
|
- `.claude/project_context.md` — original ask, intent, short/long-term goals
|
||||||
|
- Local copy also at `C:\Users\j_foucher\.claude\projects\...\memory\`
|
||||||
|
|
||||||
|
Commit: `f0055e8` (with plugin), updated in `4d6ae10`
|
||||||
|
|
||||||
|
### 5. .gitignore Updated
|
||||||
|
Added to existing ignores:
|
||||||
|
- `Unreal/PS_AI_Agent/Plugins/*/Binaries/`
|
||||||
|
- `Unreal/PS_AI_Agent/Plugins/*/Intermediate/`
|
||||||
|
- `Unreal/PS_AI_Agent/*.sln` / `*.suo`
|
||||||
|
- `.claude/settings.local.json`
|
||||||
|
- `generate_pptx.py`
|
||||||
|
|
||||||
|
Commit: `4d6ae10`, `b114ab0`
|
||||||
|
|
||||||
|
### 6. Compile — First Attempt (Errors Found)
|
||||||
|
Ran `Build.bat PS_AI_AgentEditor Win64 Development`. Errors:
|
||||||
|
- `WebSockets` listed in `.uplugin` — it's a module not a plugin → removed
|
||||||
|
- `OpenDefaultCaptureStream` doesn't exist in UE 5.5 → use `OpenAudioCaptureStream`
|
||||||
|
- `FOnAudioCaptureFunction` callback uses `const void*` not `const float*` → fixed cast
|
||||||
|
- `TArray::RemoveAt(0, N, false)` deprecated → use `EAllowShrinking::No`
|
||||||
|
- `AudioCapture` is a plugin and must be in `.uplugin` Plugins array → added
|
||||||
|
|
||||||
|
Commit: `bb1a857`
|
||||||
|
|
||||||
|
### 7. Compile — Success
|
||||||
|
Clean build, no warnings, no errors.
|
||||||
|
Output: `Plugins/PS_AI_Agent_ElevenLabs/Binaries/Win64/UnrealEditor-PS_AI_Agent_ElevenLabs.dll`
|
||||||
|
|
||||||
|
Memory updated with confirmed UE 5.5 API patterns. Commit: `3b98edc`
|
||||||
|
|
||||||
|
### 8. Documentation — Markdown
|
||||||
|
Full reference doc written to `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
||||||
|
- Installation, Project Settings, Quick Start (BP + C++), Components Reference,
|
||||||
|
Data Types, Turn Modes, Security/Signed URL, Audio Pipeline, Common Patterns, Troubleshooting.
|
||||||
|
|
||||||
|
Commit: `c833ccd`
|
||||||
|
|
||||||
|
### 9. Documentation — PowerPoint
|
||||||
|
20-slide dark-themed PowerPoint generated via Python (python-pptx 1.0.2):
|
||||||
|
- File: `PS_AI_Agent_ElevenLabs_Documentation.pptx` in repo root
|
||||||
|
- Covers all sections with visual layout, code blocks, flow diagrams, colour-coded elements
|
||||||
|
- Generator script `generate_pptx.py` excluded from git via .gitignore
|
||||||
|
|
||||||
|
Commit: `1b72026`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session 2 — 2026-02-19 (continued context)
|
||||||
|
|
||||||
|
### 10. API vs Implementation Cross-Check (3 bugs found and fixed)
|
||||||
|
Cross-referenced `elevenlabs_api_reference.md` against plugin source. Found 3 protocol bugs:
|
||||||
|
|
||||||
|
**Bug 1 — Transcript fields wrong:**
|
||||||
|
- Type: `"transcript"` → `"user_transcript"`
|
||||||
|
- Event key: `"transcript_event"` → `"user_transcription_event"`
|
||||||
|
- Field: `"message"` → `"user_transcript"`
|
||||||
|
|
||||||
|
**Bug 2 — Pong format wrong:**
|
||||||
|
- `event_id` was nested in `pong_event{}` → must be top-level
|
||||||
|
|
||||||
|
**Bug 3 — Client turn mode messages don't exist:**
|
||||||
|
- `"user_turn_start"` / `"user_turn_end"` are not valid API types
|
||||||
|
- Replaced: start → `"user_activity"`, end → no-op (server detects silence)
|
||||||
|
|
||||||
|
Commit: `ae2c9b9`
|
||||||
|
|
||||||
|
### 11. SendTextMessage Added
|
||||||
|
User asked for text input to agent for testing (without mic).
|
||||||
|
Added `SendTextMessage(FString)` to `UElevenLabsWebSocketProxy` and `UElevenLabsConversationalAgentComponent`.
|
||||||
|
Sends `{"type":"user_message","text":"..."}` — agent replies with audio + text.
|
||||||
|
|
||||||
|
Commit: `b489d11`
|
||||||
|
|
||||||
|
### 12. Binary WebSocket Frame Fix
|
||||||
|
User reported: `"Received unexpected binary WebSocket frame"` warnings.
|
||||||
|
Root cause: ElevenLabs sends **ALL WebSocket frames as binary**, never text.
|
||||||
|
`OnMessage` (text handler) never fires. `OnRawMessage` must handle everything.
|
||||||
|
|
||||||
|
Fix: Implemented `OnWsBinaryMessage` with fragment reassembly (`BinaryFrameBuffer`).
|
||||||
|
|
||||||
|
Commit: `669c503`
|
||||||
|
|
||||||
|
### 13. JSON vs PCM Discrimination Fix
|
||||||
|
After binary fix: `"Failed to parse WebSocket message as JSON"` errors.
|
||||||
|
Root cause: Binary frames contain BOTH JSON control messages AND raw PCM audio.
|
||||||
|
|
||||||
|
Fix: Peek at byte[0] of assembled buffer:
|
||||||
|
- `'{'` (0x7B) → UTF-8 JSON → route to `OnWsMessage()`
|
||||||
|
- anything else → raw PCM audio → broadcast to `OnAudioReceived`
|
||||||
|
|
||||||
|
Commit: `4834567`
|
||||||
|
|
||||||
|
### 14. Documentation Updated to v1.1.0
|
||||||
|
Full rewrite of `.claude/PS_AI_Agent_ElevenLabs_Documentation.md`:
|
||||||
|
- Added Changelog section (v1.0.0 / v1.1.0)
|
||||||
|
- Updated audio pipeline (binary PCM path, not Base64 JSON)
|
||||||
|
- Added `SendTextMessage` to all function tables and examples
|
||||||
|
- Corrected turn mode docs, transcript docs, `OnAgentConnected` timing
|
||||||
|
- New troubleshooting entries
|
||||||
|
|
||||||
|
Commit: `e464cfe`
|
||||||
|
|
||||||
|
### 15. Test Blueprint Asset Updated
|
||||||
|
`test_AI_Actor.uasset` updated in UE Editor.
|
||||||
|
|
||||||
|
Commit: `99017f4`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Git History (this session)
|
||||||
|
|
||||||
|
| Hash | Message |
|
||||||
|
|------|---------|
|
||||||
|
| `f0055e8` | Add PS_AI_Agent_ElevenLabs plugin (initial implementation) |
|
||||||
|
| `4d6ae10` | Update .gitignore: exclude plugin build artifacts and local Claude settings |
|
||||||
|
| `b114ab0` | Broaden .gitignore: use glob for all plugin Binaries/Intermediate |
|
||||||
|
| `bb1a857` | Fix compile errors in PS_AI_Agent_ElevenLabs plugin |
|
||||||
|
| `3b98edc` | Update memory: document confirmed UE 5.5 API patterns and plugin compile status |
|
||||||
|
| `c833ccd` | Add plugin documentation for PS_AI_Agent_ElevenLabs |
|
||||||
|
| `1b72026` | Add PowerPoint documentation and update .gitignore |
|
||||||
|
| `bbeb429` | ElevenLabs API reference doc |
|
||||||
|
| `dbd6161` | TestMap, test actor, DefaultEngine.ini, memory update |
|
||||||
|
| `ae2c9b9` | Fix 3 WebSocket protocol bugs |
|
||||||
|
| `b489d11` | Add SendTextMessage |
|
||||||
|
| `669c503` | Fix binary WebSocket frames |
|
||||||
|
| `4834567` | Fix JSON vs binary frame discrimination |
|
||||||
|
| `e464cfe` | Update documentation to v1.1.0 |
|
||||||
|
| `99017f4` | Update test_AI_Actor blueprint asset |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Technical Decisions Made This Session
|
||||||
|
|
||||||
|
| Decision | Reason |
|
||||||
|
|----------|--------|
|
||||||
|
| WebSocket instead of gRPC | ElevenLabs Conversational AI uses WS/JSON; no ThirdParty libs needed |
|
||||||
|
| `AudioCapture` in `.uplugin` Plugins array | It's an engine plugin, not a module — UBT requires it declared |
|
||||||
|
| `WebSockets` in Build.cs only | It's a module (no `.uplugin` file), declaring it in `.uplugin` causes build error |
|
||||||
|
| `FOnAudioCaptureFunction` uses `const void*` | UE 5.3+ API change — must cast to `float*` inside callback |
|
||||||
|
| `EAllowShrinking::No` | Bool overload of `RemoveAt` deprecated in UE 5.5 |
|
||||||
|
| `USoundWaveProcedural` for playback | Allows pushing raw PCM bytes at runtime without file I/O |
|
||||||
|
| Silence threshold = 30 ticks | ~0.5s at 60fps heuristic to detect agent finished speaking |
|
||||||
|
| Binary frame handling | ElevenLabs sends ALL WS frames as binary; peek byte[0] to discriminate JSON vs PCM |
|
||||||
|
| `user_activity` for client turn | `user_turn_start`/`user_turn_end` don't exist in ElevenLabs API |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps (not done yet)
|
||||||
|
|
||||||
|
- [ ] Verify mic audio actually reaches ElevenLabs (enable Verbose Logging, test in Editor)
|
||||||
|
- [ ] Test `USoundWaveProcedural` underflow behaviour in practice (check for audio glitches)
|
||||||
|
- [ ] Test `SendTextMessage` end-to-end in Blueprint
|
||||||
|
- [ ] Add lip-sync support (future)
|
||||||
|
- [ ] Add session memory / conversation history (future, matching Convai)
|
||||||
|
- [ ] Add environment/action context support (future)
|
||||||
|
- [ ] Consider Signed URL Mode backend implementation
|
||||||
14
.gitignore
vendored
14
.gitignore
vendored
@ -4,3 +4,17 @@ Unreal/PS_AI_Agent/Binaries/
|
|||||||
Unreal/PS_AI_Agent/Intermediate/
|
Unreal/PS_AI_Agent/Intermediate/
|
||||||
Unreal/PS_AI_Agent/Saved/
|
Unreal/PS_AI_Agent/Saved/
|
||||||
ConvAI/Convai/Binaries/
|
ConvAI/Convai/Binaries/
|
||||||
|
|
||||||
|
# All plugin build artifacts (Binaries + Intermediate for any plugin)
|
||||||
|
Unreal/PS_AI_Agent/Plugins/*/Binaries/
|
||||||
|
Unreal/PS_AI_Agent/Plugins/*/Intermediate/
|
||||||
|
|
||||||
|
# UE5 generated solution files
|
||||||
|
Unreal/PS_AI_Agent/*.sln
|
||||||
|
Unreal/PS_AI_Agent/*.suo
|
||||||
|
|
||||||
|
# Claude Code local session settings (machine-specific, memory files in .claude/ are kept)
|
||||||
|
.claude/settings.local.json
|
||||||
|
|
||||||
|
# Documentation generator script (dev tool, output .pptx is committed instead)
|
||||||
|
generate_pptx.py
|
||||||
|
|||||||
BIN
PS_AI_Agent_ElevenLabs_Documentation.pptx
Normal file
BIN
PS_AI_Agent_ElevenLabs_Documentation.pptx
Normal file
Binary file not shown.
@ -1,7 +1,8 @@
|
|||||||
|
|
||||||
|
|
||||||
[/Script/EngineSettings.GameMapsSettings]
|
[/Script/EngineSettings.GameMapsSettings]
|
||||||
GameDefaultMap=/Engine/Maps/Templates/OpenWorld
|
GameDefaultMap=/Game/TestMap.TestMap
|
||||||
|
EditorStartupMap=/Game/TestMap.TestMap
|
||||||
|
|
||||||
[/Script/Engine.RendererSettings]
|
[/Script/Engine.RendererSettings]
|
||||||
r.AllowStaticLighting=False
|
r.AllowStaticLighting=False
|
||||||
@ -90,3 +91,4 @@ ConnectionType=USBOnly
|
|||||||
bUseManualIPAddress=False
|
bUseManualIPAddress=False
|
||||||
ManualIPAddress=
|
ManualIPAddress=
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
BIN
Unreal/PS_AI_Agent/Content/test_AI_Actor.uasset
Normal file
BIN
Unreal/PS_AI_Agent/Content/test_AI_Actor.uasset
Normal file
Binary file not shown.
@ -17,6 +17,10 @@
|
|||||||
"TargetAllowList": [
|
"TargetAllowList": [
|
||||||
"Editor"
|
"Editor"
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"Name": "PS_AI_Agent_ElevenLabs",
|
||||||
|
"Enabled": true
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"FileVersion": 3,
|
||||||
|
"Version": 1,
|
||||||
|
"VersionName": "1.0.0",
|
||||||
|
"FriendlyName": "PS AI Agent - ElevenLabs",
|
||||||
|
"Description": "Integrates ElevenLabs Conversational AI Agent into Unreal Engine 5.5. Supports real-time voice conversation via WebSocket, microphone capture, and audio playback.",
|
||||||
|
"Category": "AI",
|
||||||
|
"CreatedBy": "ASTERION",
|
||||||
|
"CreatedByURL": "",
|
||||||
|
"DocsURL": "https://elevenlabs.io/docs/conversational-ai",
|
||||||
|
"MarketplaceURL": "",
|
||||||
|
"SupportURL": "",
|
||||||
|
"CanContainContent": false,
|
||||||
|
"IsBetaVersion": true,
|
||||||
|
"IsExperimentalVersion": false,
|
||||||
|
"Installed": false,
|
||||||
|
"Modules": [
|
||||||
|
{
|
||||||
|
"Name": "PS_AI_Agent_ElevenLabs",
|
||||||
|
"Type": "Runtime",
|
||||||
|
"LoadingPhase": "PreDefault",
|
||||||
|
"PlatformAllowList": [
|
||||||
|
"Win64",
|
||||||
|
"Mac",
|
||||||
|
"Linux"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"Plugins": [
|
||||||
|
{
|
||||||
|
"Name": "AudioCapture",
|
||||||
|
"Enabled": true
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@ -0,0 +1,40 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
using UnrealBuildTool;
|
||||||
|
|
||||||
|
public class PS_AI_Agent_ElevenLabs : ModuleRules
|
||||||
|
{
|
||||||
|
public PS_AI_Agent_ElevenLabs(ReadOnlyTargetRules Target) : base(Target)
|
||||||
|
{
|
||||||
|
DefaultBuildSettings = BuildSettingsVersion.Latest;
|
||||||
|
PCHUsage = PCHUsageMode.UseExplicitOrSharedPCHs;
|
||||||
|
|
||||||
|
PublicDependencyModuleNames.AddRange(new string[]
|
||||||
|
{
|
||||||
|
"Core",
|
||||||
|
"CoreUObject",
|
||||||
|
"Engine",
|
||||||
|
"InputCore",
|
||||||
|
// JSON serialization for WebSocket message payloads
|
||||||
|
"Json",
|
||||||
|
"JsonUtilities",
|
||||||
|
// WebSocket for ElevenLabs Conversational AI real-time API
|
||||||
|
"WebSockets",
|
||||||
|
// HTTP for REST calls (agent metadata, auth, etc.)
|
||||||
|
"HTTP",
|
||||||
|
// Audio capture (microphone input)
|
||||||
|
"AudioMixer",
|
||||||
|
"AudioCaptureCore",
|
||||||
|
"AudioCapture",
|
||||||
|
"Voice",
|
||||||
|
"SignalProcessing",
|
||||||
|
});
|
||||||
|
|
||||||
|
PrivateDependencyModuleNames.AddRange(new string[]
|
||||||
|
{
|
||||||
|
"Projects",
|
||||||
|
// For ISettingsModule (Project Settings integration)
|
||||||
|
"Settings",
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,345 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#include "ElevenLabsConversationalAgentComponent.h"
|
||||||
|
#include "ElevenLabsMicrophoneCaptureComponent.h"
|
||||||
|
#include "PS_AI_Agent_ElevenLabs.h"
|
||||||
|
|
||||||
|
#include "Components/AudioComponent.h"
|
||||||
|
#include "Sound/SoundWaveProcedural.h"
|
||||||
|
#include "GameFramework/Actor.h"
|
||||||
|
#include "Engine/World.h"
|
||||||
|
|
||||||
|
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsAgent, Log, All);
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Constructor
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UElevenLabsConversationalAgentComponent::UElevenLabsConversationalAgentComponent()
|
||||||
|
{
|
||||||
|
PrimaryComponentTick.bCanEverTick = true;
|
||||||
|
// Tick is used only to detect silence (agent stopped speaking).
|
||||||
|
// Disable if not needed for perf.
|
||||||
|
PrimaryComponentTick.TickInterval = 1.0f / 60.0f;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Lifecycle
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsConversationalAgentComponent::BeginPlay()
|
||||||
|
{
|
||||||
|
Super::BeginPlay();
|
||||||
|
InitAudioPlayback();
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
|
||||||
|
{
|
||||||
|
EndConversation();
|
||||||
|
Super::EndPlay(EndPlayReason);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::TickComponent(float DeltaTime, ELevelTick TickType,
|
||||||
|
FActorComponentTickFunction* ThisTickFunction)
|
||||||
|
{
|
||||||
|
Super::TickComponent(DeltaTime, TickType, ThisTickFunction);
|
||||||
|
|
||||||
|
if (bAgentSpeaking)
|
||||||
|
{
|
||||||
|
FScopeLock Lock(&AudioQueueLock);
|
||||||
|
if (AudioQueue.Num() == 0)
|
||||||
|
{
|
||||||
|
SilentTickCount++;
|
||||||
|
if (SilentTickCount >= SilenceThresholdTicks)
|
||||||
|
{
|
||||||
|
bAgentSpeaking = false;
|
||||||
|
SilentTickCount = 0;
|
||||||
|
OnAgentStoppedSpeaking.Broadcast();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
SilentTickCount = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Control
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsConversationalAgentComponent::StartConversation()
|
||||||
|
{
|
||||||
|
if (!WebSocketProxy)
|
||||||
|
{
|
||||||
|
WebSocketProxy = NewObject<UElevenLabsWebSocketProxy>(this);
|
||||||
|
WebSocketProxy->OnConnected.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleConnected);
|
||||||
|
WebSocketProxy->OnDisconnected.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleDisconnected);
|
||||||
|
WebSocketProxy->OnError.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleError);
|
||||||
|
WebSocketProxy->OnAudioReceived.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleAudioReceived);
|
||||||
|
WebSocketProxy->OnTranscript.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleTranscript);
|
||||||
|
WebSocketProxy->OnAgentResponse.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleAgentResponse);
|
||||||
|
WebSocketProxy->OnInterrupted.AddDynamic(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::HandleInterrupted);
|
||||||
|
}
|
||||||
|
|
||||||
|
WebSocketProxy->Connect(AgentID);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::EndConversation()
|
||||||
|
{
|
||||||
|
StopListening();
|
||||||
|
StopAgentAudio();
|
||||||
|
|
||||||
|
if (WebSocketProxy)
|
||||||
|
{
|
||||||
|
WebSocketProxy->Disconnect();
|
||||||
|
WebSocketProxy = nullptr;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::StartListening()
|
||||||
|
{
|
||||||
|
if (!IsConnected())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsAgent, Warning, TEXT("StartListening: not connected."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (bIsListening) return;
|
||||||
|
bIsListening = true;
|
||||||
|
|
||||||
|
if (TurnMode == EElevenLabsTurnMode::Client)
|
||||||
|
{
|
||||||
|
WebSocketProxy->SendUserTurnStart();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Find the microphone component on our owner actor, or create one.
|
||||||
|
UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||||
|
GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>();
|
||||||
|
|
||||||
|
if (!Mic)
|
||||||
|
{
|
||||||
|
Mic = NewObject<UElevenLabsMicrophoneCaptureComponent>(GetOwner(),
|
||||||
|
TEXT("ElevenLabsMicrophone"));
|
||||||
|
Mic->RegisterComponent();
|
||||||
|
}
|
||||||
|
|
||||||
|
Mic->OnAudioCaptured.AddUObject(this,
|
||||||
|
&UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured);
|
||||||
|
Mic->StartCapture();
|
||||||
|
|
||||||
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture started."));
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::StopListening()
|
||||||
|
{
|
||||||
|
if (!bIsListening) return;
|
||||||
|
bIsListening = false;
|
||||||
|
|
||||||
|
if (UElevenLabsMicrophoneCaptureComponent* Mic =
|
||||||
|
GetOwner() ? GetOwner()->FindComponentByClass<UElevenLabsMicrophoneCaptureComponent>() : nullptr)
|
||||||
|
{
|
||||||
|
Mic->StopCapture();
|
||||||
|
Mic->OnAudioCaptured.RemoveAll(this);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (WebSocketProxy && TurnMode == EElevenLabsTurnMode::Client)
|
||||||
|
{
|
||||||
|
WebSocketProxy->SendUserTurnEnd();
|
||||||
|
}
|
||||||
|
|
||||||
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Microphone capture stopped."));
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::SendTextMessage(const FString& Text)
|
||||||
|
{
|
||||||
|
if (!IsConnected())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsAgent, Warning, TEXT("SendTextMessage: not connected. Call StartConversation() first."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
WebSocketProxy->SendTextMessage(Text);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::InterruptAgent()
|
||||||
|
{
|
||||||
|
if (WebSocketProxy) WebSocketProxy->SendInterrupt();
|
||||||
|
StopAgentAudio();
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// State queries
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
bool UElevenLabsConversationalAgentComponent::IsConnected() const
|
||||||
|
{
|
||||||
|
return WebSocketProxy && WebSocketProxy->IsConnected();
|
||||||
|
}
|
||||||
|
|
||||||
|
const FElevenLabsConversationInfo& UElevenLabsConversationalAgentComponent::GetConversationInfo() const
|
||||||
|
{
|
||||||
|
static FElevenLabsConversationInfo Empty;
|
||||||
|
return WebSocketProxy ? WebSocketProxy->GetConversationInfo() : Empty;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// WebSocket event handlers
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleConnected(const FElevenLabsConversationInfo& Info)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent connected. ConversationID=%s"), *Info.ConversationID);
|
||||||
|
OnAgentConnected.Broadcast(Info);
|
||||||
|
|
||||||
|
if (bAutoStartListening)
|
||||||
|
{
|
||||||
|
StartListening();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleDisconnected(int32 StatusCode, const FString& Reason)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsAgent, Log, TEXT("Agent disconnected. Code=%d Reason=%s"), StatusCode, *Reason);
|
||||||
|
bIsListening = false;
|
||||||
|
bAgentSpeaking = false;
|
||||||
|
OnAgentDisconnected.Broadcast(StatusCode, Reason);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleError(const FString& ErrorMessage)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsAgent, Error, TEXT("Agent error: %s"), *ErrorMessage);
|
||||||
|
OnAgentError.Broadcast(ErrorMessage);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleAudioReceived(const TArray<uint8>& PCMData)
|
||||||
|
{
|
||||||
|
EnqueueAgentAudio(PCMData);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleTranscript(const FElevenLabsTranscriptSegment& Segment)
|
||||||
|
{
|
||||||
|
OnAgentTranscript.Broadcast(Segment);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleAgentResponse(const FString& ResponseText)
|
||||||
|
{
|
||||||
|
OnAgentTextResponse.Broadcast(ResponseText);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::HandleInterrupted()
|
||||||
|
{
|
||||||
|
StopAgentAudio();
|
||||||
|
OnAgentInterrupted.Broadcast();
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Audio playback
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsConversationalAgentComponent::InitAudioPlayback()
|
||||||
|
{
|
||||||
|
AActor* Owner = GetOwner();
|
||||||
|
if (!Owner) return;
|
||||||
|
|
||||||
|
// USoundWaveProcedural lets us push raw PCM data at runtime.
|
||||||
|
ProceduralSoundWave = NewObject<USoundWaveProcedural>(this);
|
||||||
|
ProceduralSoundWave->SetSampleRate(ElevenLabsAudio::SampleRate);
|
||||||
|
ProceduralSoundWave->NumChannels = ElevenLabsAudio::Channels;
|
||||||
|
ProceduralSoundWave->Duration = INDEFINITELY_LOOPING_DURATION;
|
||||||
|
ProceduralSoundWave->SoundGroup = SOUNDGROUP_Voice;
|
||||||
|
ProceduralSoundWave->bLooping = false;
|
||||||
|
|
||||||
|
// Create the audio component attached to the owner.
|
||||||
|
AudioPlaybackComponent = NewObject<UAudioComponent>(Owner, TEXT("ElevenLabsAudioPlayback"));
|
||||||
|
AudioPlaybackComponent->RegisterComponent();
|
||||||
|
AudioPlaybackComponent->bAutoActivate = false;
|
||||||
|
AudioPlaybackComponent->SetSound(ProceduralSoundWave);
|
||||||
|
|
||||||
|
// When the procedural sound wave needs more audio data, pull from our queue.
|
||||||
|
ProceduralSoundWave->OnSoundWaveProceduralUnderflow =
|
||||||
|
FOnSoundWaveProceduralUnderflow::CreateUObject(
|
||||||
|
this, &UElevenLabsConversationalAgentComponent::OnProceduralUnderflow);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::OnProceduralUnderflow(
|
||||||
|
USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired)
|
||||||
|
{
|
||||||
|
FScopeLock Lock(&AudioQueueLock);
|
||||||
|
if (AudioQueue.Num() == 0) return;
|
||||||
|
|
||||||
|
const int32 BytesRequired = SamplesRequired * sizeof(int16);
|
||||||
|
const int32 BytesToPush = FMath::Min(AudioQueue.Num(), BytesRequired);
|
||||||
|
|
||||||
|
InProceduralWave->QueueAudio(AudioQueue.GetData(), BytesToPush);
|
||||||
|
AudioQueue.RemoveAt(0, BytesToPush, EAllowShrinking::No);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::EnqueueAgentAudio(const TArray<uint8>& PCMData)
|
||||||
|
{
|
||||||
|
{
|
||||||
|
FScopeLock Lock(&AudioQueueLock);
|
||||||
|
AudioQueue.Append(PCMData);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Start playback if not already playing.
|
||||||
|
if (!bAgentSpeaking)
|
||||||
|
{
|
||||||
|
bAgentSpeaking = true;
|
||||||
|
SilentTickCount = 0;
|
||||||
|
OnAgentStartedSpeaking.Broadcast();
|
||||||
|
|
||||||
|
if (AudioPlaybackComponent && !AudioPlaybackComponent->IsPlaying())
|
||||||
|
{
|
||||||
|
AudioPlaybackComponent->Play();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsConversationalAgentComponent::StopAgentAudio()
|
||||||
|
{
|
||||||
|
if (AudioPlaybackComponent && AudioPlaybackComponent->IsPlaying())
|
||||||
|
{
|
||||||
|
AudioPlaybackComponent->Stop();
|
||||||
|
}
|
||||||
|
|
||||||
|
FScopeLock Lock(&AudioQueueLock);
|
||||||
|
AudioQueue.Empty();
|
||||||
|
|
||||||
|
if (bAgentSpeaking)
|
||||||
|
{
|
||||||
|
bAgentSpeaking = false;
|
||||||
|
SilentTickCount = 0;
|
||||||
|
OnAgentStoppedSpeaking.Broadcast();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Microphone → WebSocket
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsConversationalAgentComponent::OnMicrophoneDataCaptured(const TArray<float>& FloatPCM)
|
||||||
|
{
|
||||||
|
if (!IsConnected() || !bIsListening) return;
|
||||||
|
|
||||||
|
TArray<uint8> PCMBytes = FloatPCMToInt16Bytes(FloatPCM);
|
||||||
|
WebSocketProxy->SendAudioChunk(PCMBytes);
|
||||||
|
}
|
||||||
|
|
||||||
|
TArray<uint8> UElevenLabsConversationalAgentComponent::FloatPCMToInt16Bytes(const TArray<float>& FloatPCM)
|
||||||
|
{
|
||||||
|
TArray<uint8> Out;
|
||||||
|
Out.Reserve(FloatPCM.Num() * 2);
|
||||||
|
|
||||||
|
for (float Sample : FloatPCM)
|
||||||
|
{
|
||||||
|
// Clamp to [-1,1] then scale to int16 range
|
||||||
|
const float Clamped = FMath::Clamp(Sample, -1.0f, 1.0f);
|
||||||
|
const int16 Int16Sample = static_cast<int16>(Clamped * 32767.0f);
|
||||||
|
|
||||||
|
// Little-endian
|
||||||
|
Out.Add(static_cast<uint8>(Int16Sample & 0xFF));
|
||||||
|
Out.Add(static_cast<uint8>((Int16Sample >> 8) & 0xFF));
|
||||||
|
}
|
||||||
|
|
||||||
|
return Out;
|
||||||
|
}
|
||||||
@ -0,0 +1,171 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#include "ElevenLabsMicrophoneCaptureComponent.h"
|
||||||
|
#include "ElevenLabsDefinitions.h"
|
||||||
|
|
||||||
|
#include "AudioCaptureCore.h"
|
||||||
|
#include "Async/Async.h"
|
||||||
|
|
||||||
|
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsMic, Log, All);
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Constructor
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UElevenLabsMicrophoneCaptureComponent::UElevenLabsMicrophoneCaptureComponent()
|
||||||
|
{
|
||||||
|
PrimaryComponentTick.bCanEverTick = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Lifecycle
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsMicrophoneCaptureComponent::EndPlay(const EEndPlayReason::Type EndPlayReason)
|
||||||
|
{
|
||||||
|
StopCapture();
|
||||||
|
Super::EndPlay(EndPlayReason);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Capture control
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsMicrophoneCaptureComponent::StartCapture()
|
||||||
|
{
|
||||||
|
if (bCapturing)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsMic, Warning, TEXT("StartCapture called while already capturing."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Open the default audio capture stream.
|
||||||
|
// FOnAudioCaptureFunction uses const void* per UE 5.3+ API (cast to float* inside).
|
||||||
|
Audio::FOnAudioCaptureFunction CaptureCallback =
|
||||||
|
[this](const void* InAudio, int32 NumFrames, int32 InNumChannels,
|
||||||
|
int32 InSampleRate, double StreamTime, bool bOverflow)
|
||||||
|
{
|
||||||
|
OnAudioGenerate(InAudio, NumFrames, InNumChannels, InSampleRate, StreamTime, bOverflow);
|
||||||
|
};
|
||||||
|
|
||||||
|
if (!AudioCapture.OpenAudioCaptureStream(DeviceParams, MoveTemp(CaptureCallback), 1024))
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsMic, Error, TEXT("Failed to open default audio capture stream."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Retrieve the actual device parameters after opening the stream.
|
||||||
|
Audio::FCaptureDeviceInfo DeviceInfo;
|
||||||
|
if (AudioCapture.GetCaptureDeviceInfo(DeviceInfo))
|
||||||
|
{
|
||||||
|
DeviceSampleRate = DeviceInfo.PreferredSampleRate;
|
||||||
|
DeviceChannels = DeviceInfo.InputChannels;
|
||||||
|
UE_LOG(LogElevenLabsMic, Log, TEXT("Capture device: %s | Rate=%d | Channels=%d"),
|
||||||
|
*DeviceInfo.DeviceName, DeviceSampleRate, DeviceChannels);
|
||||||
|
}
|
||||||
|
|
||||||
|
AudioCapture.StartStream();
|
||||||
|
bCapturing = true;
|
||||||
|
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture started."));
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsMicrophoneCaptureComponent::StopCapture()
|
||||||
|
{
|
||||||
|
if (!bCapturing) return;
|
||||||
|
|
||||||
|
AudioCapture.StopStream();
|
||||||
|
AudioCapture.CloseStream();
|
||||||
|
bCapturing = false;
|
||||||
|
UE_LOG(LogElevenLabsMic, Log, TEXT("Audio capture stopped."));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Audio callback (background thread)
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsMicrophoneCaptureComponent::OnAudioGenerate(
|
||||||
|
const void* InAudio, int32 NumFrames,
|
||||||
|
int32 InNumChannels, int32 InSampleRate,
|
||||||
|
double StreamTime, bool bOverflow)
|
||||||
|
{
|
||||||
|
if (bOverflow)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsMic, Verbose, TEXT("Audio capture buffer overflow."));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Device sends float32 interleaved samples; cast from the void* API.
|
||||||
|
const float* FloatAudio = static_cast<const float*>(InAudio);
|
||||||
|
|
||||||
|
// Resample + downmix to 16000 Hz mono.
|
||||||
|
TArray<float> Resampled = ResampleTo16000(FloatAudio, NumFrames, InNumChannels, InSampleRate);
|
||||||
|
|
||||||
|
// Apply volume multiplier.
|
||||||
|
if (!FMath::IsNearlyEqual(VolumeMultiplier, 1.0f))
|
||||||
|
{
|
||||||
|
for (float& S : Resampled)
|
||||||
|
{
|
||||||
|
S *= VolumeMultiplier;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fire the delegate on the game thread so subscribers don't need to be
|
||||||
|
// thread-safe (WebSocket Send is not thread-safe in UE's implementation).
|
||||||
|
AsyncTask(ENamedThreads::GameThread, [this, Data = MoveTemp(Resampled)]()
|
||||||
|
{
|
||||||
|
if (bCapturing)
|
||||||
|
{
|
||||||
|
OnAudioCaptured.Broadcast(Data);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Resampling
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
TArray<float> UElevenLabsMicrophoneCaptureComponent::ResampleTo16000(
|
||||||
|
const float* InAudio, int32 NumSamples,
|
||||||
|
int32 InChannels, int32 InSampleRate)
|
||||||
|
{
|
||||||
|
const int32 TargetRate = ElevenLabsAudio::SampleRate; // 16000
|
||||||
|
|
||||||
|
// --- Step 1: Downmix to mono ---
|
||||||
|
TArray<float> Mono;
|
||||||
|
if (InChannels == 1)
|
||||||
|
{
|
||||||
|
Mono = TArray<float>(InAudio, NumSamples);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
const int32 NumFrames = NumSamples / InChannels;
|
||||||
|
Mono.Reserve(NumFrames);
|
||||||
|
for (int32 i = 0; i < NumFrames; i++)
|
||||||
|
{
|
||||||
|
float Sum = 0.0f;
|
||||||
|
for (int32 c = 0; c < InChannels; c++)
|
||||||
|
{
|
||||||
|
Sum += InAudio[i * InChannels + c];
|
||||||
|
}
|
||||||
|
Mono.Add(Sum / static_cast<float>(InChannels));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- Step 2: Resample via linear interpolation ---
|
||||||
|
if (InSampleRate == TargetRate)
|
||||||
|
{
|
||||||
|
return Mono;
|
||||||
|
}
|
||||||
|
|
||||||
|
const float Ratio = static_cast<float>(InSampleRate) / static_cast<float>(TargetRate);
|
||||||
|
const int32 OutSamples = FMath::FloorToInt(static_cast<float>(Mono.Num()) / Ratio);
|
||||||
|
|
||||||
|
TArray<float> Out;
|
||||||
|
Out.Reserve(OutSamples);
|
||||||
|
|
||||||
|
for (int32 i = 0; i < OutSamples; i++)
|
||||||
|
{
|
||||||
|
const float SrcIndex = static_cast<float>(i) * Ratio;
|
||||||
|
const int32 SrcLow = FMath::FloorToInt(SrcIndex);
|
||||||
|
const int32 SrcHigh = FMath::Min(SrcLow + 1, Mono.Num() - 1);
|
||||||
|
const float Alpha = SrcIndex - static_cast<float>(SrcLow);
|
||||||
|
|
||||||
|
Out.Add(FMath::Lerp(Mono[SrcLow], Mono[SrcHigh], Alpha));
|
||||||
|
}
|
||||||
|
|
||||||
|
return Out;
|
||||||
|
}
|
||||||
@ -0,0 +1,455 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#include "ElevenLabsWebSocketProxy.h"
|
||||||
|
#include "PS_AI_Agent_ElevenLabs.h"
|
||||||
|
|
||||||
|
#include "WebSocketsModule.h"
|
||||||
|
#include "IWebSocket.h"
|
||||||
|
|
||||||
|
#include "Json.h"
|
||||||
|
#include "JsonUtilities.h"
|
||||||
|
#include "Misc/Base64.h"
|
||||||
|
|
||||||
|
DEFINE_LOG_CATEGORY_STATIC(LogElevenLabsWS, Log, All);
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Helpers
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
static void EL_LOG(bool bVerbose, const TCHAR* Format, ...)
|
||||||
|
{
|
||||||
|
if (!bVerbose) return;
|
||||||
|
va_list Args;
|
||||||
|
va_start(Args, Format);
|
||||||
|
// Forward to UE_LOG at Verbose level
|
||||||
|
TCHAR Buffer[2048];
|
||||||
|
FCString::GetVarArgs(Buffer, UE_ARRAY_COUNT(Buffer), Format, Args);
|
||||||
|
va_end(Args);
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("%s"), Buffer);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Connect / Disconnect
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsWebSocketProxy::Connect(const FString& AgentIDOverride, const FString& APIKeyOverride)
|
||||||
|
{
|
||||||
|
if (ConnectionState == EElevenLabsConnectionState::Connected ||
|
||||||
|
ConnectionState == EElevenLabsConnectionState::Connecting)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("Connect called but already connecting/connected. Ignoring."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!FModuleManager::Get().IsModuleLoaded("WebSockets"))
|
||||||
|
{
|
||||||
|
FModuleManager::LoadModuleChecked<FWebSocketsModule>("WebSockets");
|
||||||
|
}
|
||||||
|
|
||||||
|
const FString URL = BuildWebSocketURL(AgentIDOverride, APIKeyOverride);
|
||||||
|
if (URL.IsEmpty())
|
||||||
|
{
|
||||||
|
const FString Msg = TEXT("Cannot connect: no Agent ID configured. Set it in Project Settings or pass it to Connect().");
|
||||||
|
UE_LOG(LogElevenLabsWS, Error, TEXT("%s"), *Msg);
|
||||||
|
OnError.Broadcast(Msg);
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Error;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("Connecting to ElevenLabs: %s"), *URL);
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Connecting;
|
||||||
|
|
||||||
|
// Headers: the ElevenLabs Conversational AI WS endpoint accepts the
|
||||||
|
// xi-api-key header on the initial HTTP upgrade request.
|
||||||
|
TMap<FString, FString> UpgradeHeaders;
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
const FString ResolvedKey = APIKeyOverride.IsEmpty() ? Settings->API_Key : APIKeyOverride;
|
||||||
|
if (!ResolvedKey.IsEmpty())
|
||||||
|
{
|
||||||
|
UpgradeHeaders.Add(TEXT("xi-api-key"), ResolvedKey);
|
||||||
|
}
|
||||||
|
|
||||||
|
WebSocket = FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), UpgradeHeaders);
|
||||||
|
|
||||||
|
WebSocket->OnConnected().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnected);
|
||||||
|
WebSocket->OnConnectionError().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsConnectionError);
|
||||||
|
WebSocket->OnClosed().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsClosed);
|
||||||
|
WebSocket->OnMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsMessage);
|
||||||
|
WebSocket->OnRawMessage().AddUObject(this, &UElevenLabsWebSocketProxy::OnWsBinaryMessage);
|
||||||
|
|
||||||
|
WebSocket->Connect();
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::Disconnect()
|
||||||
|
{
|
||||||
|
if (WebSocket.IsValid() && WebSocket->IsConnected())
|
||||||
|
{
|
||||||
|
WebSocket->Close(1000, TEXT("Client disconnected"));
|
||||||
|
}
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Audio & turn control
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsWebSocketProxy::SendAudioChunk(const TArray<uint8>& PCMData)
|
||||||
|
{
|
||||||
|
if (!IsConnected())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendAudioChunk: not connected."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (PCMData.Num() == 0) return;
|
||||||
|
|
||||||
|
// ElevenLabs expects: { "user_audio_chunk": "<base64 PCM>" }
|
||||||
|
const FString Base64Audio = FBase64::Encode(PCMData.GetData(), PCMData.Num());
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||||
|
Msg->SetStringField(ElevenLabsMessageType::AudioChunk, Base64Audio);
|
||||||
|
SendJsonMessage(Msg);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::SendUserTurnStart()
|
||||||
|
{
|
||||||
|
// In client turn mode, signal that the user is active/speaking.
|
||||||
|
// API message: { "type": "user_activity" }
|
||||||
|
if (!IsConnected()) return;
|
||||||
|
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||||
|
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserActivity);
|
||||||
|
SendJsonMessage(Msg);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::SendUserTurnEnd()
|
||||||
|
{
|
||||||
|
// In client turn mode, stopping user_activity signals end of user turn.
|
||||||
|
// The API uses user_activity for ongoing speech; simply stop sending it.
|
||||||
|
// No explicit end message is required — silence is detected server-side.
|
||||||
|
// We still log for debug visibility.
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("User turn ended (client mode) — stopped sending user_activity."));
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::SendTextMessage(const FString& Text)
|
||||||
|
{
|
||||||
|
if (!IsConnected())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendTextMessage: not connected."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (Text.IsEmpty()) return;
|
||||||
|
|
||||||
|
// API: { "type": "user_message", "text": "Hello agent" }
|
||||||
|
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||||
|
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::UserMessage);
|
||||||
|
Msg->SetStringField(TEXT("text"), Text);
|
||||||
|
SendJsonMessage(Msg);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::SendInterrupt()
|
||||||
|
{
|
||||||
|
if (!IsConnected()) return;
|
||||||
|
TSharedPtr<FJsonObject> Msg = MakeShareable(new FJsonObject());
|
||||||
|
Msg->SetStringField(TEXT("type"), ElevenLabsMessageType::Interrupt);
|
||||||
|
SendJsonMessage(Msg);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// WebSocket callbacks
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsWebSocketProxy::OnWsConnected()
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket connected. Waiting for conversation_initiation_metadata..."));
|
||||||
|
// State stays Connecting until we receive the initiation metadata from the server.
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::OnWsConnectionError(const FString& Error)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Error, TEXT("WebSocket connection error: %s"), *Error);
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Error;
|
||||||
|
OnError.Broadcast(Error);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("WebSocket closed. Code=%d Reason=%s Clean=%d"), StatusCode, *Reason, bWasClean);
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||||
|
WebSocket.Reset();
|
||||||
|
OnDisconnected.Broadcast(StatusCode, Reason);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::OnWsMessage(const FString& Message)
|
||||||
|
{
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
if (Settings->bVerboseLogging)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT(">> %s"), *Message);
|
||||||
|
}
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> Root;
|
||||||
|
TSharedRef<TJsonReader<>> Reader = TJsonReaderFactory<>::Create(Message);
|
||||||
|
if (!FJsonSerializer::Deserialize(Reader, Root) || !Root.IsValid())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to parse WebSocket message as JSON (first 80 chars): %.80s"), *Message);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
FString MsgType;
|
||||||
|
// ElevenLabs wraps the type in a "type" field
|
||||||
|
if (!Root->TryGetStringField(TEXT("type"), MsgType))
|
||||||
|
{
|
||||||
|
// Fallback: some messages use the top-level key as the type
|
||||||
|
// e.g. { "user_audio_chunk": "..." } from ourselves (shouldn't arrive)
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Message has no 'type' field, ignoring."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (MsgType == ElevenLabsMessageType::ConversationInitiation)
|
||||||
|
{
|
||||||
|
HandleConversationInitiation(Root);
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::AudioResponse)
|
||||||
|
{
|
||||||
|
HandleAudioResponse(Root);
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::UserTranscript)
|
||||||
|
{
|
||||||
|
HandleTranscript(Root);
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::AgentResponse)
|
||||||
|
{
|
||||||
|
HandleAgentResponse(Root);
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::AgentResponseCorrection)
|
||||||
|
{
|
||||||
|
// Silently ignore for now — corrected text after interruption.
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("agent_response_correction received (ignored)."));
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::InterruptionEvent)
|
||||||
|
{
|
||||||
|
HandleInterruption(Root);
|
||||||
|
}
|
||||||
|
else if (MsgType == ElevenLabsMessageType::PingEvent)
|
||||||
|
{
|
||||||
|
HandlePing(Root);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Unhandled message type: %s"), *MsgType);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining)
|
||||||
|
{
|
||||||
|
// Accumulate fragments until BytesRemaining == 0.
|
||||||
|
const uint8* Bytes = static_cast<const uint8*>(Data);
|
||||||
|
BinaryFrameBuffer.Append(Bytes, Size);
|
||||||
|
|
||||||
|
if (BytesRemaining > 0)
|
||||||
|
{
|
||||||
|
// More fragments coming — wait for the rest
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const int32 TotalSize = BinaryFrameBuffer.Num();
|
||||||
|
|
||||||
|
// Peek at first byte to distinguish JSON (starts with '{') from raw binary audio.
|
||||||
|
const bool bLooksLikeJson = (TotalSize > 0 && BinaryFrameBuffer[0] == '{');
|
||||||
|
|
||||||
|
if (bLooksLikeJson)
|
||||||
|
{
|
||||||
|
// Null-terminate safely then decode as UTF-8 JSON
|
||||||
|
BinaryFrameBuffer.Add(0);
|
||||||
|
const FString JsonString = FString(UTF8_TO_TCHAR(
|
||||||
|
reinterpret_cast<const char*>(BinaryFrameBuffer.GetData())));
|
||||||
|
BinaryFrameBuffer.Reset();
|
||||||
|
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
if (Settings->bVerboseLogging)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary JSON frame (%d bytes): %.120s"), TotalSize, *JsonString);
|
||||||
|
}
|
||||||
|
|
||||||
|
OnWsMessage(JsonString);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// Raw binary audio frame — PCM bytes sent directly without Base64/JSON wrapper.
|
||||||
|
// Log first few bytes as hex to help diagnose the format.
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
if (Settings->bVerboseLogging)
|
||||||
|
{
|
||||||
|
FString HexPreview;
|
||||||
|
const int32 PreviewBytes = FMath::Min(TotalSize, 8);
|
||||||
|
for (int32 i = 0; i < PreviewBytes; i++)
|
||||||
|
{
|
||||||
|
HexPreview += FString::Printf(TEXT("%02X "), BinaryFrameBuffer[i]);
|
||||||
|
}
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("Binary audio frame: %d bytes | first bytes: %s"), TotalSize, *HexPreview);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Broadcast raw PCM bytes directly to the audio queue.
|
||||||
|
TArray<uint8> PCMData = MoveTemp(BinaryFrameBuffer);
|
||||||
|
BinaryFrameBuffer.Reset();
|
||||||
|
OnAudioReceived.Broadcast(PCMData);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Message handlers
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsWebSocketProxy::HandleConversationInitiation(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
// Expected structure:
|
||||||
|
// { "type": "conversation_initiation_metadata",
|
||||||
|
// "conversation_initiation_metadata_event": {
|
||||||
|
// "conversation_id": "...",
|
||||||
|
// "agent_output_audio_format": "pcm_16000"
|
||||||
|
// }
|
||||||
|
// }
|
||||||
|
const TSharedPtr<FJsonObject>* MetaObj = nullptr;
|
||||||
|
if (Root->TryGetObjectField(TEXT("conversation_initiation_metadata_event"), MetaObj) && MetaObj)
|
||||||
|
{
|
||||||
|
(*MetaObj)->TryGetStringField(TEXT("conversation_id"), ConversationInfo.ConversationID);
|
||||||
|
}
|
||||||
|
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("Conversation initiated. ID=%s"), *ConversationInfo.ConversationID);
|
||||||
|
ConnectionState = EElevenLabsConnectionState::Connected;
|
||||||
|
OnConnected.Broadcast(ConversationInfo);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::HandleAudioResponse(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
// Expected structure:
|
||||||
|
// { "type": "audio",
|
||||||
|
// "audio_event": { "audio_base_64": "<base64 PCM>", "event_id": 1 }
|
||||||
|
// }
|
||||||
|
const TSharedPtr<FJsonObject>* AudioEvent = nullptr;
|
||||||
|
if (!Root->TryGetObjectField(TEXT("audio_event"), AudioEvent) || !AudioEvent)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio message missing 'audio_event' field."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
FString Base64Audio;
|
||||||
|
if (!(*AudioEvent)->TryGetStringField(TEXT("audio_base_64"), Base64Audio))
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("audio_event missing 'audio_base_64' field."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
TArray<uint8> PCMData;
|
||||||
|
if (!FBase64::Decode(Base64Audio, PCMData))
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("Failed to Base64-decode audio data."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
OnAudioReceived.Broadcast(PCMData);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::HandleTranscript(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
// API structure:
|
||||||
|
// { "type": "user_transcript",
|
||||||
|
// "user_transcription_event": { "user_transcript": "Hello" }
|
||||||
|
// }
|
||||||
|
// This message only carries the user's speech-to-text — speaker is always "user".
|
||||||
|
const TSharedPtr<FJsonObject>* TranscriptEvent = nullptr;
|
||||||
|
if (!Root->TryGetObjectField(TEXT("user_transcription_event"), TranscriptEvent) || !TranscriptEvent)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("user_transcript message missing 'user_transcription_event' field."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
FElevenLabsTranscriptSegment Segment;
|
||||||
|
Segment.Speaker = TEXT("user");
|
||||||
|
(*TranscriptEvent)->TryGetStringField(TEXT("user_transcript"), Segment.Text);
|
||||||
|
// user_transcript messages are always final (interim results are not sent for user speech)
|
||||||
|
Segment.bIsFinal = true;
|
||||||
|
|
||||||
|
OnTranscript.Broadcast(Segment);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::HandleAgentResponse(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
// { "type": "agent_response",
|
||||||
|
// "agent_response_event": { "agent_response": "..." }
|
||||||
|
// }
|
||||||
|
const TSharedPtr<FJsonObject>* ResponseEvent = nullptr;
|
||||||
|
if (!Root->TryGetObjectField(TEXT("agent_response_event"), ResponseEvent) || !ResponseEvent)
|
||||||
|
{
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
FString ResponseText;
|
||||||
|
(*ResponseEvent)->TryGetStringField(TEXT("agent_response"), ResponseText);
|
||||||
|
OnAgentResponse.Broadcast(ResponseText);
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::HandleInterruption(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Log, TEXT("Agent interrupted."));
|
||||||
|
OnInterrupted.Broadcast();
|
||||||
|
}
|
||||||
|
|
||||||
|
void UElevenLabsWebSocketProxy::HandlePing(const TSharedPtr<FJsonObject>& Root)
|
||||||
|
{
|
||||||
|
// Reply with a pong to keep the connection alive.
|
||||||
|
// Incoming: { "type": "ping", "ping_event": { "event_id": 1, "ping_ms": 150 } }
|
||||||
|
// Reply: { "type": "pong", "event_id": 1 } ← event_id is top-level, no wrapper object
|
||||||
|
int32 EventID = 0;
|
||||||
|
const TSharedPtr<FJsonObject>* PingEvent = nullptr;
|
||||||
|
if (Root->TryGetObjectField(TEXT("ping_event"), PingEvent) && PingEvent)
|
||||||
|
{
|
||||||
|
(*PingEvent)->TryGetNumberField(TEXT("event_id"), EventID);
|
||||||
|
}
|
||||||
|
|
||||||
|
TSharedPtr<FJsonObject> Pong = MakeShareable(new FJsonObject());
|
||||||
|
Pong->SetStringField(TEXT("type"), TEXT("pong"));
|
||||||
|
Pong->SetNumberField(TEXT("event_id"), EventID); // top-level, not nested
|
||||||
|
SendJsonMessage(Pong);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Helpers
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
void UElevenLabsWebSocketProxy::SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj)
|
||||||
|
{
|
||||||
|
if (!WebSocket.IsValid() || !WebSocket->IsConnected())
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Warning, TEXT("SendJsonMessage: WebSocket not connected."));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
FString Out;
|
||||||
|
TSharedRef<TJsonWriter<>> Writer = TJsonWriterFactory<>::Create(&Out);
|
||||||
|
FJsonSerializer::Serialize(JsonObj.ToSharedRef(), Writer);
|
||||||
|
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
if (Settings->bVerboseLogging)
|
||||||
|
{
|
||||||
|
UE_LOG(LogElevenLabsWS, Verbose, TEXT("<< %s"), *Out);
|
||||||
|
}
|
||||||
|
|
||||||
|
WebSocket->Send(Out);
|
||||||
|
}
|
||||||
|
|
||||||
|
FString UElevenLabsWebSocketProxy::BuildWebSocketURL(const FString& AgentIDOverride, const FString& APIKeyOverride) const
|
||||||
|
{
|
||||||
|
const UElevenLabsSettings* Settings = FPS_AI_Agent_ElevenLabsModule::Get().GetSettings();
|
||||||
|
|
||||||
|
// Custom URL override takes full precedence
|
||||||
|
if (!Settings->CustomWebSocketURL.IsEmpty())
|
||||||
|
{
|
||||||
|
return Settings->CustomWebSocketURL;
|
||||||
|
}
|
||||||
|
|
||||||
|
const FString ResolvedAgentID = AgentIDOverride.IsEmpty() ? Settings->AgentID : AgentIDOverride;
|
||||||
|
if (ResolvedAgentID.IsEmpty())
|
||||||
|
{
|
||||||
|
return FString();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Official ElevenLabs Conversational AI WebSocket endpoint
|
||||||
|
// wss://api.elevenlabs.io/v1/convai/conversation?agent_id=<ID>
|
||||||
|
return FString::Printf(
|
||||||
|
TEXT("wss://api.elevenlabs.io/v1/convai/conversation?agent_id=%s"),
|
||||||
|
*ResolvedAgentID);
|
||||||
|
}
|
||||||
@ -0,0 +1,50 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#include "PS_AI_Agent_ElevenLabs.h"
|
||||||
|
#include "Developer/Settings/Public/ISettingsModule.h"
|
||||||
|
#include "UObject/UObjectGlobals.h"
|
||||||
|
#include "UObject/Package.h"
|
||||||
|
|
||||||
|
IMPLEMENT_MODULE(FPS_AI_Agent_ElevenLabsModule, PS_AI_Agent_ElevenLabs)
|
||||||
|
|
||||||
|
#define LOCTEXT_NAMESPACE "PS_AI_Agent_ElevenLabs"
|
||||||
|
|
||||||
|
void FPS_AI_Agent_ElevenLabsModule::StartupModule()
|
||||||
|
{
|
||||||
|
Settings = NewObject<UElevenLabsSettings>(GetTransientPackage(), "ElevenLabsSettings", RF_Standalone);
|
||||||
|
Settings->AddToRoot();
|
||||||
|
|
||||||
|
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
|
||||||
|
{
|
||||||
|
SettingsModule->RegisterSettings(
|
||||||
|
"Project", "Plugins", "ElevenLabsAIAgent",
|
||||||
|
LOCTEXT("SettingsName", "ElevenLabs AI Agent"),
|
||||||
|
LOCTEXT("SettingsDescription", "Configure the ElevenLabs Conversational AI Agent plugin"),
|
||||||
|
Settings);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void FPS_AI_Agent_ElevenLabsModule::ShutdownModule()
|
||||||
|
{
|
||||||
|
if (ISettingsModule* SettingsModule = FModuleManager::GetModulePtr<ISettingsModule>("Settings"))
|
||||||
|
{
|
||||||
|
SettingsModule->UnregisterSettings("Project", "Plugins", "ElevenLabsAIAgent");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!GExitPurge)
|
||||||
|
{
|
||||||
|
Settings->RemoveFromRoot();
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
Settings = nullptr;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
UElevenLabsSettings* FPS_AI_Agent_ElevenLabsModule::GetSettings() const
|
||||||
|
{
|
||||||
|
check(Settings);
|
||||||
|
return Settings;
|
||||||
|
}
|
||||||
|
|
||||||
|
#undef LOCTEXT_NAMESPACE
|
||||||
@ -0,0 +1,233 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "CoreMinimal.h"
|
||||||
|
#include "Components/ActorComponent.h"
|
||||||
|
#include "ElevenLabsDefinitions.h"
|
||||||
|
#include "ElevenLabsWebSocketProxy.h"
|
||||||
|
#include "Sound/SoundWaveProcedural.h"
|
||||||
|
#include "ElevenLabsConversationalAgentComponent.generated.h"
|
||||||
|
|
||||||
|
class UAudioComponent;
|
||||||
|
class UElevenLabsMicrophoneCaptureComponent;
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Delegates exposed to Blueprint
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentConnected,
|
||||||
|
const FElevenLabsConversationInfo&, ConversationInfo);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnAgentDisconnected,
|
||||||
|
int32, StatusCode, const FString&, Reason);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentError,
|
||||||
|
const FString&, ErrorMessage);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTranscript,
|
||||||
|
const FElevenLabsTranscriptSegment&, Segment);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnAgentTextResponse,
|
||||||
|
const FString&, ResponseText);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStartedSpeaking);
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentStoppedSpeaking);
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnAgentInterrupted);
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// UElevenLabsConversationalAgentComponent
|
||||||
|
//
|
||||||
|
// Attach this to any Actor (e.g. a character NPC) to give it a voice powered by
|
||||||
|
// the ElevenLabs Conversational AI API.
|
||||||
|
//
|
||||||
|
// Workflow:
|
||||||
|
// 1. Set AgentID (or rely on project default).
|
||||||
|
// 2. Call StartConversation() to open the WebSocket.
|
||||||
|
// 3. Call StartListening() / StopListening() to control microphone capture.
|
||||||
|
// 4. React to events (OnAgentTranscript, OnAgentTextResponse, etc.) in Blueprint.
|
||||||
|
// 5. Call EndConversation() when done.
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
|
||||||
|
DisplayName = "ElevenLabs Conversational Agent")
|
||||||
|
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsConversationalAgentComponent : public UActorComponent
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
public:
|
||||||
|
UElevenLabsConversationalAgentComponent();
|
||||||
|
|
||||||
|
// ── Configuration ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* ElevenLabs Agent ID. Overrides the project-level default in Project Settings.
|
||||||
|
* Leave empty to use the project default.
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||||
|
FString AgentID;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Turn mode:
|
||||||
|
* - Server VAD: ElevenLabs detects end-of-speech automatically (recommended).
|
||||||
|
* - Client Controlled: you call StartListening/StopListening manually (push-to-talk).
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||||
|
EElevenLabsTurnMode TurnMode = EElevenLabsTurnMode::Server;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Automatically start listening (microphone capture) once the WebSocket is
|
||||||
|
* connected and the conversation is initiated.
|
||||||
|
*/
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs")
|
||||||
|
bool bAutoStartListening = true;
|
||||||
|
|
||||||
|
// ── Events ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentConnected OnAgentConnected;
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentDisconnected OnAgentDisconnected;
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentError OnAgentError;
|
||||||
|
|
||||||
|
/** Fired for every transcript segment (user speech or agent speech, tentative and final). */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentTranscript OnAgentTranscript;
|
||||||
|
|
||||||
|
/** Final text response produced by the agent (mirrors the audio). */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentTextResponse OnAgentTextResponse;
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentStartedSpeaking OnAgentStartedSpeaking;
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentStoppedSpeaking OnAgentStoppedSpeaking;
|
||||||
|
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnAgentInterrupted OnAgentInterrupted;
|
||||||
|
|
||||||
|
// ── Control ───────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Open the WebSocket connection and start the conversation.
|
||||||
|
* If bAutoStartListening is true, microphone capture also starts once connected.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void StartConversation();
|
||||||
|
|
||||||
|
/** Close the WebSocket and stop all audio. */
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void EndConversation();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Start capturing microphone audio and streaming it to ElevenLabs.
|
||||||
|
* In Client turn mode, also sends a UserTurnStart signal.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void StartListening();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Stop capturing microphone audio.
|
||||||
|
* In Client turn mode, also sends a UserTurnEnd signal.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void StopListening();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Send a plain text message to the agent without using the microphone.
|
||||||
|
* The agent will respond with audio and text just as if it heard you speak.
|
||||||
|
* Useful for testing in the Editor or for text-based interaction.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendTextMessage(const FString& Text);
|
||||||
|
|
||||||
|
/** Interrupt the agent's current utterance. */
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void InterruptAgent();
|
||||||
|
|
||||||
|
// ── State queries ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
bool IsConnected() const;
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
bool IsListening() const { return bIsListening; }
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
bool IsAgentSpeaking() const { return bAgentSpeaking; }
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
const FElevenLabsConversationInfo& GetConversationInfo() const;
|
||||||
|
|
||||||
|
/** Access the underlying WebSocket proxy (advanced use). */
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
UElevenLabsWebSocketProxy* GetWebSocketProxy() const { return WebSocketProxy; }
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
// UActorComponent overrides
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
virtual void BeginPlay() override;
|
||||||
|
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
|
||||||
|
virtual void TickComponent(float DeltaTime, ELevelTick TickType,
|
||||||
|
FActorComponentTickFunction* ThisTickFunction) override;
|
||||||
|
|
||||||
|
private:
|
||||||
|
// ── Internal event handlers ───────────────────────────────────────────────
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleConnected(const FElevenLabsConversationInfo& Info);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleDisconnected(int32 StatusCode, const FString& Reason);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleError(const FString& ErrorMessage);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleAudioReceived(const TArray<uint8>& PCMData);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleTranscript(const FElevenLabsTranscriptSegment& Segment);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleAgentResponse(const FString& ResponseText);
|
||||||
|
|
||||||
|
UFUNCTION()
|
||||||
|
void HandleInterrupted();
|
||||||
|
|
||||||
|
// ── Audio playback ────────────────────────────────────────────────────────
|
||||||
|
void InitAudioPlayback();
|
||||||
|
void EnqueueAgentAudio(const TArray<uint8>& PCMData);
|
||||||
|
void StopAgentAudio();
|
||||||
|
/** Called by USoundWaveProcedural when it needs more PCM data. */
|
||||||
|
void OnProceduralUnderflow(USoundWaveProcedural* InProceduralWave, const int32 SamplesRequired);
|
||||||
|
|
||||||
|
// ── Microphone streaming ──────────────────────────────────────────────────
|
||||||
|
void OnMicrophoneDataCaptured(const TArray<float>& FloatPCM);
|
||||||
|
/** Convert float PCM to int16 little-endian bytes for ElevenLabs. */
|
||||||
|
static TArray<uint8> FloatPCMToInt16Bytes(const TArray<float>& FloatPCM);
|
||||||
|
|
||||||
|
// ── Sub-objects ───────────────────────────────────────────────────────────
|
||||||
|
UPROPERTY()
|
||||||
|
UElevenLabsWebSocketProxy* WebSocketProxy = nullptr;
|
||||||
|
|
||||||
|
UPROPERTY()
|
||||||
|
UAudioComponent* AudioPlaybackComponent = nullptr;
|
||||||
|
|
||||||
|
UPROPERTY()
|
||||||
|
USoundWaveProcedural* ProceduralSoundWave = nullptr;
|
||||||
|
|
||||||
|
// ── State ─────────────────────────────────────────────────────────────────
|
||||||
|
bool bIsListening = false;
|
||||||
|
bool bAgentSpeaking = false;
|
||||||
|
|
||||||
|
// Accumulates incoming PCM bytes until the audio component needs data.
|
||||||
|
TArray<uint8> AudioQueue;
|
||||||
|
FCriticalSection AudioQueueLock;
|
||||||
|
|
||||||
|
// Simple heuristic: if we haven't received audio data for this many ticks,
|
||||||
|
// consider the agent done speaking.
|
||||||
|
int32 SilentTickCount = 0;
|
||||||
|
static constexpr int32 SilenceThresholdTicks = 30; // ~0.5s at 60fps
|
||||||
|
};
|
||||||
@ -0,0 +1,109 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "CoreMinimal.h"
|
||||||
|
#include "ElevenLabsDefinitions.generated.h"
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Connection state
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UENUM(BlueprintType)
|
||||||
|
enum class EElevenLabsConnectionState : uint8
|
||||||
|
{
|
||||||
|
Disconnected UMETA(DisplayName = "Disconnected"),
|
||||||
|
Connecting UMETA(DisplayName = "Connecting"),
|
||||||
|
Connected UMETA(DisplayName = "Connected"),
|
||||||
|
Error UMETA(DisplayName = "Error"),
|
||||||
|
};
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Agent turn mode
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UENUM(BlueprintType)
|
||||||
|
enum class EElevenLabsTurnMode : uint8
|
||||||
|
{
|
||||||
|
/** ElevenLabs server decides when the user has finished speaking (default). */
|
||||||
|
Server UMETA(DisplayName = "Server VAD"),
|
||||||
|
/** Client explicitly signals turn start/end (manual push-to-talk). */
|
||||||
|
Client UMETA(DisplayName = "Client Controlled"),
|
||||||
|
};
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// WebSocket message type helpers (internal, not exposed to Blueprint)
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
namespace ElevenLabsMessageType
|
||||||
|
{
|
||||||
|
// Client → Server
|
||||||
|
static const FString AudioChunk = TEXT("user_audio_chunk");
|
||||||
|
// Client turn mode: signal user is currently active/speaking
|
||||||
|
static const FString UserActivity = TEXT("user_activity");
|
||||||
|
// Client turn mode: send a text message without audio
|
||||||
|
static const FString UserMessage = TEXT("user_message");
|
||||||
|
static const FString Interrupt = TEXT("interrupt");
|
||||||
|
static const FString ClientToolResult = TEXT("client_tool_result");
|
||||||
|
static const FString ConversationClientData = TEXT("conversation_initiation_client_data");
|
||||||
|
|
||||||
|
// Server → Client
|
||||||
|
static const FString ConversationInitiation = TEXT("conversation_initiation_metadata");
|
||||||
|
static const FString AudioResponse = TEXT("audio");
|
||||||
|
// User speech-to-text transcript (speaker is always the user)
|
||||||
|
static const FString UserTranscript = TEXT("user_transcript");
|
||||||
|
static const FString AgentResponse = TEXT("agent_response");
|
||||||
|
static const FString AgentResponseCorrection= TEXT("agent_response_correction");
|
||||||
|
static const FString InterruptionEvent = TEXT("interruption");
|
||||||
|
static const FString PingEvent = TEXT("ping");
|
||||||
|
static const FString ClientToolCall = TEXT("client_tool_call");
|
||||||
|
static const FString InternalTentativeAgent = TEXT("internal_tentative_agent_response");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Audio format exchanged with ElevenLabs
|
||||||
|
// PCM 16-bit signed, 16000 Hz, mono, little-endian.
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
namespace ElevenLabsAudio
|
||||||
|
{
|
||||||
|
static constexpr int32 SampleRate = 16000;
|
||||||
|
static constexpr int32 Channels = 1;
|
||||||
|
static constexpr int32 BitsPerSample = 16;
|
||||||
|
// Chunk size sent per WebSocket frame: 100 ms of audio
|
||||||
|
static constexpr int32 ChunkSamples = SampleRate / 10; // 1600 samples = 3200 bytes
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Conversation metadata received on successful connection
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
USTRUCT(BlueprintType)
|
||||||
|
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsConversationInfo
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
/** Unique ID of this conversation session assigned by ElevenLabs. */
|
||||||
|
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||||
|
FString ConversationID;
|
||||||
|
|
||||||
|
/** Agent ID that is responding. */
|
||||||
|
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||||
|
FString AgentID;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Transcript segment
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
USTRUCT(BlueprintType)
|
||||||
|
struct PS_AI_AGENT_ELEVENLABS_API FElevenLabsTranscriptSegment
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
/** Transcribed text. */
|
||||||
|
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||||
|
FString Text;
|
||||||
|
|
||||||
|
/** "user" or "agent". */
|
||||||
|
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||||
|
FString Speaker;
|
||||||
|
|
||||||
|
/** Whether this is a final transcript or a tentative (in-progress) one. */
|
||||||
|
UPROPERTY(BlueprintReadOnly, Category = "ElevenLabs")
|
||||||
|
bool bIsFinal = false;
|
||||||
|
};
|
||||||
@ -0,0 +1,73 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "CoreMinimal.h"
|
||||||
|
#include "Components/ActorComponent.h"
|
||||||
|
#include "AudioCapture.h"
|
||||||
|
#include "ElevenLabsMicrophoneCaptureComponent.generated.h"
|
||||||
|
|
||||||
|
// Delivers captured float PCM samples (16000 Hz mono, resampled from device rate).
|
||||||
|
DECLARE_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioCaptured, const TArray<float>& /*FloatPCM*/);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Lightweight microphone capture component.
|
||||||
|
* Captures from the default audio input device, resamples to 16000 Hz mono,
|
||||||
|
* and delivers chunks via FOnElevenLabsAudioCaptured.
|
||||||
|
*
|
||||||
|
* Modelled after Convai's ConvaiAudioCaptureComponent but stripped to the
|
||||||
|
* minimal functionality needed for the ElevenLabs Conversational AI API.
|
||||||
|
*/
|
||||||
|
UCLASS(ClassGroup = "ElevenLabs", meta = (BlueprintSpawnableComponent),
|
||||||
|
DisplayName = "ElevenLabs Microphone Capture")
|
||||||
|
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsMicrophoneCaptureComponent : public UActorComponent
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
public:
|
||||||
|
UElevenLabsMicrophoneCaptureComponent();
|
||||||
|
|
||||||
|
/** Volume multiplier applied to captured samples before forwarding. */
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "ElevenLabs|Microphone",
|
||||||
|
meta = (ClampMin = "0.0", ClampMax = "4.0"))
|
||||||
|
float VolumeMultiplier = 1.0f;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Delegate fired on the game thread each time a new chunk of PCM audio
|
||||||
|
* is captured. Samples are float32, resampled to 16000 Hz mono.
|
||||||
|
*/
|
||||||
|
FOnElevenLabsAudioCaptured OnAudioCaptured;
|
||||||
|
|
||||||
|
/** Open the default capture device and begin streaming audio. */
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void StartCapture();
|
||||||
|
|
||||||
|
/** Stop streaming and close the capture device. */
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void StopCapture();
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
bool IsCapturing() const { return bCapturing; }
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
// UActorComponent overrides
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
virtual void EndPlay(const EEndPlayReason::Type EndPlayReason) override;
|
||||||
|
|
||||||
|
private:
|
||||||
|
/** Called by the audio capture callback on a background thread. Raw void* per UE 5.3+ API. */
|
||||||
|
void OnAudioGenerate(const void* InAudio, int32 NumFrames,
|
||||||
|
int32 InNumChannels, int32 InSampleRate, double StreamTime, bool bOverflow);
|
||||||
|
|
||||||
|
/** Simple linear resample from InSampleRate to 16000 Hz. Input is float32 frames. */
|
||||||
|
static TArray<float> ResampleTo16000(const float* InAudio, int32 NumFrames,
|
||||||
|
int32 InChannels, int32 InSampleRate);
|
||||||
|
|
||||||
|
Audio::FAudioCapture AudioCapture;
|
||||||
|
Audio::FAudioCaptureDeviceParams DeviceParams;
|
||||||
|
bool bCapturing = false;
|
||||||
|
|
||||||
|
// Device sample rate discovered on StartCapture
|
||||||
|
int32 DeviceSampleRate = 44100;
|
||||||
|
int32 DeviceChannels = 1;
|
||||||
|
};
|
||||||
@ -0,0 +1,186 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "CoreMinimal.h"
|
||||||
|
#include "UObject/NoExportTypes.h"
|
||||||
|
#include "ElevenLabsDefinitions.h"
|
||||||
|
#include "IWebSocket.h"
|
||||||
|
#include "ElevenLabsWebSocketProxy.generated.h"
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Delegates (all Blueprint-assignable)
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsConnected,
|
||||||
|
const FElevenLabsConversationInfo&, ConversationInfo);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_TwoParams(FOnElevenLabsDisconnected,
|
||||||
|
int32, StatusCode, const FString&, Reason);
|
||||||
|
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsError,
|
||||||
|
const FString&, ErrorMessage);
|
||||||
|
|
||||||
|
/** Fired when a PCM audio chunk arrives from the agent. Raw bytes, 16-bit signed 16kHz mono. */
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAudioReceived,
|
||||||
|
const TArray<uint8>&, PCMData);
|
||||||
|
|
||||||
|
/** Fired for user or agent transcript segments. */
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsTranscript,
|
||||||
|
const FElevenLabsTranscriptSegment&, Segment);
|
||||||
|
|
||||||
|
/** Fired with the final text response from the agent. */
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(FOnElevenLabsAgentResponse,
|
||||||
|
const FString&, ResponseText);
|
||||||
|
|
||||||
|
/** Fired when the agent interrupts the user. */
|
||||||
|
DECLARE_DYNAMIC_MULTICAST_DELEGATE(FOnElevenLabsInterrupted);
|
||||||
|
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// WebSocket Proxy
|
||||||
|
// Manages the lifecycle of a single ElevenLabs Conversational AI WebSocket session.
|
||||||
|
// Instantiate via UElevenLabsConversationalAgentComponent (the component manages
|
||||||
|
// one proxy at a time), or create manually through Blueprints.
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UCLASS(BlueprintType, Blueprintable)
|
||||||
|
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsWebSocketProxy : public UObject
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
public:
|
||||||
|
// ── Events ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/** Called once the WebSocket handshake succeeds and the agent sends its initiation metadata. */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsConnected OnConnected;
|
||||||
|
|
||||||
|
/** Called when the WebSocket closes (graceful or remote). */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsDisconnected OnDisconnected;
|
||||||
|
|
||||||
|
/** Called on any connection or protocol error. */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsError OnError;
|
||||||
|
|
||||||
|
/** Raw PCM audio coming from the agent — feed this into your audio component. */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsAudioReceived OnAudioReceived;
|
||||||
|
|
||||||
|
/** User or agent transcript (may be tentative while the conversation is ongoing). */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsTranscript OnTranscript;
|
||||||
|
|
||||||
|
/** Final text response from the agent (complements audio). */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsAgentResponse OnAgentResponse;
|
||||||
|
|
||||||
|
/** The agent was interrupted by new user speech. */
|
||||||
|
UPROPERTY(BlueprintAssignable, Category = "ElevenLabs|Events")
|
||||||
|
FOnElevenLabsInterrupted OnInterrupted;
|
||||||
|
|
||||||
|
// ── Lifecycle ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Open a WebSocket connection to ElevenLabs.
|
||||||
|
* Uses settings from Project Settings unless overridden by the parameters.
|
||||||
|
*
|
||||||
|
* @param AgentID ElevenLabs agent ID. Overrides the project-level default when non-empty.
|
||||||
|
* @param APIKey API key. Overrides the project-level default when non-empty.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void Connect(const FString& AgentID = TEXT(""), const FString& APIKey = TEXT(""));
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gracefully close the WebSocket connection.
|
||||||
|
* OnDisconnected will fire after the server acknowledges.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void Disconnect();
|
||||||
|
|
||||||
|
/** Current connection state. */
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
EElevenLabsConnectionState GetConnectionState() const { return ConnectionState; }
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
bool IsConnected() const { return ConnectionState == EElevenLabsConnectionState::Connected; }
|
||||||
|
|
||||||
|
// ── Audio sending ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Send a chunk of raw PCM audio to ElevenLabs.
|
||||||
|
* Audio must be 16-bit signed, 16000 Hz, mono, little-endian.
|
||||||
|
* The data is Base64-encoded and sent as a JSON message.
|
||||||
|
* Call this repeatedly while the microphone is capturing.
|
||||||
|
*
|
||||||
|
* @param PCMData Raw PCM bytes (16-bit LE, 16kHz, mono).
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendAudioChunk(const TArray<uint8>& PCMData);
|
||||||
|
|
||||||
|
// ── Turn control (only relevant in Client turn mode) ──────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Signal that the user is actively speaking (Client turn mode).
|
||||||
|
* Sends a { "type": "user_activity" } message to the server.
|
||||||
|
* Call this periodically while the user is speaking (e.g. every audio chunk).
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendUserTurnStart();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Signal that the user has finished speaking (Client turn mode).
|
||||||
|
* No explicit API message — simply stop sending user_activity.
|
||||||
|
* The server detects silence and hands the turn to the agent.
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendUserTurnEnd();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Send a text message to the agent (no microphone needed).
|
||||||
|
* Useful for testing or text-only interaction.
|
||||||
|
* Sends: { "type": "user_message", "text": "..." }
|
||||||
|
*/
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendTextMessage(const FString& Text);
|
||||||
|
|
||||||
|
/** Ask the agent to stop the current utterance. */
|
||||||
|
UFUNCTION(BlueprintCallable, Category = "ElevenLabs")
|
||||||
|
void SendInterrupt();
|
||||||
|
|
||||||
|
// ── Info ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
UFUNCTION(BlueprintPure, Category = "ElevenLabs")
|
||||||
|
const FElevenLabsConversationInfo& GetConversationInfo() const { return ConversationInfo; }
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
// Internal
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────
|
||||||
|
private:
|
||||||
|
void OnWsConnected();
|
||||||
|
void OnWsConnectionError(const FString& Error);
|
||||||
|
void OnWsClosed(int32 StatusCode, const FString& Reason, bool bWasClean);
|
||||||
|
void OnWsMessage(const FString& Message);
|
||||||
|
void OnWsBinaryMessage(const void* Data, SIZE_T Size, SIZE_T BytesRemaining);
|
||||||
|
|
||||||
|
void HandleConversationInitiation(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
void HandleAudioResponse(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
void HandleTranscript(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
void HandleAgentResponse(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
void HandleInterruption(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
void HandlePing(const TSharedPtr<FJsonObject>& Payload);
|
||||||
|
|
||||||
|
/** Build and send a JSON text frame to the server. */
|
||||||
|
void SendJsonMessage(const TSharedPtr<FJsonObject>& JsonObj);
|
||||||
|
|
||||||
|
/** Resolve the WebSocket URL from settings / parameters. */
|
||||||
|
FString BuildWebSocketURL(const FString& AgentID, const FString& APIKey) const;
|
||||||
|
|
||||||
|
TSharedPtr<IWebSocket> WebSocket;
|
||||||
|
EElevenLabsConnectionState ConnectionState = EElevenLabsConnectionState::Disconnected;
|
||||||
|
FElevenLabsConversationInfo ConversationInfo;
|
||||||
|
|
||||||
|
// Accumulation buffer for multi-fragment binary WebSocket frames.
|
||||||
|
// ElevenLabs sends JSON as binary frames; large messages arrive in fragments.
|
||||||
|
TArray<uint8> BinaryFrameBuffer;
|
||||||
|
};
|
||||||
@ -0,0 +1,99 @@
|
|||||||
|
// Copyright ASTERION. All Rights Reserved.
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "CoreMinimal.h"
|
||||||
|
#include "Modules/ModuleManager.h"
|
||||||
|
#include "PS_AI_Agent_ElevenLabs.generated.h"
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Settings object – exposed in Project Settings → Plugins → ElevenLabs AI Agent
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
UCLASS(config = Engine, defaultconfig)
|
||||||
|
class PS_AI_AGENT_ELEVENLABS_API UElevenLabsSettings : public UObject
|
||||||
|
{
|
||||||
|
GENERATED_BODY()
|
||||||
|
|
||||||
|
public:
|
||||||
|
UElevenLabsSettings(const FObjectInitializer& ObjectInitializer)
|
||||||
|
: Super(ObjectInitializer)
|
||||||
|
{
|
||||||
|
API_Key = TEXT("");
|
||||||
|
AgentID = TEXT("");
|
||||||
|
bSignedURLMode = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* ElevenLabs API key.
|
||||||
|
* Obtain from https://elevenlabs.io – used to authenticate WebSocket connections.
|
||||||
|
* Keep this secret; do not ship with the key hard-coded in a shipping build.
|
||||||
|
*/
|
||||||
|
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
|
||||||
|
FString API_Key;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* The default ElevenLabs Conversational Agent ID to use when none is specified
|
||||||
|
* on the component. Create agents at https://elevenlabs.io/app/conversational-ai
|
||||||
|
*/
|
||||||
|
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API")
|
||||||
|
FString AgentID;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* When true, the plugin fetches a signed WebSocket URL from your own backend
|
||||||
|
* before connecting, so the API key is never exposed in the client.
|
||||||
|
* Set SignedURLEndpoint to point to your server that returns the signed URL.
|
||||||
|
*/
|
||||||
|
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security")
|
||||||
|
bool bSignedURLMode;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Your backend endpoint that returns a signed WebSocket URL for ElevenLabs.
|
||||||
|
* Only used when bSignedURLMode = true.
|
||||||
|
* Expected response body: { "signed_url": "wss://..." }
|
||||||
|
*/
|
||||||
|
UPROPERTY(Config, EditAnywhere, Category = "ElevenLabs API | Security",
|
||||||
|
meta = (EditCondition = "bSignedURLMode"))
|
||||||
|
FString SignedURLEndpoint;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Override the ElevenLabs WebSocket base URL. Leave empty to use the default:
|
||||||
|
* wss://api.elevenlabs.io/v1/convai/conversation
|
||||||
|
*/
|
||||||
|
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
|
||||||
|
FString CustomWebSocketURL;
|
||||||
|
|
||||||
|
/** Log verbose WebSocket messages to the Output Log (useful during development). */
|
||||||
|
UPROPERTY(Config, EditAnywhere, AdvancedDisplay, Category = "ElevenLabs API")
|
||||||
|
bool bVerboseLogging = false;
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Module
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
class PS_AI_AGENT_ELEVENLABS_API FPS_AI_Agent_ElevenLabsModule : public IModuleInterface
|
||||||
|
{
|
||||||
|
public:
|
||||||
|
/** IModuleInterface implementation */
|
||||||
|
virtual void StartupModule() override;
|
||||||
|
virtual void ShutdownModule() override;
|
||||||
|
|
||||||
|
virtual bool IsGameModule() const override { return true; }
|
||||||
|
|
||||||
|
/** Singleton access */
|
||||||
|
static inline FPS_AI_Agent_ElevenLabsModule& Get()
|
||||||
|
{
|
||||||
|
return FModuleManager::LoadModuleChecked<FPS_AI_Agent_ElevenLabsModule>("PS_AI_Agent_ElevenLabs");
|
||||||
|
}
|
||||||
|
|
||||||
|
static inline bool IsAvailable()
|
||||||
|
{
|
||||||
|
return FModuleManager::Get().IsModuleLoaded("PS_AI_Agent_ElevenLabs");
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Access the settings object at runtime */
|
||||||
|
UElevenLabsSettings* GetSettings() const;
|
||||||
|
|
||||||
|
private:
|
||||||
|
UElevenLabsSettings* Settings = nullptr;
|
||||||
|
};
|
||||||
Loading…
x
Reference in New Issue
Block a user