2026-02-21 20:48:10 +01:00

198 lines
5.7 KiB
Markdown

# ElevenLabs Conversational AI - C++ Implementation
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![C++17](https://img.shields.io/badge/C%2B%2B-17-blue.svg)](https://en.wikipedia.org/wiki/C%2B%2B17)
[![CMake](https://img.shields.io/badge/CMake-3.14+-green.svg)](https://cmake.org/)
C++ implementation of ElevenLabs Conversational AI client
## Features
- **Real-time Audio Processing**: Full-duplex audio streaming with low-latency playback
- **WebSocket Integration**: Secure WSS connection to ElevenLabs Conversational AI platform
- **Cross-platform Audio**: PortAudio-based implementation supporting Windows, macOS, and Linux
- **Echo Suppression**: Built-in acoustic feedback prevention
- **Modern C++**: Clean, maintainable C++17 codebase with proper RAII and exception handling
- **Flexible Architecture**: Modular design allowing easy customization and extension
## Architecture
```mermaid
graph TB
subgraph "User Interface"
A[main.cpp] --> B[Conversation]
end
subgraph "Core Components"
B --> C[DefaultAudioInterface]
B --> D[WebSocket Client]
C --> E[PortAudio]
D --> F[Boost.Beast + OpenSSL]
end
subgraph "ElevenLabs Platform"
F --> G[WSS API Endpoint]
G --> H[Conversational AI Agent]
end
subgraph "Audio Flow"
I[Microphone] --> C
C --> J[Base64 Encoding]
J --> D
D --> K[Audio Events]
K --> L[Base64 Decoding]
L --> C
C --> M[Speakers]
end
subgraph "Message Types"
N[user_audio_chunk]
O[agent_response]
P[user_transcript]
Q[audio_event]
R[ping/pong]
end
style B fill:#e1f5fe
style C fill:#f3e5f5
style D fill:#e8f5e8
style H fill:#fff3e0
```
## Quick Start
### Prerequisites
- **C++17 compatible compiler**: GCC 11+, Clang 14+, or MSVC 2022+
- **CMake** 3.14 or higher
- **Dependencies** (install via package manager):
#### macOS (Homebrew)
```bash
brew install boost openssl portaudio nlohmann-json cmake pkg-config
```
#### Ubuntu/Debian
```bash
sudo apt update
sudo apt install build-essential cmake pkg-config
sudo apt install libboost-system-dev libboost-thread-dev
sudo apt install libssl-dev libportaudio2-dev nlohmann-json3-dev
```
#### Windows (vcpkg)
```bash
vcpkg install boost-system boost-thread openssl portaudio nlohmann-json
```
### Building
```bash
# Clone the repository
git clone https://github.com/Jitendra2603/elevenlabs-convai-cpp.git
cd elevenlabs-convai-cpp
# Build the project
mkdir build && cd build
cmake ..
cmake --build . --config Release
```
### Running
```bash
# Set your agent ID (get this from ElevenLabs dashboard)
export AGENT_ID="your-agent-id-here"
# Run the demo
./convai_cpp
```
The application will:
1. Connect to your ElevenLabs Conversational AI agent
2. Start capturing audio from your default microphone
3. Stream audio to the agent and play responses through speakers
4. Display conversation transcripts in the terminal
5. Continue until you press Enter to quit
## 📋 Usage Examples
### Basic Conversation
```bash
export AGENT_ID="agent_"
./convai_cpp
# Speak into your microphone and hear the AI agent respond
```
## Configuration
### Audio Settings
The audio interface is configured for optimal real-time performance:
- **Sample Rate**: 16 kHz
- **Format**: 16-bit PCM mono
- **Input Buffer**: 250ms (4000 frames)
- **Output Buffer**: 62.5ms (1000 frames)
### WebSocket Connection
- **Endpoint**: `wss://api.elevenlabs.io/v1/convai/conversation`
- **Protocol**: WebSocket Secure (WSS) with TLS 1.2+
- **Authentication**: Optional (required for private agents)
## Project Structure
```
elevenlabs-convai-cpp/
├── CMakeLists.txt # Build configuration
├── README.md # This file
├── LICENSE # MIT license
├── CONTRIBUTING.md # Contribution guidelines
├── .gitignore # Git ignore rules
├── include/ # Header files
│ ├── AudioInterface.hpp # Abstract audio interface
│ ├── DefaultAudioInterface.hpp # PortAudio implementation
│ └── Conversation.hpp # Main conversation handler
└── src/ # Source files
├── main.cpp # Demo application
├── Conversation.cpp # WebSocket and message handling
└── DefaultAudioInterface.cpp # Audio I/O implementation
```
## Technical Details
### Audio Processing Pipeline
1. **Capture**: PortAudio captures 16-bit PCM audio at 16kHz
2. **Encoding**: Raw audio is base64-encoded for WebSocket transmission
3. **Streaming**: Audio chunks sent as `user_audio_chunk` messages
4. **Reception**: Server sends `audio_event` messages with agent responses
5. **Decoding**: Base64 audio data decoded back to PCM
6. **Playback**: Audio queued and played through PortAudio output stream
### Echo Suppression
The implementation includes a simple, effective echo suppression mechanism:
- Microphone input is suppressed during agent speech playback
- Prevents acoustic feedback loops that cause the agent to respond to itself
- Uses atomic flags for thread-safe coordination between input/output
### WebSocket Message Handling
Supported message types:
- `conversation_initiation_client_data` - Session initialization
- `user_audio_chunk` - Microphone audio data
- `audio_event` - Agent speech audio
- `agent_response` - Agent text responses
- `user_transcript` - Speech-to-text results
- `ping`/`pong` - Connection keepalive
## 📝 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.