198 lines
5.7 KiB
Markdown
198 lines
5.7 KiB
Markdown
# ElevenLabs Conversational AI - C++ Implementation
|
|
|
|
[](https://opensource.org/licenses/MIT)
|
|
[](https://en.wikipedia.org/wiki/C%2B%2B17)
|
|
[](https://cmake.org/)
|
|
|
|
C++ implementation of ElevenLabs Conversational AI client
|
|
|
|
## Features
|
|
|
|
- **Real-time Audio Processing**: Full-duplex audio streaming with low-latency playback
|
|
- **WebSocket Integration**: Secure WSS connection to ElevenLabs Conversational AI platform
|
|
- **Cross-platform Audio**: PortAudio-based implementation supporting Windows, macOS, and Linux
|
|
- **Echo Suppression**: Built-in acoustic feedback prevention
|
|
- **Modern C++**: Clean, maintainable C++17 codebase with proper RAII and exception handling
|
|
- **Flexible Architecture**: Modular design allowing easy customization and extension
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "User Interface"
|
|
A[main.cpp] --> B[Conversation]
|
|
end
|
|
|
|
subgraph "Core Components"
|
|
B --> C[DefaultAudioInterface]
|
|
B --> D[WebSocket Client]
|
|
C --> E[PortAudio]
|
|
D --> F[Boost.Beast + OpenSSL]
|
|
end
|
|
|
|
subgraph "ElevenLabs Platform"
|
|
F --> G[WSS API Endpoint]
|
|
G --> H[Conversational AI Agent]
|
|
end
|
|
|
|
subgraph "Audio Flow"
|
|
I[Microphone] --> C
|
|
C --> J[Base64 Encoding]
|
|
J --> D
|
|
D --> K[Audio Events]
|
|
K --> L[Base64 Decoding]
|
|
L --> C
|
|
C --> M[Speakers]
|
|
end
|
|
|
|
subgraph "Message Types"
|
|
N[user_audio_chunk]
|
|
O[agent_response]
|
|
P[user_transcript]
|
|
Q[audio_event]
|
|
R[ping/pong]
|
|
end
|
|
|
|
style B fill:#e1f5fe
|
|
style C fill:#f3e5f5
|
|
style D fill:#e8f5e8
|
|
style H fill:#fff3e0
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- **C++17 compatible compiler**: GCC 11+, Clang 14+, or MSVC 2022+
|
|
- **CMake** 3.14 or higher
|
|
- **Dependencies** (install via package manager):
|
|
|
|
#### macOS (Homebrew)
|
|
```bash
|
|
brew install boost openssl portaudio nlohmann-json cmake pkg-config
|
|
```
|
|
|
|
#### Ubuntu/Debian
|
|
```bash
|
|
sudo apt update
|
|
sudo apt install build-essential cmake pkg-config
|
|
sudo apt install libboost-system-dev libboost-thread-dev
|
|
sudo apt install libssl-dev libportaudio2-dev nlohmann-json3-dev
|
|
```
|
|
|
|
#### Windows (vcpkg)
|
|
```bash
|
|
vcpkg install boost-system boost-thread openssl portaudio nlohmann-json
|
|
```
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/Jitendra2603/elevenlabs-convai-cpp.git
|
|
cd elevenlabs-convai-cpp
|
|
|
|
# Build the project
|
|
mkdir build && cd build
|
|
cmake ..
|
|
cmake --build . --config Release
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
# Set your agent ID (get this from ElevenLabs dashboard)
|
|
export AGENT_ID="your-agent-id-here"
|
|
|
|
# Run the demo
|
|
./convai_cpp
|
|
```
|
|
|
|
The application will:
|
|
1. Connect to your ElevenLabs Conversational AI agent
|
|
2. Start capturing audio from your default microphone
|
|
3. Stream audio to the agent and play responses through speakers
|
|
4. Display conversation transcripts in the terminal
|
|
5. Continue until you press Enter to quit
|
|
|
|
## 📋 Usage Examples
|
|
|
|
### Basic Conversation
|
|
```bash
|
|
export AGENT_ID="agent_"
|
|
./convai_cpp
|
|
# Speak into your microphone and hear the AI agent respond
|
|
```
|
|
|
|
|
|
## Configuration
|
|
|
|
### Audio Settings
|
|
|
|
The audio interface is configured for optimal real-time performance:
|
|
|
|
- **Sample Rate**: 16 kHz
|
|
- **Format**: 16-bit PCM mono
|
|
- **Input Buffer**: 250ms (4000 frames)
|
|
- **Output Buffer**: 62.5ms (1000 frames)
|
|
|
|
### WebSocket Connection
|
|
|
|
- **Endpoint**: `wss://api.elevenlabs.io/v1/convai/conversation`
|
|
- **Protocol**: WebSocket Secure (WSS) with TLS 1.2+
|
|
- **Authentication**: Optional (required for private agents)
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
elevenlabs-convai-cpp/
|
|
├── CMakeLists.txt # Build configuration
|
|
├── README.md # This file
|
|
├── LICENSE # MIT license
|
|
├── CONTRIBUTING.md # Contribution guidelines
|
|
├── .gitignore # Git ignore rules
|
|
├── include/ # Header files
|
|
│ ├── AudioInterface.hpp # Abstract audio interface
|
|
│ ├── DefaultAudioInterface.hpp # PortAudio implementation
|
|
│ └── Conversation.hpp # Main conversation handler
|
|
└── src/ # Source files
|
|
├── main.cpp # Demo application
|
|
├── Conversation.cpp # WebSocket and message handling
|
|
└── DefaultAudioInterface.cpp # Audio I/O implementation
|
|
```
|
|
|
|
## Technical Details
|
|
|
|
### Audio Processing Pipeline
|
|
|
|
1. **Capture**: PortAudio captures 16-bit PCM audio at 16kHz
|
|
2. **Encoding**: Raw audio is base64-encoded for WebSocket transmission
|
|
3. **Streaming**: Audio chunks sent as `user_audio_chunk` messages
|
|
4. **Reception**: Server sends `audio_event` messages with agent responses
|
|
5. **Decoding**: Base64 audio data decoded back to PCM
|
|
6. **Playback**: Audio queued and played through PortAudio output stream
|
|
|
|
### Echo Suppression
|
|
|
|
The implementation includes a simple, effective echo suppression mechanism:
|
|
|
|
- Microphone input is suppressed during agent speech playback
|
|
- Prevents acoustic feedback loops that cause the agent to respond to itself
|
|
- Uses atomic flags for thread-safe coordination between input/output
|
|
|
|
### WebSocket Message Handling
|
|
|
|
Supported message types:
|
|
- `conversation_initiation_client_data` - Session initialization
|
|
- `user_audio_chunk` - Microphone audio data
|
|
- `audio_event` - Agent speech audio
|
|
- `agent_response` - Agent text responses
|
|
- `user_transcript` - Speech-to-text results
|
|
- `ping`/`pong` - Connection keepalive
|
|
|
|
|
|
|
|
## 📝 License
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|