ElevenLabs Conversational AI - C++ Implementation
C++ implementation of ElevenLabs Conversational AI client
Features
- Real-time Audio Processing: Full-duplex audio streaming with low-latency playback
- WebSocket Integration: Secure WSS connection to ElevenLabs Conversational AI platform
- Cross-platform Audio: PortAudio-based implementation supporting Windows, macOS, and Linux
- Echo Suppression: Built-in acoustic feedback prevention
- Modern C++: Clean, maintainable C++17 codebase with proper RAII and exception handling
- Flexible Architecture: Modular design allowing easy customization and extension
Architecture
graph TB
subgraph "User Interface"
A[main.cpp] --> B[Conversation]
end
subgraph "Core Components"
B --> C[DefaultAudioInterface]
B --> D[WebSocket Client]
C --> E[PortAudio]
D --> F[Boost.Beast + OpenSSL]
end
subgraph "ElevenLabs Platform"
F --> G[WSS API Endpoint]
G --> H[Conversational AI Agent]
end
subgraph "Audio Flow"
I[Microphone] --> C
C --> J[Base64 Encoding]
J --> D
D --> K[Audio Events]
K --> L[Base64 Decoding]
L --> C
C --> M[Speakers]
end
subgraph "Message Types"
N[user_audio_chunk]
O[agent_response]
P[user_transcript]
Q[audio_event]
R[ping/pong]
end
style B fill:#e1f5fe
style C fill:#f3e5f5
style D fill:#e8f5e8
style H fill:#fff3e0
Quick Start
Prerequisites
- C++17 compatible compiler: GCC 11+, Clang 14+, or MSVC 2022+
- CMake 3.14 or higher
- Dependencies (install via package manager):
macOS (Homebrew)
brew install boost openssl portaudio nlohmann-json cmake pkg-config
Ubuntu/Debian
sudo apt update
sudo apt install build-essential cmake pkg-config
sudo apt install libboost-system-dev libboost-thread-dev
sudo apt install libssl-dev libportaudio2-dev nlohmann-json3-dev
Windows (vcpkg)
vcpkg install boost-system boost-thread openssl portaudio nlohmann-json
Building
# Clone the repository
git clone https://github.com/Jitendra2603/elevenlabs-convai-cpp.git
cd elevenlabs-convai-cpp
# Build the project
mkdir build && cd build
cmake ..
cmake --build . --config Release
Running
# Set your agent ID (get this from ElevenLabs dashboard)
export AGENT_ID="your-agent-id-here"
# Run the demo
./convai_cpp
The application will:
- Connect to your ElevenLabs Conversational AI agent
- Start capturing audio from your default microphone
- Stream audio to the agent and play responses through speakers
- Display conversation transcripts in the terminal
- Continue until you press Enter to quit
📋 Usage Examples
Basic Conversation
export AGENT_ID="agent_"
./convai_cpp
# Speak into your microphone and hear the AI agent respond
Configuration
Audio Settings
The audio interface is configured for optimal real-time performance:
- Sample Rate: 16 kHz
- Format: 16-bit PCM mono
- Input Buffer: 250ms (4000 frames)
- Output Buffer: 62.5ms (1000 frames)
WebSocket Connection
- Endpoint:
wss://api.elevenlabs.io/v1/convai/conversation - Protocol: WebSocket Secure (WSS) with TLS 1.2+
- Authentication: Optional (required for private agents)
Project Structure
elevenlabs-convai-cpp/
├── CMakeLists.txt # Build configuration
├── README.md # This file
├── LICENSE # MIT license
├── CONTRIBUTING.md # Contribution guidelines
├── .gitignore # Git ignore rules
├── include/ # Header files
│ ├── AudioInterface.hpp # Abstract audio interface
│ ├── DefaultAudioInterface.hpp # PortAudio implementation
│ └── Conversation.hpp # Main conversation handler
└── src/ # Source files
├── main.cpp # Demo application
├── Conversation.cpp # WebSocket and message handling
└── DefaultAudioInterface.cpp # Audio I/O implementation
Technical Details
Audio Processing Pipeline
- Capture: PortAudio captures 16-bit PCM audio at 16kHz
- Encoding: Raw audio is base64-encoded for WebSocket transmission
- Streaming: Audio chunks sent as
user_audio_chunkmessages - Reception: Server sends
audio_eventmessages with agent responses - Decoding: Base64 audio data decoded back to PCM
- Playback: Audio queued and played through PortAudio output stream
Echo Suppression
The implementation includes a simple, effective echo suppression mechanism:
- Microphone input is suppressed during agent speech playback
- Prevents acoustic feedback loops that cause the agent to respond to itself
- Uses atomic flags for thread-safe coordination between input/output
WebSocket Message Handling
Supported message types:
conversation_initiation_client_data- Session initializationuser_audio_chunk- Microphone audio dataaudio_event- Agent speech audioagent_response- Agent text responsesuser_transcript- Speech-to-text resultsping/pong- Connection keepalive
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.