# ElevenLabs Conversational AI - C++ Implementation [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![C++17](https://img.shields.io/badge/C%2B%2B-17-blue.svg)](https://en.wikipedia.org/wiki/C%2B%2B17) [![CMake](https://img.shields.io/badge/CMake-3.14+-green.svg)](https://cmake.org/) C++ implementation of ElevenLabs Conversational AI client ## Features - **Real-time Audio Processing**: Full-duplex audio streaming with low-latency playback - **WebSocket Integration**: Secure WSS connection to ElevenLabs Conversational AI platform - **Cross-platform Audio**: PortAudio-based implementation supporting Windows, macOS, and Linux - **Echo Suppression**: Built-in acoustic feedback prevention - **Modern C++**: Clean, maintainable C++17 codebase with proper RAII and exception handling - **Flexible Architecture**: Modular design allowing easy customization and extension ## Architecture ```mermaid graph TB subgraph "User Interface" A[main.cpp] --> B[Conversation] end subgraph "Core Components" B --> C[DefaultAudioInterface] B --> D[WebSocket Client] C --> E[PortAudio] D --> F[Boost.Beast + OpenSSL] end subgraph "ElevenLabs Platform" F --> G[WSS API Endpoint] G --> H[Conversational AI Agent] end subgraph "Audio Flow" I[Microphone] --> C C --> J[Base64 Encoding] J --> D D --> K[Audio Events] K --> L[Base64 Decoding] L --> C C --> M[Speakers] end subgraph "Message Types" N[user_audio_chunk] O[agent_response] P[user_transcript] Q[audio_event] R[ping/pong] end style B fill:#e1f5fe style C fill:#f3e5f5 style D fill:#e8f5e8 style H fill:#fff3e0 ``` ## Quick Start ### Prerequisites - **C++17 compatible compiler**: GCC 11+, Clang 14+, or MSVC 2022+ - **CMake** 3.14 or higher - **Dependencies** (install via package manager): #### macOS (Homebrew) ```bash brew install boost openssl portaudio nlohmann-json cmake pkg-config ``` #### Ubuntu/Debian ```bash sudo apt update sudo apt install build-essential cmake pkg-config sudo apt install libboost-system-dev libboost-thread-dev sudo apt install libssl-dev libportaudio2-dev nlohmann-json3-dev ``` #### Windows (vcpkg) ```bash vcpkg install boost-system boost-thread openssl portaudio nlohmann-json ``` ### Building ```bash # Clone the repository git clone https://github.com/Jitendra2603/elevenlabs-convai-cpp.git cd elevenlabs-convai-cpp # Build the project mkdir build && cd build cmake .. cmake --build . --config Release ``` ### Running ```bash # Set your agent ID (get this from ElevenLabs dashboard) export AGENT_ID="your-agent-id-here" # Run the demo ./convai_cpp ``` The application will: 1. Connect to your ElevenLabs Conversational AI agent 2. Start capturing audio from your default microphone 3. Stream audio to the agent and play responses through speakers 4. Display conversation transcripts in the terminal 5. Continue until you press Enter to quit ## 📋 Usage Examples ### Basic Conversation ```bash export AGENT_ID="agent_" ./convai_cpp # Speak into your microphone and hear the AI agent respond ``` ## Configuration ### Audio Settings The audio interface is configured for optimal real-time performance: - **Sample Rate**: 16 kHz - **Format**: 16-bit PCM mono - **Input Buffer**: 250ms (4000 frames) - **Output Buffer**: 62.5ms (1000 frames) ### WebSocket Connection - **Endpoint**: `wss://api.elevenlabs.io/v1/convai/conversation` - **Protocol**: WebSocket Secure (WSS) with TLS 1.2+ - **Authentication**: Optional (required for private agents) ## Project Structure ``` elevenlabs-convai-cpp/ ├── CMakeLists.txt # Build configuration ├── README.md # This file ├── LICENSE # MIT license ├── CONTRIBUTING.md # Contribution guidelines ├── .gitignore # Git ignore rules ├── include/ # Header files │ ├── AudioInterface.hpp # Abstract audio interface │ ├── DefaultAudioInterface.hpp # PortAudio implementation │ └── Conversation.hpp # Main conversation handler └── src/ # Source files ├── main.cpp # Demo application ├── Conversation.cpp # WebSocket and message handling └── DefaultAudioInterface.cpp # Audio I/O implementation ``` ## Technical Details ### Audio Processing Pipeline 1. **Capture**: PortAudio captures 16-bit PCM audio at 16kHz 2. **Encoding**: Raw audio is base64-encoded for WebSocket transmission 3. **Streaming**: Audio chunks sent as `user_audio_chunk` messages 4. **Reception**: Server sends `audio_event` messages with agent responses 5. **Decoding**: Base64 audio data decoded back to PCM 6. **Playback**: Audio queued and played through PortAudio output stream ### Echo Suppression The implementation includes a simple, effective echo suppression mechanism: - Microphone input is suppressed during agent speech playback - Prevents acoustic feedback loops that cause the agent to respond to itself - Uses atomic flags for thread-safe coordination between input/output ### WebSocket Message Handling Supported message types: - `conversation_initiation_client_data` - Session initialization - `user_audio_chunk` - Microphone audio data - `audio_event` - Agent speech audio - `agent_response` - Agent text responses - `user_transcript` - Speech-to-text results - `ping`/`pong` - Connection keepalive ## 📝 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.