2026-02-21 20:48:10 +01:00

5.7 KiB

ElevenLabs Conversational AI - C++ Implementation

License: MIT C++17 CMake

C++ implementation of ElevenLabs Conversational AI client

Features

  • Real-time Audio Processing: Full-duplex audio streaming with low-latency playback
  • WebSocket Integration: Secure WSS connection to ElevenLabs Conversational AI platform
  • Cross-platform Audio: PortAudio-based implementation supporting Windows, macOS, and Linux
  • Echo Suppression: Built-in acoustic feedback prevention
  • Modern C++: Clean, maintainable C++17 codebase with proper RAII and exception handling
  • Flexible Architecture: Modular design allowing easy customization and extension

Architecture

graph TB
    subgraph "User Interface"
        A[main.cpp] --> B[Conversation]
    end
    
    subgraph "Core Components"
        B --> C[DefaultAudioInterface]
        B --> D[WebSocket Client]
        C --> E[PortAudio]
        D --> F[Boost.Beast + OpenSSL]
    end
    
    subgraph "ElevenLabs Platform"
        F --> G[WSS API Endpoint]
        G --> H[Conversational AI Agent]
    end
    
    subgraph "Audio Flow"
        I[Microphone] --> C
        C --> J[Base64 Encoding]
        J --> D
        D --> K[Audio Events]
        K --> L[Base64 Decoding]
        L --> C
        C --> M[Speakers]
    end
    
    subgraph "Message Types"
        N[user_audio_chunk]
        O[agent_response]
        P[user_transcript]
        Q[audio_event]
        R[ping/pong]
    end
    
    style B fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style H fill:#fff3e0

Quick Start

Prerequisites

  • C++17 compatible compiler: GCC 11+, Clang 14+, or MSVC 2022+
  • CMake 3.14 or higher
  • Dependencies (install via package manager):

macOS (Homebrew)

brew install boost openssl portaudio nlohmann-json cmake pkg-config

Ubuntu/Debian

sudo apt update
sudo apt install build-essential cmake pkg-config
sudo apt install libboost-system-dev libboost-thread-dev
sudo apt install libssl-dev libportaudio2-dev nlohmann-json3-dev

Windows (vcpkg)

vcpkg install boost-system boost-thread openssl portaudio nlohmann-json

Building

# Clone the repository
git clone https://github.com/Jitendra2603/elevenlabs-convai-cpp.git
cd elevenlabs-convai-cpp

# Build the project
mkdir build && cd build
cmake ..
cmake --build . --config Release

Running

# Set your agent ID (get this from ElevenLabs dashboard)
export AGENT_ID="your-agent-id-here"

# Run the demo
./convai_cpp

The application will:

  1. Connect to your ElevenLabs Conversational AI agent
  2. Start capturing audio from your default microphone
  3. Stream audio to the agent and play responses through speakers
  4. Display conversation transcripts in the terminal
  5. Continue until you press Enter to quit

📋 Usage Examples

Basic Conversation

export AGENT_ID="agent_"
./convai_cpp
# Speak into your microphone and hear the AI agent respond

Configuration

Audio Settings

The audio interface is configured for optimal real-time performance:

  • Sample Rate: 16 kHz
  • Format: 16-bit PCM mono
  • Input Buffer: 250ms (4000 frames)
  • Output Buffer: 62.5ms (1000 frames)

WebSocket Connection

  • Endpoint: wss://api.elevenlabs.io/v1/convai/conversation
  • Protocol: WebSocket Secure (WSS) with TLS 1.2+
  • Authentication: Optional (required for private agents)

Project Structure

elevenlabs-convai-cpp/
├── CMakeLists.txt              # Build configuration
├── README.md                   # This file
├── LICENSE                     # MIT license
├── CONTRIBUTING.md             # Contribution guidelines
├── .gitignore                  # Git ignore rules
├── include/                    # Header files
│   ├── AudioInterface.hpp      # Abstract audio interface
│   ├── DefaultAudioInterface.hpp # PortAudio implementation
│   └── Conversation.hpp        # Main conversation handler
└── src/                        # Source files
    ├── main.cpp                # Demo application
    ├── Conversation.cpp        # WebSocket and message handling
    └── DefaultAudioInterface.cpp # Audio I/O implementation

Technical Details

Audio Processing Pipeline

  1. Capture: PortAudio captures 16-bit PCM audio at 16kHz
  2. Encoding: Raw audio is base64-encoded for WebSocket transmission
  3. Streaming: Audio chunks sent as user_audio_chunk messages
  4. Reception: Server sends audio_event messages with agent responses
  5. Decoding: Base64 audio data decoded back to PCM
  6. Playback: Audio queued and played through PortAudio output stream

Echo Suppression

The implementation includes a simple, effective echo suppression mechanism:

  • Microphone input is suppressed during agent speech playback
  • Prevents acoustic feedback loops that cause the agent to respond to itself
  • Uses atomic flags for thread-safe coordination between input/output

WebSocket Message Handling

Supported message types:

  • conversation_initiation_client_data - Session initialization
  • user_audio_chunk - Microphone audio data
  • audio_event - Agent speech audio
  • agent_response - Agent text responses
  • user_transcript - Speech-to-text results
  • ping/pong - Connection keepalive

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.