- Fix event_id filtering bug: reset LastInterruptEventId when new generation
starts, preventing all audio from being silently dropped after an interruption
- Match C++ sample API config: remove optimize_streaming_latency and
custom_llm_extra_body overrides, send empty conversation_config_override
in Server VAD mode (only send turn_timeout in Client mode)
- Instant audio stop on interruption: call ResetAudio() before Stop() to
flush USoundWaveProcedural's internal ring buffer
- Lip sync reset on interruption/stop: bind OnAgentInterrupted (snap to
neutral) and OnAgentStoppedSpeaking (clear queues) events
- Revert jitter buffer (replaced by pre-buffer approach, default 2000ms)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove silence padding accumulation bug: QueueAudio'd silence was
accumulating in USoundWaveProcedural's internal buffer during TTS gaps,
delaying real audio by ~800ms. USoundWaveProcedural with
INDEFINITELY_LOOPING_DURATION generates silence internally instead.
- Fix pre-buffer bypass: guard OnProceduralUnderflow with bPreBuffering
check — the audio component never stops (INDEFINITELY_LOOPING_DURATION)
so it was draining AudioQueue during pre-buffering, defeating it entirely.
- Audio pre-buffer default 2000ms (max 4000ms) to absorb ElevenLabs
server-side TTS inter-chunk gaps (~2s between chunks confirmed).
- Add diagnostic timestamps [T+Xs] in HandleAudioReceived and
AudioQueue DRY/recovered logs for debugging audio pipeline timing.
- Fix lip sync not returning to neutral: add snap-to-zero (< 0.01)
in blendshape smoothing pass and clean up PreviousBlendshapes to
prevent asymptotic Lerp residuals keeping mouth slightly open.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix text erasure between TTS audio chunks (bFullTextReceived guard):
partial text now persists across all chunks of the same utterance instead
of being erased after chunk 1's queue empties
- Add audio pre-buffering (AudioPreBufferMs, default 250ms) to absorb TTS
inter-chunk gaps and eliminate mid-sentence audio pauses
- Lip sync pauses viseme queue consumption during pre-buffer to stay in sync
- Inter-frame interpolation (lerp between consumed and next queued frame)
for smoother mouth transitions instead of 32ms step-wise jumps
- Reduce double-smoothing (blendshape smooth 0.8→0.4, release 0.5→0.65)
- Adjust duration weights (vowels 2.0/1.7, plosives 0.8, silence 1.0)
- UI range refinement (AmplitudeScale 0.5-1.0, SmoothingSpeed 35-65)
- Silence padding capped at 512 samples (32ms) to prevent buffer accumulation
- Audio playback restart on buffer underrun during speech
- Optimized log levels (most debug→Verbose, kept key diagnostics at Log)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Real-time lip sync component that performs client-side spectral analysis
on the agent's PCM audio stream (ElevenLabs doesn't provide viseme data).
Pipeline: 512-point FFT (16kHz) → 5 frequency bands → 15 OVR visemes
→ ARKit blendshapes (MetaHuman compatible) → auto-apply morph targets.
Currently uses SetMorphTarget() which may be overridden by MetaHuman's
Face AnimBP — face animation not yet working. Debug logs added to
diagnose: audio flow, spectrum energy, morph target name matching.
Next steps: verify debug output, fix MetaHuman morph target override
(likely needs AnimBP integration like Convai approach).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
UE was converting the raw ms value to seconds in the Details panel,
showing "0.1 s" instead of "100". Removing Units="ms" lets the slider
display the integer value directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All configuration and event properties in ConversationalAgentComponent and
MicrophoneCaptureComponent now have explicit ToolTip meta for clear descriptions
in the Unreal Editor Details panel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Server VAD + interruption: mic stays open while agent speaks, server
detects user voice and triggers interruption automatically. Echo
suppression disabled in this mode so audio reaches the server.
- Fix agent_chat_response_part parsing: ElevenLabs API now uses
text_response_part.text instead of agent_chat_response_part_event.
Added fallback for legacy format.
- Expose MicChunkDurationMs as UPROPERTY (20-500ms, default 100ms)
instead of compile-time constant.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix regression from v1.7.0 where agent couldn't hear user speech:
- Restore AsyncTask game-thread dispatch for delegate broadcast (AddUObject
weak pointer checks are not thread-safe from WASAPI thread)
- Keep early echo suppression in WASAPI callback (before resampling)
- Keep MicChunkMinBytes at 3200 (100ms) for lower latency
- Add thread safety: std::atomic<bool> for bIsListening/bAgentSpeaking/bCapturing,
FCriticalSection for MicSendLock and WebSocketSendLock
- Add EchoSuppressFlag pointer from agent to mic component
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip mic buffer flush during collision avoidance (bAgentGenerating guard
in StopListening) to prevent sending audio to a mid-generation server
which caused both sides to stall permanently
- Add OnAgentPartialResponse event: streams LLM text fragments from
agent_chat_response_part in real-time (opt-in via bEnableAgentPartialResponse),
separate from the existing OnAgentTextResponse (full text at end)
- French agent server drop after 3 turns is a server-side issue, not client
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs prevented voice input from working:
1. ResampleTo16000() treated NumFrames as total samples, dividing by
channel count again — losing half the audio data with stereo input.
The corrupted audio was unrecognizable to ElevenLabs VAD/STT.
2. Sent nonexistent "client_vad" turn mode in session init. The API has
no turn.mode field; replaced with turn_timeout parameter.
3. Sent user_activity with every audio chunk, which resets the turn
timeout timer and prevents the server from taking its turn.
Also: send audio chunks as compact JSON, add message type debug logging,
send conversation_initiation_client_data on connect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WASAPI fires mic callbacks every ~5ms (158 bytes at 16kHz 16-bit mono).
ElevenLabs VAD/STT requires a minimum of ~100ms (3200 bytes) per chunk.
Tiny fragments arrived at the server but were never processed, so the
agent never transcribed or responded to user speech.
Fix: OnMicrophoneDataCaptured now appends to MicAccumulationBuffer and
only calls SendAudioChunk once >= 3200 bytes are accumulated. StopListening
flushes any remaining bytes before sending UserTurnEnd so the final words
of an utterance are never discarded. HandleDisconnected also clears the
buffer to prevent stale data on reconnect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reflects all bug fixes and new features added since initial release:
- Binary WS frame handling (JSON vs raw PCM discrimination)
- Corrected transcript message type and field names
- Corrected pong format (top-level event_id)
- Corrected client turn mode (user_activity, no explicit end message)
- New SendTextMessage feature documented with Blueprint + C++ examples
- Added Section 13: Changelog (v1.0.0 / v1.1.0)
- Updated audio pipeline diagram for raw binary PCM output path
- Added OnAgentConnected timing note (fires after initiation_metadata)
- Added FTranscriptSegment clarification (speaker always "user")
- Added API key / git workflow note in Security section
- New troubleshooting entries for binary frames and OnAgentConnected
- New "Test without microphone" common pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ElevenLabs sends two kinds of binary WebSocket frames:
1. JSON control messages (starts with '{') — decode as UTF-8, route to OnWsMessage
2. Raw PCM audio (binary, does not start with '{') — broadcast directly as audio
Previously all binary frames were decoded as UTF-8 JSON, causing
"Failed to parse WebSocket message as JSON" for every audio frame.
Fix: peek at first byte of assembled frame buffer:
- '{' → UTF-8 JSON path (null-terminated, routed to existing message handler)
- anything else → raw PCM path (broadcast directly to OnAudioReceived)
Also: improved "Failed to parse JSON" log to show first 80 chars of message,
and added verbose hex dump of binary audio frame prefix for diagnostics.
Compiles cleanly on UE 5.5 Win64.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ElevenLabs sends all JSON messages as binary WS frames, not text frames.
The OnRawMessage callback receives them; we were logging them as warnings
and discarding the data entirely — causing no events to fire at all.
Fix: accumulate binary frame fragments (BytesRemaining > 0 = more coming),
reassemble into a complete buffer, decode as UTF-8 JSON string, then route
through the existing OnWsMessage text handler unchanged.
Added BinaryFrameBuffer (TArray<uint8>) to proxy header for accumulation.
Compiles cleanly on UE 5.5 Win64.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sends {"type":"user_message","text":"..."} to the ElevenLabs API.
Agent responds with audio + text exactly as if it heard spoken input.
Useful for testing without a microphone and for text-only NPC interactions.
Available in Blueprint on UElevenLabsConversationalAgentComponent.
Compiles cleanly on UE 5.5 Win64.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bug 1 — Transcript handler: wrong type string + wrong JSON fields
- type was "transcript", API sends "user_transcript"
- event key was "transcript_event", API uses "user_transcription_event"
- field was "message", API uses "user_transcript"
- removed non-existent "speaker"/"is_final" fields; speaker is always "user"
Bug 2 — Pong format: event_id must be top-level, not nested in pong_event
- Was: {"type":"pong","pong_event":{"event_id":1}}
- Fixed: {"type":"pong","event_id":1}
Bug 3 — Client turn mode: user_turn_start/end don't exist in the API
- SendUserTurnStart now sends {"type":"user_activity"} (correct API message)
- SendUserTurnEnd now a no-op with log (no explicit end message in API)
- Renamed constants in ElevenLabsDefinitions.h accordingly
Also added AgentResponseCorrection and ConversationClientData constants.
Compiles cleanly on UE 5.5 Win64.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- DefaultEngine.ini: set GameDefaultMap + EditorStartupMap to TestMap
(API key stripped — set locally via Project Settings, not committed)
- Content/TestMap.umap: initial test level
- Content/test_AI_Actor.uasset: initial test actor
- .claude/MEMORY.md: document API key handling, add memory file index,
note private git server and TestMap as default map
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers: WebSocket protocol (all message types), Agent ID location,
Signed URL auth, REST agents API, audio format, UE5 integration notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full record of everything done in today's session: plugin creation,
compile fixes, documentation (Markdown + PowerPoint), git history,
technical decisions made, and next steps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove WebSockets from .uplugin (it is a module, not a plugin)
- Add AudioCapture plugin dependency to .uplugin
- Fix FOnAudioCaptureFunction: use OpenAudioCaptureStream (not deprecated
OpenDefaultCaptureStream) and correct callback signature (const void* per UE 5.3+)
- Cast void* to float* inside OnAudioGenerate for float sample processing
- Fix TArray::RemoveAt: use EAllowShrinking::No instead of deprecated bool overload
Plugin now compiles cleanly with no errors or warnings on UE 5.5 / Win64.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the plugin-specific path with a wildcard pattern so any
future plugin under Plugins/ is automatically excluded from version control.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new UE5.5 plugin integrating the ElevenLabs Conversational AI Agent
via WebSocket. No gRPC or third-party libs required.
Plugin components:
- UElevenLabsSettings: API key + Agent ID in Project Settings
- UElevenLabsWebSocketProxy: full WS session lifecycle, JSON message handling,
ping/pong keepalive, Base64 PCM audio send/receive
- UElevenLabsConversationalAgentComponent: ActorComponent for NPC voice
conversation, orchestrates mic capture -> WS -> procedural audio playback
- UElevenLabsMicrophoneCaptureComponent: wraps Audio::FAudioCapture,
resamples to 16kHz mono, dispatches on game thread
Also adds .claude/ memory files (project context, plugin notes, patterns)
so Claude Code can restore full context on any machine after a git pull.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>