Compare commits
6 Commits
d60f8d8484
...
2169c58cd7
| Author | SHA1 | Date | |
|---|---|---|---|
| 2169c58cd7 | |||
| 5fad6376bc | |||
| 0b190c3149 | |||
| 56b072c45e | |||
| a8dd5a022f | |||
| 955c97e0dd |
@ -17,69 +17,70 @@
|
|||||||
## Plugins
|
## Plugins
|
||||||
| Plugin | Path | Purpose |
|
| Plugin | Path | Purpose |
|
||||||
|--------|------|---------|
|
|--------|------|---------|
|
||||||
| Convai (reference) | `<repo_root>/ConvAI/Convai/` | gRPC + protobuf streaming to Convai API. Has ElevenLabs voice type enum in `ConvaiDefinitions.h`. Used as architectural reference. |
|
| Convai (reference) | `<repo_root>/ConvAI/Convai/` | gRPC + protobuf streaming to Convai API. Used as architectural reference. |
|
||||||
| **PS_AI_Agent_ElevenLabs** | `<repo_root>/Unreal/PS_AI_Agent/Plugins/PS_AI_Agent_ElevenLabs/` | Our ElevenLabs Conversational AI integration. See `.claude/elevenlabs_plugin.md` for full details. |
|
| **PS_AI_ConvAgent** | `<repo_root>/Unreal/PS_AI_Agent/Plugins/PS_AI_ConvAgent/` | Main plugin — ElevenLabs Conversational AI, posture, gaze, lip sync, facial expressions. |
|
||||||
|
|
||||||
## User Preferences
|
## User Preferences
|
||||||
- Plugin naming: `PS_AI_Agent_<Service>` (e.g. `PS_AI_Agent_ElevenLabs`)
|
- Plugin naming: `PS_AI_ConvAgent` (renamed from PS_AI_Agent_ElevenLabs)
|
||||||
- Save memory frequently during long sessions
|
- Save memory frequently during long sessions
|
||||||
- Goal: ElevenLabs Conversational AI integration — simpler than Convai, no gRPC
|
|
||||||
- Full original ask + intent: see `.claude/project_context.md`
|
|
||||||
- Git remote is a **private server** — no public exposure risk
|
- Git remote is a **private server** — no public exposure risk
|
||||||
|
- Full original ask + intent: see `.claude/project_context.md`
|
||||||
|
|
||||||
|
## Current Branch & Work
|
||||||
|
- **Branch**: `main`
|
||||||
|
- **Recent merges**: `feature/multi-player-shared-agent` merged to main
|
||||||
|
|
||||||
|
### Latency Debug HUD (just implemented)
|
||||||
|
- Separate `bDebugLatency` property + CVar `ps.ai.ConvAgent.Debug.Latency`
|
||||||
|
- All metrics anchored to `GenerationStartTime` (`agent_response_started` event)
|
||||||
|
- Metrics: Gen>Audio (LLM+TTS), Pre-buffer, Gen>Ear (user-perceived)
|
||||||
|
- Reset per turn in `HandleAgentResponseStarted()`
|
||||||
|
- `DrawLatencyHUD()` separate from `DrawDebugHUD()`
|
||||||
|
|
||||||
|
### Future: Server-Side Latency from ElevenLabs API
|
||||||
|
**TODO — high-value improvement parked for later:**
|
||||||
|
- `GET /v1/convai/conversations/{conversation_id}` returns:
|
||||||
|
- `conversation_turn_metrics` with `elapsed_time` per metric (STT, LLM, TTS breakdown!)
|
||||||
|
- `tool_latency_secs`, `step_latency_secs`, `rag_latency_secs`
|
||||||
|
- `time_in_call_secs` per message
|
||||||
|
- `ping` WS event has `ping_ms` (network round-trip) — could display on HUD
|
||||||
|
- `vad_score` WS event (0.0-1.0) — could detect real speech start client-side
|
||||||
|
- Docs: https://elevenlabs.io/docs/api-reference/conversations/get
|
||||||
|
|
||||||
|
### Multi-Player Shared Agent — Key Design
|
||||||
|
- **Old model**: exclusive lock (one player per agent via `NetConversatingPawn`)
|
||||||
|
- **New model**: shared array (`NetConnectedPawns`) + active speaker (`NetActiveSpeakerPawn`)
|
||||||
|
- Speaker arbitration: server-side with `SpeakerSwitchHysteresis` (0.3s) + `SpeakerIdleTimeout` (3.0s)
|
||||||
|
- In standalone (≤1 player): speaker arbitration bypassed, audio sent directly to WebSocket
|
||||||
|
- Internal mic (WASAPI thread): direct WebSocket send, no game-thread state access
|
||||||
|
- `GetCurrentBlendshapes()` thread-safe via `ThreadSafeBlendshapes` snapshot + `BlendshapeLock`
|
||||||
|
|
||||||
## Key UE5 Plugin Patterns
|
## Key UE5 Plugin Patterns
|
||||||
- Settings object: `UCLASS(config=Engine, defaultconfig)` inheriting `UObject`, registered via `ISettingsModule`
|
- Settings object: `UCLASS(config=Engine, defaultconfig)` inheriting `UObject`, registered via `ISettingsModule`
|
||||||
- Module startup: `NewObject<USettings>(..., RF_Standalone)` + `AddToRoot()`
|
|
||||||
- WebSocket: `FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), Headers)`
|
- WebSocket: `FWebSocketsModule::Get().CreateWebSocket(URL, TEXT(""), Headers)`
|
||||||
- `WebSockets` is a **module** (Build.cs only) — NOT a plugin, don't put it in `.uplugin`
|
- Audio capture: `Audio::FAudioCapture::OpenAudioCaptureStream()` (UE 5.3+)
|
||||||
- Audio capture: `Audio::FAudioCapture::OpenAudioCaptureStream()` (UE 5.3+, replaces deprecated `OpenCaptureStream`)
|
- Callback arrives on **background thread** — marshal to game thread
|
||||||
- `AudioCapture` IS a plugin — declare it in `.uplugin` Plugins array
|
- Procedural audio playback: `USoundWaveProcedural` + `OnSoundWaveProceduralUnderflow`
|
||||||
- Callback type: `FOnAudioCaptureFunction` = `TFunction<void(const void*, int32, int32, int32, double, bool)>`
|
|
||||||
- Cast `const void*` to `const float*` inside — device sends float32 interleaved
|
|
||||||
- Procedural audio playback: `USoundWaveProcedural` + `OnSoundWaveProceduralUnderflow` delegate
|
|
||||||
- Audio capture callbacks arrive on a **background thread** — always marshal to game thread with `AsyncTask(ENamedThreads::GameThread, ...)`
|
|
||||||
- Resample mic audio to **16000 Hz mono** before sending to ElevenLabs
|
- Resample mic audio to **16000 Hz mono** before sending to ElevenLabs
|
||||||
- `TArray::RemoveAt(idx, count, EAllowShrinking::No)` — bool overload deprecated in UE 5.5
|
- `TArray::RemoveAt(idx, count, EAllowShrinking::No)` — bool overload deprecated in UE 5.5
|
||||||
|
|
||||||
## Plugin Status
|
|
||||||
- **PS_AI_Agent_ElevenLabs**: compiles cleanly on UE 5.5 Win64 (verified 2026-02-19)
|
|
||||||
- v1.5.0 — mic audio chunk size fixed: WASAPI 5ms callbacks accumulated to 100ms before sending
|
|
||||||
- v1.4.0 — push-to-talk fully fixed: bAutoStartListening now ignored in Client turn mode
|
|
||||||
- Binary WS frame handling implemented (ElevenLabs sends ALL frames as binary, not text)
|
|
||||||
- First-byte discrimination: `{` = JSON control message, else = raw PCM audio
|
|
||||||
- `SendTextMessage()` added to both WebSocketProxy and ConversationalAgentComponent
|
|
||||||
- `conversation_initiation_client_data` now sent immediately on WS connect (required for mic + latency)
|
|
||||||
|
|
||||||
## Audio Chunk Size — CRITICAL
|
|
||||||
- WASAPI fires mic callbacks every ~5ms → **158 bytes** at 16kHz 16-bit mono
|
|
||||||
- ElevenLabs VAD/STT requires **≥3200 bytes (100ms)** per chunk; smaller chunks are silently ignored
|
|
||||||
- Fix: `MicAccumulationBuffer` in component accumulates chunks; sends only when `>= MicChunkMinBytes` (3200)
|
|
||||||
- `StopListening()` flushes remainder so final partial chunk is never dropped before end-of-turn
|
|
||||||
|
|
||||||
## ElevenLabs WebSocket Protocol Notes
|
## ElevenLabs WebSocket Protocol Notes
|
||||||
- **ALL frames are binary** — bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text) — UE fires both for same frame → double audio bug
|
- **ALL frames are binary** — bind ONLY `OnRawMessage`; NEVER bind `OnMessage` (text)
|
||||||
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
- Binary frame discrimination: peek byte[0] → `'{'` (0x7B) = JSON, else = raw PCM audio
|
||||||
- Fragment reassembly: accumulate into `BinaryFrameBuffer` until `BytesRemaining == 0`
|
|
||||||
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
- Pong: `{"type":"pong","event_id":N}` — `event_id` is **top-level**, NOT nested
|
||||||
- Transcript: type=`user_transcript`, key=`user_transcription_event`, field=`user_transcript`
|
- `user_transcript` arrives AFTER `agent_response_started` in Server VAD mode
|
||||||
- Client turn mode (`client_vad`): send `user_activity` **with every audio chunk** (not just once) — server needs continuous signal to know user is speaking; stopping chunks = silence detected = agent responds
|
- **MUST send `conversation_initiation_client_data` immediately after WS connect**
|
||||||
- Text input: `{"type":"user_message","text":"..."}` — agent replies with audio + text
|
|
||||||
- **MUST send `conversation_initiation_client_data` immediately after WS connect** — without it, server won't process client audio (mic appears dead)
|
|
||||||
- `conversation_initiation_client_data` payload: `conversation_config_override.agent.turn.mode`, `conversation_config_override.tts.optimize_streaming_latency`, `custom_llm_extra_body.enable_intermediate_response`
|
|
||||||
- `enable_intermediate_response: true` in `custom_llm_extra_body` reduces time-to-first-audio latency
|
|
||||||
|
|
||||||
## API Keys / Secrets
|
## API Keys / Secrets
|
||||||
- ElevenLabs API key is set in **Project Settings → Plugins → ElevenLabs AI Agent** in the Editor
|
- ElevenLabs API key: **Project Settings → Plugins → ElevenLabs AI Agent**
|
||||||
- UE saves it to `DefaultEngine.ini` under `[/Script/PS_AI_Agent_ElevenLabs.ElevenLabsSettings]`
|
- Saved to `DefaultEngine.ini` — **stripped before every commit**
|
||||||
- **The key is stripped from `DefaultEngine.ini` before every commit** — do not commit it
|
|
||||||
- Each developer sets the key locally; it does not go in git
|
|
||||||
|
|
||||||
## Claude Memory Files in This Repo
|
## Claude Memory Files in This Repo
|
||||||
| File | Contents |
|
| File | Contents |
|
||||||
|------|----------|
|
|------|----------|
|
||||||
| `.claude/MEMORY.md` | This file — project structure, patterns, status |
|
| `.claude/MEMORY.md` | This file — project structure, patterns, status |
|
||||||
| `.claude/elevenlabs_plugin.md` | Plugin file map, ElevenLabs WS protocol, design decisions |
|
| `.claude/elevenlabs_plugin.md` | Plugin file map, ElevenLabs WS protocol, design decisions |
|
||||||
| `.claude/elevenlabs_api_reference.md` | Full ElevenLabs API reference (WS messages, REST, signed URL, Agent ID location) |
|
| `.claude/elevenlabs_api_reference.md` | Full ElevenLabs API reference (WS messages, REST, signed URL) |
|
||||||
| `.claude/project_context.md` | Original ask, intent, short/long-term goals |
|
| `.claude/project_context.md` | Original ask, intent, short/long-term goals |
|
||||||
| `.claude/session_log_2026-02-19.md` | Full session record: steps, commits, technical decisions, next steps |
|
| `.claude/session_log_2026-02-19.md` | Session record: steps, commits, technical decisions |
|
||||||
| `.claude/PS_AI_Agent_ElevenLabs_Documentation.md` | User-facing Markdown reference doc |
|
| `.claude/PS_AI_Agent_ElevenLabs_Documentation.md` | User-facing Markdown reference doc |
|
||||||
|
|||||||
@ -26,6 +26,12 @@ static TAutoConsoleVariable<int32> CVarDebugElevenLabs(
|
|||||||
TEXT("Debug HUD for ElevenLabs. -1=use property, 0=off, 1-3=verbosity."),
|
TEXT("Debug HUD for ElevenLabs. -1=use property, 0=off, 1-3=verbosity."),
|
||||||
ECVF_Default);
|
ECVF_Default);
|
||||||
|
|
||||||
|
static TAutoConsoleVariable<int32> CVarDebugLatency(
|
||||||
|
TEXT("ps.ai.ConvAgent.Debug.Latency"),
|
||||||
|
-1,
|
||||||
|
TEXT("Latency debug HUD. -1=use property, 0=off, 1=on."),
|
||||||
|
ECVF_Default);
|
||||||
|
|
||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
// Constructor
|
// Constructor
|
||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
@ -160,9 +166,13 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::TickComponent(float DeltaTime, ELevel
|
|||||||
AudioPlaybackComponent->Play();
|
AudioPlaybackComponent->Play();
|
||||||
}
|
}
|
||||||
PlaybackStartTime = FPlatformTime::Seconds();
|
PlaybackStartTime = FPlatformTime::Seconds();
|
||||||
if (bDebug && TurnEndTime > 0.0)
|
if (GenerationStartTime > 0.0)
|
||||||
{
|
{
|
||||||
LastLatencies.EndToEarMs = static_cast<float>((PlaybackStartTime - TurnEndTime) * 1000.0);
|
CurrentLatencies.GenToEarMs = static_cast<float>((PlaybackStartTime - GenerationStartTime) * 1000.0);
|
||||||
|
}
|
||||||
|
if (PreBufferStartTime > 0.0)
|
||||||
|
{
|
||||||
|
CurrentLatencies.PreBufferMs = static_cast<float>((PlaybackStartTime - PreBufferStartTime) * 1000.0);
|
||||||
}
|
}
|
||||||
OnAudioPlaybackStarted.Broadcast();
|
OnAudioPlaybackStarted.Broadcast();
|
||||||
}
|
}
|
||||||
@ -308,6 +318,14 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::TickComponent(float DeltaTime, ELevel
|
|||||||
DrawDebugHUD();
|
DrawDebugHUD();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
{
|
||||||
|
const int32 CVarVal = CVarDebugLatency.GetValueOnGameThread();
|
||||||
|
const bool bShowLatency = (CVarVal >= 0) ? (CVarVal > 0) : bDebugLatency;
|
||||||
|
if (bShowLatency)
|
||||||
|
{
|
||||||
|
DrawLatencyHUD();
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
@ -576,6 +594,7 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::StopListening()
|
|||||||
}
|
}
|
||||||
|
|
||||||
TurnEndTime = FPlatformTime::Seconds();
|
TurnEndTime = FPlatformTime::Seconds();
|
||||||
|
|
||||||
// Start the response timeout clock — but only when the server hasn't already started
|
// Start the response timeout clock — but only when the server hasn't already started
|
||||||
// generating. When StopListening() is called from HandleAgentResponseStarted() as part
|
// generating. When StopListening() is called from HandleAgentResponseStarted() as part
|
||||||
// of collision avoidance, bAgentGenerating is already true, meaning the server IS already
|
// of collision avoidance, bAgentGenerating is already true, meaning the server IS already
|
||||||
@ -1057,14 +1076,26 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::HandleAgentResponseStarted()
|
|||||||
}
|
}
|
||||||
|
|
||||||
const double Now = FPlatformTime::Seconds();
|
const double Now = FPlatformTime::Seconds();
|
||||||
GenerationStartTime = Now;
|
|
||||||
if (bDebug && TurnEndTime > 0.0)
|
// --- Latency reset for this new response cycle ---
|
||||||
|
// In Server VAD mode, StopListening() is not called — the server detects
|
||||||
|
// end of user speech and immediately starts generating. If TurnEndTime was
|
||||||
|
// not set by StopListening since the last generation (i.e. it's stale or 0),
|
||||||
|
// use Now as the best client-side approximation.
|
||||||
|
const bool bFreshTurnEnd = (TurnEndTime > GenerationStartTime) && (GenerationStartTime > 0.0);
|
||||||
|
if (!bFreshTurnEnd)
|
||||||
{
|
{
|
||||||
LastLatencies.STTToGenMs = static_cast<float>((Now - TurnEndTime) * 1000.0);
|
TurnEndTime = Now;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Reset all latency measurements — new response cycle starts here.
|
||||||
|
// All metrics are anchored to GenerationStartTime (= now), which is the closest
|
||||||
|
// client-side proxy for "user stopped speaking" in Server VAD mode.
|
||||||
|
CurrentLatencies = FDebugLatencies();
|
||||||
|
GenerationStartTime = Now;
|
||||||
|
|
||||||
const double T = Now - SessionStartTime;
|
const double T = Now - SessionStartTime;
|
||||||
const double LatencyFromTurnEnd = TurnEndTime > 0.0 ? Now - TurnEndTime : 0.0;
|
const double LatencyFromTurnEnd = Now - TurnEndTime;
|
||||||
if (bIsListening)
|
if (bIsListening)
|
||||||
{
|
{
|
||||||
// In Server VAD + interruption mode, keep the mic open so the server can
|
// In Server VAD + interruption mode, keep the mic open so the server can
|
||||||
@ -1341,6 +1372,10 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::EnqueueAgentAudio(const TArray<uint8>
|
|||||||
bQueueWasDry = false;
|
bQueueWasDry = false;
|
||||||
SilentTickCount = 0;
|
SilentTickCount = 0;
|
||||||
|
|
||||||
|
// Latency capture (always, for HUD display).
|
||||||
|
if (GenerationStartTime > 0.0)
|
||||||
|
CurrentLatencies.GenToAudioMs = static_cast<float>((AgentSpeakStart - GenerationStartTime) * 1000.0);
|
||||||
|
|
||||||
if (bDebug)
|
if (bDebug)
|
||||||
{
|
{
|
||||||
const double T = AgentSpeakStart - SessionStartTime;
|
const double T = AgentSpeakStart - SessionStartTime;
|
||||||
@ -1348,12 +1383,6 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::EnqueueAgentAudio(const TArray<uint8>
|
|||||||
UE_LOG(LogPS_AI_ConvAgent_ElevenLabs, Log,
|
UE_LOG(LogPS_AI_ConvAgent_ElevenLabs, Log,
|
||||||
TEXT("[T+%.2fs] [Turn %d] Agent speaking — first audio chunk. (%.2fs after turn end)"),
|
TEXT("[T+%.2fs] [Turn %d] Agent speaking — first audio chunk. (%.2fs after turn end)"),
|
||||||
T, LastClosedTurnIndex, LatencyFromTurnEnd);
|
T, LastClosedTurnIndex, LatencyFromTurnEnd);
|
||||||
|
|
||||||
// Update latency snapshot for HUD display.
|
|
||||||
if (TurnEndTime > 0.0)
|
|
||||||
LastLatencies.TotalMs = static_cast<float>((AgentSpeakStart - TurnEndTime) * 1000.0);
|
|
||||||
if (GenerationStartTime > 0.0)
|
|
||||||
LastLatencies.GenToAudioMs = static_cast<float>((AgentSpeakStart - GenerationStartTime) * 1000.0);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
OnAgentStartedSpeaking.Broadcast();
|
OnAgentStartedSpeaking.Broadcast();
|
||||||
@ -1386,10 +1415,11 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::EnqueueAgentAudio(const TArray<uint8>
|
|||||||
AudioPlaybackComponent->Play();
|
AudioPlaybackComponent->Play();
|
||||||
}
|
}
|
||||||
PlaybackStartTime = FPlatformTime::Seconds();
|
PlaybackStartTime = FPlatformTime::Seconds();
|
||||||
if (bDebug && TurnEndTime > 0.0)
|
if (GenerationStartTime > 0.0)
|
||||||
{
|
{
|
||||||
LastLatencies.EndToEarMs = static_cast<float>((PlaybackStartTime - TurnEndTime) * 1000.0);
|
CurrentLatencies.GenToEarMs = static_cast<float>((PlaybackStartTime - GenerationStartTime) * 1000.0);
|
||||||
}
|
}
|
||||||
|
// No pre-buffer in this path (immediate playback).
|
||||||
OnAudioPlaybackStarted.Broadcast();
|
OnAudioPlaybackStarted.Broadcast();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -1417,9 +1447,13 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::EnqueueAgentAudio(const TArray<uint8>
|
|||||||
AudioPlaybackComponent->Play();
|
AudioPlaybackComponent->Play();
|
||||||
}
|
}
|
||||||
PlaybackStartTime = FPlatformTime::Seconds();
|
PlaybackStartTime = FPlatformTime::Seconds();
|
||||||
if (bDebug && TurnEndTime > 0.0)
|
if (GenerationStartTime > 0.0)
|
||||||
{
|
{
|
||||||
LastLatencies.EndToEarMs = static_cast<float>((PlaybackStartTime - TurnEndTime) * 1000.0);
|
CurrentLatencies.GenToEarMs = static_cast<float>((PlaybackStartTime - GenerationStartTime) * 1000.0);
|
||||||
|
}
|
||||||
|
if (PreBufferStartTime > 0.0)
|
||||||
|
{
|
||||||
|
CurrentLatencies.PreBufferMs = static_cast<float>((PlaybackStartTime - PreBufferStartTime) * 1000.0);
|
||||||
}
|
}
|
||||||
OnAudioPlaybackStarted.Broadcast();
|
OnAudioPlaybackStarted.Broadcast();
|
||||||
}
|
}
|
||||||
@ -2362,19 +2396,45 @@ void UPS_AI_ConvAgent_ElevenLabsComponent::DrawDebugHUD() const
|
|||||||
NetConnectedPawns.Num(), *SpeakerName));
|
NetConnectedPawns.Num(), *SpeakerName));
|
||||||
}
|
}
|
||||||
|
|
||||||
// Latencies (from last completed turn)
|
|
||||||
if (LastLatencies.TotalMs > 0.0f)
|
|
||||||
{
|
|
||||||
GEngine->AddOnScreenDebugMessage(BaseKey + 8, DisplayTime, MainColor,
|
|
||||||
FString::Printf(TEXT(" Latency: total=%.0fms (stt>gen=%.0fms gen>audio=%.0fms) ear=%.0fms"),
|
|
||||||
LastLatencies.TotalMs, LastLatencies.STTToGenMs,
|
|
||||||
LastLatencies.GenToAudioMs, LastLatencies.EndToEarMs));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Reconnection
|
// Reconnection
|
||||||
GEngine->AddOnScreenDebugMessage(BaseKey + 9, DisplayTime,
|
GEngine->AddOnScreenDebugMessage(BaseKey + 8, DisplayTime,
|
||||||
bWantsReconnect ? FColor::Red : MainColor,
|
bWantsReconnect ? FColor::Red : MainColor,
|
||||||
FString::Printf(TEXT(" Reconnect: %d/%d attempts%s"),
|
FString::Printf(TEXT(" Reconnect: %d/%d attempts%s"),
|
||||||
ReconnectAttemptCount, MaxReconnectAttempts,
|
ReconnectAttemptCount, MaxReconnectAttempts,
|
||||||
bWantsReconnect ? TEXT(" (ACTIVE)") : TEXT("")));
|
bWantsReconnect ? TEXT(" (ACTIVE)") : TEXT("")));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void UPS_AI_ConvAgent_ElevenLabsComponent::DrawLatencyHUD() const
|
||||||
|
{
|
||||||
|
if (!GEngine) return;
|
||||||
|
|
||||||
|
// Separate BaseKey range so it never collides with DrawDebugHUD
|
||||||
|
const int32 BaseKey = 93700;
|
||||||
|
const float DisplayTime = 1.0f; // long enough to avoid flicker between ticks
|
||||||
|
|
||||||
|
const FColor TitleColor = FColor::Cyan;
|
||||||
|
const FColor ValueColor = FColor::White;
|
||||||
|
const FColor HighlightColor = FColor::Yellow;
|
||||||
|
|
||||||
|
// Helper: format a single metric — shows "---" when not yet captured this turn
|
||||||
|
auto Fmt = [](float Ms) -> FString
|
||||||
|
{
|
||||||
|
return (Ms > 0.0f) ? FString::Printf(TEXT("%.0f ms"), Ms) : FString(TEXT("---"));
|
||||||
|
};
|
||||||
|
|
||||||
|
// Title — all times measured from agent_response_started
|
||||||
|
GEngine->AddOnScreenDebugMessage(BaseKey, DisplayTime, TitleColor,
|
||||||
|
TEXT("=== Latency (from gen start) ==="));
|
||||||
|
|
||||||
|
// 1. Gen → Audio: generation start → first audio chunk (LLM + TTS)
|
||||||
|
GEngine->AddOnScreenDebugMessage(BaseKey + 1, DisplayTime, ValueColor,
|
||||||
|
FString::Printf(TEXT(" Gen>Audio: %s"), *Fmt(CurrentLatencies.GenToAudioMs)));
|
||||||
|
|
||||||
|
// 2. Pre-buffer wait before playback
|
||||||
|
GEngine->AddOnScreenDebugMessage(BaseKey + 2, DisplayTime, ValueColor,
|
||||||
|
FString::Printf(TEXT(" Pre-buffer: %s"), *Fmt(CurrentLatencies.PreBufferMs)));
|
||||||
|
|
||||||
|
// 3. Gen → Ear: generation start → playback starts (user-perceived total)
|
||||||
|
GEngine->AddOnScreenDebugMessage(BaseKey + 3, DisplayTime, HighlightColor,
|
||||||
|
FString::Printf(TEXT(" Gen>Ear: %s"), *Fmt(CurrentLatencies.GenToEarMs)));
|
||||||
|
}
|
||||||
|
|||||||
@ -231,6 +231,11 @@ public:
|
|||||||
meta = (ClampMin = "0", ClampMax = "3", EditCondition = "bDebug"))
|
meta = (ClampMin = "0", ClampMax = "3", EditCondition = "bDebug"))
|
||||||
int32 DebugVerbosity = 1;
|
int32 DebugVerbosity = 1;
|
||||||
|
|
||||||
|
/** Show a separate latency debug HUD with detailed per-turn timing breakdown.
|
||||||
|
* Independent from bDebug — can be enabled alone via CVar ps.ai.ConvAgent.Debug.Latency. */
|
||||||
|
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "PS AI ConvAgent|Debug")
|
||||||
|
bool bDebugLatency = false;
|
||||||
|
|
||||||
// ── Events ────────────────────────────────────────────────────────────────
|
// ── Events ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/** Fired when the WebSocket connection is established and the conversation session is ready. Provides the ConversationID and AgentID. */
|
/** Fired when the WebSocket connection is established and the conversation session is ready. Provides the ConversationID and AgentID. */
|
||||||
@ -635,16 +640,17 @@ private:
|
|||||||
double GenerationStartTime = 0.0; // Set in HandleAgentResponseStarted — server starts generating.
|
double GenerationStartTime = 0.0; // Set in HandleAgentResponseStarted — server starts generating.
|
||||||
double PlaybackStartTime = 0.0; // Set when audio playback actually starts (post pre-buffer).
|
double PlaybackStartTime = 0.0; // Set when audio playback actually starts (post pre-buffer).
|
||||||
|
|
||||||
// Last-turn latency snapshot (ms) — updated per turn, displayed on debug HUD.
|
// Current-turn latency measurements (ms). Reset in HandleAgentResponseStarted.
|
||||||
// Persists between turns so the HUD always shows the most recent measurement.
|
// All anchored to GenerationStartTime (agent_response_started event), which is
|
||||||
|
// the closest client-side proxy for "user stopped speaking" in Server VAD mode.
|
||||||
|
// Zero means "not yet measured this turn".
|
||||||
struct FDebugLatencies
|
struct FDebugLatencies
|
||||||
{
|
{
|
||||||
float STTToGenMs = 0.0f; // TurnEnd → server starts generating
|
float GenToAudioMs = 0.0f; // agent_response_started → first audio chunk (LLM + TTS)
|
||||||
float GenToAudioMs = 0.0f; // Server generating → first audio chunk
|
float PreBufferMs = 0.0f; // Pre-buffer wait before playback starts
|
||||||
float TotalMs = 0.0f; // TurnEnd → first audio chunk
|
float GenToEarMs = 0.0f; // agent_response_started → playback starts (user-perceived)
|
||||||
float EndToEarMs = 0.0f; // TurnEnd → audio playback starts (user-perceived)
|
|
||||||
};
|
};
|
||||||
FDebugLatencies LastLatencies;
|
FDebugLatencies CurrentLatencies;
|
||||||
|
|
||||||
// Accumulates incoming PCM bytes until the audio component needs data.
|
// Accumulates incoming PCM bytes until the audio component needs data.
|
||||||
// Uses a read offset instead of RemoveAt(0,N) to avoid O(n) memmove every
|
// Uses a read offset instead of RemoveAt(0,N) to avoid O(n) memmove every
|
||||||
@ -747,4 +753,5 @@ private:
|
|||||||
|
|
||||||
/** Draw on-screen debug info (called from TickComponent when bDebug). */
|
/** Draw on-screen debug info (called from TickComponent when bDebug). */
|
||||||
void DrawDebugHUD() const;
|
void DrawDebugHUD() const;
|
||||||
|
void DrawLatencyHUD() const;
|
||||||
};
|
};
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user