user_transcript arrives AFTER agent_response_started in Server VAD mode
(the server detects end of speech via VAD, starts generating immediately,
and STT completes later). This caused Transcript>Gen to show stale values
(19s) and Total < Gen>Audio (impossible).
Now all metrics are anchored to GenerationStartTime (agent_response_started),
which is the closest client-side proxy for "user stopped speaking":
- Gen>Audio: generation start → first audio chunk (LLM + TTS)
- Pre-buffer: wait before playback
- Gen>Ear: generation start → playback starts (user-perceived)
Removed STTToGenMs, TotalMs, EndToEarMs, UserSpeechMs (all depended on
unreliable timestamps). Simpler, always correct, 3 clear metrics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>