# Plan: Analyze & Fix `llm-metrics` Extension Timing Bug ## Problem Statement The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs). ## Phase 1: Locate & Map the Extension 1. **Find the source code** - Search `~/.pi/extensions/`, `~/.pi/tools/`, and the pi-coding-agent package for files matching `llm`, `metric`, `performance`, `benchmark` - Check `~/.pi/config` or project `.pi/config` for extension/tool registration - Look for custom tool definitions in `extensions/`, `tools/`, or `skills/` directories 2. **Identify the provider integration** - The log shows `"provider":"llama.cpp"` — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception) - Map the data flow: raw llama.cpp output → extension parsing → JSON log writing ## Phase 2: Diagnose the Timing Bug 3. **Trace phase boundary detection** - Find how the extension defines "prefill" vs "generation" start/end times - Check if it uses: - `timeToFirstToken` (TTFT) as the split point - llama.cpp callback hooks (`completion_token_callback`, `prompt_token_callback`) - Wall-clock timestamps around token streaming 4. **Verify the calculation** - Confirm the formula: `generationTok/s = outputTokens / (totalDuration - TTFT)` - Check if `totalDuration` includes only generation, or the full call - Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed 5. **Reproduce the anomaly** - Run the same model with identical prompt/output length - Compare TTFT, totalDuration, and per-phase timestamps - Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs ## Phase 3: Fix the Implementation 6. **Correct phase boundaries** - If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit `done` event - If using wall-clock: add a small buffer after last token to account for async flush - Add validation: reject generation speeds > 500 tok/s (sanity check) 7. **Fix label assignment** - Ensure `prefillTokensPerSec` = `inputTokens / TTFT` - Ensure `generationTokensPerSec` = `outputTokens / (totalDuration - TTFT)` - Add explicit phase logging to debug output 8. **Add telemetry** - Log raw timestamps: `prefill_start`, `prefill_end`, `gen_start`, `gen_end`, `total_start`, `total_end` - Log per-phase token counts to catch mismatches - Write to `.pi/llm-metrics.log` with consistent schema ## Phase 4: Verify & Deploy 9. **Test cases** - Small prompt + short output (baseline) - Large prompt + long output (original failure case) - Speculative decoding run (if supported) - Early termination / stop token edge case 10. **Validate output** - Generation speed should be 40–100 tok/s for this model/hardware - Prefill speed should be 50–200 tok/s (parallel compute) - TTFT should match prefill duration - No negative phase durations 11. **Update schema & docs** - Add `rawTimestamps` field to log entries for debugging - Document phase definitions in extension README - Add unit tests for metric calculation logic ## Deliverables - [ ] Extension source located & data flow mapped - [ ] Root cause identified (callback timing gap, phase boundary misassignment, or async flush race) - [ ] Fix implemented with sanity checks - [ ] Test suite covering edge cases - [ ] Log schema updated with raw timestamps - [ ] PR or patch ready for review ## Questions to Answer During Analysis - Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper? - Are callbacks synchronous or async? - Is there a `done`/`end` event, or does it rely on empty token streams? - Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?