monotonic-pi-extensions/plans/metrics-check.md

# Plan: Analyze & Fix `llm-metrics` Extension Timing Bug

## Problem Statement
The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs).

## Phase 1: Locate & Map the Extension
1. **Find the source code**
   - Search `~/.pi/extensions/`, `~/.pi/tools/`, and the pi-coding-agent package for files matching `llm`, `metric`, `performance`, `benchmark`
   - Check `~/.pi/config` or project `.pi/config` for extension/tool registration
   - Look for custom tool definitions in `extensions/`, `tools/`, or `skills/` directories
2. **Identify the provider integration**
   - The log shows `"provider":"llama.cpp"` — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
   - Map the data flow: raw llama.cpp output → extension parsing → JSON log writing

## Phase 2: Diagnose the Timing Bug
3. **Trace phase boundary detection**
   - Find how the extension defines "prefill" vs "generation" start/end times
   - Check if it uses:
     - `timeToFirstToken` (TTFT) as the split point
     - llama.cpp callback hooks (`completion_token_callback`, `prompt_token_callback`)
     - Wall-clock timestamps around token streaming
4. **Verify the calculation**
   - Confirm the formula: `generationTok/s = outputTokens / (totalDuration - TTFT)`
   - Check if `totalDuration` includes only generation, or the full call
   - Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
5. **Reproduce the anomaly**
   - Run the same model with identical prompt/output length
   - Compare TTFT, totalDuration, and per-phase timestamps
   - Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs

## Phase 3: Fix the Implementation
6. **Correct phase boundaries**
   - If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit `done` event
   - If using wall-clock: add a small buffer after last token to account for async flush
   - Add validation: reject generation speeds > 500 tok/s (sanity check)
7. **Fix label assignment**
   - Ensure `prefillTokensPerSec` = `inputTokens / TTFT`
   - Ensure `generationTokensPerSec` = `outputTokens / (totalDuration - TTFT)`
   - Add explicit phase logging to debug output
8. **Add telemetry**
   - Log raw timestamps: `prefill_start`, `prefill_end`, `gen_start`, `gen_end`, `total_start`, `total_end`
   - Log per-phase token counts to catch mismatches
   - Write to `.pi/llm-metrics.log` with consistent schema

## Phase 4: Verify & Deploy
9. **Test cases**
   - Small prompt + short output (baseline)
   - Large prompt + long output (original failure case)
   - Speculative decoding run (if supported)
   - Early termination / stop token edge case
10. **Validate output**
    - Generation speed should be 40–100 tok/s for this model/hardware
    - Prefill speed should be 50–200 tok/s (parallel compute)
    - TTFT should match prefill duration
    - No negative phase durations
11. **Update schema & docs**
    - Add `rawTimestamps` field to log entries for debugging
    - Document phase definitions in extension README
    - Add unit tests for metric calculation logic

## Deliverables
- [ ] Extension source located & data flow mapped
- [ ] Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
- [ ] Fix implemented with sanity checks
- [ ] Test suite covering edge cases
- [ ] Log schema updated with raw timestamps
- [ ] PR or patch ready for review

## Questions to Answer During Analysis
- Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
- Are callbacks synchronous or async?
- Is there a `done`/`end` event, or does it rely on empty token streams?
- Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?