Plan: Analyze & Fix `llm-metrics` Extension Timing Bug

Problem Statement

The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs).

Phase 1: Locate & Map the Extension

Find the source code
- Search ~/.pi/extensions/, ~/.pi/tools/, and the pi-coding-agent package for files matching llm, metric, performance, benchmark
- Check ~/.pi/config or project .pi/config for extension/tool registration
- Look for custom tool definitions in extensions/, tools/, or skills/ directories
Identify the provider integration
- The log shows "provider":"llama.cpp" — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
- Map the data flow: raw llama.cpp output → extension parsing → JSON log writing

Phase 2: Diagnose the Timing Bug

Trace phase boundary detection
- Find how the extension defines "prefill" vs "generation" start/end times
- Check if it uses:
  - timeToFirstToken (TTFT) as the split point
  - llama.cpp callback hooks (completion_token_callback, prompt_token_callback)
  - Wall-clock timestamps around token streaming
Verify the calculation
- Confirm the formula: generationTok/s = outputTokens / (totalDuration - TTFT)
- Check if totalDuration includes only generation, or the full call
- Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
Reproduce the anomaly
- Run the same model with identical prompt/output length
- Compare TTFT, totalDuration, and per-phase timestamps
- Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs

Phase 3: Fix the Implementation

Correct phase boundaries
- If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit done event
- If using wall-clock: add a small buffer after last token to account for async flush
- Add validation: reject generation speeds > 500 tok/s (sanity check)
Fix label assignment
- Ensure prefillTokensPerSec = inputTokens / TTFT
- Ensure generationTokensPerSec = outputTokens / (totalDuration - TTFT)
- Add explicit phase logging to debug output
Add telemetry
- Log raw timestamps: prefill_start, prefill_end, gen_start, gen_end, total_start, total_end
- Log per-phase token counts to catch mismatches
- Write to .pi/llm-metrics.log with consistent schema

Phase 4: Verify & Deploy

Test cases
- Small prompt + short output (baseline)
- Large prompt + long output (original failure case)
- Speculative decoding run (if supported)
- Early termination / stop token edge case
Validate output
- Generation speed should be 40–100 tok/s for this model/hardware
- Prefill speed should be 50–200 tok/s (parallel compute)
- TTFT should match prefill duration
- No negative phase durations
Update schema & docs
- Add rawTimestamps field to log entries for debugging
- Document phase definitions in extension README
- Add unit tests for metric calculation logic

Deliverables

Extension source located & data flow mapped
Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
Fix implemented with sanity checks
Test suite covering edge cases
Log schema updated with raw timestamps
PR or patch ready for review

Questions to Answer During Analysis

Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
Are callbacks synchronous or async?
Is there a done/end event, or does it rely on empty token streams?
Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?

4.0 KiB Raw Blame History Unescape Escape

Plan: Analyze & Fix llm-metrics Extension Timing Bug