4.0 KiB
4.0 KiB
Plan: Analyze & Fix llm-metrics Extension Timing Bug
Problem Statement
The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs).
Phase 1: Locate & Map the Extension
- Find the source code
- Search
~/.pi/extensions/,~/.pi/tools/, and the pi-coding-agent package for files matchingllm,metric,performance,benchmark - Check
~/.pi/configor project.pi/configfor extension/tool registration - Look for custom tool definitions in
extensions/,tools/, orskills/directories
- Search
- Identify the provider integration
- The log shows
"provider":"llama.cpp"— find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception) - Map the data flow: raw llama.cpp output → extension parsing → JSON log writing
- The log shows
Phase 2: Diagnose the Timing Bug
- Trace phase boundary detection
- Find how the extension defines "prefill" vs "generation" start/end times
- Check if it uses:
timeToFirstToken(TTFT) as the split point- llama.cpp callback hooks (
completion_token_callback,prompt_token_callback) - Wall-clock timestamps around token streaming
- Verify the calculation
- Confirm the formula:
generationTok/s = outputTokens / (totalDuration - TTFT) - Check if
totalDurationincludes only generation, or the full call - Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
- Confirm the formula:
- Reproduce the anomaly
- Run the same model with identical prompt/output length
- Compare TTFT, totalDuration, and per-phase timestamps
- Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs
Phase 3: Fix the Implementation
- Correct phase boundaries
- If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit
doneevent - If using wall-clock: add a small buffer after last token to account for async flush
- Add validation: reject generation speeds > 500 tok/s (sanity check)
- If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit
- Fix label assignment
- Ensure
prefillTokensPerSec=inputTokens / TTFT - Ensure
generationTokensPerSec=outputTokens / (totalDuration - TTFT) - Add explicit phase logging to debug output
- Ensure
- Add telemetry
- Log raw timestamps:
prefill_start,prefill_end,gen_start,gen_end,total_start,total_end - Log per-phase token counts to catch mismatches
- Write to
.pi/llm-metrics.logwith consistent schema
- Log raw timestamps:
Phase 4: Verify & Deploy
- Test cases
- Small prompt + short output (baseline)
- Large prompt + long output (original failure case)
- Speculative decoding run (if supported)
- Early termination / stop token edge case
- Validate output
- Generation speed should be 40–100 tok/s for this model/hardware
- Prefill speed should be 50–200 tok/s (parallel compute)
- TTFT should match prefill duration
- No negative phase durations
- Update schema & docs
- Add
rawTimestampsfield to log entries for debugging - Document phase definitions in extension README
- Add unit tests for metric calculation logic
- Add
Deliverables
- Extension source located & data flow mapped
- Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
- Fix implemented with sanity checks
- Test suite covering edge cases
- Log schema updated with raw timestamps
- PR or patch ready for review
Questions to Answer During Analysis
- Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
- Are callbacks synchronous or async?
- Is there a
done/endevent, or does it rely on empty token streams? - Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?