monotonic-pi-extensions/plans/metrics-check.md

74 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan: Analyze & Fix `llm-metrics` Extension Timing Bug
## Problem Statement
The extension reports generation speed as ~8,0002,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~5370 tok/s (confirmed by earlier runs).
## Phase 1: Locate & Map the Extension
1. **Find the source code**
- Search `~/.pi/extensions/`, `~/.pi/tools/`, and the pi-coding-agent package for files matching `llm`, `metric`, `performance`, `benchmark`
- Check `~/.pi/config` or project `.pi/config` for extension/tool registration
- Look for custom tool definitions in `extensions/`, `tools/`, or `skills/` directories
2. **Identify the provider integration**
- The log shows `"provider":"llama.cpp"` — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
- Map the data flow: raw llama.cpp output → extension parsing → JSON log writing
## Phase 2: Diagnose the Timing Bug
3. **Trace phase boundary detection**
- Find how the extension defines "prefill" vs "generation" start/end times
- Check if it uses:
- `timeToFirstToken` (TTFT) as the split point
- llama.cpp callback hooks (`completion_token_callback`, `prompt_token_callback`)
- Wall-clock timestamps around token streaming
4. **Verify the calculation**
- Confirm the formula: `generationTok/s = outputTokens / (totalDuration - TTFT)`
- Check if `totalDuration` includes only generation, or the full call
- Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
5. **Reproduce the anomaly**
- Run the same model with identical prompt/output length
- Compare TTFT, totalDuration, and per-phase timestamps
- Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs
## Phase 3: Fix the Implementation
6. **Correct phase boundaries**
- If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit `done` event
- If using wall-clock: add a small buffer after last token to account for async flush
- Add validation: reject generation speeds > 500 tok/s (sanity check)
7. **Fix label assignment**
- Ensure `prefillTokensPerSec` = `inputTokens / TTFT`
- Ensure `generationTokensPerSec` = `outputTokens / (totalDuration - TTFT)`
- Add explicit phase logging to debug output
8. **Add telemetry**
- Log raw timestamps: `prefill_start`, `prefill_end`, `gen_start`, `gen_end`, `total_start`, `total_end`
- Log per-phase token counts to catch mismatches
- Write to `.pi/llm-metrics.log` with consistent schema
## Phase 4: Verify & Deploy
9. **Test cases**
- Small prompt + short output (baseline)
- Large prompt + long output (original failure case)
- Speculative decoding run (if supported)
- Early termination / stop token edge case
10. **Validate output**
- Generation speed should be 40100 tok/s for this model/hardware
- Prefill speed should be 50200 tok/s (parallel compute)
- TTFT should match prefill duration
- No negative phase durations
11. **Update schema & docs**
- Add `rawTimestamps` field to log entries for debugging
- Document phase definitions in extension README
- Add unit tests for metric calculation logic
## Deliverables
- [ ] Extension source located & data flow mapped
- [ ] Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
- [ ] Fix implemented with sanity checks
- [ ] Test suite covering edge cases
- [ ] Log schema updated with raw timestamps
- [ ] PR or patch ready for review
## Questions to Answer During Analysis
- Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
- Are callbacks synchronous or async?
- Is there a `done`/`end` event, or does it rely on empty token streams?
- Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?