74 lines
4.0 KiB
Markdown
74 lines
4.0 KiB
Markdown
# Plan: Analyze & Fix `llm-metrics` Extension Timing Bug
|
||
|
||
## Problem Statement
|
||
The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs).
|
||
|
||
## Phase 1: Locate & Map the Extension
|
||
1. **Find the source code**
|
||
- Search `~/.pi/extensions/`, `~/.pi/tools/`, and the pi-coding-agent package for files matching `llm`, `metric`, `performance`, `benchmark`
|
||
- Check `~/.pi/config` or project `.pi/config` for extension/tool registration
|
||
- Look for custom tool definitions in `extensions/`, `tools/`, or `skills/` directories
|
||
2. **Identify the provider integration**
|
||
- The log shows `"provider":"llama.cpp"` — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
|
||
- Map the data flow: raw llama.cpp output → extension parsing → JSON log writing
|
||
|
||
## Phase 2: Diagnose the Timing Bug
|
||
3. **Trace phase boundary detection**
|
||
- Find how the extension defines "prefill" vs "generation" start/end times
|
||
- Check if it uses:
|
||
- `timeToFirstToken` (TTFT) as the split point
|
||
- llama.cpp callback hooks (`completion_token_callback`, `prompt_token_callback`)
|
||
- Wall-clock timestamps around token streaming
|
||
4. **Verify the calculation**
|
||
- Confirm the formula: `generationTok/s = outputTokens / (totalDuration - TTFT)`
|
||
- Check if `totalDuration` includes only generation, or the full call
|
||
- Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
|
||
5. **Reproduce the anomaly**
|
||
- Run the same model with identical prompt/output length
|
||
- Compare TTFT, totalDuration, and per-phase timestamps
|
||
- Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs
|
||
|
||
## Phase 3: Fix the Implementation
|
||
6. **Correct phase boundaries**
|
||
- If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit `done` event
|
||
- If using wall-clock: add a small buffer after last token to account for async flush
|
||
- Add validation: reject generation speeds > 500 tok/s (sanity check)
|
||
7. **Fix label assignment**
|
||
- Ensure `prefillTokensPerSec` = `inputTokens / TTFT`
|
||
- Ensure `generationTokensPerSec` = `outputTokens / (totalDuration - TTFT)`
|
||
- Add explicit phase logging to debug output
|
||
8. **Add telemetry**
|
||
- Log raw timestamps: `prefill_start`, `prefill_end`, `gen_start`, `gen_end`, `total_start`, `total_end`
|
||
- Log per-phase token counts to catch mismatches
|
||
- Write to `.pi/llm-metrics.log` with consistent schema
|
||
|
||
## Phase 4: Verify & Deploy
|
||
9. **Test cases**
|
||
- Small prompt + short output (baseline)
|
||
- Large prompt + long output (original failure case)
|
||
- Speculative decoding run (if supported)
|
||
- Early termination / stop token edge case
|
||
10. **Validate output**
|
||
- Generation speed should be 40–100 tok/s for this model/hardware
|
||
- Prefill speed should be 50–200 tok/s (parallel compute)
|
||
- TTFT should match prefill duration
|
||
- No negative phase durations
|
||
11. **Update schema & docs**
|
||
- Add `rawTimestamps` field to log entries for debugging
|
||
- Document phase definitions in extension README
|
||
- Add unit tests for metric calculation logic
|
||
|
||
## Deliverables
|
||
- [ ] Extension source located & data flow mapped
|
||
- [ ] Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
|
||
- [ ] Fix implemented with sanity checks
|
||
- [ ] Test suite covering edge cases
|
||
- [ ] Log schema updated with raw timestamps
|
||
- [ ] PR or patch ready for review
|
||
|
||
## Questions to Answer During Analysis
|
||
- Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
|
||
- Are callbacks synchronous or async?
|
||
- Is there a `done`/`end` event, or does it rely on empty token streams?
|
||
- Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?
|