monotonic-pi-extensions/plans/metrics-check.md

4.0 KiB
Raw Blame History

Plan: Analyze & Fix llm-metrics Extension Timing Bug

Problem Statement

The extension reports generation speed as ~8,0002,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~5370 tok/s (confirmed by earlier runs).

Phase 1: Locate & Map the Extension

  1. Find the source code
    • Search ~/.pi/extensions/, ~/.pi/tools/, and the pi-coding-agent package for files matching llm, metric, performance, benchmark
    • Check ~/.pi/config or project .pi/config for extension/tool registration
    • Look for custom tool definitions in extensions/, tools/, or skills/ directories
  2. Identify the provider integration
    • The log shows "provider":"llama.cpp" — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
    • Map the data flow: raw llama.cpp output → extension parsing → JSON log writing

Phase 2: Diagnose the Timing Bug

  1. Trace phase boundary detection
    • Find how the extension defines "prefill" vs "generation" start/end times
    • Check if it uses:
      • timeToFirstToken (TTFT) as the split point
      • llama.cpp callback hooks (completion_token_callback, prompt_token_callback)
      • Wall-clock timestamps around token streaming
  2. Verify the calculation
    • Confirm the formula: generationTok/s = outputTokens / (totalDuration - TTFT)
    • Check if totalDuration includes only generation, or the full call
    • Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
  3. Reproduce the anomaly
    • Run the same model with identical prompt/output length
    • Compare TTFT, totalDuration, and per-phase timestamps
    • Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs

Phase 3: Fix the Implementation

  1. Correct phase boundaries
    • If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit done event
    • If using wall-clock: add a small buffer after last token to account for async flush
    • Add validation: reject generation speeds > 500 tok/s (sanity check)
  2. Fix label assignment
    • Ensure prefillTokensPerSec = inputTokens / TTFT
    • Ensure generationTokensPerSec = outputTokens / (totalDuration - TTFT)
    • Add explicit phase logging to debug output
  3. Add telemetry
    • Log raw timestamps: prefill_start, prefill_end, gen_start, gen_end, total_start, total_end
    • Log per-phase token counts to catch mismatches
    • Write to .pi/llm-metrics.log with consistent schema

Phase 4: Verify & Deploy

  1. Test cases
    • Small prompt + short output (baseline)
    • Large prompt + long output (original failure case)
    • Speculative decoding run (if supported)
    • Early termination / stop token edge case
  2. Validate output
    • Generation speed should be 40100 tok/s for this model/hardware
    • Prefill speed should be 50200 tok/s (parallel compute)
    • TTFT should match prefill duration
    • No negative phase durations
  3. Update schema & docs
    • Add rawTimestamps field to log entries for debugging
    • Document phase definitions in extension README
    • Add unit tests for metric calculation logic

Deliverables

  • Extension source located & data flow mapped
  • Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
  • Fix implemented with sanity checks
  • Test suite covering edge cases
  • Log schema updated with raw timestamps
  • PR or patch ready for review

Questions to Answer During Analysis

  • Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
  • Are callbacks synchronous or async?
  • Is there a done/end event, or does it rely on empty token streams?
  • Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?