# pi-llm-performance Pi coding agent extension that captures and displays LLM inference performance metrics. ## Why Understanding model performance helps you: - **Compare models** — measure throughput differences between providers and model sizes - **Debug slowdowns** — spot when prefill or generation degrades unexpectedly - **Validate hardware** — confirm your setup delivers expected token throughput - **Tune parameters** — evaluate the impact of speculative decoding, context window size, etc. ## What it measures | Metric | Description | |--------|-------------| | **TTFT** | Time to first token (ms) — how long before you see output | | **Prefill speed** | Input tokens processed per second during the prefill phase | | **Generation speed** | Output tokens generated per second during the generation phase | | **Combined speed** | Total tokens (input + output) per second across the full prompt | ## How it works The extension hooks into pi's agent lifecycle events: | Event | Behavior | |-------|----------| | `agent_start` | Records provider/model, resets counters | | `turn_start` | Marks turn boundary | | `message_update` | Captures TTFT on first token delta | | `turn_end` | Records token counts and turn duration | | `agent_end` | Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log` | ## Output ### TUI notification After each prompt completes, a notification shows: ``` 📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf Prefill: 1,240 tokens @ 68.3 tok/s Generation: 312 tokens @ 89.9 tok/s Combined: 1,552 tokens @ 78.4 tok/s (19.8s total) TTFT: 1250ms ``` ### Status bar The footer status shows combined throughput: `📊 78.4 tok/s` ### Log file Each prompt writes a JSONL entry to `.pi/llm-metrics.log`: ```json { "timestamp": "2026-04-28T10:05:00.000Z", "provider": "llama.cpp", "model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf", "turnCount": 1, "inputTokens": 1240, "outputTokens": 312, "totalTokens": 1552, "prefillTokensPerSec": 68.3, "generationTokensPerSec": 89.9, "combinedTokensPerSec": 78.4, "totalDurationMs": 19800, "timeToFirstTokenMs": 1250, "rawTimestamps": { "ttftMs": 1250, "generationDurationMs": 18550, "turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}] } } ``` ## Sanity checks The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early. ## Development This package lives in the `pi-extensions` monorepo. ```bash pnpm install # workspace setup deno test # run tests ``` ## License MIT