Summary of changes:
┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐
│ Step │ Change │ Result │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 1 │ Removed duplicate llm-performance-metrics.test.ts │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 2 │ Added rawTimestamps assertions to toLogEntry test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 3 │ Added rawTimestamps assertions to single-turn aggregate test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 4 │ Added rawTimestamps assertions to multi-turn aggregate test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 5 │ Added negative TTFT filtering test │ 15 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 6 │ Added "first turn missing TTFT, later turns have it" test │ 16 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 7 │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │
└──────┴──────────────────────────────────────────────────────────────────┴──────────┘
This is what it looks like now when I run `pi`
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
Prefill: 15,460 tokens @ 20104.0 tok/s
Generation: 12,179 tokens @ 52.6 tok/s
Combined: 27,639 tokens @ 118.9 tok/s (3.9m total)
TTFT: 769ms
Turns: 36
pi-llm-performance
Pi coding agent extension that captures and displays LLM inference performance metrics.
Why
Understanding model performance helps you:
- Compare models — measure throughput differences between providers and model sizes
- Debug slowdowns — spot when prefill or generation degrades unexpectedly
- Validate hardware — confirm your setup delivers expected token throughput
- Tune parameters — evaluate the impact of speculative decoding, context window size, etc.
What it measures
| Metric | Description |
|---|---|
| TTFT | Time to first token (ms) — how long before you see output |
| Prefill speed | Input tokens processed per second during the prefill phase |
| Generation speed | Output tokens generated per second during the generation phase |
| Combined speed | Total tokens (input + output) per second across the full prompt |
How it works
The extension hooks into pi's agent lifecycle events:
| Event | Behavior |
|---|---|
agent_start |
Records provider/model, resets counters |
turn_start |
Marks turn boundary |
message_update |
Captures TTFT on first token delta |
turn_end |
Records token counts and turn duration |
agent_end |
Aggregates metrics, displays in TUI, logs to .pi/llm-metrics.log |
Output
TUI notification
After each prompt completes, a notification shows:
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
Prefill: 1,240 tokens @ 68.3 tok/s
Generation: 312 tokens @ 89.9 tok/s
Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
TTFT: 1250ms
Status bar
The footer status shows combined throughput: 📊 78.4 tok/s
Log file
Each prompt writes a JSONL entry to .pi/llm-metrics.log:
{
"timestamp": "2026-04-28T10:05:00.000Z",
"provider": "llama.cpp",
"model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
"turnCount": 1,
"inputTokens": 1240,
"outputTokens": 312,
"totalTokens": 1552,
"prefillTokensPerSec": 68.3,
"generationTokensPerSec": 89.9,
"combinedTokensPerSec": 78.4,
"totalDurationMs": 19800,
"timeToFirstTokenMs": 1250,
"rawTimestamps": {
"ttftMs": 1250,
"generationDurationMs": 18550,
"turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
}
}
Sanity checks
The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.
Development
This package lives in the pi-extensions monorepo.
pnpm install # workspace setup
deno test # run tests
License
MIT