Summary of changes:
┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐
│ Step │ Change │ Result │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 1 │ Removed duplicate llm-performance-metrics.test.ts │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 2 │ Added rawTimestamps assertions to toLogEntry test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 3 │ Added rawTimestamps assertions to single-turn aggregate test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 4 │ Added rawTimestamps assertions to multi-turn aggregate test │ 14 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 5 │ Added negative TTFT filtering test │ 15 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 6 │ Added "first turn missing TTFT, later turns have it" test │ 16 tests │
├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
│ 7 │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │
└──────┴──────────────────────────────────────────────────────────────────┴──────────┘
This is what it looks like now when I run `pi`
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
Prefill: 15,460 tokens @ 20104.0 tok/s
Generation: 12,179 tokens @ 52.6 tok/s
Combined: 27,639 tokens @ 118.9 tok/s (3.9m total)
TTFT: 769ms
Turns: 36
95 lines
2.6 KiB
Markdown
95 lines
2.6 KiB
Markdown
# pi-llm-performance
|
|
|
|
Pi coding agent extension that captures and displays LLM inference performance metrics.
|
|
|
|
## Why
|
|
|
|
Understanding model performance helps you:
|
|
|
|
- **Compare models** — measure throughput differences between providers and model sizes
|
|
- **Debug slowdowns** — spot when prefill or generation degrades unexpectedly
|
|
- **Validate hardware** — confirm your setup delivers expected token throughput
|
|
- **Tune parameters** — evaluate the impact of speculative decoding, context window size, etc.
|
|
|
|
## What it measures
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| **TTFT** | Time to first token (ms) — how long before you see output |
|
|
| **Prefill speed** | Input tokens processed per second during the prefill phase |
|
|
| **Generation speed** | Output tokens generated per second during the generation phase |
|
|
| **Combined speed** | Total tokens (input + output) per second across the full prompt |
|
|
|
|
## How it works
|
|
|
|
The extension hooks into pi's agent lifecycle events:
|
|
|
|
| Event | Behavior |
|
|
|-------|----------|
|
|
| `agent_start` | Records provider/model, resets counters |
|
|
| `turn_start` | Marks turn boundary |
|
|
| `message_update` | Captures TTFT on first token delta |
|
|
| `turn_end` | Records token counts and turn duration |
|
|
| `agent_end` | Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log` |
|
|
|
|
## Output
|
|
|
|
### TUI notification
|
|
|
|
After each prompt completes, a notification shows:
|
|
|
|
```
|
|
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
|
|
Prefill: 1,240 tokens @ 68.3 tok/s
|
|
Generation: 312 tokens @ 89.9 tok/s
|
|
Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
|
|
TTFT: 1250ms
|
|
```
|
|
|
|
### Status bar
|
|
|
|
The footer status shows combined throughput: `📊 78.4 tok/s`
|
|
|
|
### Log file
|
|
|
|
Each prompt writes a JSONL entry to `.pi/llm-metrics.log`:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2026-04-28T10:05:00.000Z",
|
|
"provider": "llama.cpp",
|
|
"model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
|
|
"turnCount": 1,
|
|
"inputTokens": 1240,
|
|
"outputTokens": 312,
|
|
"totalTokens": 1552,
|
|
"prefillTokensPerSec": 68.3,
|
|
"generationTokensPerSec": 89.9,
|
|
"combinedTokensPerSec": 78.4,
|
|
"totalDurationMs": 19800,
|
|
"timeToFirstTokenMs": 1250,
|
|
"rawTimestamps": {
|
|
"ttftMs": 1250,
|
|
"generationDurationMs": 18550,
|
|
"turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Sanity checks
|
|
|
|
The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.
|
|
|
|
## Development
|
|
|
|
This package lives in the `pi-extensions` monorepo.
|
|
|
|
```bash
|
|
pnpm install # workspace setup
|
|
deno test # run tests
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|