Willem van den Ende 98e18643c5 pi-performance: Make Time to first token more accurate.
Summary of changes:

 ┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐
 │ Step │ Change                                                           │ Result   │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 1    │ Removed duplicate llm-performance-metrics.test.ts                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 2    │ Added rawTimestamps assertions to toLogEntry test                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 3    │ Added rawTimestamps assertions to single-turn aggregate test     │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 4    │ Added rawTimestamps assertions to multi-turn aggregate test      │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 5    │ Added negative TTFT filtering test                               │ 15 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 6    │ Added "first turn missing TTFT, later turns have it" test        │ 16 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 7    │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │
 └──────┴──────────────────────────────────────────────────────────────────┴──────────┘

This is what it looks like now when I run `pi`
 📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
   Prefill: 15,460 tokens @ 20104.0 tok/s
   Generation: 12,179 tokens @ 52.6 tok/s
   Combined: 27,639 tokens @ 118.9 tok/s (3.9m total)
   TTFT: 769ms
   Turns: 36
2026-04-28 10:52:00 +01:00

95 lines
2.6 KiB
Markdown

# pi-llm-performance
Pi coding agent extension that captures and displays LLM inference performance metrics.
## Why
Understanding model performance helps you:
- **Compare models** — measure throughput differences between providers and model sizes
- **Debug slowdowns** — spot when prefill or generation degrades unexpectedly
- **Validate hardware** — confirm your setup delivers expected token throughput
- **Tune parameters** — evaluate the impact of speculative decoding, context window size, etc.
## What it measures
| Metric | Description |
|--------|-------------|
| **TTFT** | Time to first token (ms) — how long before you see output |
| **Prefill speed** | Input tokens processed per second during the prefill phase |
| **Generation speed** | Output tokens generated per second during the generation phase |
| **Combined speed** | Total tokens (input + output) per second across the full prompt |
## How it works
The extension hooks into pi's agent lifecycle events:
| Event | Behavior |
|-------|----------|
| `agent_start` | Records provider/model, resets counters |
| `turn_start` | Marks turn boundary |
| `message_update` | Captures TTFT on first token delta |
| `turn_end` | Records token counts and turn duration |
| `agent_end` | Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log` |
## Output
### TUI notification
After each prompt completes, a notification shows:
```
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
Prefill: 1,240 tokens @ 68.3 tok/s
Generation: 312 tokens @ 89.9 tok/s
Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
TTFT: 1250ms
```
### Status bar
The footer status shows combined throughput: `📊 78.4 tok/s`
### Log file
Each prompt writes a JSONL entry to `.pi/llm-metrics.log`:
```json
{
"timestamp": "2026-04-28T10:05:00.000Z",
"provider": "llama.cpp",
"model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
"turnCount": 1,
"inputTokens": 1240,
"outputTokens": 312,
"totalTokens": 1552,
"prefillTokensPerSec": 68.3,
"generationTokensPerSec": 89.9,
"combinedTokensPerSec": 78.4,
"totalDurationMs": 19800,
"timeToFirstTokenMs": 1250,
"rawTimestamps": {
"ttftMs": 1250,
"generationDurationMs": 18550,
"turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
}
}
```
## Sanity checks
The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.
## Development
This package lives in the `pi-extensions` monorepo.
```bash
pnpm install # workspace setup
deno test # run tests
```
## License
MIT