Willem van den Ende 98e18643c5 pi-performance: Make Time to first token more accurate.
Summary of changes:

 ┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐
 │ Step │ Change                                                           │ Result   │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 1    │ Removed duplicate llm-performance-metrics.test.ts                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 2    │ Added rawTimestamps assertions to toLogEntry test                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 3    │ Added rawTimestamps assertions to single-turn aggregate test     │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 4    │ Added rawTimestamps assertions to multi-turn aggregate test      │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 5    │ Added negative TTFT filtering test                               │ 15 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 6    │ Added "first turn missing TTFT, later turns have it" test        │ 16 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 7    │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │
 └──────┴──────────────────────────────────────────────────────────────────┴──────────┘

This is what it looks like now when I run `pi`
 📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
   Prefill: 15,460 tokens @ 20104.0 tok/s
   Generation: 12,179 tokens @ 52.6 tok/s
   Combined: 27,639 tokens @ 118.9 tok/s (3.9m total)
   TTFT: 769ms
   Turns: 36
2026-04-28 10:52:00 +01:00
..

pi-llm-performance

Pi coding agent extension that captures and displays LLM inference performance metrics.

Why

Understanding model performance helps you:

  • Compare models — measure throughput differences between providers and model sizes
  • Debug slowdowns — spot when prefill or generation degrades unexpectedly
  • Validate hardware — confirm your setup delivers expected token throughput
  • Tune parameters — evaluate the impact of speculative decoding, context window size, etc.

What it measures

Metric Description
TTFT Time to first token (ms) — how long before you see output
Prefill speed Input tokens processed per second during the prefill phase
Generation speed Output tokens generated per second during the generation phase
Combined speed Total tokens (input + output) per second across the full prompt

How it works

The extension hooks into pi's agent lifecycle events:

Event Behavior
agent_start Records provider/model, resets counters
turn_start Marks turn boundary
message_update Captures TTFT on first token delta
turn_end Records token counts and turn duration
agent_end Aggregates metrics, displays in TUI, logs to .pi/llm-metrics.log

Output

TUI notification

After each prompt completes, a notification shows:

📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
  Prefill: 1,240 tokens @ 68.3 tok/s
  Generation: 312 tokens @ 89.9 tok/s
  Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
  TTFT: 1250ms

Status bar

The footer status shows combined throughput: 📊 78.4 tok/s

Log file

Each prompt writes a JSONL entry to .pi/llm-metrics.log:

{
  "timestamp": "2026-04-28T10:05:00.000Z",
  "provider": "llama.cpp",
  "model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
  "turnCount": 1,
  "inputTokens": 1240,
  "outputTokens": 312,
  "totalTokens": 1552,
  "prefillTokensPerSec": 68.3,
  "generationTokensPerSec": 89.9,
  "combinedTokensPerSec": 78.4,
  "totalDurationMs": 19800,
  "timeToFirstTokenMs": 1250,
  "rawTimestamps": {
    "ttftMs": 1250,
    "generationDurationMs": 18550,
    "turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
  }
}

Sanity checks

The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.

Development

This package lives in the pi-extensions monorepo.

pnpm install          # workspace setup
deno test             # run tests

License

MIT