History

Willem van den Ende 98e18643c5 pi-performance: Make Time to first token more accurate.

Summary of changes:

 ┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐
 │ Step │ Change                                                           │ Result   │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 1    │ Removed duplicate llm-performance-metrics.test.ts                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 2    │ Added rawTimestamps assertions to toLogEntry test                │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 3    │ Added rawTimestamps assertions to single-turn aggregate test     │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 4    │ Added rawTimestamps assertions to multi-turn aggregate test      │ 14 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 5    │ Added negative TTFT filtering test                               │ 15 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 6    │ Added "first turn missing TTFT, later turns have it" test        │ 16 tests │
 ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤
 │ 7    │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │
 └──────┴──────────────────────────────────────────────────────────────────┴──────────┘

This is what it looks like now when I run `pi`
 📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
   Prefill: 15,460 tokens @ 20104.0 tok/s
   Generation: 12,179 tokens @ 52.6 tok/s
   Combined: 27,639 tokens @ 118.9 tok/s (3.9m total)
   TTFT: 769ms
   Turns: 36

2026-04-28 10:52:00 +01:00

src

pi-performance: Make Time to first token more accurate.

2026-04-28 10:52:00 +01:00

deno.json

move pi-llm-performance to monorepo, update README and add deno.json

2026-04-28 10:06:03 +01:00

deno.lock

pi-performance: Make Time to first token more accurate.

2026-04-28 10:52:00 +01:00

package.json

move pi-llm-performance to this repo

2026-04-28 10:00:45 +01:00

README.md

pi-performance: Make Time to first token more accurate.

2026-04-28 10:52:00 +01:00

README.md

pi-llm-performance

Pi coding agent extension that captures and displays LLM inference performance metrics.

Why

Understanding model performance helps you:

Compare models — measure throughput differences between providers and model sizes
Debug slowdowns — spot when prefill or generation degrades unexpectedly
Validate hardware — confirm your setup delivers expected token throughput
Tune parameters — evaluate the impact of speculative decoding, context window size, etc.

What it measures

Metric	Description
TTFT	Time to first token (ms) — how long before you see output
Prefill speed	Input tokens processed per second during the prefill phase
Generation speed	Output tokens generated per second during the generation phase
Combined speed	Total tokens (input + output) per second across the full prompt

How it works

The extension hooks into pi's agent lifecycle events:

Event	Behavior
`agent_start`	Records provider/model, resets counters
`turn_start`	Marks turn boundary
`message_update`	Captures TTFT on first token delta
`turn_end`	Records token counts and turn duration
`agent_end`	Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log`

Output

TUI notification

After each prompt completes, a notification shows:

📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
  Prefill: 1,240 tokens @ 68.3 tok/s
  Generation: 312 tokens @ 89.9 tok/s
  Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
  TTFT: 1250ms

Status bar

The footer status shows combined throughput: 📊 78.4 tok/s

Log file

Each prompt writes a JSONL entry to .pi/llm-metrics.log:

{
  "timestamp": "2026-04-28T10:05:00.000Z",
  "provider": "llama.cpp",
  "model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
  "turnCount": 1,
  "inputTokens": 1240,
  "outputTokens": 312,
  "totalTokens": 1552,
  "prefillTokensPerSec": 68.3,
  "generationTokensPerSec": 89.9,
  "combinedTokensPerSec": 78.4,
  "totalDurationMs": 19800,
  "timeToFirstTokenMs": 1250,
  "rawTimestamps": {
    "ttftMs": 1250,
    "generationDurationMs": 18550,
    "turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
  }
}

Sanity checks

The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.

Development

This package lives in the pi-extensions monorepo.

pnpm install          # workspace setup
deno test             # run tests

License

MIT