monotonic-pi-extensions/packages/pi-llm-performance/README.md

# pi-llm-performance

Pi coding agent extension that captures and displays LLM inference performance metrics.

## Why

Understanding model performance helps you:

- **Compare models** — measure throughput differences between providers and model sizes
- **Debug slowdowns** — spot when prefill or generation degrades unexpectedly
- **Validate hardware** — confirm your setup delivers expected token throughput
- **Tune parameters** — evaluate the impact of speculative decoding, context window size, etc.

## What it measures

| Metric | Description |
|--------|-------------|
| **TTFT** | Time to first token (ms) — how long before you see output |
| **Prefill speed** | Input tokens processed per second during the prefill phase |
| **Generation speed** | Output tokens generated per second during the generation phase |
| **Combined speed** | Total tokens (input + output) per second across the full prompt |

## How it works

The extension hooks into pi's agent lifecycle events:

| Event | Behavior |
|-------|----------|
| `agent_start` | Records provider/model, resets counters |
| `turn_start` | Marks turn boundary |
| `message_update` | Captures TTFT on first token delta |
| `turn_end` | Records token counts and turn duration |
| `agent_end` | Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log` |

## Output

### TUI notification

After each prompt completes, a notification shows:

```
📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
  Prefill: 1,240 tokens @ 68.3 tok/s
  Generation: 312 tokens @ 89.9 tok/s
  Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
  TTFT: 1250ms
```

### Status bar

The footer status shows combined throughput: `📊 78.4 tok/s`

### Log file

Each prompt writes a JSONL entry to `.pi/llm-metrics.log`:

```json
{
  "timestamp": "2026-04-28T10:05:00.000Z",
  "provider": "llama.cpp",
  "model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
  "turnCount": 1,
  "inputTokens": 1240,
  "outputTokens": 312,
  "totalTokens": 1552,
  "prefillTokensPerSec": 68.3,
  "generationTokensPerSec": 89.9,
  "combinedTokensPerSec": 78.4,
  "totalDurationMs": 19800,
  "timeToFirstTokenMs": 1250,
  "rawTimestamps": {
    "ttftMs": 1250,
    "generationDurationMs": 18550,
    "turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
  }
}
```

## Sanity checks

The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.

## Development

This package lives in the `pi-extensions` monorepo.

```bash
pnpm install          # workspace setup
deno test             # run tests
```

## License

MIT