Publish pi-notifications and pi-llm-performance

chore(pi-notifications): remove remaining console.log
chore(pi-notifications): remove debug mode and console.log noise
2026-04-28 13:13:52 +01:00 · 2026-04-28 13:06:44 +01:00 · 2026-04-28 13:04:33 +01:00 · 2026-04-28 13:03:38 +01:00 · 2026-04-28 13:03:11 +01:00 · 2026-04-28 13:02:18 +01:00
19 changed files with 1660 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,4 @@
 node_modules/
 .pnpm-store/
 pnpm-lock.yaml
+.pi/llm-metrics.log
--- a/.nvimlog
+++ b/.nvimlog
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@ Experimental monorepo for [Pi coding agent](https://github.com/mariozechner/pi-c

 ### `pi-turn-limit`

-Limits the number of turns (agent round-trips) in a Pi session. When the limit is reached, the user is prompted to continue or abort.
+Limits the number of turns (agent round-trips) in a Pi session. When the limit is reached, the user is prompted to continue or abort. Use when you want to be in-the-loop, or when a model misbehaves and does too many tool calls. It is a good way to control the *Time To Next Interaction*.

 - **Default limit:** 25 turns
 - **Override:** set `PI_MAX_TURNS` environment variable to a positive integer
@ -15,6 +15,26 @@ Limits the number of turns (agent round-trips) in a Pi session. When the limit i

 See [packages/pi-turn-limit/README.md](packages/pi-turn-limit/README.md) for details and the [Allium spec](packages/pi-turn-limit/turn-limit.allium).

+### `pi-notifications`
+
+Audio alerts via `afplay` when the agent finishes a turn. Run the agent and step away — you'll hear when input is needed. No multi-tasking required, but it gives you the breathing room to stretch, grab a coffee, or write something down without staring at the screen.
+
+- **Config:** `PI_NOTIFICATION_ENABLED`, `PI_NOTIFICATION_AGENT_END`, `PI_NOTIFICATION_AUDIO` (defaults to macOS Glass sound)
+- **Platform:** macOS (uses `afplay`)
+
+See [packages/pi-notifications/README.md](packages/pi-notifications/README.md) for details.
+
+### `pi-llm-performance`
+
+Captures and displays LLM inference performance metrics (TTFT, prefill/generation throughput, combined speed) after each prompt. Lets you benchmark shiny new local inference server optimizations at a glance — no need to dig through different server logs.
+
+- **Output:** TUI notification + status bar (`📊 tok/s`) + JSONL log at `.pi/llm-metrics.log`
+- **Sanity checks:** Warns when generation speed exceeds 500 tok/s (physically impossible)
+
+See [packages/pi-llm-performance/README.md](packages/pi-llm-performance/README.md) for details.
+
+
+
 ## Installation

 This is an early release — not on npm yet. To install from source:
--- a/mise.toml
+++ b/mise.toml
@ -1,5 +1,6 @@
 [tools]
 bun = "latest"
+deno = "latest"
 elixir = "latest"
 erlang = "latest"
 node = "24"
--- a/packages/pi-llm-performance/README.md
+++ b/packages/pi-llm-performance/README.md
@ -0,0 +1,94 @@
+# pi-llm-performance
+
+Pi coding agent extension that captures and displays LLM inference performance metrics.
+
+## Why
+
+Understanding model performance helps you:
+
+- **Compare models** — measure throughput differences between providers and model sizes
+- **Debug slowdowns** — spot when prefill or generation degrades unexpectedly
+- **Validate hardware** — confirm your setup delivers expected token throughput
+- **Tune parameters** — evaluate the impact of speculative decoding, context window size, etc.
+
+## What it measures
+
+| Metric | Description |
+|--------|-------------|
+| **TTFT** | Time to first token (ms) — how long before you see output |
+| **Prefill speed** | Input tokens processed per second during the prefill phase |
+| **Generation speed** | Output tokens generated per second during the generation phase |
+| **Combined speed** | Total tokens (input + output) per second across the full prompt |
+
+## How it works
+
+The extension hooks into pi's agent lifecycle events:
+
+| Event | Behavior |
+|-------|----------|
+| `agent_start` | Records provider/model, resets counters |
+| `turn_start` | Marks turn boundary |
+| `message_update` | Captures TTFT on first token delta |
+| `turn_end` | Records token counts and turn duration |
+| `agent_end` | Aggregates metrics, displays in TUI, logs to `.pi/llm-metrics.log` |
+
+## Output
+
+### TUI notification
+
+After each prompt completes, a notification shows:
+
+```
+📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
+  Prefill: 1,240 tokens @ 68.3 tok/s
+  Generation: 312 tokens @ 89.9 tok/s
+  Combined: 1,552 tokens @ 78.4 tok/s (19.8s total)
+  TTFT: 1250ms
+```
+
+### Status bar
+
+The footer status shows combined throughput: `📊 78.4 tok/s`
+
+### Log file
+
+Each prompt writes a JSONL entry to `.pi/llm-metrics.log`:
+
+```json
+{
+  "timestamp": "2026-04-28T10:05:00.000Z",
+  "provider": "llama.cpp",
+  "model": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
+  "turnCount": 1,
+  "inputTokens": 1240,
+  "outputTokens": 312,
+  "totalTokens": 1552,
+  "prefillTokensPerSec": 68.3,
+  "generationTokensPerSec": 89.9,
+  "combinedTokensPerSec": 78.4,
+  "totalDurationMs": 19800,
+  "timeToFirstTokenMs": 1250,
+  "rawTimestamps": {
+    "ttftMs": 1250,
+    "generationDurationMs": 18550,
+    "turns": [{"turnId": "turn-0", "durationMs": 19800, "ttftMs": 1250}]
+  }
+}
+```
+
+## Sanity checks
+
+The extension warns to the console if generation speed exceeds 500 tok/s (physically impossible for any known model/hardware setup). This helps catch timing bugs early.
+
+## Development
+
+This package lives in the `pi-extensions` monorepo.
+
+```bash
+pnpm install          # workspace setup
+deno test             # run tests
+```
+
+## License
+
+MIT
--- a/packages/pi-llm-performance/deno.json
+++ b/packages/pi-llm-performance/deno.json
@ -0,0 +1,8 @@
+{
+  "imports": {
+    "@std/assert": "jsr:@std/assert@^1.0.0"
+  },
+  "tasks": {
+    "test": "deno test src/"
+  }
+}
--- a/packages/pi-llm-performance/deno.lock
+++ b/packages/pi-llm-performance/deno.lock
@ -0,0 +1,31 @@
+{
+  "version": "5",
+  "specifiers": {
+    "jsr:@std/assert@*": "1.0.19",
+    "jsr:@std/assert@^1.0.19": "1.0.19",
+    "jsr:@std/internal@^1.0.12": "1.0.12",
+    "jsr:@std/testing@*": "1.0.18"
+  },
+  "jsr": {
+    "@std/assert@1.0.19": {
+      "integrity": "eaada96ee120cb980bc47e040f82814d786fe8162ecc53c91d8df60b8755991e",
+      "dependencies": [
+        "jsr:@std/internal"
+      ]
+    },
+    "@std/internal@1.0.12": {
+      "integrity": "972a634fd5bc34b242024402972cd5143eac68d8dffaca5eaa4dba30ce17b027"
+    },
+    "@std/testing@1.0.18": {
+      "integrity": "d3152f57b11666bf6358d0e127c7e3488e91178b0c2d8fbf0793e1c53cd13cb1",
+      "dependencies": [
+        "jsr:@std/assert@^1.0.19"
+      ]
+    }
+  },
+  "workspace": {
+    "dependencies": [
+      "jsr:@std/assert@1"
+    ]
+  }
+}
--- a/packages/pi-llm-performance/package.json
+++ b/packages/pi-llm-performance/package.json
@ -0,0 +1,17 @@
+{
+  "name": "pi-llm-performance",
+  "version": "0.1.0",
+  "description": "LLM performance metrics extension",
+  "type": "module",
+  "exports": {
+    ".": "./src/llm-performance-metrics.ts"
+  },
+  "keywords": ["pi-package"],
+  "pi": {
+    "extensions": ["src/llm-performance-metrics.ts"]
+  },
+  "peerDependencies": {
+    "@mariozechner/pi-coding-agent": "*"
+  },
+  "license": "MIT"
+}
--- a/packages/pi-llm-performance/src/llm-metrics-core.test.ts
+++ b/packages/pi-llm-performance/src/llm-metrics-core.test.ts
@ -0,0 +1,558 @@
+import {
+  calculateTurnMetrics,
+  aggregatePromptMetrics,
+  formatMetricsForDisplay,
+  toLogEntry,
+  type TurnMetrics,
+  type PromptMetrics,
+} from "./llm-metrics-core.ts";
+import { assertEquals, assertGreaterOrEqual, assertLessOrEqual } from "jsr:@std/assert";
+
+Deno.test("calculateTurnMetrics - creates turn metrics object", () => {
+  const result = calculateTurnMetrics({
+    turnId: "turn-1",
+    inputTokens: 100,
+    outputTokens: 50,
+    durationMs: 2000,
+    timeToFirstTokenMs: 500,
+  });
+
+  assertEquals(result.turnId, "turn-1");
+  assertEquals(result.inputTokens, 100);
+  assertEquals(result.outputTokens, 50);
+  assertEquals(result.durationMs, 2000);
+  assertEquals(result.timeToFirstTokenMs, 500);
+});
+
+Deno.test("calculateTurnMetrics - handles missing timeToFirstToken", () => {
+  const result = calculateTurnMetrics({
+    turnId: "turn-1",
+    inputTokens: 100,
+    outputTokens: 50,
+    durationMs: 2000,
+  });
+
+  assertEquals(result.timeToFirstTokenMs, undefined);
+});
+
+Deno.test("aggregatePromptMetrics - aggregates single turn", () => {
+  const turnMetrics: TurnMetrics[] = [
+    {
+      turnId: "turn-1",
+      inputTokens: 1000,
+      outputTokens: 200,
+      durationMs: 5000,
+      timeToFirstTokenMs: 800,
+    },
+  ];
+
+  const result = aggregatePromptMetrics({
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnMetrics,
+  });
+
+  assertEquals(result.provider, "anthropic");
+  assertEquals(result.model, "claude-sonnet-4");
+  assertEquals(result.turnCount, 1);
+  assertEquals(result.inputTokens, 1000);
+  assertEquals(result.outputTokens, 200);
+  assertEquals(result.totalTokens, 1200);
+  assertEquals(result.totalDurationMs, 5000);
+  assertEquals(result.timeToFirstTokenMs, 800);
+
+  // Tokens per second calculations
+  // prefill: 1000 input tokens / 0.8s TTFT = 1250 tok/s
+  assertEquals(result.prefillTokensPerSec, 1250);
+  // generation: 200 output tokens / 4.2s (5s - 0.8s) = 47.62 tok/s
+  assertGreaterOrEqual(result.generationTokensPerSec, 47.6);
+  assertLessOrEqual(result.generationTokensPerSec, 47.7);
+  // combined: 1200 total tokens / 5s = 240 tok/s
+  assertEquals(result.combinedTokensPerSec, 240);
+
+  // rawTimestamps
+  assertEquals(result.rawTimestamps?.ttftMs, 800);
+  assertEquals(result.rawTimestamps?.allTtftMs, [800]);
+  assertEquals(result.rawTimestamps?.generationDurationMs, 4200);
+});
+
+Deno.test("aggregatePromptMetrics - aggregates multiple turns", () => {
+  const turnMetrics: TurnMetrics[] = [
+    {
+      turnId: "turn-1",
+      inputTokens: 1000,
+      outputTokens: 200,
+      durationMs: 3000,
+      timeToFirstTokenMs: 800,
+    },
+    {
+      turnId: "turn-2",
+      inputTokens: 500,
+      outputTokens: 150,
+      durationMs: 2000,
+    },
+    {
+      turnId: "turn-3",
+      inputTokens: 300,
+      outputTokens: 100,
+      durationMs: 1500,
+    },
+  ];
+
+  const result = aggregatePromptMetrics({
+    provider: "openai",
+    model: "gpt-4o",
+    turnMetrics,
+  });
+
+  assertEquals(result.turnCount, 3);
+  assertEquals(result.inputTokens, 1800); // 1000 + 500 + 300
+  assertEquals(result.outputTokens, 450); // 200 + 150 + 100
+  assertEquals(result.totalTokens, 2250);
+  assertEquals(result.totalDurationMs, 6500); // 3000 + 2000 + 1500
+  assertEquals(result.timeToFirstTokenMs, 800); // From first turn only
+
+  // Tokens per second: prefill uses TTFT (0.8s), generation uses (total - TTFT) = 5.7s
+  // prefill: 1800 / 0.8 = 2250 tok/s
+  assertEquals(result.prefillTokensPerSec, 2250);
+  // generation: 450 / 5.7 = 78.95 tok/s
+  assertGreaterOrEqual(result.generationTokensPerSec, 78.9);
+  assertLessOrEqual(result.generationTokensPerSec, 79.0);
+  // combined: 2250 / 6.5 = 346.15 tok/s
+  assertGreaterOrEqual(result.combinedTokensPerSec, 346.1);
+  assertLessOrEqual(result.combinedTokensPerSec, 346.2);
+
+  // rawTimestamps: only turn-1 has valid TTFT, turns 2+ have none
+  assertEquals(result.rawTimestamps?.ttftMs, 800);
+  assertEquals(result.rawTimestamps?.allTtftMs, [800]);
+  assertEquals(result.rawTimestamps?.generationDurationMs, 5700);
+  assertEquals(result.rawTimestamps?.turns.length, 3);
+  assertEquals(result.rawTimestamps?.turns[0].ttftMs, 800);
+  assertEquals(result.rawTimestamps?.turns[1].ttftMs, undefined);
+  assertEquals(result.rawTimestamps?.turns[2].ttftMs, undefined);
+});
+
+Deno.test("aggregatePromptMetrics - handles empty turn list", () => {
+  const result = aggregatePromptMetrics({
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnMetrics: [],
+  });
+
+  assertEquals(result.turnCount, 0);
+  assertEquals(result.inputTokens, 0);
+  assertEquals(result.outputTokens, 0);
+  assertEquals(result.totalTokens, 0);
+  assertEquals(result.prefillTokensPerSec, 0);
+  assertEquals(result.generationTokensPerSec, 0);
+  assertEquals(result.combinedTokensPerSec, 0);
+  assertEquals(result.totalDurationMs, 0);
+  assertEquals(result.timeToFirstTokenMs, undefined);
+});
+
+Deno.test("formatMetricsForDisplay - formats single turn metrics", () => {
+  const metrics: PromptMetrics = {
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnCount: 1,
+    inputTokens: 1250,
+    outputTokens: 342,
+    totalTokens: 1592,
+    prefillTokensPerSec: 482.1,
+    generationTokensPerSec: 18.3,
+    combinedTokensPerSec: 38.0,
+    totalDurationMs: 21600,
+    timeToFirstTokenMs: 850,
+    turns: [],
+  };
+
+  const display = formatMetricsForDisplay(metrics);
+
+  assertEquals(display.includes("anthropic/claude-sonnet-4"), true);
+  assertEquals(display.includes("1,250 tokens"), true);
+  assertEquals(display.includes("482.1 tok/s"), true);
+  assertEquals(display.includes("342 tokens"), true);
+  assertEquals(display.includes("18.3 tok/s"), true);
+  assertEquals(display.includes("1,592 tokens"), true);
+  assertEquals(display.includes("38.0 tok/s"), true);
+  assertEquals(display.includes("21.6s"), true);
+  assertEquals(display.includes("TTFT: 850ms"), true);
+});
+
+Deno.test("formatMetricsForDisplay - formats duration as minutes when over 60s", () => {
+  const metrics: PromptMetrics = {
+    provider: "openai",
+    model: "gpt-4o",
+    turnCount: 1,
+    inputTokens: 5000,
+    outputTokens: 1000,
+    totalTokens: 6000,
+    prefillTokensPerSec: 50,
+    generationTokensPerSec: 10,
+    combinedTokensPerSec: 60,
+    totalDurationMs: 120000, // 2 minutes
+    timeToFirstTokenMs: 1500,
+    turns: [],
+  };
+
+  const display = formatMetricsForDisplay(metrics);
+
+  assertEquals(display.includes("2.0m"), true);
+});
+
+Deno.test("formatMetricsForDisplay - omits turn count when single turn", () => {
+  const metrics: PromptMetrics = {
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnCount: 1,
+    inputTokens: 100,
+    outputTokens: 50,
+    totalTokens: 150,
+    prefillTokensPerSec: 20,
+    generationTokensPerSec: 10,
+    combinedTokensPerSec: 30,
+    totalDurationMs: 5000,
+    timeToFirstTokenMs: 500,
+    turns: [],
+  };
+
+  const display = formatMetricsForDisplay(metrics);
+
+  assertEquals(display.includes("Turns: 1"), false);
+});
+
+Deno.test("formatMetricsForDisplay - omits prefill/generation when TTFT is unavailable", () => {
+  const metrics: PromptMetrics = {
+    provider: "openai",
+    model: "gpt-4o",
+    turnCount: 1,
+    inputTokens: 1000,
+    outputTokens: 200,
+    totalTokens: 1200,
+    prefillTokensPerSec: 0,
+    generationTokensPerSec: 0,
+    combinedTokensPerSec: 240,
+    totalDurationMs: 5000,
+    timeToFirstTokenMs: undefined,
+    turns: [],
+  };
+
+  const display = formatMetricsForDisplay(metrics);
+
+  assertEquals(display.includes("Prefill:"), false);
+  assertEquals(display.includes("Generation:"), false);
+  assertEquals(display.includes("1,200 tokens"), true);
+  assertEquals(display.includes("240.0 tok/s"), true);
+});
+
+Deno.test("formatMetricsForDisplay - shows turn count when multiple turns", () => {
+  const metrics: PromptMetrics = {
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnCount: 3,
+    inputTokens: 100,
+    outputTokens: 50,
+    totalTokens: 150,
+    prefillTokensPerSec: 20,
+    generationTokensPerSec: 10,
+    combinedTokensPerSec: 30,
+    totalDurationMs: 5000,
+    timeToFirstTokenMs: 500,
+    turns: [],
+  };
+
+  const display = formatMetricsForDisplay(metrics);
+
+  assertEquals(display.includes("Turns: 3"), true);
+});
+
+Deno.test("aggregatePromptMetrics - uses first valid TTFT when turn-0 has none", () => {
+  // Edge case: turn-0 has no TTFT, turn-1 does. Should use turn-1's TTFT.
+  const turnMetrics: TurnMetrics[] = [
+    {
+      turnId: "turn-0",
+      inputTokens: 1000,
+      outputTokens: 200,
+      durationMs: 3000,
+      // No timeToFirstTokenMs
+    },
+    {
+      turnId: "turn-1",
+      inputTokens: 500,
+      outputTokens: 150,
+      durationMs: 2000,
+      timeToFirstTokenMs: 600,
+    },
+  ];
+
+  const result = aggregatePromptMetrics({
+    provider: "llama.cpp",
+    model: "Qwen3.6-35B",
+    turnMetrics,
+  });
+
+  // First valid TTFT is from turn-1 (600ms)
+  assertEquals(result.rawTimestamps?.allTtftMs, [600]);
+  assertEquals(result.rawTimestamps?.ttftMs, 600);
+  // Generation duration = totalDuration - firstValidTTFT = 5000 - 600 = 4400
+  assertEquals(result.rawTimestamps?.generationDurationMs, 4400);
+  // prefill: 1500 / 0.6 = 2500
+  assertEquals(result.prefillTokensPerSec, 2500);
+  // generation: 350 / 4.4 = 79.55
+  assertGreaterOrEqual(result.generationTokensPerSec, 79.5);
+  assertLessOrEqual(result.generationTokensPerSec, 79.6);
+});
+
+Deno.test("aggregatePromptMetrics - filters out negative TTFT values", () => {
+  // Simulates the bug where turn-2 got TTFT=-20390 from the old global-firstToken code
+  const turnMetrics: TurnMetrics[] = [
+    {
+      turnId: "turn-0",
+      inputTokens: 1000,
+      outputTokens: 200,
+      durationMs: 3000,
+      timeToFirstTokenMs: 800,
+    },
+    {
+      turnId: "turn-1",
+      inputTokens: 500,
+      outputTokens: 150,
+      durationMs: 2000,
+      timeToFirstTokenMs: -5000, // Invalid: negative
+    },
+  ];
+
+  const result = aggregatePromptMetrics({
+    provider: "llama.cpp",
+    model: "Qwen3.6-35B",
+    turnMetrics,
+  });
+
+  // Only turn-0's TTFT (800) should be used; turn-1's negative value is filtered
+  assertEquals(result.rawTimestamps?.allTtftMs, [800]);
+  assertEquals(result.rawTimestamps?.ttftMs, 800);
+  // Generation duration = totalDuration - firstTurnTTFT = 5000 - 800 = 4200
+  assertEquals(result.rawTimestamps?.generationDurationMs, 4200);
+  // prefill: 1500 / 0.8 = 1875
+  assertEquals(result.prefillTokensPerSec, 1875);
+  // generation: 350 / 4.2 = 83.33
+  assertGreaterOrEqual(result.generationTokensPerSec, 83.3);
+  assertLessOrEqual(result.generationTokensPerSec, 83.4);
+});
+
+Deno.test("toLogEntry - creates JSON-serializable log entry", () => {
+  const metrics: PromptMetrics = {
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnCount: 2,
+    inputTokens: 1250,
+    outputTokens: 342,
+    totalTokens: 1592,
+    prefillTokensPerSec: 482.12345,
+    generationTokensPerSec: 18.34567,
+    combinedTokensPerSec: 38.09876,
+    totalDurationMs: 21600,
+    timeToFirstTokenMs: 850,
+    rawTimestamps: {
+      ttftMs: 850,
+      allTtftMs: [850],
+      generationDurationMs: 20750,
+      turns: [],
+    },
+    turns: [],
+  };
+
+  const logEntry = toLogEntry(metrics);
+
+  assertEquals(logEntry.provider, "anthropic");
+  assertEquals(logEntry.model, "claude-sonnet-4");
+  assertEquals(logEntry.turnCount, 2);
+  assertEquals(logEntry.inputTokens, 1250);
+  assertEquals(logEntry.outputTokens, 342);
+  assertEquals(logEntry.totalTokens, 1592);
+  // Rounded to 2 decimal places
+  assertEquals(logEntry.prefillTokensPerSec, 482.12);
+  assertEquals(logEntry.generationTokensPerSec, 18.35);
+  assertEquals(logEntry.combinedTokensPerSec, 38.1);
+  assertEquals(logEntry.totalDurationMs, 21600);
+  assertEquals(logEntry.timeToFirstTokenMs, 850);
+
+  // Should have ISO timestamp
+  assertEquals(logEntry.timestamp.includes("T"), true);
+  assertEquals(logEntry.timestamp.includes("Z"), true);
+
+  // Should be JSON serializable
+  const json = JSON.stringify(logEntry);
+  assertEquals(json.length > 0, true);
+  const parsed = JSON.parse(json);
+  assertEquals(parsed.provider, "anthropic");
+
+  // rawTimestamps should be included
+  assertEquals(logEntry.rawTimestamps?.ttftMs, 850);
+  assertEquals(logEntry.rawTimestamps?.allTtftMs, [850]);
+  assertEquals(logEntry.rawTimestamps?.generationDurationMs, 20750);
+  assertEquals(logEntry.rawTimestamps?.turns.length, 0);
+});
+
+Deno.test("aggregatePromptMetrics - warns when generation speed is physically impossible", () => {
+  const originalWarn = console.warn;
+  let warnCall: string | undefined;
+  console.warn = (msg: string) => { warnCall = msg; };
+
+  try {
+    const turnMetrics: TurnMetrics[] = [
+      {
+        turnId: "turn-0",
+        inputTokens: 100,
+        outputTokens: 1000,
+        durationMs: 1000,
+        timeToFirstTokenMs: 100,
+      },
+    ];
+
+    aggregatePromptMetrics({
+      provider: "llama.cpp",
+      model: "Qwen3.6-35B",
+      turnMetrics,
+    });
+
+    // generation: 1000 / 0.9 = 1111.11 tok/s > 500
+    assertGreaterOrEqual(warnCall, "");
+    assertEquals(warnCall!.includes("Suspicious generation speed"), true);
+    assertEquals(warnCall!.includes("1111.1 tok/s"), true);
+    assertEquals(warnCall!.includes("output=1000"), true);
+  } finally {
+    console.warn = originalWarn;
+  }
+});
+
+Deno.test("aggregatePromptMetrics - does not warn for normal speeds", () => {
+  const originalWarn = console.warn;
+  let warnCall: string | undefined;
+  console.warn = (msg: string) => { warnCall = msg; };
+
+  try {
+    const turnMetrics: TurnMetrics[] = [
+      {
+        turnId: "turn-0",
+        inputTokens: 1000,
+        outputTokens: 200,
+        durationMs: 5000,
+        timeToFirstTokenMs: 800,
+      },
+    ];
+
+    aggregatePromptMetrics({
+      provider: "llama.cpp",
+      model: "Qwen3.6-35B",
+      turnMetrics,
+    });
+
+    assertEquals(warnCall, undefined);
+  } finally {
+    console.warn = originalWarn;
+  }
+});
+
+Deno.test("aggregatePromptMetrics - uses full duration when TTFT is undefined", () => {
+  const turnMetrics: TurnMetrics[] = [
+    {
+      turnId: "turn-1",
+      inputTokens: 1000,
+      outputTokens: 200,
+      durationMs: 5000,
+      // No timeToFirstTokenMs
+    },
+  ];
+
+  const result = aggregatePromptMetrics({
+    provider: "openai",
+    model: "gpt-4o",
+    turnMetrics,
+  });
+
+  assertEquals(result.turnCount, 1);
+  assertEquals(result.inputTokens, 1000);
+  assertEquals(result.outputTokens, 200);
+  // Without TTFT, prefill and generation rates are 0 (cannot separate phases)
+  // Only combined rate is meaningful
+  assertEquals(result.prefillTokensPerSec, 0);
+  assertEquals(result.generationTokensPerSec, 0);
+  assertEquals(result.combinedTokensPerSec, 240);
+});
+
+Deno.test("toLogEntry - handles missing timeToFirstToken", () => {
+  const metrics: PromptMetrics = {
+    provider: "anthropic",
+    model: "claude-sonnet-4",
+    turnCount: 1,
+    inputTokens: 100,
+    outputTokens: 50,
+    totalTokens: 150,
+    prefillTokensPerSec: 20,
+    generationTokensPerSec: 10,
+    combinedTokensPerSec: 30,
+    totalDurationMs: 5000,
+    timeToFirstTokenMs: undefined,
+    turns: [],
+  };
+
+  const logEntry = toLogEntry(metrics);
+
+  assertEquals(logEntry.timeToFirstTokenMs, undefined);
+});
+
+Deno.test("Integration - full flow from turns to log entry", () => {
+  // Simulate a real scenario with multiple turns
+  const turn1 = calculateTurnMetrics({
+    turnId: "turn-1",
+    inputTokens: 2000,
+    outputTokens: 500,
+    durationMs: 8000,
+    timeToFirstTokenMs: 1200,
+  });
+
+  const turn2 = calculateTurnMetrics({
+    turnId: "turn-2",
+    inputTokens: 800,
+    outputTokens: 200,
+    durationMs: 3000,
+  });
+
+  const promptMetrics = aggregatePromptMetrics({
+    provider: "groq",
+    model: "llama-3.1-70b",
+    turnMetrics: [turn1, turn2],
+  });
+
+  const display = formatMetricsForDisplay(promptMetrics);
+  const logEntry = toLogEntry(promptMetrics);
+
+  // Verify aggregation
+  assertEquals(promptMetrics.turnCount, 2);
+  assertEquals(promptMetrics.inputTokens, 2800);
+  assertEquals(promptMetrics.outputTokens, 700);
+  assertEquals(promptMetrics.totalTokens, 3500);
+  assertEquals(promptMetrics.totalDurationMs, 11000);
+  assertEquals(promptMetrics.timeToFirstTokenMs, 1200);
+
+  // Verify corrected rate calculations
+  // prefill: 2800 / 1.2 = 2333.33 tok/s
+  assertGreaterOrEqual(promptMetrics.prefillTokensPerSec, 2333.3);
+  assertLessOrEqual(promptMetrics.prefillTokensPerSec, 2333.4);
+  // generation: 700 / 9.8 = 71.43 tok/s
+  assertGreaterOrEqual(promptMetrics.generationTokensPerSec, 71.4);
+  assertLessOrEqual(promptMetrics.generationTokensPerSec, 71.5);
+  // combined: 3500 / 11 = 318.18 tok/s
+  assertGreaterOrEqual(promptMetrics.combinedTokensPerSec, 318.1);
+  assertLessOrEqual(promptMetrics.combinedTokensPerSec, 318.2);
+
+  // Verify display contains key info
+  assertEquals(display.includes("groq/llama-3.1-70b"), true);
+  assertEquals(display.includes("TTFT: 1200ms"), true);
+
+  // Verify log entry
+  assertEquals(logEntry.provider, "groq");
+  assertEquals(logEntry.model, "llama-3.1-70b");
+  assertEquals(logEntry.turnCount, 2);
+});
--- a/packages/pi-llm-performance/src/llm-metrics-core.ts
+++ b/packages/pi-llm-performance/src/llm-metrics-core.ts
@ -0,0 +1,234 @@
+// Functional core for LLM performance metrics calculation
+
+// Extracted warning function so tests can mock it without touching console
+export function warn(msg: string): void {
+  console.warn(msg);
+}
+
+export interface TurnMetrics {
+  turnId: string;
+  inputTokens: number;
+  outputTokens: number;
+  durationMs: number;
+  timeToFirstTokenMs?: number;
+}
+
+export interface PromptMetrics {
+  provider: string;
+  model: string;
+  turnCount: number;
+  inputTokens: number;
+  outputTokens: number;
+  totalTokens: number;
+  prefillTokensPerSec: number;
+  generationTokensPerSec: number;
+  combinedTokensPerSec: number;
+  totalDurationMs: number;
+  timeToFirstTokenMs?: number;
+  rawTimestamps?: {
+    ttftMs?: number;
+    allTtftMs?: number[];
+    generationDurationMs?: number;
+    turns: Array<{ turnId: string; durationMs: number; ttftMs?: number }>;
+  };
+  turns: TurnMetrics[];
+}
+
+export interface MetricLogEntry {
+  timestamp: string;
+  provider: string;
+  model: string;
+  turnCount: number;
+  inputTokens: number;
+  outputTokens: number;
+  totalTokens: number;
+  prefillTokensPerSec: number;
+  generationTokensPerSec: number;
+  combinedTokensPerSec: number;
+  totalDurationMs: number;
+  timeToFirstTokenMs?: number;
+  rawTimestamps?: {
+    ttftMs?: number;
+    allTtftMs?: number[];
+    generationDurationMs?: number;
+    turns: Array<{ turnId: string; durationMs: number; ttftMs?: number }>;
+  };
+}
+
+/**
+ * Calculate metrics for a single turn
+ */
+export function calculateTurnMetrics(params: {
+  turnId: string;
+  inputTokens: number;
+  outputTokens: number;
+  durationMs: number;
+  timeToFirstTokenMs?: number;
+}): TurnMetrics {
+  return {
+    turnId: params.turnId,
+    inputTokens: params.inputTokens,
+    outputTokens: params.outputTokens,
+    durationMs: params.durationMs,
+    timeToFirstTokenMs: params.timeToFirstTokenMs,
+  };
+}
+
+/**
+ * Aggregate multiple turn metrics into prompt-level metrics
+ */
+export function aggregatePromptMetrics(params: {
+  provider: string;
+  model: string;
+  turnMetrics: TurnMetrics[];
+}): PromptMetrics {
+  const { provider, model, turnMetrics } = params;
+
+  if (turnMetrics.length === 0) {
+    return {
+      provider,
+      model,
+      turnCount: 0,
+      inputTokens: 0,
+      outputTokens: 0,
+      totalTokens: 0,
+      prefillTokensPerSec: 0,
+      generationTokensPerSec: 0,
+      combinedTokensPerSec: 0,
+      totalDurationMs: 0,
+      rawTimestamps: { turns: [] },
+      turns: [],
+    };
+  }
+
+  // Sum tokens across all turns
+  const inputTokens = turnMetrics.reduce((sum, t) => sum + t.inputTokens, 0);
+  const outputTokens = turnMetrics.reduce((sum, t) => sum + t.outputTokens, 0);
+  const totalTokens = inputTokens + outputTokens;
+
+  // Sum duration across all turns
+  const totalDurationMs = turnMetrics.reduce((sum, t) => sum + t.durationMs, 0);
+  const totalDurationSec = totalDurationMs / 1000;
+
+  // Collect per-turn TTFTs; prefill boundary is the first turn's TTFT
+  const ttftValues = turnMetrics.map(t => t.timeToFirstTokenMs).filter((t): t is number => t !== undefined && t >= 0);
+  const firstTurnTtftMs = ttftValues.length > 0 ? ttftValues[0] : undefined;
+
+  // Calculate tokens per second
+  // Prefill: input tokens / first-turn TTFT (prefill happens once at the start)
+  // Generation: output tokens / (totalDuration - firstTurnTTFT) (generation phase)
+  // Combined: total tokens / total duration
+  // When first-turn TTFT is unavailable, prefill and generation phases cannot be separated,
+  // so we set them to 0 and only report combined.
+  const ttftSec = firstTurnTtftMs !== undefined ? firstTurnTtftMs / 1000 : undefined;
+  const generationDurationSec = firstTurnTtftMs !== undefined
+    ? (totalDurationMs - firstTurnTtftMs) / 1000
+    : undefined;
+
+  const prefillTokensPerSec = (ttftSec && ttftSec > 0) ? inputTokens / ttftSec : 0;
+  const generationTokensPerSec = (generationDurationSec !== undefined && generationDurationSec > 0)
+    ? outputTokens / generationDurationSec
+    : 0;
+  const combinedTokensPerSec = totalDurationSec > 0 ? totalTokens / totalDurationSec : 0;
+
+  // Sanity check: flag physically impossible generation speeds
+  if (generationTokensPerSec > 500) {
+    warn(
+      `[metrics] Suspicious generation speed: ${generationTokensPerSec.toFixed(1)} tok/s (input=${inputTokens}, output=${outputTokens}, totalDuration=${totalDurationMs}ms, TTFT=${firstTurnTtftMs}ms)`
+    );
+  }
+
+  return {
+    provider,
+    model,
+    turnCount: turnMetrics.length,
+    inputTokens,
+    outputTokens,
+    totalTokens,
+    prefillTokensPerSec,
+    generationTokensPerSec,
+    combinedTokensPerSec,
+    totalDurationMs,
+    timeToFirstTokenMs: firstTurnTtftMs,
+    rawTimestamps: {
+      ttftMs: firstTurnTtftMs,
+      allTtftMs: ttftValues,
+      generationDurationMs: generationDurationSec !== undefined ? generationDurationSec * 1000 : undefined,
+      turns: turnMetrics.map(t => ({ turnId: t.turnId, durationMs: t.durationMs, ttftMs: t.timeToFirstTokenMs })),
+    },
+    turns: turnMetrics,
+  };
+}
+
+/**
+ * Format metrics for TUI display
+ */
+export function formatMetricsForDisplay(metrics: PromptMetrics): string {
+  const lines: string[] = [];
+
+  // Header with provider/model
+  lines.push(`📊 Performance: ${metrics.provider}/${metrics.model}`);
+
+  if (metrics.turnCount === 0) {
+    lines.push("  No turns recorded");
+    return lines.join("\n");
+  }
+
+  // Format duration display
+  const durationSec = metrics.totalDurationMs / 1000;
+  const durationDisplay = durationSec >= 60
+    ? `${(durationSec / 60).toFixed(1)}m`
+    : `${durationSec.toFixed(1)}s`;
+
+  // Prefill metrics (only when TTFT was available)
+  if (metrics.prefillTokensPerSec > 0) {
+    lines.push(
+      `  Prefill: ${metrics.inputTokens.toLocaleString()} tokens @ ${metrics.prefillTokensPerSec.toFixed(1)} tok/s`
+    );
+  }
+
+  // Generation metrics (only when TTFT was available)
+  if (metrics.generationTokensPerSec > 0) {
+    lines.push(
+      `  Generation: ${metrics.outputTokens.toLocaleString()} tokens @ ${metrics.generationTokensPerSec.toFixed(1)} tok/s`
+    );
+  }
+
+  // Combined metrics
+  lines.push(
+    `  Combined: ${metrics.totalTokens.toLocaleString()} tokens @ ${metrics.combinedTokensPerSec.toFixed(1)} tok/s (${durationDisplay} total)`
+  );
+
+  // Time to first token
+  if (metrics.timeToFirstTokenMs !== undefined) {
+    lines.push(`  TTFT: ${metrics.timeToFirstTokenMs.toFixed(0)}ms`);
+  }
+
+  // Turn count
+  if (metrics.turnCount > 1) {
+    lines.push(`  Turns: ${metrics.turnCount}`);
+  }
+
+  return lines.join("\n");
+}
+
+/**
+ * Convert PromptMetrics to JSONL log entry
+ */
+export function toLogEntry(metrics: PromptMetrics): MetricLogEntry {
+  return {
+    timestamp: new Date().toISOString(),
+    provider: metrics.provider,
+    model: metrics.model,
+    turnCount: metrics.turnCount,
+    inputTokens: metrics.inputTokens,
+    outputTokens: metrics.outputTokens,
+    totalTokens: metrics.totalTokens,
+    prefillTokensPerSec: Math.round(metrics.prefillTokensPerSec * 100) / 100,
+    generationTokensPerSec: Math.round(metrics.generationTokensPerSec * 100) / 100,
+    combinedTokensPerSec: Math.round(metrics.combinedTokensPerSec * 100) / 100,
+    totalDurationMs: metrics.totalDurationMs,
+    timeToFirstTokenMs: metrics.timeToFirstTokenMs,
+    rawTimestamps: metrics.rawTimestamps,
+  };
+}
--- a/packages/pi-llm-performance/src/llm-performance-metrics.ts
+++ b/packages/pi-llm-performance/src/llm-performance-metrics.ts
@ -0,0 +1,101 @@
+// LLM Performance Metrics Extension
+// Captures and displays LLM inference performance metrics
+
+import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
+import { appendFileSync, mkdirSync } from "node:fs";
+import { dirname, join } from "node:path";
+
+// Re-export core functions from the shared metrics module
+import {
+  calculateTurnMetrics,
+  aggregatePromptMetrics,
+  formatMetricsForDisplay,
+  toLogEntry,
+  type TurnMetrics,
+  type PromptMetrics,
+  type MetricLogEntry,
+} from "./llm-metrics-core.ts";
+
+// ============================================================================
+// Extension Event Handlers (imperative shell)
+// ============================================================================
+
+// State tracking
+let promptStartMs: number | undefined;
+let currentTurnStartMs: number | undefined;
+let currentTurnId: string | undefined;
+let turnMetrics: TurnMetrics[] = [];
+let currentTurnFirstTokenMs: number | undefined; // Per-turn TTFT
+let provider: string | undefined;
+let model: string | undefined;
+
+export default function (pi: ExtensionAPI) {
+  const logFile = join(process.cwd(), ".pi", "llm-metrics.log");
+
+  pi.on("agent_start", async (_event, ctx) => {
+    if (!ctx.model) return;
+    promptStartMs = Date.now();
+    turnMetrics = [];
+    currentTurnFirstTokenMs = undefined;
+    provider = ctx.model.provider;
+    model = ctx.model.id;
+  });
+
+  pi.on("turn_start", async (event, _ctx) => {
+    currentTurnStartMs = Date.now();
+    currentTurnId = `turn-${event.turnIndex}`;
+    currentTurnFirstTokenMs = undefined; // Reset TTFT for this turn
+  });
+
+  pi.on("message_update", async (event, _ctx) => {
+    // Capture per-turn TTFT on first token
+    if (currentTurnFirstTokenMs === undefined && event.assistantMessageEvent?.type === "text_delta") {
+      currentTurnFirstTokenMs = Date.now();
+    }
+  });
+
+  pi.on("turn_end", async (event, _ctx) => {
+    if (event.message.role !== "assistant") return;
+    const inputTokens = event.message.usage?.input ?? 0;
+    const outputTokens = event.message.usage?.output ?? 0;
+    const durationMs = currentTurnStartMs ? Date.now() - currentTurnStartMs : 0;
+    const ttftMs = currentTurnFirstTokenMs && currentTurnStartMs
+      ? currentTurnFirstTokenMs - currentTurnStartMs
+      : undefined;
+
+    const turnMetric = calculateTurnMetrics({
+      turnId: currentTurnId!,
+      inputTokens,
+      outputTokens,
+      durationMs,
+      timeToFirstTokenMs: ttftMs,
+    });
+
+    turnMetrics.push(turnMetric);
+  });
+
+  pi.on("agent_end", async (_event, ctx) => {
+    if (!provider || !model || promptStartMs === undefined) return;
+
+    const promptMetrics = aggregatePromptMetrics({
+      provider,
+      model,
+      turnMetrics,
+    });
+
+    // Display in TUI
+    const display = formatMetricsForDisplay(promptMetrics);
+    ctx.ui.notify(display, "info");
+    ctx.ui.setStatus("metrics", `📊 ${promptMetrics.combinedTokensPerSec.toFixed(1)} tok/s`);
+
+    // Log to JSONL file
+    const logEntry = toLogEntry(promptMetrics);
+    mkdirSync(dirname(logFile), { recursive: true });
+    appendFileSync(logFile, JSON.stringify(logEntry) + "\n", "utf8");
+
+    // Reset state
+    promptStartMs = undefined;
+    turnMetrics = [];
+    currentTurnFirstTokenMs = undefined;
+  });
+}
--- a/packages/pi-notifications/README.md
+++ b/packages/pi-notifications/README.md
@ -0,0 +1,62 @@
+# pi-notifications
+
+Audio alerts for pi agent events via `afplay`.
+
+## What it does
+
+Plays a sound when the agent finishes a turn, so you can step away and get alerted when input is needed.
+
+## Configuration
+
+| Env var | Default | Description |
+|---------|---------|-------------|
+| `PI_NOTIFICATIONS_ENABLED` | `true` | Set to `false` to disable all notifications |
+| `PI_NOTIFICATION_AGENT_END` | `true` | Play sound when agent finishes |
+| `PI_NOTIFICATION_AUDIO` | `/System/Library/Sounds/Glass.aiff` | Path to audio file (.aiff/.wav/.mp3) |
+
+
+## Standalone tester
+
+Verify audio playback:
+
+```bash
+node --input-type=module -e "import {createJiti} from './node_modules/.pnpm/@mariozechner+jiti@2.6.5/node_modules/@mariozechner/jiti/lib/jiti.mjs'; const jiti = createJiti(); await jiti.import('./packages/pi-notifications/src/test-notify.ts');"
+```
+
+## Available macOS sounds
+
+```
+/System/Library/Sounds/Bottle.aiff
+/System/Library/Sounds/Cork.aiff
+/System/Library/Sounds/Frog.aiff
+/System/Library/Sounds/Glass.aiff       ← default
+/System/Library/Sounds/Hero.aiff
+/System/Library/Sounds/Morse.aiff
+/System/Library/Sounds/Ping.aiff
+/System/Library/Sounds/Pop.aiff
+/System/Library/Sounds/Submarine.aiff
+/System/Library/Sounds/Sosumi.aiff
+/System/Library/Sounds/Tink.aiff
+```
+
+## Usage
+
+Add to `~/.pi/agent/settings.json`:
+
+```json
+{
+  "packages": [
+    "/path/to/packages/pi-notifications"
+  ]
+}
+```
+
+Then reload pi:
+
+```bash
+/reload
+```
+
+## License
+
+MIT
--- a/packages/pi-notifications/package.json
+++ b/packages/pi-notifications/package.json
@ -0,0 +1,17 @@
+{
+  "name": "pi-notifications",
+  "version": "0.1.0",
+  "description": "Desktop notifications for pi agent events",
+  "type": "module",
+  "exports": {
+    ".": "./src/index.ts"
+  },
+  "keywords": ["pi-package"],
+  "pi": {
+    "extensions": ["src/index.ts"]
+  },
+  "peerDependencies": {
+    "@mariozechner/pi-coding-agent": "*"
+  },
+  "license": "MIT"
+}
--- a/packages/pi-notifications/src/index.ts
+++ b/packages/pi-notifications/src/index.ts
@ -0,0 +1,36 @@
+// Desktop notifications for pi agent events
+// Plays an audio file to alert the user
+
+import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
+import { execSync } from "node:child_process";
+import { existsSync } from "node:fs";
+
+// Configuration via environment variables
+const enabled = process.env.PI_NOTIFICATIONS_ENABLED !== "false";
+const agentEndEnabled = process.env.PI_NOTIFICATION_AGENT_END !== "false";
+const audioPath = process.env.PI_NOTIFICATION_AUDIO || "/System/Library/Sounds/Glass.aiff";
+
+function notify(body: string, subtitle?: string): void {
+  if (!enabled) return;
+  try {
+    if (existsSync(audioPath)) {
+      execSync(`afplay "${audioPath}"`, { stdio: "ignore" });
+    }
+  } catch {
+    // audio playback failed — silently fail
+  }
+}
+
+export default function (pi: ExtensionAPI) {
+  pi.on("session_start", async (_event, _ctx) => {
+    if (enabled) {
+      notify("pi-notifications active", "Listening for agent_end");
+    }
+  });
+
+  pi.on("agent_end", async (event, _ctx) => {
+    if (!agentEndEnabled) return;
+
+    notify("Agent finished", `${event.messages?.length ?? 0} turns`);
+  });
+}
--- a/packages/pi-notifications/src/test-notify.ts
+++ b/packages/pi-notifications/src/test-notify.ts
@ -0,0 +1,24 @@
+// Standalone audio tester — run from bash to verify audio playback works
+// Usage: npx jiti packages/pi-notifications/src/test-notify.ts
+//
+// This is completely decoupled from the agent loop.
+// Use it to verify that audio playback works before debugging event handler wiring.
+
+import { execSync } from "node:child_process";
+import { existsSync } from "node:fs";
+
+const audioPath = process.env.PI_NOTIFICATION_AUDIO || "/System/Library/Sounds/Glass.aiff";
+
+try {
+  if (!existsSync(audioPath)) {
+    console.error("[test-audio] ❌ Audio file not found:", audioPath);
+    console.error("[test-audio] Set PI_NOTIFICATION_AUDIO to a valid .aiff/.wav/.mp3 path");
+    process.exit(1);
+  }
+  console.log("[test-audio] playing:", audioPath);
+  execSync(`afplay "${audioPath}"`, { stdio: ["ignore", "pipe", "pipe"] });
+  console.log("[test-audio] ✅ Audio played");
+} catch (e: any) {
+  console.error("[test-audio] ❌ Failed:", e.message);
+  process.exit(1);
+}
--- a/plans/metrics-check.md
+++ b/plans/metrics-check.md
@ -0,0 +1,73 @@
+# Plan: Analyze & Fix `llm-metrics` Extension Timing Bug
+
+## Problem Statement
+The extension reports generation speed as ~8,000–2,400 tok/s (physically impossible) while prefill speed is ~70 tok/s. The math is internally consistent but the underlying phase boundaries are inverted or misaligned. Real generation speed is ~53–70 tok/s (confirmed by earlier runs).
+
+## Phase 1: Locate & Map the Extension
+1. **Find the source code**
+   - Search `~/.pi/extensions/`, `~/.pi/tools/`, and the pi-coding-agent package for files matching `llm`, `metric`, `performance`, `benchmark`
+   - Check `~/.pi/config` or project `.pi/config` for extension/tool registration
+   - Look for custom tool definitions in `extensions/`, `tools/`, or `skills/` directories
+2. **Identify the provider integration**
+   - The log shows `"provider":"llama.cpp"` — find where the extension hooks into llama.cpp (likely via subprocess, WebSocket, or callback interception)
+   - Map the data flow: raw llama.cpp output → extension parsing → JSON log writing
+
+## Phase 2: Diagnose the Timing Bug
+3. **Trace phase boundary detection**
+   - Find how the extension defines "prefill" vs "generation" start/end times
+   - Check if it uses:
+     - `timeToFirstToken` (TTFT) as the split point
+     - llama.cpp callback hooks (`completion_token_callback`, `prompt_token_callback`)
+     - Wall-clock timestamps around token streaming
+4. **Verify the calculation**
+   - Confirm the formula: `generationTok/s = outputTokens / (totalDuration - TTFT)`
+   - Check if `totalDuration` includes only generation, or the full call
+   - Look for race conditions: async callbacks firing out of order, or generation end timestamp captured before all tokens are flushed
+5. **Reproduce the anomaly**
+   - Run the same model with identical prompt/output length
+   - Compare TTFT, totalDuration, and per-phase timestamps
+   - Check if the bug appears only with large prompts, speculative decoding, or certain sampling configs
+
+## Phase 3: Fix the Implementation
+6. **Correct phase boundaries**
+   - If using callbacks: ensure generation start = TTFT timestamp, generation end = last token callback or explicit `done` event
+   - If using wall-clock: add a small buffer after last token to account for async flush
+   - Add validation: reject generation speeds > 500 tok/s (sanity check)
+7. **Fix label assignment**
+   - Ensure `prefillTokensPerSec` = `inputTokens / TTFT`
+   - Ensure `generationTokensPerSec` = `outputTokens / (totalDuration - TTFT)`
+   - Add explicit phase logging to debug output
+8. **Add telemetry**
+   - Log raw timestamps: `prefill_start`, `prefill_end`, `gen_start`, `gen_end`, `total_start`, `total_end`
+   - Log per-phase token counts to catch mismatches
+   - Write to `.pi/llm-metrics.log` with consistent schema
+
+## Phase 4: Verify & Deploy
+9. **Test cases**
+   - Small prompt + short output (baseline)
+   - Large prompt + long output (original failure case)
+   - Speculative decoding run (if supported)
+   - Early termination / stop token edge case
+10. **Validate output**
+    - Generation speed should be 40–100 tok/s for this model/hardware
+    - Prefill speed should be 50–200 tok/s (parallel compute)
+    - TTFT should match prefill duration
+    - No negative phase durations
+11. **Update schema & docs**
+    - Add `rawTimestamps` field to log entries for debugging
+    - Document phase definitions in extension README
+    - Add unit tests for metric calculation logic
+
+## Deliverables
+- [ ] Extension source located & data flow mapped
+- [ ] Root cause identified (callback timing gap, phase boundary misassignment, or async flush race)
+- [ ] Fix implemented with sanity checks
+- [ ] Test suite covering edge cases
+- [ ] Log schema updated with raw timestamps
+- [ ] PR or patch ready for review
+
+## Questions to Answer During Analysis
+- Does the extension intercept llama.cpp at the C++ level, via CLI, or through a Python wrapper?
+- Are callbacks synchronous or async?
+- Is there a `done`/`end` event, or does it rely on empty token streams?
+- Could speculative decoding be causing the draft model's batched verification to be misclassified as "generation"?
--- a/plans/pi-notifications-version-0.md
+++ b/plans/pi-notifications-version-0.md
@ -0,0 +1,154 @@
+# Plan: pi-notifications v0 — Desktop Notifications for Agent Events
+
+## Goal
+
+Make the `pi-notifications` extension reliably show macOS Notification Center alerts when the agent finishes a turn, so the user gets alerted without needing to watch the screen.
+
+## Current State
+
+- Extension exists at `packages/pi-notifications/src/index.ts` (monorepo) and `~/.pi/agent/extensions/pi-notifications.ts` (auto-discovery)
+- Extension loads correctly (appears in `/reload` extension list)
+- `console.log` from extensions is NOT visible in `/reload` output
+- `osascript` works when run directly in bash, but notification doesn't appear when called from the extension
+- The `session_start` handler fires on reload, `agent_end` fires when prompts complete
+
+## Debugging Strategy (split into two orthogonal problems)
+
+### Problem A: "Does the trigger fire?" — visible debug signal
+
+`console.log` from extensions is invisible in pi's TUI output. To debug the trigger logic in a fast loop, add a **debug mode** (`PI_NOTIFICATION_DEBUG=true`) that emits a visible signal via `ctx.ui.steer()` (or similar) right before calling `notify()`. This surfaces in the chat/TUI so you can verify the handler fires without needing actual desktop notifications.
+
+### Problem B: "Does `osascript` actually deliver?" — isolated tester
+
+Create a standalone script (`test-notify.ts`) that you run from bash independently of the agent loop. This verifies `osascript` works in the extension's import context, decoupled from event handlers.
+
+### 1. Verify `osascript` works in extension context
+
+The extension uses `execSync` from `node:child_process`. Test that it works inside the extension:
+
+```typescript
+// In the extension, add this to session_start handler:
+try {
+  const output = execSync('osascript -e "display notification \\"test\\" with title \\"test\\""').toString();
+  console.log("[pi-notifications] osascript output:", output);
+} catch (e: any) {
+  console.log("[pi-notifications] osascript error:", e.message, e.stderr?.toString());
+}
+```
+
+If `execSync` fails silently, try:
+- Using `{ stdio: ["pipe", "pipe", "pipe"] }` to capture stderr
+- Checking if `node:child_process` is available in the extension sandbox
+
+### 2. Check macOS notification settings
+
+Notifications may be delivered but not shown as banners:
+- **System Settings → Notifications → Ghostty → Notification Style** — must be "Banners" or "Alerts", not "None" (osascript fires from the Ghostty process, so macOS attributes notifications to Ghostty, not "pi")
+- **System Settings → Focus → [active focus] → Apps** — ensure "Ghostty" is not excluded
+- **System Settings → Notifications → Show Notifications on Lock Screen** — enable if needed
+
+**Known symptom:** Notifications appear in Notification Center when pulled down, but never pop up as banners. This is a macOS style setting, not a code issue.
+
+### 3. Ghostty suppresses banners when focused
+
+Ghostty intentionally silences banner notifications (no pop-up, no sound) when the Ghostty window is **active/focused**. The notification is still delivered to Notification Center. Banners only appear when Ghostty is **not** the active window.
+
+**Workarounds:**
+- **System Settings → Notifications → Ghostty → Alert Style → "Persistent"** — macOS shows these as banners regardless of Ghostty's silencing
+- **Switch to another app** (e.g. leave your browser open) when you want to see the banner
+
+### 3. Try alternative notification methods
+
+If `osascript` doesn't work from the extension, try:
+- `notify-send` (Linux-only, not relevant for macOS)
+- A custom TUI widget that shows a persistent banner
+- Using `ctx.ui.notify()` (but this only shows in pi's TUI, not system notification)
+
+### 4. Verify event handlers fire
+
+Add a `session_start` handler that definitely fires:
+
+```typescript
+pi.on("session_start", async (_event, ctx) => {
+  console.log("[pi-notifications] session_start fired");
+  ctx.ui.notify("pi-notifications active", "info");  // Shows in TUI
+});
+```
+
+If `ctx.ui.notify()` works but `osascript` doesn't, the issue is macOS notification permissions, not the extension.
+
+## Implementation Plan
+
+### Step 0A: Add debug mode with visible signal (PI_NOTIFICATION_DEBUG)
+
+Add a `PI_NOTIFICATION_DEBUG=true` env var. When enabled, the extension calls `ctx.ui.steer()` (or a visible TUI signal) right before each notification, so you see "notification triggered" in the chat output during the agent loop. This lets you verify trigger logic without needing actual desktop notifications.
+
+- In `agent_end` handler: if `PI_NOTIFICATION_DEBUG=true`, call `ctx.ui.steer("[pi-notifications] notification triggered")` before `notify()`
+- In `session_start` handler: same pattern
+- This is purely for debugging — no desktop notification shown when debug is on (or both are shown)
+
+### Step 0B: Create isolated notification tester
+
+Create `packages/pi-notifications/src/test-notify.ts` — a standalone script runnable via `npx jiti` that fires a test notification. Run it from bash to verify `osascript` works in the extension's context, completely separate from the agent loop.
+
+### Step 1: Fix notification delivery (priority)
+
+Once the root cause is identified:
+
+**If `osascript` works but notification is suppressed:**
+- Add a `PI_NOTIFICATION_SOUND` env var (already in design)
+- Add `PI_NOTIFICATIONS_ENABLED` toggle (already in design)
+- Consider adding a "first-run" notification that asks user to enable notifications
+
+**If `osascript` doesn't work from extension:**
+- Fall back to `ctx.ui.notify()` which shows in pi's TUI
+- Or use a different approach (e.g., write to a file that a separate process monitors)
+
+### Step 2: Add turn-limit notification
+
+In `packages/pi-turn-limit/src/turn-limit.ts`, add notification when the limit is reached:
+
+```typescript
+// In the turn-limit extension, when the limit fires:
+if (shouldNotify) {
+  execSync('osascript -e \'display notification "Turn limit reached" with title "pi" subtitle "Turns: ' + turnCount + '/' + maxTurns + '"\'');
+}
+```
+
+Configuration via env var:
+- `PI_NOTIFICATION_TURN_LIMIT` — default `true`, set to `false` to disable
+
+### Step 3: Add sound option
+
+Already designed in the extension:
+- `PI_NOTIFICATION_SOUND` env var (default: `default`)
+- macOS sounds: `Bottle`, `Ping`, `Pop`, `Submarine`, `Sosumi`, `Tink`
+- Set to `""` for silent
+
+### Step 4: Update README
+
+Document the extension with:
+- What it does
+- Configuration options
+- How to enable macOS notifications
+- Troubleshooting tips
+
+## Files to Modify
+
+| File | Action |
+|------|--------|
+| `~/.pi/agent/extensions/pi-notifications.ts` | Debug and fix notification delivery |
+| `packages/pi-notifications/src/index.ts` | Sync fixes from auto-discovery version |
+| `packages/pi-turn-limit/src/turn-limit.ts` | Add turn-limit notification |
+| `packages/pi-notifications/README.md` | Update with notification docs |
+
+## Success Criteria
+
+1. ✅ Extension loads and appears in `/reload` output
+2. ✅ macOS Notification Center shows "pi-notifications active" on reload
+3. ✅ macOS Notification Center shows "Agent finished — N turns" when agent completes a prompt
+4. ✅ Turn-limit notification shows when turn limit is exceeded
+5. ✅ `PI_NOTIFICATIONS_ENABLED=false` disables all notifications
+6. ✅ README documents all configuration options
+7. ✅ `PI_NOTIFICATION_DEBUG=true` shows visible signal in TUI when handlers fire
+8. ✅ `test-notify.ts` fires a notification when run standalone
--- a/scoped-packages.md
+++ b/scoped-packages.md
@ -0,0 +1,76 @@
+# Scoped Packages
+
+## Step 1: Create the npm org
+
+```bash
+npm org create mostalive
+```
+
+This creates the `@mostalive` scope on npm. You'll need to pay the [org fee](https://docs.npmjs.com/about-organizations) (currently ~$7/month for the basic tier).
+
+Alternatively, if you already have an account, you can use your username directly — scoped packages can use your personal account too:
+
+```bash
+# No separate org creation needed if @mostalive is your npm username
+```
+
+Check if the scope exists:
+
+```bash
+npm org list
+```
+
+## Step 2: Rename the package
+
+In `packages/pi-turn-limit/package.json`:
+
+```json
+{
+  "name": "@mostalive/pi-turn-limit",
+  "version": "0.1.0",
+  ...
+}
+```
+
+## Step 3: Publish
+
+```bash
+cd packages/pi-turn-limit
+npm publish
+```
+
+Scoped packages require `--access public` on first publish (since npm defaults scoped packages to private):
+
+```bash
+npm publish --access public
+```
+
+## Step 4: Users install
+
+```bash
+pi install npm:@mostalive/pi-turn-limit
+```
+
+---
+
+## Cheaper Alternative: Scoped Git Package
+
+If you don't want to pay for an npm org, you can ship via git without scoping:
+
+```bash
+pi install git:github.com/mostalive/pi-turn-limit
+```
+
+No npm org needed. Users install directly from your GitHub repo. You'd still need to publish to npm for the `npm:` install path, but the git path is free.
+
+---
+
+## Summary
+
+| Approach | Cost | User installs via |
+|----------|------|-------------------|
+| `npm org create` + scoped npm | ~$7/mo | `pi install npm:@mostalive/pi-turn-limit` |
+| GitHub repo (no scope) | Free | `pi install git:github.com/user/repo` |
+| Unscoped npm (`pi-turn-limit`) | Free | `pi install npm:pi-turn-limit` |
+
+If you already have a personal npm account named `mostalive`, the scope is free — scoped packages just use your existing account. The org fee only applies if you create a separate organization entity.
--- a/working-with-extensions.md
+++ b/working-with-extensions.md
@ -0,0 +1,152 @@
+# Working with Pi Extensions
+
+## Installation Options
+
+### Option 1: Publish to npm + `pi install` (Recommended)
+
+The cleanest path that replicates the official pi experience.
+
+**You (publishing):**
+
+```bash
+cd packages/pi-turn-limit
+npm publish
+```
+
+**Users (installing globally):**
+
+```bash
+pi install npm:pi-turn-limit
+```
+
+This writes to `~/.pi/agent/settings.json` under `packages`. Pi handles the install, runs `npm install`, and auto-discovers the extension from the `pi.extensions` manifest.
+
+### Option 2: npm global install + settings.json
+
+**You (publishing):**
+
+```bash
+npm publish
+```
+
+**Users:** Two steps — install the npm package globally, then tell pi about it:
+
+```bash
+npm install -g pi-turn-limit
+```
+
+Then in `~/.pi/agent/settings.json`:
+
+```json
+{
+  "packages": [
+    "npm:pi-turn-limit"
+  ]
+}
+```
+
+Or use the same command as Option 1 — `pi install npm:pi-turn-limit` does both steps.
+
+### Option 3: Local directory (for development)
+
+For local testing without publishing:
+
+```bash
+pi install /Users/willem/dev/spikes/llm/monotonic-pi-extensions/packages/pi-turn-limit
+```
+
+Or in `~/.pi/agent/settings.json`:
+
+```json
+{
+  "packages": [
+    "/Users/willem/dev/spikes/llm/monotonic-pi-extensions/packages/pi-turn-limit"
+  ]
+}
+```
+
+Or as a single-file extension in `~/.pi/agent/extensions/`:
+
+```bash
+cp packages/pi-turn-limit/src/turn-limit.ts ~/.pi/agent/extensions/turn-limit.ts
+```
+
+### Option 4: Per-repo project-local install
+
+Users can install an extension only for a specific project:
+
+```bash
+pi install -l npm:pi-turn-limit    # -l = project-local
+```
+
+This writes to `.pi/settings.json` in the project root. Pi auto-installs missing packages on startup per-project.
+
+---
+
+## Disabling Extensions Per-Repo
+
+Three approaches:
+
+### A. `pi config` (simplest)
+
+```bash
+pi config turn-limit:off    # Disable by extension name
+pi config turn-limit:on     # Re-enable
+```
+
+Works for both global and project scope. Per-repo:
+
+```bash
+pi config -l turn-limit:off
+```
+
+### B. Package filtering in project `settings.json`
+
+In `.pi/settings.json` (project-local):
+
+```json
+{
+  "packages": [
+    {
+      "source": "npm:pi-turn-limit",
+      "extensions": []    // Load none
+    }
+  ]
+}
+```
+
+Or filter specific files:
+
+```json
+{
+  "packages": [
+    {
+      "source": "npm:pi-turn-limit",
+      "extensions": ["!src/turn-limit.ts"]  // Exclude this one
+    }
+  ]
+}
+```
+
+### C. Remove from settings entirely
+
+```bash
+pi remove npm:pi-turn-limit
+```
+
+Or manually edit `~/.pi/agent/settings.json` and remove the package entry.
+
+---
+
+## Summary Table
+
+| Method | Scope | User Command |
+|--------|-------|--------------|
+| `pi install npm:pkg` | Global | One command, handles everything |
+| `npm i -g` + settings.json | Global | Two steps |
+| `pi install ./path` | Global (symlink-style) | Local dev |
+| `pi install -l npm:pkg` | Project-local | Per-repo |
+| `pi config name:off` | Toggle | Enable/disable without uninstalling |
+| `pi config -l name:off` | Project-local toggle | Per-repo disable |
+
+**Recommendation:** Publish to npm, then users run `pi install npm:pi-turn-limit`. For disabling per-repo, `pi config -l turn-limit:off` is the simplest approach — a one-liner that doesn't require editing JSON files.
Author	SHA1	Message	Date
Willem van den Ende	f6d7416b00	Publish pi-notifications and pi-llm-performance	2026-04-28 13:13:52 +01:00
Willem van den Ende	93a5675f06	chore(pi-notifications): remove remaining console.log	2026-04-28 13:06:44 +01:00
Willem van den Ende	113878e83f	chore(pi-notifications): remove debug mode and console.log noise - Remove PI_NOTIFICATION_DEBUG env var and steer signal logic - Remove console.log on extension load - Keep only: session_start notification + agent_end beep - Clean README without debug references	2026-04-28 13:04:33 +01:00
Willem van den Ende	7faabcb038	chore: update llm-metrics log	2026-04-28 13:03:38 +01:00
Willem van den Ende	d7eabfffbb	docs(pi-notifications): update README for audio-based alerts	2026-04-28 13:03:11 +01:00
Willem van den Ende	823af3c486	feat(pi-notifications): switch to afplay audio instead of desktop notifications - Uses afplay to play an audio file (default: Glass.aiff) - Configurable via PI_NOTIFICATION_AUDIO env var - Works from sandboxed context — no osascript needed - test-notify.ts verifies audio playback standalone - Synced to auto-discovery extension path	2026-04-28 13:02:18 +01:00
Willem van den Ende	040513e1d6	feat(pi-notifications): use 'tell application' for notifications to suppress Show button - Tell target app (default: Ghostty) to display notification instead of raw osascript - This attributes notification to the app, avoiding the 'Show' button that opens Script Editor - Configurable via PI_NOTIFICATION_APP env var - test-notify.ts falls back to plain display notification if target app isn't running - Synced to auto-discovery extension path	2026-04-28 12:56:02 +01:00
Willem van den Ende	383cb46fe7	feat(pi-notifications): add standalone test-notify.ts and fix AppleScript sound bug - Add packages/pi-notifications/src/test-notify.ts for isolated testing - Run with: node --input-type=module -e "import {createJiti} from ..." ./packages/pi-notifications/src/test-notify.ts - Decoupled from agent loop — verifies osascript in extension context - Fix: 'default' is a reserved word in AppleScript, skip sound param when sound='default' - Synced fix to auto-discovery extension path	2026-04-28 12:10:30 +01:00
Willem van den Ende	45a13fd08c	feat(pi-notifications): add PI_NOTIFICATION_DEBUG mode with visible steer signal - Add PI_NOTIFICATION_DEBUG=true env var - When enabled, calls ctx.ui.steer() instead of desktop notification - Lets you verify trigger logic in the agent loop without actual notifications - Synced to both monorepo and auto-discovery extension paths	2026-04-28 12:09:06 +01:00
Willem van den Ende	ce4d6c5971	ignore llm-metrics log	2026-04-28 10:53:44 +01:00
Willem van den Ende	98e18643c5	pi-performance: Make Time to first token more accurate. Summary of changes: ┌──────┬──────────────────────────────────────────────────────────────────┬──────────┐ │ Step │ Change │ Result │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 1 │ Removed duplicate llm-performance-metrics.test.ts │ 14 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 2 │ Added rawTimestamps assertions to toLogEntry test │ 14 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 3 │ Added rawTimestamps assertions to single-turn aggregate test │ 14 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 4 │ Added rawTimestamps assertions to multi-turn aggregate test │ 14 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 5 │ Added negative TTFT filtering test │ 15 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 6 │ Added "first turn missing TTFT, later turns have it" test │ 16 tests │ ├──────┼──────────────────────────────────────────────────────────────────┼──────────┤ │ 7 │ Added sanity check tests (warn on >500 tok/s, no warn otherwise) │ 18 tests │ └──────┴──────────────────────────────────────────────────────────────────┴──────────┘ This is what it looks like now when I run `pi` 📊 Performance: llama.cpp/Qwen3.6-35B-A3B-MXFP4_MOE.gguf Prefill: 15,460 tokens @ 20104.0 tok/s Generation: 12,179 tokens @ 52.6 tok/s Combined: 27,639 tokens @ 118.9 tok/s (3.9m total) TTFT: 769ms Turns: 36	2026-04-28 10:52:00 +01:00
Willem van den Ende	a38c76c65e	move pi-llm-performance to monorepo, update README and add deno.json	2026-04-28 10:06:03 +01:00
Willem van den Ende	0cf13ed54e	move pi-llm-performance to this repo	2026-04-28 10:00:45 +01:00