firehose/sequence-diagram-skill/autoresearch.md

# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix

## Objective

Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams
from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B
model running on CPU. The primary failure mode is **sidetracking** — the model
abandons the diagram task and starts reviewing/critiquing the code instead.

## Primary Metric

**score** — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs).

## Secondary Metrics

- **sidetrack_count** — number of test runs containing review/critique language (lower is better)
- **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better)

## Architecture

Pi runs the skill against the Firehose codebase (mounted in the workspace) using
the target model. Scoring is done by bash scripts — no judge model needed.

## The Codebase Under Test

**Firehose** — a Phoenix blogging platform with a monorepo structure:

- `app/` — Phoenix web app (OTP app: `:firehose`)
  - `lib/firehose_web/router.ex` — routes
  - `lib/firehose_web/controllers/blog_controller.ex` — blog actions
  - `lib/firehose_web/controllers/page_controller.ex` — homepage
  - `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes)
- `blogex/` — sibling library for compile-time blog engine
  - `lib/blogex/blog.ex` — `use Blogex.Blog` macro (NimblePublisher)
  - `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.)
  - `lib/blogex/router.ex` — API/feed routes

**Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled
into BEAM module attributes at build time. There is NO runtime file I/O for reading
posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from
`@posts` module attributes. This is the #1 thing models get wrong.

## Test Inputs (3 scenarios)

### 1. Click tag on post (small)
"Generate a sequence diagram for: a user on a blog post page clicks a tag link
(e.g., 'elixir'). Trace the full request from browser through to rendered response."

### 2. Show homepage (small)
"Generate a sequence diagram for: a user visits the homepage (GET /).
Trace from browser through to rendered HTML."

### 3. Add blog post on disk (larger, crosses compile/runtime boundary)
"Generate a sequence diagram for: a developer creates a new markdown file in
priv/blog/engineering/. Trace what happens from file creation through to the
post being visible on the blog. Include the compile-time and runtime phases."

## Eval Criteria (6 binary checks)

1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram`
2. **diagram_parseable** — the mermaid block is syntactically valid
3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
5. **no_sidetracking** — output does NOT contain code review language (see blocklist)
6. **concise** — total output is under 3000 characters

## Files in Scope

| File | Agent may edit? |
|------|-----------------|
| `skill/SKILL.md` | ✅ YES — the only file the agent modifies |
| `benchmark/tasks.jsonl` | ❌ NO |
| `scripts/score.sh` | ❌ NO |
| `scripts/run_one.sh` | ❌ NO |
| `scripts/sidetrack_blocklist.txt` | ❌ NO |
| `autoresearch.sh` | ❌ NO |
| `autoresearch.checks.sh` | ❌ NO |

## Constraints

- SKILL.md must stay under 1500 tokens.
- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase,
  not just Firehose.

## What Has Been Tried

(autoresearch fills this in)

## Dead Ends

(autoresearch fills this in)

## Key Wins

(autoresearch fills this in)