Autoresearch: Sequence Diagram Skill for Elixir/Phoenix

Objective

Optimize a pi skill (skill/SKILL.md) that generates Mermaid sequence diagrams from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B model running on CPU. The primary failure mode is sidetracking — the model abandons the diagram task and starts reviewing/critiquing the code instead.

Primary Metric

score — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs).

Secondary Metrics

sidetrack_count — number of test runs containing review/critique language (lower is better)
parse_count — number of outputs that contain a parseable sequenceDiagram (higher is better)

Architecture

Pi runs the skill against the Firehose codebase (mounted in the workspace) using the target model. Scoring is done by bash scripts — no judge model needed.

The Codebase Under Test

Firehose — a Phoenix blogging platform with a monorepo structure:

app/ — Phoenix web app (OTP app: :firehose)
- lib/firehose_web/router.ex — routes
- lib/firehose_web/controllers/blog_controller.ex — blog actions
- lib/firehose_web/controllers/page_controller.ex — homepage
- lib/firehose/blogs/ — blog context modules (EngineeringBlog, ReleaseNotes)
blogex/ — sibling library for compile-time blog engine
- lib/blogex/blog.ex — use Blogex.Blog macro (NimblePublisher)
- lib/blogex/components.ex — Phoenix function components (post_meta, tag_list, etc.)
- lib/blogex/router.ex — API/feed routes

Key architectural fact: Blogex uses NimblePublisher. All blog posts are compiled into BEAM module attributes at build time. There is NO runtime file I/O for reading posts. Functions like all_posts/0, get_post!/1, posts_by_tag/1 read from @posts module attributes. This is the #1 thing models get wrong.

Test Inputs (3 scenarios)

1. Click tag on post (small)

"Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full request from browser through to rendered response."

2. Show homepage (small)

"Generate a sequence diagram for: a user visits the homepage (GET /). Trace from browser through to rendered HTML."

3. Add blog post on disk (larger, crosses compile/runtime boundary)

"Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/. Trace what happens from file creation through to the post being visible on the blog. Include the compile-time and runtime phases."

Eval Criteria (6 binary checks)

has_diagram — output contains ```mermaid and sequenceDiagram
diagram_parseable — the mermaid block is syntactically valid
uses_real_modules — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
uses_real_functions — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
no_sidetracking — output does NOT contain code review language (see blocklist)
concise — total output is under 3000 characters

Files in Scope

File	Agent may edit?
`skill/SKILL.md`	✅ YES — the only file the agent modifies
`benchmark/tasks.jsonl`	❌ NO
`scripts/score.sh`	❌ NO
`scripts/run_one.sh`	❌ NO
`scripts/sidetrack_blocklist.txt`	❌ NO
`autoresearch.sh`	❌ NO
`autoresearch.checks.sh`	❌ NO

Constraints

SKILL.md must stay under 1500 tokens.
SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase, not just Firehose.

What Has Been Tried

(autoresearch fills this in)

Dead Ends

(autoresearch fills this in)

Key Wins

(autoresearch fills this in)

3.8 KiB Raw Blame History Unescape Escape

Autoresearch: Sequence Diagram Skill for Elixir/Phoenix

Objective

Primary Metric

Secondary Metrics

Architecture

The Codebase Under Test

Test Inputs (3 scenarios)

1. Click tag on post (small)

2. Show homepage (small)

3. Add blog post on disk (larger, crosses compile/runtime boundary)

Eval Criteria (6 binary checks)

Files in Scope

Constraints

What Has Been Tried

Dead Ends

Key Wins

3.8 KiB

Raw Blame History