3.8 KiB
Autoresearch: Sequence Diagram Skill for Elixir/Phoenix
Objective
Optimize a pi skill (skill/SKILL.md) that generates Mermaid sequence diagrams
from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B
model running on CPU. The primary failure mode is sidetracking — the model
abandons the diagram task and starts reviewing/critiquing the code instead.
Primary Metric
score — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs).
Secondary Metrics
- sidetrack_count — number of test runs containing review/critique language (lower is better)
- parse_count — number of outputs that contain a parseable sequenceDiagram (higher is better)
Architecture
Pi runs the skill against the Firehose codebase (mounted in the workspace) using the target model. Scoring is done by bash scripts — no judge model needed.
The Codebase Under Test
Firehose — a Phoenix blogging platform with a monorepo structure:
app/— Phoenix web app (OTP app::firehose)lib/firehose_web/router.ex— routeslib/firehose_web/controllers/blog_controller.ex— blog actionslib/firehose_web/controllers/page_controller.ex— homepagelib/firehose/blogs/— blog context modules (EngineeringBlog, ReleaseNotes)
blogex/— sibling library for compile-time blog enginelib/blogex/blog.ex—use Blogex.Blogmacro (NimblePublisher)lib/blogex/components.ex— Phoenix function components (post_meta, tag_list, etc.)lib/blogex/router.ex— API/feed routes
Key architectural fact: Blogex uses NimblePublisher. All blog posts are compiled
into BEAM module attributes at build time. There is NO runtime file I/O for reading
posts. Functions like all_posts/0, get_post!/1, posts_by_tag/1 read from
@posts module attributes. This is the #1 thing models get wrong.
Test Inputs (3 scenarios)
1. Click tag on post (small)
"Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full request from browser through to rendered response."
2. Show homepage (small)
"Generate a sequence diagram for: a user visits the homepage (GET /). Trace from browser through to rendered HTML."
3. Add blog post on disk (larger, crosses compile/runtime boundary)
"Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/. Trace what happens from file creation through to the post being visible on the blog. Include the compile-time and runtime phases."
Eval Criteria (6 binary checks)
- has_diagram — output contains
```mermaidandsequenceDiagram - diagram_parseable — the mermaid block is syntactically valid
- uses_real_modules — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
- uses_real_functions — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
- no_sidetracking — output does NOT contain code review language (see blocklist)
- concise — total output is under 3000 characters
Files in Scope
| File | Agent may edit? |
|---|---|
skill/SKILL.md |
✅ YES — the only file the agent modifies |
benchmark/tasks.jsonl |
❌ NO |
scripts/score.sh |
❌ NO |
scripts/run_one.sh |
❌ NO |
scripts/sidetrack_blocklist.txt |
❌ NO |
autoresearch.sh |
❌ NO |
autoresearch.checks.sh |
❌ NO |
Constraints
- SKILL.md must stay under 1500 tokens.
- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase, not just Firehose.
What Has Been Tried
(autoresearch fills this in)
Dead Ends
(autoresearch fills this in)
Key Wins
(autoresearch fills this in)