# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix ## Objective Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B model running on CPU. The primary failure mode is **sidetracking** — the model abandons the diagram task and starts reviewing/critiquing the code instead. ## Primary Metric **score** — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs). ## Secondary Metrics - **sidetrack_count** — number of test runs containing review/critique language (lower is better) - **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better) ## Architecture Pi runs the skill against the Firehose codebase (mounted in the workspace) using the target model. Scoring is done by bash scripts — no judge model needed. ## The Codebase Under Test **Firehose** — a Phoenix blogging platform with a monorepo structure: - `app/` — Phoenix web app (OTP app: `:firehose`) - `lib/firehose_web/router.ex` — routes - `lib/firehose_web/controllers/blog_controller.ex` — blog actions - `lib/firehose_web/controllers/page_controller.ex` — homepage - `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes) - `blogex/` — sibling library for compile-time blog engine - `lib/blogex/blog.ex` — `use Blogex.Blog` macro (NimblePublisher) - `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.) - `lib/blogex/router.ex` — API/feed routes **Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled into BEAM module attributes at build time. There is NO runtime file I/O for reading posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from `@posts` module attributes. This is the #1 thing models get wrong. ## Test Inputs (3 scenarios) ### 1. Click tag on post (small) "Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full request from browser through to rendered response." ### 2. Show homepage (small) "Generate a sequence diagram for: a user visits the homepage (GET /). Trace from browser through to rendered HTML." ### 3. Add blog post on disk (larger, crosses compile/runtime boundary) "Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/. Trace what happens from file creation through to the post being visible on the blog. Include the compile-time and runtime phases." ## Eval Criteria (6 binary checks) 1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram` 2. **diagram_parseable** — the mermaid block is syntactically valid 3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController 4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render 5. **no_sidetracking** — output does NOT contain code review language (see blocklist) 6. **concise** — total output is under 3000 characters ## Files in Scope | File | Agent may edit? | |------|-----------------| | `skill/SKILL.md` | ✅ YES — the only file the agent modifies | | `benchmark/tasks.jsonl` | ❌ NO | | `scripts/score.sh` | ❌ NO | | `scripts/run_one.sh` | ❌ NO | | `scripts/sidetrack_blocklist.txt` | ❌ NO | | `autoresearch.sh` | ❌ NO | | `autoresearch.checks.sh` | ❌ NO | ## Constraints - SKILL.md must stay under 1500 tokens. - SKILL.md must NOT contain any code from the Firehose codebase (no overfitting). - SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase, not just Firehose. ## What Has Been Tried (autoresearch fills this in) ## Dead Ends (autoresearch fills this in) ## Key Wins (autoresearch fills this in)