3.8 KiB
Raw Blame History

Autoresearch: Sequence Diagram Skill for Elixir/Phoenix

Objective

Optimize a pi skill (skill/SKILL.md) that generates Mermaid sequence diagrams from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B model running on CPU. The primary failure mode is sidetracking — the model abandons the diagram task and starts reviewing/critiquing the code instead.

Primary Metric

score — higher is better (018 scale, sum of 6 binary evals × 3 test inputs).

Secondary Metrics

  • sidetrack_count — number of test runs containing review/critique language (lower is better)
  • parse_count — number of outputs that contain a parseable sequenceDiagram (higher is better)

Architecture

Pi runs the skill against the Firehose codebase (mounted in the workspace) using the target model. Scoring is done by bash scripts — no judge model needed.

The Codebase Under Test

Firehose — a Phoenix blogging platform with a monorepo structure:

  • app/ — Phoenix web app (OTP app: :firehose)
    • lib/firehose_web/router.ex — routes
    • lib/firehose_web/controllers/blog_controller.ex — blog actions
    • lib/firehose_web/controllers/page_controller.ex — homepage
    • lib/firehose/blogs/ — blog context modules (EngineeringBlog, ReleaseNotes)
  • blogex/ — sibling library for compile-time blog engine
    • lib/blogex/blog.exuse Blogex.Blog macro (NimblePublisher)
    • lib/blogex/components.ex — Phoenix function components (post_meta, tag_list, etc.)
    • lib/blogex/router.ex — API/feed routes

Key architectural fact: Blogex uses NimblePublisher. All blog posts are compiled into BEAM module attributes at build time. There is NO runtime file I/O for reading posts. Functions like all_posts/0, get_post!/1, posts_by_tag/1 read from @posts module attributes. This is the #1 thing models get wrong.

Test Inputs (3 scenarios)

1. Click tag on post (small)

"Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full request from browser through to rendered response."

2. Show homepage (small)

"Generate a sequence diagram for: a user visits the homepage (GET /). Trace from browser through to rendered HTML."

3. Add blog post on disk (larger, crosses compile/runtime boundary)

"Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/. Trace what happens from file creation through to the post being visible on the blog. Include the compile-time and runtime phases."

Eval Criteria (6 binary checks)

  1. has_diagram — output contains ```mermaid and sequenceDiagram
  2. diagram_parseable — the mermaid block is syntactically valid
  3. uses_real_modules — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
  4. uses_real_functions — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
  5. no_sidetracking — output does NOT contain code review language (see blocklist)
  6. concise — total output is under 3000 characters

Files in Scope

File Agent may edit?
skill/SKILL.md YES — the only file the agent modifies
benchmark/tasks.jsonl NO
scripts/score.sh NO
scripts/run_one.sh NO
scripts/sidetrack_blocklist.txt NO
autoresearch.sh NO
autoresearch.checks.sh NO

Constraints

  • SKILL.md must stay under 1500 tokens.
  • SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
  • SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase, not just Firehose.

What Has Been Tried

(autoresearch fills this in)

Dead Ends

(autoresearch fills this in)

Key Wins

(autoresearch fills this in)