97 lines
3.8 KiB
Markdown
97 lines
3.8 KiB
Markdown
# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix
|
||
|
||
## Objective
|
||
|
||
Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams
|
||
from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B
|
||
model running on CPU. The primary failure mode is **sidetracking** — the model
|
||
abandons the diagram task and starts reviewing/critiquing the code instead.
|
||
|
||
## Primary Metric
|
||
|
||
**score** — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs).
|
||
|
||
## Secondary Metrics
|
||
|
||
- **sidetrack_count** — number of test runs containing review/critique language (lower is better)
|
||
- **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better)
|
||
|
||
## Architecture
|
||
|
||
Pi runs the skill against the Firehose codebase (mounted in the workspace) using
|
||
the target model. Scoring is done by bash scripts — no judge model needed.
|
||
|
||
## The Codebase Under Test
|
||
|
||
**Firehose** — a Phoenix blogging platform with a monorepo structure:
|
||
|
||
- `app/` — Phoenix web app (OTP app: `:firehose`)
|
||
- `lib/firehose_web/router.ex` — routes
|
||
- `lib/firehose_web/controllers/blog_controller.ex` — blog actions
|
||
- `lib/firehose_web/controllers/page_controller.ex` — homepage
|
||
- `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes)
|
||
- `blogex/` — sibling library for compile-time blog engine
|
||
- `lib/blogex/blog.ex` — `use Blogex.Blog` macro (NimblePublisher)
|
||
- `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.)
|
||
- `lib/blogex/router.ex` — API/feed routes
|
||
|
||
**Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled
|
||
into BEAM module attributes at build time. There is NO runtime file I/O for reading
|
||
posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from
|
||
`@posts` module attributes. This is the #1 thing models get wrong.
|
||
|
||
## Test Inputs (3 scenarios)
|
||
|
||
### 1. Click tag on post (small)
|
||
"Generate a sequence diagram for: a user on a blog post page clicks a tag link
|
||
(e.g., 'elixir'). Trace the full request from browser through to rendered response."
|
||
|
||
### 2. Show homepage (small)
|
||
"Generate a sequence diagram for: a user visits the homepage (GET /).
|
||
Trace from browser through to rendered HTML."
|
||
|
||
### 3. Add blog post on disk (larger, crosses compile/runtime boundary)
|
||
"Generate a sequence diagram for: a developer creates a new markdown file in
|
||
priv/blog/engineering/. Trace what happens from file creation through to the
|
||
post being visible on the blog. Include the compile-time and runtime phases."
|
||
|
||
## Eval Criteria (6 binary checks)
|
||
|
||
1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram`
|
||
2. **diagram_parseable** — the mermaid block is syntactically valid
|
||
3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
|
||
4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
|
||
5. **no_sidetracking** — output does NOT contain code review language (see blocklist)
|
||
6. **concise** — total output is under 3000 characters
|
||
|
||
## Files in Scope
|
||
|
||
| File | Agent may edit? |
|
||
|------|-----------------|
|
||
| `skill/SKILL.md` | ✅ YES — the only file the agent modifies |
|
||
| `benchmark/tasks.jsonl` | ❌ NO |
|
||
| `scripts/score.sh` | ❌ NO |
|
||
| `scripts/run_one.sh` | ❌ NO |
|
||
| `scripts/sidetrack_blocklist.txt` | ❌ NO |
|
||
| `autoresearch.sh` | ❌ NO |
|
||
| `autoresearch.checks.sh` | ❌ NO |
|
||
|
||
## Constraints
|
||
|
||
- SKILL.md must stay under 1500 tokens.
|
||
- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
|
||
- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase,
|
||
not just Firehose.
|
||
|
||
## What Has Been Tried
|
||
|
||
(autoresearch fills this in)
|
||
|
||
## Dead Ends
|
||
|
||
(autoresearch fills this in)
|
||
|
||
## Key Wins
|
||
|
||
(autoresearch fills this in)
|