97 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix
## Objective
Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams
from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B
model running on CPU. The primary failure mode is **sidetracking** — the model
abandons the diagram task and starts reviewing/critiquing the code instead.
## Primary Metric
**score** — higher is better (018 scale, sum of 6 binary evals × 3 test inputs).
## Secondary Metrics
- **sidetrack_count** — number of test runs containing review/critique language (lower is better)
- **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better)
## Architecture
Pi runs the skill against the Firehose codebase (mounted in the workspace) using
the target model. Scoring is done by bash scripts — no judge model needed.
## The Codebase Under Test
**Firehose** — a Phoenix blogging platform with a monorepo structure:
- `app/` — Phoenix web app (OTP app: `:firehose`)
- `lib/firehose_web/router.ex` — routes
- `lib/firehose_web/controllers/blog_controller.ex` — blog actions
- `lib/firehose_web/controllers/page_controller.ex` — homepage
- `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes)
- `blogex/` — sibling library for compile-time blog engine
- `lib/blogex/blog.ex``use Blogex.Blog` macro (NimblePublisher)
- `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.)
- `lib/blogex/router.ex` — API/feed routes
**Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled
into BEAM module attributes at build time. There is NO runtime file I/O for reading
posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from
`@posts` module attributes. This is the #1 thing models get wrong.
## Test Inputs (3 scenarios)
### 1. Click tag on post (small)
"Generate a sequence diagram for: a user on a blog post page clicks a tag link
(e.g., 'elixir'). Trace the full request from browser through to rendered response."
### 2. Show homepage (small)
"Generate a sequence diagram for: a user visits the homepage (GET /).
Trace from browser through to rendered HTML."
### 3. Add blog post on disk (larger, crosses compile/runtime boundary)
"Generate a sequence diagram for: a developer creates a new markdown file in
priv/blog/engineering/. Trace what happens from file creation through to the
post being visible on the blog. Include the compile-time and runtime phases."
## Eval Criteria (6 binary checks)
1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram`
2. **diagram_parseable** — the mermaid block is syntactically valid
3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
5. **no_sidetracking** — output does NOT contain code review language (see blocklist)
6. **concise** — total output is under 3000 characters
## Files in Scope
| File | Agent may edit? |
|------|-----------------|
| `skill/SKILL.md` | ✅ YES — the only file the agent modifies |
| `benchmark/tasks.jsonl` | ❌ NO |
| `scripts/score.sh` | ❌ NO |
| `scripts/run_one.sh` | ❌ NO |
| `scripts/sidetrack_blocklist.txt` | ❌ NO |
| `autoresearch.sh` | ❌ NO |
| `autoresearch.checks.sh` | ❌ NO |
## Constraints
- SKILL.md must stay under 1500 tokens.
- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase,
not just Firehose.
## What Has Been Tried
(autoresearch fills this in)
## Dead Ends
(autoresearch fills this in)
## Key Wins
(autoresearch fills this in)