initial setup for autoresearch of sequence diagram prompt

This commit is contained in:
Firehose Bot 2026-03-21 15:39:15 +00:00
parent 419e5dd5bd
commit f4d992f0d6
11 changed files with 601 additions and 0 deletions

10
sequence-diagram-skill/.gitignore vendored Normal file
View File

@ -0,0 +1,10 @@
# autoresearch session
autoresearch.jsonl
autoresearch.ideas.md
# temp
.tmp_*
*.tmp
# OS
.DS_Store

View File

@ -0,0 +1,80 @@
# Sequence Diagram Skill — Autoresearch
Optimizes a pi skill for generating Mermaid sequence diagrams from
Elixir/Phoenix codebases, using [pi-autoresearch](https://github.com/davebcn87/pi-autoresearch).
## The Problem
Small local models (Qwen3.5-35B-A3B) produce great sequence diagrams for
well-represented languages (C#, Java) but go off the rails with Elixir/Phoenix —
sidetracking into imaginary code reviews instead of finishing the diagram.
## How It Works
The autoresearch loop mutates `skill/SKILL.md`, runs it against 3 scenarios
from a real Phoenix codebase (Firehose), and scores with **zero-judge-model
bash evals**:
| Eval | Check | Tool |
|------|-------|------|
| has_diagram | Output has `` ```mermaid `` + `sequenceDiagram` | grep |
| diagram_parseable | Valid mermaid syntax (participants + messages) | grep / mmdc |
| uses_real_modules | ≥2 actual module names from codebase | grep |
| uses_real_functions | ≥1 actual function name | grep |
| no_sidetracking | No review/critique language | grep against blocklist |
| concise | Under 3000 chars | wc |
3 tasks × 6 evals = 18 max score.
## Setup
1. Clone the Firehose repo into `workspace/`:
```bash
git clone https://gitea.apps.sustainabledelivery.com/mostalive/firehose workspace
```
2. Make scripts executable:
```bash
chmod +x autoresearch.sh autoresearch.checks.sh scripts/*.sh
```
3. Configure model access in `scripts/config.env`:
- Local: leave `SSH_TARGET` empty, have pi configured with your model
- Remote: set `SSH_TARGET=analyst@your-host` and `SSH_PORT=2222`
4. Init git and start:
```bash
git init && git add -A && git commit -m "initial"
pi
# then: /autoresearch
```
## Project Structure
```
sequence-diagram-skill/
├── autoresearch.md # Session doc (pi reads this)
├── autoresearch.sh # Benchmark runner
├── autoresearch.checks.sh # Sanity checks on SKILL.md
├── skill/
│ └── SKILL.md # THE FILE BEING OPTIMIZED
├── benchmark/
│ └── tasks.jsonl # 3 test scenarios
├── scripts/
│ ├── config.env # Endpoint config
│ ├── run_one.sh # Run pi with skill + single task
│ ├── score.sh # Score a single output (6 binary evals)
│ └── sidetrack_blocklist.txt # Phrases that indicate off-task behavior
└── workspace/ # Clone of Firehose repo (mounted/symlinked)
```
## Mutation Ideas for the Agent
The autoresearch agent only edits `skill/SKILL.md`. Good mutations include:
- Stronger "do not review" constraints
- Explicit Elixir/Phoenix vocabulary hints (NimblePublisher, module attributes)
- Output format enforcement (ONLY the mermaid block, nothing else)
- Step-by-step process instructions (read router first, then controller, etc.)
- Short generic example of a good sequence diagram
- Negative examples ("do NOT include suggestions or improvements")

View File

@ -0,0 +1,57 @@
#!/usr/bin/env bash
set -euo pipefail
# ─── autoresearch.checks.sh ─────────────────────────────────────────────────
# Backpressure checks for the sequence diagram skill.
# ─────────────────────────────────────────────────────────────────────────────
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SKILL_FILE="${SCRIPT_DIR}/skill/SKILL.md"
ERRORS=0
# 1. Skill exists and is non-empty
if [[ ! -s "$SKILL_FILE" ]]; then
echo "FAIL: skill/SKILL.md is missing or empty"
ERRORS=$((ERRORS + 1))
fi
# 2. Skill is not trivially short
CHAR_COUNT=$(wc -c < "$SKILL_FILE" 2>/dev/null || echo "0")
if (( CHAR_COUNT < 200 )); then
echo "FAIL: skill/SKILL.md is only ${CHAR_COUNT} chars (min: 200)"
ERRORS=$((ERRORS + 1))
fi
# 3. Skill is not too long (rough token proxy: 1500 tokens ≈ 6000 chars)
if (( CHAR_COUNT > 6000 )); then
echo "FAIL: skill/SKILL.md is ${CHAR_COUNT} chars (max: ~6000)"
ERRORS=$((ERRORS + 1))
fi
# 4. Skill must contain "sequenceDiagram" or "sequence diagram" (it's a diagram skill)
if ! grep -qi 'sequence.diagram' "$SKILL_FILE" 2>/dev/null; then
echo "FAIL: skill/SKILL.md doesn't mention sequence diagrams"
ERRORS=$((ERRORS + 1))
fi
# 5. Skill must NOT contain Firehose-specific code (no overfitting)
for term in "BlogController" "EngineeringBlog" "Firehose" "blogex" "priv/blog"; do
if grep -q "$term" "$SKILL_FILE" 2>/dev/null; then
echo "FAIL: skill/SKILL.md contains codebase-specific term '${term}'"
ERRORS=$((ERRORS + 1))
fi
done
# 6. Valid UTF-8
if ! iconv -f utf-8 -t utf-8 "$SKILL_FILE" > /dev/null 2>&1; then
echo "FAIL: skill/SKILL.md contains invalid UTF-8"
ERRORS=$((ERRORS + 1))
fi
if (( ERRORS > 0 )); then
echo "Checks FAILED with ${ERRORS} error(s)"
exit 1
else
echo "All checks passed. Skill: ${CHAR_COUNT} chars."
exit 0
fi

View File

@ -0,0 +1,96 @@
# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix
## Objective
Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams
from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B
model running on CPU. The primary failure mode is **sidetracking** — the model
abandons the diagram task and starts reviewing/critiquing the code instead.
## Primary Metric
**score** — higher is better (018 scale, sum of 6 binary evals × 3 test inputs).
## Secondary Metrics
- **sidetrack_count** — number of test runs containing review/critique language (lower is better)
- **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better)
## Architecture
Pi runs the skill against the Firehose codebase (mounted in the workspace) using
the target model. Scoring is done by bash scripts — no judge model needed.
## The Codebase Under Test
**Firehose** — a Phoenix blogging platform with a monorepo structure:
- `app/` — Phoenix web app (OTP app: `:firehose`)
- `lib/firehose_web/router.ex` — routes
- `lib/firehose_web/controllers/blog_controller.ex` — blog actions
- `lib/firehose_web/controllers/page_controller.ex` — homepage
- `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes)
- `blogex/` — sibling library for compile-time blog engine
- `lib/blogex/blog.ex``use Blogex.Blog` macro (NimblePublisher)
- `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.)
- `lib/blogex/router.ex` — API/feed routes
**Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled
into BEAM module attributes at build time. There is NO runtime file I/O for reading
posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from
`@posts` module attributes. This is the #1 thing models get wrong.
## Test Inputs (3 scenarios)
### 1. Click tag on post (small)
"Generate a sequence diagram for: a user on a blog post page clicks a tag link
(e.g., 'elixir'). Trace the full request from browser through to rendered response."
### 2. Show homepage (small)
"Generate a sequence diagram for: a user visits the homepage (GET /).
Trace from browser through to rendered HTML."
### 3. Add blog post on disk (larger, crosses compile/runtime boundary)
"Generate a sequence diagram for: a developer creates a new markdown file in
priv/blog/engineering/. Trace what happens from file creation through to the
post being visible on the blog. Include the compile-time and runtime phases."
## Eval Criteria (6 binary checks)
1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram`
2. **diagram_parseable** — the mermaid block is syntactically valid
3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController
4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render
5. **no_sidetracking** — output does NOT contain code review language (see blocklist)
6. **concise** — total output is under 3000 characters
## Files in Scope
| File | Agent may edit? |
|------|-----------------|
| `skill/SKILL.md` | ✅ YES — the only file the agent modifies |
| `benchmark/tasks.jsonl` | ❌ NO |
| `scripts/score.sh` | ❌ NO |
| `scripts/run_one.sh` | ❌ NO |
| `scripts/sidetrack_blocklist.txt` | ❌ NO |
| `autoresearch.sh` | ❌ NO |
| `autoresearch.checks.sh` | ❌ NO |
## Constraints
- SKILL.md must stay under 1500 tokens.
- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting).
- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase,
not just Firehose.
## What Has Been Tried
(autoresearch fills this in)
## Dead Ends
(autoresearch fills this in)
## Key Wins
(autoresearch fills this in)

View File

@ -0,0 +1,101 @@
#!/usr/bin/env bash
set -euo pipefail
# ─── autoresearch.sh ─────────────────────────────────────────────────────────
# Benchmark script for sequence diagram skill optimization.
# Runs all 3 test inputs, scores each, outputs METRIC lines.
# ─────────────────────────────────────────────────────────────────────────────
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${SCRIPT_DIR}/scripts/config.env" 2>/dev/null || true
# Defaults
SSH_TARGET="${SSH_TARGET:-}"
SSH_PORT="${SSH_PORT:-2222}"
export TASK_TIMEOUT="${TASK_TIMEOUT:-180}"
# ─── Pre-checks ──────────────────────────────────────────────────────────────
SKILL_FILE="${SCRIPT_DIR}/skill/SKILL.md"
if [[ ! -s "$SKILL_FILE" ]]; then
echo "ERROR: skill/SKILL.md is missing or empty"
exit 1
fi
SKILL_CHARS=$(wc -c < "$SKILL_FILE")
echo "Skill: ${SKILL_CHARS} chars"
TASKS_FILE="${SCRIPT_DIR}/benchmark/tasks.jsonl"
if [[ ! -f "$TASKS_FILE" ]]; then
echo "ERROR: benchmark/tasks.jsonl not found"
exit 1
fi
echo "────────────────────────────────────────────────────"
# ─── Run all tasks ───────────────────────────────────────────────────────────
TMPDIR=$(mktemp -d)
TOTAL_SCORE=0
SIDETRACK_COUNT=0
PARSE_COUNT=0
TASK_COUNT=0
START_TIME=$(date +%s)
while IFS= read -r line; do
TASK_ID=$(echo "$line" | jq -r '.id')
TASK_PROMPT=$(echo "$line" | jq -r '.prompt')
TASK_COUNT=$((TASK_COUNT + 1))
OUTPUT_FILE="${TMPDIR}/${TASK_ID}.txt"
SCORE_FILE="${TMPDIR}/${TASK_ID}.json"
echo " [${TASK_COUNT}/3] ${TASK_ID}..."
# Run the task
bash "${SCRIPT_DIR}/scripts/run_one.sh" \
"$TASK_PROMPT" \
"$OUTPUT_FILE" \
"$SSH_TARGET" \
"$SSH_PORT"
# Score it
SCORE_JSON=$(bash "${SCRIPT_DIR}/scripts/score.sh" "$OUTPUT_FILE")
echo "$SCORE_JSON" > "$SCORE_FILE"
# Extract scores
TASK_SCORE=$(echo "$SCORE_JSON" | jq -r '.score')
TASK_SIDETRACK=$(echo "$SCORE_JSON" | jq -r '.no_sidetracking')
TASK_PARSE=$(echo "$SCORE_JSON" | jq -r '.diagram_parseable')
TASK_CHARS=$(echo "$SCORE_JSON" | jq -r '.char_count')
TOTAL_SCORE=$((TOTAL_SCORE + TASK_SCORE))
if (( TASK_SIDETRACK == 0 )); then
SIDETRACK_COUNT=$((SIDETRACK_COUNT + 1))
fi
if (( TASK_PARSE == 1 )); then
PARSE_COUNT=$((PARSE_COUNT + 1))
fi
echo " score=${TASK_SCORE}/6 sidetrack=$(( 1 - TASK_SIDETRACK )) parseable=${TASK_PARSE} chars=${TASK_CHARS}"
done < "$TASKS_FILE"
END_TIME=$(date +%s)
TOTAL_SECONDS=$((END_TIME - START_TIME))
# ─── Cleanup ─────────────────────────────────────────────────────────────────
rm -rf "$TMPDIR"
# ─── Output METRIC lines ────────────────────────────────────────────────────
echo ""
echo "METRIC score=${TOTAL_SCORE}"
echo "METRIC sidetrack_count=${SIDETRACK_COUNT}"
echo "METRIC parse_count=${PARSE_COUNT}"
echo "METRIC total_seconds=${TOTAL_SECONDS}"
echo "METRIC skill_chars=${SKILL_CHARS}"

View File

@ -0,0 +1,3 @@
{"id": "click-tag", "prompt": "Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full HTTP request from browser through the Phoenix router, controller, domain modules, templates, and back to the browser. The codebase is in /home/analyst/workspace/. Read the relevant source files first."}
{"id": "show-homepage", "prompt": "Generate a sequence diagram for: a user visits the homepage (GET /). Trace from the browser's HTTP request through the Phoenix router, controller, template rendering, layout wrapping, and back to the browser. The codebase is in /home/analyst/workspace/. Read the relevant source files first."}
{"id": "add-post", "prompt": "Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/ and the post becomes visible on the blog. Trace what happens including the compile-time phase (NimblePublisher, module recompilation) and the runtime request phase. The codebase is in /home/analyst/workspace/. Read the relevant source files first."}

View File

@ -0,0 +1,10 @@
# ─── config.env ──────────────────────────────────────────────────────────────
# Leave SSH_TARGET empty to run pi locally (e.g., on your Mac).
# Set it to use the remote pi container.
# Remote pi container (leave empty for local)
SSH_TARGET=""
SSH_PORT=2222
# Timeout per task (seconds)
TASK_TIMEOUT=180

View File

@ -0,0 +1,58 @@
#!/usr/bin/env bash
set -euo pipefail
# ─── run_one.sh ──────────────────────────────────────────────────────────────
# Run pi with the sequence-diagram skill on a single task.
# Usage: ./scripts/run_one.sh <task_prompt> <output_file> [ssh_target] [ssh_port]
#
# If ssh_target is provided, runs remotely via SSH into the pi container.
# Otherwise runs pi locally.
# ─────────────────────────────────────────────────────────────────────────────
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
TASK_PROMPT="$1"
OUTPUT_FILE="$2"
SSH_TARGET="${3:-}"
SSH_PORT="${4:-2222}"
TIMEOUT="${TASK_TIMEOUT:-180}"
SKILL_FILE="${PROJECT_DIR}/skill/SKILL.md"
if [[ ! -f "$SKILL_FILE" ]]; then
echo "ERROR: skill/SKILL.md not found" >&2
exit 1
fi
SKILL_CONTENT=$(cat "$SKILL_FILE")
# Build the full prompt: skill instructions + task
FULL_PROMPT="## Skill Instructions
${SKILL_CONTENT}
## Task
${TASK_PROMPT}"
if [[ -n "$SSH_TARGET" ]]; then
# ─── Remote: SSH into pi container ───────────────────────────────────
PAYLOAD=$(jq -n --arg prompt "$FULL_PROMPT" '{"prompt": $prompt}')
ssh -p "$SSH_PORT" \
-o StrictHostKeyChecking=no \
-o ConnectTimeout=10 \
-o BatchMode=yes \
"$SSH_TARGET" \
"run-task --stdin --mode print --thinking off --timeout $TIMEOUT" \
<<< "$PAYLOAD" > "$OUTPUT_FILE" 2>/dev/null
else
# ─── Local: run pi directly ──────────────────────────────────────────
timeout "${TIMEOUT}s" pi \
--mode print \
--no-session \
--no-extensions \
--thinking none \
-p "$FULL_PROMPT" > "$OUTPUT_FILE" 2>/dev/null || true
fi

View File

@ -0,0 +1,109 @@
#!/usr/bin/env bash
set -euo pipefail
# ─── score.sh ────────────────────────────────────────────────────────────────
# Score a single diagram output against 6 binary evals.
# Usage: ./scripts/score.sh <output_file>
# Prints a JSON line with pass/fail for each eval and total score.
# ─────────────────────────────────────────────────────────────────────────────
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
OUTPUT_FILE="$1"
if [[ ! -f "$OUTPUT_FILE" ]]; then
echo '{"error": "file not found", "score": 0}'
exit 0
fi
CONTENT=$(cat "$OUTPUT_FILE")
CHAR_COUNT=${#CONTENT}
# ─── Eval 1: has_diagram ─────────────────────────────────────────────────────
# Output contains a mermaid fenced block with sequenceDiagram
has_diagram=0
if echo "$CONTENT" | grep -q '```mermaid' && echo "$CONTENT" | grep -q 'sequenceDiagram'; then
has_diagram=1
fi
# ─── Eval 2: diagram_parseable ───────────────────────────────────────────────
# Extract the mermaid block and check basic syntax
diagram_parseable=0
if (( has_diagram == 1 )); then
# Extract mermaid block
MERMAID_BLOCK=$(echo "$CONTENT" | sed -n '/```mermaid/,/```/p' | sed '1d;$d')
if [[ -n "$MERMAID_BLOCK" ]]; then
# Basic syntax checks:
# - Has "sequenceDiagram" keyword
# - Has at least one "participant" line
# - Has at least one "->>", "-->>", or "->>" message line
has_keyword=$(echo "$MERMAID_BLOCK" | grep -c 'sequenceDiagram' || true)
has_participant=$(echo "$MERMAID_BLOCK" | grep -c 'participant' || true)
has_message=$(echo "$MERMAID_BLOCK" | grep -cE '\->>|-->>|\->' || true)
if (( has_keyword > 0 && has_participant > 0 && has_message > 0 )); then
diagram_parseable=1
fi
fi
# If mmdc (mermaid CLI) is available, use it for real validation
if command -v mmdc &> /dev/null && (( diagram_parseable == 1 )); then
TMPFILE=$(mktemp /tmp/mermaid_XXXXXX.mmd)
echo "$MERMAID_BLOCK" > "$TMPFILE"
if mmdc -i "$TMPFILE" -o /dev/null 2>/dev/null; then
diagram_parseable=1
else
diagram_parseable=0
fi
rm -f "$TMPFILE"
fi
fi
# ─── Eval 3: uses_real_modules ───────────────────────────────────────────────
# Diagram mentions at least 2 real modules from the Firehose codebase
uses_real_modules=0
module_count=0
for module in BlogController EngineeringBlog ReleaseNotes Blogex Router PageController Layouts; do
if echo "$CONTENT" | grep -qi "$module"; then
module_count=$((module_count + 1))
fi
done
if (( module_count >= 2 )); then
uses_real_modules=1
fi
# ─── Eval 4: uses_real_functions ─────────────────────────────────────────────
# Diagram mentions at least 1 real function from the codebase
uses_real_functions=0
for func in posts_by_tag get_post all_posts paginate resolve_blog render recent_posts; do
if echo "$CONTENT" | grep -qi "$func"; then
uses_real_functions=1
break
fi
done
# ─── Eval 5: no_sidetracking ────────────────────────────────────────────────
# Output does NOT contain code review / critique language
no_sidetracking=1
BLOCKLIST="${SCRIPT_DIR}/sidetrack_blocklist.txt"
if [[ -f "$BLOCKLIST" ]]; then
while IFS= read -r phrase; do
phrase=$(echo "$phrase" | xargs) # trim whitespace
if [[ -n "$phrase" ]] && echo "$CONTENT" | grep -qi "$phrase"; then
no_sidetracking=0
break
fi
done < "$BLOCKLIST"
fi
# ─── Eval 6: concise ────────────────────────────────────────────────────────
# Total output under 3000 characters
concise=0
if (( CHAR_COUNT < 3000 )); then
concise=1
fi
# ─── Total ───────────────────────────────────────────────────────────────────
score=$((has_diagram + diagram_parseable + uses_real_modules + uses_real_functions + no_sidetracking + concise))
echo "{\"score\":${score},\"has_diagram\":${has_diagram},\"diagram_parseable\":${diagram_parseable},\"uses_real_modules\":${uses_real_modules},\"uses_real_functions\":${uses_real_functions},\"no_sidetracking\":${no_sidetracking},\"concise\":${concise},\"char_count\":${CHAR_COUNT}}"

View File

@ -0,0 +1,23 @@
potential issue
consider using
should be
could be improved
recommend
suggestion
improvement
code review
refactor
best practice
security concern
vulnerability
error handling could
missing error
you might want
it would be better
note that this
be aware that
one concern
problematic
anti-pattern
smell
technical debt

View File

@ -0,0 +1,54 @@
---
name: sequence-diagram
description: Generate a Mermaid sequence diagram showing message flow across module boundaries for an Elixir/Phoenix interaction. Use when asked to diagram, trace, or visualize a user interaction, request flow, or feature path through the codebase.
---
# Sequence Diagram Skill
Generate a Mermaid `sequenceDiagram` that traces a specific user interaction
across module boundaries in an Elixir/Phoenix codebase.
## Your Task
Given a description of an interaction (e.g., "user clicks a tag on a blog post")
and access to the source files, produce a Mermaid sequence diagram that accurately
shows the message flow between modules.
## Process
1. **Identify the entry point.** What triggers this interaction? (HTTP request,
LiveView event, PubSub message, etc.)
2. **Read the router** to find which controller/live module handles the route.
3. **Read the controller/live module** to find which functions are called and
which domain modules they delegate to.
4. **Read the domain modules** to understand what they return and how.
5. **Read templates/components** if the rendering path matters.
6. **Emit the diagram.** Use `sequenceDiagram` with participants named after
actual modules. Show function calls as messages.
## Output Format
Respond with ONLY a fenced Mermaid code block. No preamble, no explanation,
no code review, no suggestions. Just the diagram.
```mermaid
sequenceDiagram
participant Browser
participant Router as FirehoseWeb.Router
...
```
## Rules
- **Participants must be real modules** from the codebase. Never invent modules.
- **Messages must be real function calls** or HTTP verbs. Use the actual function
names you found in the source (e.g., `blog.posts_by_tag(tag)`, not "get posts").
- **Show the return path.** Responses flow back: module returns data, controller
renders, browser receives HTML.
- **Distinguish compile-time from runtime.** If a module uses NimblePublisher
or module attributes, the data is compiled into the BEAM — there is no runtime
file I/O. Show this as a note, not as a message to the filesystem.
- **Stay on task.** Do NOT review the code. Do NOT suggest improvements. Do NOT
mention potential issues. Your only job is the diagram.
- **Keep it readable.** Use `Note over` for context. Use short aliases for
long module names in the participant declaration.