2.9 KiB
Raw Blame History

Sequence Diagram Skill — Autoresearch

Optimizes a pi skill for generating Mermaid sequence diagrams from Elixir/Phoenix codebases, using pi-autoresearch.

The Problem

Small local models (Qwen3.5-35B-A3B) produce great sequence diagrams for well-represented languages (C#, Java) but go off the rails with Elixir/Phoenix — sidetracking into imaginary code reviews instead of finishing the diagram.

How It Works

The autoresearch loop mutates skill/SKILL.md, runs it against 3 scenarios from a real Phoenix codebase (Firehose), and scores with zero-judge-model bash evals:

Eval Check Tool
has_diagram Output has ```mermaid + sequenceDiagram grep
diagram_parseable Valid mermaid syntax (participants + messages) grep / mmdc
uses_real_modules ≥2 actual module names from codebase grep
uses_real_functions ≥1 actual function name grep
no_sidetracking No review/critique language grep against blocklist
concise Under 3000 chars wc

3 tasks × 6 evals = 18 max score.

Setup

  1. Clone the Firehose repo into workspace/:

    git clone https://gitea.apps.sustainabledelivery.com/mostalive/firehose workspace
    
  2. Make scripts executable:

    chmod +x autoresearch.sh autoresearch.checks.sh scripts/*.sh
    
  3. Configure model access in scripts/config.env:

    • Local: leave SSH_TARGET empty, have pi configured with your model
    • Remote: set SSH_TARGET=analyst@your-host and SSH_PORT=2222
  4. Init git and start:

    git init && git add -A && git commit -m "initial"
    pi
    # then: /autoresearch
    

Project Structure

sequence-diagram-skill/
├── autoresearch.md           # Session doc (pi reads this)
├── autoresearch.sh           # Benchmark runner
├── autoresearch.checks.sh    # Sanity checks on SKILL.md
├── skill/
│   └── SKILL.md              # THE FILE BEING OPTIMIZED
├── benchmark/
│   └── tasks.jsonl           # 3 test scenarios
├── scripts/
│   ├── config.env            # Endpoint config
│   ├── run_one.sh            # Run pi with skill + single task
│   ├── score.sh              # Score a single output (6 binary evals)
│   └── sidetrack_blocklist.txt  # Phrases that indicate off-task behavior
└── workspace/                # Clone of Firehose repo (mounted/symlinked)

Mutation Ideas for the Agent

The autoresearch agent only edits skill/SKILL.md. Good mutations include:

  • Stronger "do not review" constraints
  • Explicit Elixir/Phoenix vocabulary hints (NimblePublisher, module attributes)
  • Output format enforcement (ONLY the mermaid block, nothing else)
  • Step-by-step process instructions (read router first, then controller, etc.)
  • Short generic example of a good sequence diagram
  • Negative examples ("do NOT include suggestions or improvements")