firehose/sequence-diagram-skill/README.md

# Sequence Diagram Skill — Autoresearch

Optimizes a pi skill for generating Mermaid sequence diagrams from
Elixir/Phoenix codebases, using [pi-autoresearch](https://github.com/davebcn87/pi-autoresearch).

## The Problem

Small local models (Qwen3.5-35B-A3B) produce great sequence diagrams for
well-represented languages (C#, Java) but go off the rails with Elixir/Phoenix —
sidetracking into imaginary code reviews instead of finishing the diagram.

## How It Works

The autoresearch loop mutates `skill/SKILL.md`, runs it against 3 scenarios
from a real Phoenix codebase (Firehose), and scores with **zero-judge-model
bash evals**:

| Eval | Check | Tool |
|------|-------|------|
| has_diagram | Output has `` ```mermaid `` + `sequenceDiagram` | grep |
| diagram_parseable | Valid mermaid syntax (participants + messages) | grep / mmdc |
| uses_real_modules | ≥2 actual module names from codebase | grep |
| uses_real_functions | ≥1 actual function name | grep |
| no_sidetracking | No review/critique language | grep against blocklist |
| concise | Under 3000 chars | wc |

3 tasks × 6 evals = 18 max score.

## Setup

1. Clone the Firehose repo into `workspace/`:
   ```bash
   git clone https://gitea.apps.sustainabledelivery.com/mostalive/firehose workspace
   ```

2. Make scripts executable:
   ```bash
   chmod +x autoresearch.sh autoresearch.checks.sh scripts/*.sh
   ```

3. Configure model access in `scripts/config.env`:
   - Local: leave `SSH_TARGET` empty, have pi configured with your model
   - Remote: set `SSH_TARGET=analyst@your-host` and `SSH_PORT=2222`

4. Init git and start:
   ```bash
   git init && git add -A && git commit -m "initial"
   pi
   # then: /autoresearch
   ```

## Project Structure

```
sequence-diagram-skill/
├── autoresearch.md           # Session doc (pi reads this)
├── autoresearch.sh           # Benchmark runner
├── autoresearch.checks.sh    # Sanity checks on SKILL.md
├── skill/
│   └── SKILL.md              # THE FILE BEING OPTIMIZED
├── benchmark/
│   └── tasks.jsonl           # 3 test scenarios
├── scripts/
│   ├── config.env            # Endpoint config
│   ├── run_one.sh            # Run pi with skill + single task
│   ├── score.sh              # Score a single output (6 binary evals)
│   └── sidetrack_blocklist.txt  # Phrases that indicate off-task behavior
└── workspace/                # Clone of Firehose repo (mounted/symlinked)
```

## Mutation Ideas for the Agent

The autoresearch agent only edits `skill/SKILL.md`. Good mutations include:

- Stronger "do not review" constraints
- Explicit Elixir/Phoenix vocabulary hints (NimblePublisher, module attributes)
- Output format enforcement (ONLY the mermaid block, nothing else)
- Step-by-step process instructions (read router first, then controller, etc.)
- Short generic example of a good sequence diagram
- Negative examples ("do NOT include suggestions or improvements")