diff --git a/sequence-diagram-skill/.gitignore b/sequence-diagram-skill/.gitignore deleted file mode 100644 index 7b40247..0000000 --- a/sequence-diagram-skill/.gitignore +++ /dev/null @@ -1,10 +0,0 @@ -# autoresearch session -autoresearch.jsonl -autoresearch.ideas.md - -# temp -.tmp_* -*.tmp - -# OS -.DS_Store diff --git a/sequence-diagram-skill/autoresearch.checks.sh b/sequence-diagram-skill/autoresearch.checks.sh deleted file mode 100755 index ec3e2ba..0000000 --- a/sequence-diagram-skill/autoresearch.checks.sh +++ /dev/null @@ -1,57 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# ─── autoresearch.checks.sh ───────────────────────────────────────────────── -# Backpressure checks for the sequence diagram skill. -# ───────────────────────────────────────────────────────────────────────────── - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -SKILL_FILE="${SCRIPT_DIR}/skill/SKILL.md" -ERRORS=0 - -# 1. Skill exists and is non-empty -if [[ ! -s "$SKILL_FILE" ]]; then - echo "FAIL: skill/SKILL.md is missing or empty" - ERRORS=$((ERRORS + 1)) -fi - -# 2. Skill is not trivially short -CHAR_COUNT=$(wc -c < "$SKILL_FILE" 2>/dev/null || echo "0") -if (( CHAR_COUNT < 200 )); then - echo "FAIL: skill/SKILL.md is only ${CHAR_COUNT} chars (min: 200)" - ERRORS=$((ERRORS + 1)) -fi - -# 3. Skill is not too long (rough token proxy: 1500 tokens ≈ 6000 chars) -if (( CHAR_COUNT > 6000 )); then - echo "FAIL: skill/SKILL.md is ${CHAR_COUNT} chars (max: ~6000)" - ERRORS=$((ERRORS + 1)) -fi - -# 4. Skill must contain "sequenceDiagram" or "sequence diagram" (it's a diagram skill) -if ! grep -qi 'sequence.diagram' "$SKILL_FILE" 2>/dev/null; then - echo "FAIL: skill/SKILL.md doesn't mention sequence diagrams" - ERRORS=$((ERRORS + 1)) -fi - -# 5. Skill must NOT contain Firehose-specific code (no overfitting) -for term in "BlogController" "EngineeringBlog" "Firehose" "blogex" "priv/blog"; do - if grep -q "$term" "$SKILL_FILE" 2>/dev/null; then - echo "FAIL: skill/SKILL.md contains codebase-specific term '${term}'" - ERRORS=$((ERRORS + 1)) - fi -done - -# 6. Valid UTF-8 -if ! iconv -f utf-8 -t utf-8 "$SKILL_FILE" > /dev/null 2>&1; then - echo "FAIL: skill/SKILL.md contains invalid UTF-8" - ERRORS=$((ERRORS + 1)) -fi - -if (( ERRORS > 0 )); then - echo "Checks FAILED with ${ERRORS} error(s)" - exit 1 -else - echo "All checks passed. Skill: ${CHAR_COUNT} chars." - exit 0 -fi diff --git a/sequence-diagram-skill/autoresearch.md b/sequence-diagram-skill/autoresearch.md deleted file mode 100644 index e60214f..0000000 --- a/sequence-diagram-skill/autoresearch.md +++ /dev/null @@ -1,96 +0,0 @@ -# Autoresearch: Sequence Diagram Skill for Elixir/Phoenix - -## Objective - -Optimize a pi skill (`skill/SKILL.md`) that generates Mermaid sequence diagrams -from Elixir/Phoenix codebases. The skill is used with a local Qwen3.5-35B-A3B -model running on CPU. The primary failure mode is **sidetracking** — the model -abandons the diagram task and starts reviewing/critiquing the code instead. - -## Primary Metric - -**score** — higher is better (0–18 scale, sum of 6 binary evals × 3 test inputs). - -## Secondary Metrics - -- **sidetrack_count** — number of test runs containing review/critique language (lower is better) -- **parse_count** — number of outputs that contain a parseable sequenceDiagram (higher is better) - -## Architecture - -Pi runs the skill against the Firehose codebase (mounted in the workspace) using -the target model. Scoring is done by bash scripts — no judge model needed. - -## The Codebase Under Test - -**Firehose** — a Phoenix blogging platform with a monorepo structure: - -- `app/` — Phoenix web app (OTP app: `:firehose`) - - `lib/firehose_web/router.ex` — routes - - `lib/firehose_web/controllers/blog_controller.ex` — blog actions - - `lib/firehose_web/controllers/page_controller.ex` — homepage - - `lib/firehose/blogs/` — blog context modules (EngineeringBlog, ReleaseNotes) -- `blogex/` — sibling library for compile-time blog engine - - `lib/blogex/blog.ex` — `use Blogex.Blog` macro (NimblePublisher) - - `lib/blogex/components.ex` — Phoenix function components (post_meta, tag_list, etc.) - - `lib/blogex/router.ex` — API/feed routes - -**Key architectural fact:** Blogex uses NimblePublisher. All blog posts are compiled -into BEAM module attributes at build time. There is NO runtime file I/O for reading -posts. Functions like `all_posts/0`, `get_post!/1`, `posts_by_tag/1` read from -`@posts` module attributes. This is the #1 thing models get wrong. - -## Test Inputs (3 scenarios) - -### 1. Click tag on post (small) -"Generate a sequence diagram for: a user on a blog post page clicks a tag link -(e.g., 'elixir'). Trace the full request from browser through to rendered response." - -### 2. Show homepage (small) -"Generate a sequence diagram for: a user visits the homepage (GET /). -Trace from browser through to rendered HTML." - -### 3. Add blog post on disk (larger, crosses compile/runtime boundary) -"Generate a sequence diagram for: a developer creates a new markdown file in -priv/blog/engineering/. Trace what happens from file creation through to the -post being visible on the blog. Include the compile-time and runtime phases." - -## Eval Criteria (6 binary checks) - -1. **has_diagram** — output contains `` ```mermaid `` and `sequenceDiagram` -2. **diagram_parseable** — the mermaid block is syntactically valid -3. **uses_real_modules** — diagram mentions at least 2 of: BlogController, EngineeringBlog, Blogex, Router, PageController -4. **uses_real_functions** — diagram mentions at least 1 of: posts_by_tag, get_post!, all_posts, paginate, resolve_blog, render -5. **no_sidetracking** — output does NOT contain code review language (see blocklist) -6. **concise** — total output is under 3000 characters - -## Files in Scope - -| File | Agent may edit? | -|------|-----------------| -| `skill/SKILL.md` | ✅ YES — the only file the agent modifies | -| `benchmark/tasks.jsonl` | ❌ NO | -| `scripts/score.sh` | ❌ NO | -| `scripts/run_one.sh` | ❌ NO | -| `scripts/sidetrack_blocklist.txt` | ❌ NO | -| `autoresearch.sh` | ❌ NO | -| `autoresearch.checks.sh` | ❌ NO | - -## Constraints - -- SKILL.md must stay under 1500 tokens. -- SKILL.md must NOT contain any code from the Firehose codebase (no overfitting). -- SKILL.md must remain generic — it should work for any Elixir/Phoenix codebase, - not just Firehose. - -## What Has Been Tried - -(autoresearch fills this in) - -## Dead Ends - -(autoresearch fills this in) - -## Key Wins - -(autoresearch fills this in) diff --git a/sequence-diagram-skill/autoresearch.sh b/sequence-diagram-skill/autoresearch.sh deleted file mode 100755 index d5b7846..0000000 --- a/sequence-diagram-skill/autoresearch.sh +++ /dev/null @@ -1,101 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# ─── autoresearch.sh ───────────────────────────────────────────────────────── -# Benchmark script for sequence diagram skill optimization. -# Runs all 3 test inputs, scores each, outputs METRIC lines. -# ───────────────────────────────────────────────────────────────────────────── - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source "${SCRIPT_DIR}/scripts/config.env" 2>/dev/null || true - -# Defaults -SSH_TARGET="${SSH_TARGET:-}" -SSH_PORT="${SSH_PORT:-2222}" -export TASK_TIMEOUT="${TASK_TIMEOUT:-180}" - -# ─── Pre-checks ────────────────────────────────────────────────────────────── - -SKILL_FILE="${SCRIPT_DIR}/skill/SKILL.md" -if [[ ! -s "$SKILL_FILE" ]]; then - echo "ERROR: skill/SKILL.md is missing or empty" - exit 1 -fi - -SKILL_CHARS=$(wc -c < "$SKILL_FILE") -echo "Skill: ${SKILL_CHARS} chars" - -TASKS_FILE="${SCRIPT_DIR}/benchmark/tasks.jsonl" -if [[ ! -f "$TASKS_FILE" ]]; then - echo "ERROR: benchmark/tasks.jsonl not found" - exit 1 -fi - -echo "────────────────────────────────────────────────────" - -# ─── Run all tasks ─────────────────────────────────────────────────────────── - -TMPDIR=$(mktemp -d) -TOTAL_SCORE=0 -SIDETRACK_COUNT=0 -PARSE_COUNT=0 -TASK_COUNT=0 - -START_TIME=$(date +%s) - -while IFS= read -r line; do - TASK_ID=$(echo "$line" | jq -r '.id') - TASK_PROMPT=$(echo "$line" | jq -r '.prompt') - TASK_COUNT=$((TASK_COUNT + 1)) - - OUTPUT_FILE="${TMPDIR}/${TASK_ID}.txt" - SCORE_FILE="${TMPDIR}/${TASK_ID}.json" - - echo " [${TASK_COUNT}/3] ${TASK_ID}..." - - # Run the task - bash "${SCRIPT_DIR}/scripts/run_one.sh" \ - "$TASK_PROMPT" \ - "$OUTPUT_FILE" \ - "$SSH_TARGET" \ - "$SSH_PORT" - - # Score it - SCORE_JSON=$(bash "${SCRIPT_DIR}/scripts/score.sh" "$OUTPUT_FILE") - echo "$SCORE_JSON" > "$SCORE_FILE" - - # Extract scores - TASK_SCORE=$(echo "$SCORE_JSON" | jq -r '.score') - TASK_SIDETRACK=$(echo "$SCORE_JSON" | jq -r '.no_sidetracking') - TASK_PARSE=$(echo "$SCORE_JSON" | jq -r '.diagram_parseable') - TASK_CHARS=$(echo "$SCORE_JSON" | jq -r '.char_count') - - TOTAL_SCORE=$((TOTAL_SCORE + TASK_SCORE)) - - if (( TASK_SIDETRACK == 0 )); then - SIDETRACK_COUNT=$((SIDETRACK_COUNT + 1)) - fi - - if (( TASK_PARSE == 1 )); then - PARSE_COUNT=$((PARSE_COUNT + 1)) - fi - - echo " score=${TASK_SCORE}/6 sidetrack=$(( 1 - TASK_SIDETRACK )) parseable=${TASK_PARSE} chars=${TASK_CHARS}" - -done < "$TASKS_FILE" - -END_TIME=$(date +%s) -TOTAL_SECONDS=$((END_TIME - START_TIME)) - -# ─── Cleanup ───────────────────────────────────────────────────────────────── - -rm -rf "$TMPDIR" - -# ─── Output METRIC lines ──────────────────────────────────────────────────── - -echo "" -echo "METRIC score=${TOTAL_SCORE}" -echo "METRIC sidetrack_count=${SIDETRACK_COUNT}" -echo "METRIC parse_count=${PARSE_COUNT}" -echo "METRIC total_seconds=${TOTAL_SECONDS}" -echo "METRIC skill_chars=${SKILL_CHARS}" diff --git a/sequence-diagram-skill/benchmark/tasks.jsonl b/sequence-diagram-skill/benchmark/tasks.jsonl deleted file mode 100644 index abfd16c..0000000 --- a/sequence-diagram-skill/benchmark/tasks.jsonl +++ /dev/null @@ -1,3 +0,0 @@ -{"id": "click-tag", "prompt": "Generate a sequence diagram for: a user on a blog post page clicks a tag link (e.g., 'elixir'). Trace the full HTTP request from browser through the Phoenix router, controller, domain modules, templates, and back to the browser. The codebase is in /home/analyst/workspace/. Read the relevant source files first."} -{"id": "show-homepage", "prompt": "Generate a sequence diagram for: a user visits the homepage (GET /). Trace from the browser's HTTP request through the Phoenix router, controller, template rendering, layout wrapping, and back to the browser. The codebase is in /home/analyst/workspace/. Read the relevant source files first."} -{"id": "add-post", "prompt": "Generate a sequence diagram for: a developer creates a new markdown file in priv/blog/engineering/ and the post becomes visible on the blog. Trace what happens including the compile-time phase (NimblePublisher, module recompilation) and the runtime request phase. The codebase is in /home/analyst/workspace/. Read the relevant source files first."} diff --git a/sequence-diagram-skill/scripts/config.env b/sequence-diagram-skill/scripts/config.env deleted file mode 100644 index 6b60d19..0000000 --- a/sequence-diagram-skill/scripts/config.env +++ /dev/null @@ -1,10 +0,0 @@ -# ─── config.env ────────────────────────────────────────────────────────────── -# Leave SSH_TARGET empty to run pi locally (e.g., on your Mac). -# Set it to use the remote pi container. - -# Remote pi container (leave empty for local) -SSH_TARGET="" -SSH_PORT=2222 - -# Timeout per task (seconds) -TASK_TIMEOUT=180 diff --git a/sequence-diagram-skill/scripts/run_one.sh b/sequence-diagram-skill/scripts/run_one.sh deleted file mode 100755 index b34b95c..0000000 --- a/sequence-diagram-skill/scripts/run_one.sh +++ /dev/null @@ -1,58 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# ─── run_one.sh ────────────────────────────────────────────────────────────── -# Run pi with the sequence-diagram skill on a single task. -# Usage: ./scripts/run_one.sh [ssh_target] [ssh_port] -# -# If ssh_target is provided, runs remotely via SSH into the pi container. -# Otherwise runs pi locally. -# ───────────────────────────────────────────────────────────────────────────── - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -PROJECT_DIR="$(dirname "$SCRIPT_DIR")" - -TASK_PROMPT="$1" -OUTPUT_FILE="$2" -SSH_TARGET="${3:-}" -SSH_PORT="${4:-2222}" -TIMEOUT="${TASK_TIMEOUT:-180}" - -SKILL_FILE="${PROJECT_DIR}/skill/SKILL.md" - -if [[ ! -f "$SKILL_FILE" ]]; then - echo "ERROR: skill/SKILL.md not found" >&2 - exit 1 -fi - -SKILL_CONTENT=$(cat "$SKILL_FILE") - -# Build the full prompt: skill instructions + task -FULL_PROMPT="## Skill Instructions - -${SKILL_CONTENT} - -## Task - -${TASK_PROMPT}" - -if [[ -n "$SSH_TARGET" ]]; then - # ─── Remote: SSH into pi container ─────────────────────────────────── - PAYLOAD=$(jq -n --arg prompt "$FULL_PROMPT" '{"prompt": $prompt}') - - ssh -p "$SSH_PORT" \ - -o StrictHostKeyChecking=no \ - -o ConnectTimeout=10 \ - -o BatchMode=yes \ - "$SSH_TARGET" \ - "run-task --stdin --mode print --thinking off --timeout $TIMEOUT" \ - <<< "$PAYLOAD" > "$OUTPUT_FILE" 2>/dev/null -else - # ─── Local: run pi directly ────────────────────────────────────────── - timeout "${TIMEOUT}s" pi \ - --mode print \ - --no-session \ - --no-extensions \ - --thinking none \ - -p "$FULL_PROMPT" > "$OUTPUT_FILE" 2>/dev/null || true -fi diff --git a/sequence-diagram-skill/scripts/score.sh b/sequence-diagram-skill/scripts/score.sh deleted file mode 100755 index 3fd4865..0000000 --- a/sequence-diagram-skill/scripts/score.sh +++ /dev/null @@ -1,109 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# ─── score.sh ──────────────────────────────────────────────────────────────── -# Score a single diagram output against 6 binary evals. -# Usage: ./scripts/score.sh -# Prints a JSON line with pass/fail for each eval and total score. -# ───────────────────────────────────────────────────────────────────────────── - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -OUTPUT_FILE="$1" - -if [[ ! -f "$OUTPUT_FILE" ]]; then - echo '{"error": "file not found", "score": 0}' - exit 0 -fi - -CONTENT=$(cat "$OUTPUT_FILE") -CHAR_COUNT=${#CONTENT} - -# ─── Eval 1: has_diagram ───────────────────────────────────────────────────── -# Output contains a mermaid fenced block with sequenceDiagram -has_diagram=0 -if echo "$CONTENT" | grep -q '```mermaid' && echo "$CONTENT" | grep -q 'sequenceDiagram'; then - has_diagram=1 -fi - -# ─── Eval 2: diagram_parseable ─────────────────────────────────────────────── -# Extract the mermaid block and check basic syntax -diagram_parseable=0 -if (( has_diagram == 1 )); then - # Extract mermaid block - MERMAID_BLOCK=$(echo "$CONTENT" | awk '/^```mermaid/{found=1;next} found && /^```$/{exit} found{print}') - - if [[ -n "$MERMAID_BLOCK" ]]; then - # Basic syntax checks: - # - Has "sequenceDiagram" keyword - # - Has at least one "participant" line - # - Has at least one "->>", "-->>", or "->>" message line - has_keyword=$(echo "$MERMAID_BLOCK" | grep -c 'sequenceDiagram' || true) - has_participant=$(echo "$MERMAID_BLOCK" | grep -c 'participant' || true) - has_message=$(echo "$MERMAID_BLOCK" | grep -cE '\->>|-->>|\->' || true) - - if (( has_keyword > 0 && has_participant > 0 && has_message > 0 )); then - diagram_parseable=1 - fi - fi - - # If mmdc (mermaid CLI) is available, use it for real validation - if command -v mmdc &> /dev/null && (( diagram_parseable == 1 )); then - TMPFILE=$(mktemp /tmp/mermaid_XXXXXX.mmd) - echo "$MERMAID_BLOCK" > "$TMPFILE" - if mmdc -i "$TMPFILE" -o /dev/null 2>/dev/null; then - diagram_parseable=1 - else - diagram_parseable=0 - fi - rm -f "$TMPFILE" - fi -fi - -# ─── Eval 3: uses_real_modules ─────────────────────────────────────────────── -# Diagram mentions at least 2 real modules from the Firehose codebase -uses_real_modules=0 -module_count=0 -for module in BlogController EngineeringBlog ReleaseNotes Blogex Router PageController Layouts; do - if echo "$CONTENT" | grep -qi "$module"; then - module_count=$((module_count + 1)) - fi -done -if (( module_count >= 2 )); then - uses_real_modules=1 -fi - -# ─── Eval 4: uses_real_functions ───────────────────────────────────────────── -# Diagram mentions at least 1 real function from the codebase -uses_real_functions=0 -for func in posts_by_tag get_post all_posts paginate resolve_blog render recent_posts; do - if echo "$CONTENT" | grep -qi "$func"; then - uses_real_functions=1 - break - fi -done - -# ─── Eval 5: no_sidetracking ──────────────────────────────────────────────── -# Output does NOT contain code review / critique language -no_sidetracking=1 -BLOCKLIST="${SCRIPT_DIR}/sidetrack_blocklist.txt" -if [[ -f "$BLOCKLIST" ]]; then - while IFS= read -r phrase; do - phrase=$(echo "$phrase" | xargs) # trim whitespace - if [[ -n "$phrase" ]] && echo "$CONTENT" | grep -qi "$phrase"; then - no_sidetracking=0 - break - fi - done < "$BLOCKLIST" -fi - -# ─── Eval 6: concise ──────────────────────────────────────────────────────── -# Total output under 3000 characters -concise=0 -if (( CHAR_COUNT < 3000 )); then - concise=1 -fi - -# ─── Total ─────────────────────────────────────────────────────────────────── -score=$((has_diagram + diagram_parseable + uses_real_modules + uses_real_functions + no_sidetracking + concise)) - -echo "{\"score\":${score},\"has_diagram\":${has_diagram},\"diagram_parseable\":${diagram_parseable},\"uses_real_modules\":${uses_real_modules},\"uses_real_functions\":${uses_real_functions},\"no_sidetracking\":${no_sidetracking},\"concise\":${concise},\"char_count\":${CHAR_COUNT}}" diff --git a/sequence-diagram-skill/scripts/sidetrack_blocklist.txt b/sequence-diagram-skill/scripts/sidetrack_blocklist.txt deleted file mode 100644 index 58b233c..0000000 --- a/sequence-diagram-skill/scripts/sidetrack_blocklist.txt +++ /dev/null @@ -1,23 +0,0 @@ -potential issue -consider using -should be -could be improved -recommend -suggestion -improvement -code review -refactor -best practice -security concern -vulnerability -error handling could -missing error -you might want -it would be better -note that this -be aware that -one concern -problematic -anti-pattern -smell -technical debt diff --git a/sequence-diagram-skill/skill/SKILL.md b/sequence-diagram-skill/skill/SKILL.md deleted file mode 100644 index 39f9962..0000000 --- a/sequence-diagram-skill/skill/SKILL.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -name: sequence-diagram -description: Generate a Mermaid sequence diagram showing message flow across module boundaries for an Elixir/Phoenix interaction. Use when asked to diagram, trace, or visualize a user interaction, request flow, or feature path through the codebase. ---- - -# Sequence Diagram Skill - -Generate a Mermaid `sequenceDiagram` that traces a specific user interaction -across module boundaries in an Elixir/Phoenix codebase. - -## Your Task - -Given a description of an interaction (e.g., "user clicks a tag on a blog post") -and access to the source files, produce a Mermaid sequence diagram that accurately -shows the message flow between modules. - -## Process - -1. **Identify the entry point.** What triggers this interaction? (HTTP request, - LiveView event, PubSub message, etc.) -2. **Read the router** to find which controller/live module handles the route. -3. **Read the controller/live module** to find which functions are called and - which domain modules they delegate to. -4. **Read the domain modules** to understand what they return and how. -5. **Read templates/components** if the rendering path matters. -6. **Emit the diagram.** Use `sequenceDiagram` with participants named after - actual modules. Show function calls as messages. - -## Output Format - -Respond with ONLY a fenced Mermaid code block. No preamble, no explanation, -no code review, no suggestions. Just the diagram. - -```mermaid -sequenceDiagram - participant Browser - participant Router as FirehoseWeb.Router - ... -``` - -## Rules - -- **Participants must be real modules** from the codebase. Never invent modules. -- **Messages must be real function calls** or HTTP verbs. Use the actual function - names you found in the source (e.g., `blog.posts_by_tag(tag)`, not "get posts"). -- **Show the return path.** Responses flow back: module returns data, controller - renders, browser receives HTML. -- **Distinguish compile-time from runtime.** If a module uses NimblePublisher - or module attributes, the data is compiled into the BEAM — there is no runtime - file I/O. Show this as a note, not as a message to the filesystem. -- **Stay on task.** Do NOT review the code. Do NOT suggest improvements. Do NOT - mention potential issues. Your only job is the diagram. -- **Keep it readable.** Use `Note over` for context. Use short aliases for - long module names in the participant declaration.