blog post about explaining image models

2026-06-20 22:05:03 +01:00 · 2026-06-20 22:05:03 +01:00 · eb3af61884
commit eb3af61884
parent ef51fdcfff
2 changed files with 141 additions and 0 deletions
--- a/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md
+++ b/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md
@ -0,0 +1,141 @@
+%{
+  title: "How to let a local model interpret images in Pi.dev with llama.cpp",
+  author: "Willem van den Ende",
+  tags: ~w(),
+  description: "QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.",
+  published: false
+}
+---
+
+I have used visual language models occasionally, but since I switched from Claude Code to [Pi](https://pi.dev) I had not gotten around to setting it up. Jon Jagger mentioned he had just tried it out in a programming session with Claude Code, and then I decided that maybe I shoult take a bit of time to figure it out. It was not that difficult, sort of in the category: "Why did I not do that sooner?". It is quite handy to be able to draw a state diagram and use that as a starting point to draw out a workflow. 
+
+In this case, we were discussing giving developers feedback on how they are running the TDD cycle in CyberDojo (e.g. many 'red' cycles may mean the step is too big), and I was also interested in making a deterministic extension for Pi to slow down coding agents and people, and make sure that the *Refactoring* step actually happens. The state diagram was a good starting point, it got more complicated when building a Pi extension. More about that in a next post.
+
+QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.
+
+In addition to the configuration I described in [My local ageentic dev setup two months ago](/blog/engineering/my-local-agentic-dev-setup-today), I had to download  an 'mmproj' file and specify it in my llama.cpp run script. And then it was vision capable. I put the mmproj file in the directory with my scripts, as the name is somewhat generic, and I do not yet know if it is the same for other models, or specific for this one.
+
+There are some more changes to the script since the last time, most notable `--spec-default` for 'speculative decoding' which gave a nice speed bump.
+
+``` bash
+#!/usr/bin/env bash
+set -euo pipefail
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+LLAMA_DIR="${ROOT_DIR}/llama.cpp"
+
+# GGUF_DIR="$(llm gguf models-dir)"
+GGUF_DIR="/Users/willem/Library/Application Support/io.datasette.llm/gguf/models/"
+MAIN="${GGUF_DIR}Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
+
+ls "${MAIN}"
+
+exec "${LLAMA_DIR}/build/bin/llama-server" \
+  -m "$MAIN" \
+  --mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \
+  --spec-default \
+  -c 262144 \
+  --temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \
+  --presence-penalty 0.0 \
+  --chat-template-kwargs '{"preserve_thinking": true}' \
+  --parallel 1 \
+  --jinja \
+  --host 0.0.0.0 --port 8000
+```
+
+For Pi.dev, I needed to add "image" to the part of `models.json` that has the model configuration:
+
+``` json
+      "models": [
+        {
+          "id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
+          "name": "Qwen3.6-35B",
+          "reasoning": true,
+          "input": [
+              "text", "image"
+          ],
+          "compat": {
+            "thinkingFormat": "qwen-chat-template"
+          },
+          "contextWindow": 262144,
+          "maxTokens": 32768,
+          "cost": {
+            "input": 0,
+            "output": 0,
+            "cacheRead": 0,
+            "cacheWrite": 0
+          }
+         }
+       ]
+```
+
+That was all the configuration changes. Two lines and a copy.
+
+I then fed a state diagram of the red-green-refactor cycle that Jon Jagger made, by tab-completing the filename in the prompt, and the model figured out it was a state diagram (this was, admittedly also in the name of the file). It also found out the states and the transitions - much faster than trying to type it out. The red green refactor cycle from tdd was recognised. TDD was in the name of the file, but the connection with red-green-refactor was not. the only thing slightly off is 'rip' instead of *ship*. Which leads to some more fun interpretations further down.
+
+The results I found surprisingly good, as Jon sketched this during a conversation, with no intent to convert it later. Here is the photo, you can see the shadow of my phone.
+
+![Hand drawn state diagram, transcribed below in 'Inline session transcript' as tables and part mermaid diagram.](/images/blog/2026/tdd-state-diagram-john-jagger.jpeg)
+
+Below you can see the part of the session where QWEN interpreted the image. I asked after that to make a mermaid diagram. The mermaid syntax was broken, so either a larger or more cohesive model might be needed, or feedback from running `mmdc` - the mer pppmaid command line.
+
+# Inline session transcript
+
+ I have chosen a vision capable model now
+ Great! Now I should be able to see the image. Let me read it again:
+ read tdd-state-diagram-john-jagger.jpeg
+
+Now I can see it! This is a TDD State Diagram drawn on a yellow sticky note. Here's the breakdown:
+
+## States
+
+ | State            | Description                                              |
+ |------------------|----------------------------------------------------------|
+ | **RED**          | The "fail first" state — write a failing test            |
+ | **GREEN**        | The "make it pass" state — minimal code to pass the test |
+ | **●** (terminal) | Accept/exit state — done, ship it                        |
+
+## Transitions
+
+ | From → To                         | Label                      | Meaning                                                      |
+ |-----------------------------------|----------------------------|--------------------------------------------------------------|
+ | **RED** → **RED** (self-loop)     | *"improve the diagnostic"* | Refine the failing test / understand what's wrong            |
+ | **RED** → **GREEN**               | *"pass"*                   | Write minimal code to make the test pass                     |
+ | **GREEN** → **RED**               | *"change"*                 | Refactor or add new requirements (break the test on purpose) |
+ | **GREEN** → **GREEN** (self-loop) | *"relax"*                  | Refactor while tests stay green                              |
+ | **GREEN** → **●** (terminal)      | *"rip"*                    | Ship it / done (rip the test away)                           |
+
+## Flow
+
+ 1. Start in **RED** — write a failing test
+ 2. **Improve the diagnostic** (self-loop) — debug the failure
+ 3. Make it **pass** → move to **GREEN**
+ 4. **Change** the requirements → back to **RED**
+ 5. **Relax** (refactor) while staying **GREEN**
+ 6. When done, **rip** → terminal state
+
+ A classic, hand-drawn representation of the Red-Green-Refactor cycle!
+
+## Mermaid Source
+
+ ```mermaid
+ ---
+ title: TDD State Diagram — John Jagger
+ ---
+ stateDiagram
+     [*] --> RED
+
+     RED --> RED : improve the diagnostic
+     RED --> GREEN : pass
+     GREEN --> RED : change
+     GREEN --> GREEN : relax
+     GREEN --> [*] : rip
+ ```
+                                                                                                                                                                                                                                 
+1. Start in RED — write a failing test                                                                                                                                                                                          
+2. Improve the diagnostic (self-loop) — debug the failure
+3. Make it pass → move to GREEN
+4. Change the requirements → back to RED
+5. Relax (refactor) while staying GREEN
+6. When done, rip → terminal state
+
+ 
--- a/app/priv/static/images/blog/2026/tdd-state-diagram-john-jagger.jpeg
+++ b/app/priv/static/images/blog/2026/tdd-state-diagram-john-jagger.jpeg