diff --git a/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md b/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md new file mode 100644 index 0000000..24e34a1 --- /dev/null +++ b/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md @@ -0,0 +1,141 @@ +%{ + title: "How to let a local model interpret images in Pi.dev with llama.cpp", + author: "Willem van den Ende", + tags: ~w(), + description: "QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.", + published: false +} +--- + +I have used visual language models occasionally, but since I switched from Claude Code to [Pi](https://pi.dev) I had not gotten around to setting it up. Jon Jagger mentioned he had just tried it out in a programming session with Claude Code, and then I decided that maybe I shoult take a bit of time to figure it out. It was not that difficult, sort of in the category: "Why did I not do that sooner?". It is quite handy to be able to draw a state diagram and use that as a starting point to draw out a workflow. + +In this case, we were discussing giving developers feedback on how they are running the TDD cycle in CyberDojo (e.g. many 'red' cycles may mean the step is too big), and I was also interested in making a deterministic extension for Pi to slow down coding agents and people, and make sure that the *Refactoring* step actually happens. The state diagram was a good starting point, it got more complicated when building a Pi extension. More about that in a next post. + +QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration. + +In addition to the configuration I described in [My local ageentic dev setup two months ago](/blog/engineering/my-local-agentic-dev-setup-today), I had to download an 'mmproj' file and specify it in my llama.cpp run script. And then it was vision capable. I put the mmproj file in the directory with my scripts, as the name is somewhat generic, and I do not yet know if it is the same for other models, or specific for this one. + +There are some more changes to the script since the last time, most notable `--spec-default` for 'speculative decoding' which gave a nice speed bump. + +``` bash +#!/usr/bin/env bash +set -euo pipefail +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +LLAMA_DIR="${ROOT_DIR}/llama.cpp" + +# GGUF_DIR="$(llm gguf models-dir)" +GGUF_DIR="/Users/willem/Library/Application Support/io.datasette.llm/gguf/models/" +MAIN="${GGUF_DIR}Qwen3.6-35B-A3B-MXFP4_MOE.gguf" + +ls "${MAIN}" + +exec "${LLAMA_DIR}/build/bin/llama-server" \ + -m "$MAIN" \ + --mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \ + --spec-default \ + -c 262144 \ + --temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \ + --presence-penalty 0.0 \ + --chat-template-kwargs '{"preserve_thinking": true}' \ + --parallel 1 \ + --jinja \ + --host 0.0.0.0 --port 8000 +``` + +For Pi.dev, I needed to add "image" to the part of `models.json` that has the model configuration: + +``` json + "models": [ + { + "id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf", + "name": "Qwen3.6-35B", + "reasoning": true, + "input": [ + "text", "image" + ], + "compat": { + "thinkingFormat": "qwen-chat-template" + }, + "contextWindow": 262144, + "maxTokens": 32768, + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + } + } + ] +``` + +That was all the configuration changes. Two lines and a copy. + +I then fed a state diagram of the red-green-refactor cycle that Jon Jagger made, by tab-completing the filename in the prompt, and the model figured out it was a state diagram (this was, admittedly also in the name of the file). It also found out the states and the transitions - much faster than trying to type it out. The red green refactor cycle from tdd was recognised. TDD was in the name of the file, but the connection with red-green-refactor was not. the only thing slightly off is 'rip' instead of *ship*. Which leads to some more fun interpretations further down. + +The results I found surprisingly good, as Jon sketched this during a conversation, with no intent to convert it later. Here is the photo, you can see the shadow of my phone. + +![Hand drawn state diagram, transcribed below in 'Inline session transcript' as tables and part mermaid diagram.](/images/blog/2026/tdd-state-diagram-john-jagger.jpeg) + +Below you can see the part of the session where QWEN interpreted the image. I asked after that to make a mermaid diagram. The mermaid syntax was broken, so either a larger or more cohesive model might be needed, or feedback from running `mmdc` - the mer pppmaid command line. + +# Inline session transcript + + I have chosen a vision capable model now + Great! Now I should be able to see the image. Let me read it again: + read tdd-state-diagram-john-jagger.jpeg + +Now I can see it! This is a TDD State Diagram drawn on a yellow sticky note. Here's the breakdown: + +## States + + | State | Description | + |------------------|----------------------------------------------------------| + | **RED** | The "fail first" state — write a failing test | + | **GREEN** | The "make it pass" state — minimal code to pass the test | + | **●** (terminal) | Accept/exit state — done, ship it | + +## Transitions + + | From → To | Label | Meaning | + |-----------------------------------|----------------------------|--------------------------------------------------------------| + | **RED** → **RED** (self-loop) | *"improve the diagnostic"* | Refine the failing test / understand what's wrong | + | **RED** → **GREEN** | *"pass"* | Write minimal code to make the test pass | + | **GREEN** → **RED** | *"change"* | Refactor or add new requirements (break the test on purpose) | + | **GREEN** → **GREEN** (self-loop) | *"relax"* | Refactor while tests stay green | + | **GREEN** → **●** (terminal) | *"rip"* | Ship it / done (rip the test away) | + +## Flow + + 1. Start in **RED** — write a failing test + 2. **Improve the diagnostic** (self-loop) — debug the failure + 3. Make it **pass** → move to **GREEN** + 4. **Change** the requirements → back to **RED** + 5. **Relax** (refactor) while staying **GREEN** + 6. When done, **rip** → terminal state + + A classic, hand-drawn representation of the Red-Green-Refactor cycle! + +## Mermaid Source + + ```mermaid + --- + title: TDD State Diagram — John Jagger + --- + stateDiagram + [*] --> RED + + RED --> RED : improve the diagnostic + RED --> GREEN : pass + GREEN --> RED : change + GREEN --> GREEN : relax + GREEN --> [*] : rip + ``` + +1. Start in RED — write a failing test +2. Improve the diagnostic (self-loop) — debug the failure +3. Make it pass → move to GREEN +4. Change the requirements → back to RED +5. Relax (refactor) while staying GREEN +6. When done, rip → terminal state + + diff --git a/app/priv/static/images/blog/2026/tdd-state-diagram-john-jagger.jpeg b/app/priv/static/images/blog/2026/tdd-state-diagram-john-jagger.jpeg new file mode 100644 index 0000000..93cd21c Binary files /dev/null and b/app/priv/static/images/blog/2026/tdd-state-diagram-john-jagger.jpeg differ