firehose/app/priv/blog/engineering/2026/06-20-enabling-a-local-model-to-explain-images-in-pidev.md
2026-06-20 22:05:03 +01:00

142 lines
7.4 KiB
Markdown

%{
title: "How to let a local model interpret images in Pi.dev with llama.cpp",
author: "Willem van den Ende",
tags: ~w(),
description: "QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.",
published: false
}
---
I have used visual language models occasionally, but since I switched from Claude Code to [Pi](https://pi.dev) I had not gotten around to setting it up. Jon Jagger mentioned he had just tried it out in a programming session with Claude Code, and then I decided that maybe I shoult take a bit of time to figure it out. It was not that difficult, sort of in the category: "Why did I not do that sooner?". It is quite handy to be able to draw a state diagram and use that as a starting point to draw out a workflow.
In this case, we were discussing giving developers feedback on how they are running the TDD cycle in CyberDojo (e.g. many 'red' cycles may mean the step is too big), and I was also interested in making a deterministic extension for Pi to slow down coding agents and people, and make sure that the *Refactoring* step actually happens. The state diagram was a good starting point, it got more complicated when building a Pi extension. More about that in a next post.
QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.
In addition to the configuration I described in [My local ageentic dev setup two months ago](/blog/engineering/my-local-agentic-dev-setup-today), I had to download an 'mmproj' file and specify it in my llama.cpp run script. And then it was vision capable. I put the mmproj file in the directory with my scripts, as the name is somewhat generic, and I do not yet know if it is the same for other models, or specific for this one.
There are some more changes to the script since the last time, most notable `--spec-default` for 'speculative decoding' which gave a nice speed bump.
``` bash
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LLAMA_DIR="${ROOT_DIR}/llama.cpp"
# GGUF_DIR="$(llm gguf models-dir)"
GGUF_DIR="/Users/willem/Library/Application Support/io.datasette.llm/gguf/models/"
MAIN="${GGUF_DIR}Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
ls "${MAIN}"
exec "${LLAMA_DIR}/build/bin/llama-server" \
-m "$MAIN" \
--mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \
--spec-default \
-c 262144 \
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \
--presence-penalty 0.0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--parallel 1 \
--jinja \
--host 0.0.0.0 --port 8000
```
For Pi.dev, I needed to add "image" to the part of `models.json` that has the model configuration:
``` json
"models": [
{
"id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
"name": "Qwen3.6-35B",
"reasoning": true,
"input": [
"text", "image"
],
"compat": {
"thinkingFormat": "qwen-chat-template"
},
"contextWindow": 262144,
"maxTokens": 32768,
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
```
That was all the configuration changes. Two lines and a copy.
I then fed a state diagram of the red-green-refactor cycle that Jon Jagger made, by tab-completing the filename in the prompt, and the model figured out it was a state diagram (this was, admittedly also in the name of the file). It also found out the states and the transitions - much faster than trying to type it out. The red green refactor cycle from tdd was recognised. TDD was in the name of the file, but the connection with red-green-refactor was not. the only thing slightly off is 'rip' instead of *ship*. Which leads to some more fun interpretations further down.
The results I found surprisingly good, as Jon sketched this during a conversation, with no intent to convert it later. Here is the photo, you can see the shadow of my phone.
![Hand drawn state diagram, transcribed below in 'Inline session transcript' as tables and part mermaid diagram.](/images/blog/2026/tdd-state-diagram-john-jagger.jpeg)
Below you can see the part of the session where QWEN interpreted the image. I asked after that to make a mermaid diagram. The mermaid syntax was broken, so either a larger or more cohesive model might be needed, or feedback from running `mmdc` - the mer pppmaid command line.
# Inline session transcript
I have chosen a vision capable model now
Great! Now I should be able to see the image. Let me read it again:
read tdd-state-diagram-john-jagger.jpeg
Now I can see it! This is a TDD State Diagram drawn on a yellow sticky note. Here's the breakdown:
## States
| State | Description |
|------------------|----------------------------------------------------------|
| **RED** | The "fail first" state — write a failing test |
| **GREEN** | The "make it pass" state — minimal code to pass the test |
| **●** (terminal) | Accept/exit state — done, ship it |
## Transitions
| From → To | Label | Meaning |
|-----------------------------------|----------------------------|--------------------------------------------------------------|
| **RED** → **RED** (self-loop) | *"improve the diagnostic"* | Refine the failing test / understand what's wrong |
| **RED** → **GREEN** | *"pass"* | Write minimal code to make the test pass |
| **GREEN** → **RED** | *"change"* | Refactor or add new requirements (break the test on purpose) |
| **GREEN** → **GREEN** (self-loop) | *"relax"* | Refactor while tests stay green |
| **GREEN** → **●** (terminal) | *"rip"* | Ship it / done (rip the test away) |
## Flow
1. Start in **RED** — write a failing test
2. **Improve the diagnostic** (self-loop) — debug the failure
3. Make it **pass** → move to **GREEN**
4. **Change** the requirements → back to **RED**
5. **Relax** (refactor) while staying **GREEN**
6. When done, **rip** → terminal state
A classic, hand-drawn representation of the Red-Green-Refactor cycle!
## Mermaid Source
```mermaid
---
title: TDD State Diagram — John Jagger
---
stateDiagram
[*] --> RED
RED --> RED : improve the diagnostic
RED --> GREEN : pass
GREEN --> RED : change
GREEN --> GREEN : relax
GREEN --> [*] : rip
```
1. Start in RED — write a failing test
2. Improve the diagnostic (self-loop) — debug the failure
3. Make it pass → move to GREEN
4. Change the requirements → back to RED
5. Relax (refactor) while staying GREEN
6. When done, rip → terminal state