144 lines
7.5 KiB
Markdown
144 lines
7.5 KiB
Markdown
%{
|
|
title: "How to let a local model interpret images in Pi.dev with llama.cpp",
|
|
author: "Willem van den Ende",
|
|
tags: ~w(),
|
|
description: "QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.",
|
|
published: true
|
|
}
|
|
---
|
|
|
|
I have used visual language models occasionally, but since I switched from Claude Code to [Pi](https://pi.dev) I had not gotten around to setting it up. Jon Jagger mentioned he had just tried it out in a programming session with Claude Code, and then I decided that maybe I shoult take a bit of time to figure it out. It was not that difficult, sort of in the category: "Why did I not do that sooner?". It is quite handy to be able to draw a state diagram and use that as a starting point to draw out a workflow.
|
|
|
|
In this case, we were discussing giving developers feedback on how they are running the TDD cycle in CyberDojo (e.g. many 'red' cycles may mean the step is too big), and I was also interested in making a deterministic extension for Pi to slow down coding agents and people, and make sure that the *Refactoring* step actually happens. The state diagram was a good starting point, it got more complicated when building a Pi extension. More about that in a next post.
|
|
|
|
QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.
|
|
|
|
In addition to the configuration I described in [My local agentic dev setup two months ago](/blog/engineering/my-local-agentic-dev-setup-today), I had to download an 'mmproj' file and specify it in my llama.cpp run script. And then it was vision capable. I put the mmproj file in the directory with my scripts, as the name is somewhat generic, and I do not yet know if it is the same for other models, or specific for this one.
|
|
|
|
There are some more changes to the script since the last time, most notable `--spec-default` for 'speculative decoding' which gave a nice speed bump.
|
|
|
|
``` bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
LLAMA_DIR="${ROOT_DIR}/llama.cpp"
|
|
|
|
# GGUF_DIR="$(llm gguf models-dir)"
|
|
GGUF_DIR="/Users/willem/Library/Application Support/io.datasette.llm/gguf/models/"
|
|
MAIN="${GGUF_DIR}Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
|
|
|
|
ls "${MAIN}"
|
|
|
|
exec "${LLAMA_DIR}/build/bin/llama-server" \
|
|
-m "$MAIN" \
|
|
--mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \
|
|
--spec-default \
|
|
-c 262144 \
|
|
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \
|
|
--presence-penalty 0.0 \
|
|
--chat-template-kwargs '{"preserve_thinking": true}' \
|
|
--parallel 1 \
|
|
--jinja \
|
|
--host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
For Pi.dev, I needed to add "image" to the part of `models.json` that has the model configuration:
|
|
|
|
``` json
|
|
"models": [
|
|
{
|
|
"id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
|
|
"name": "Qwen3.6-35B",
|
|
"reasoning": true,
|
|
"input": [
|
|
"text", "image"
|
|
],
|
|
"compat": {
|
|
"thinkingFormat": "qwen-chat-template"
|
|
},
|
|
"contextWindow": 262144,
|
|
"maxTokens": 32768,
|
|
"cost": {
|
|
"input": 0,
|
|
"output": 0,
|
|
"cacheRead": 0,
|
|
"cacheWrite": 0
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
That was all the configuration changes. Two lines and a copy.
|
|
|
|
I then fed a state diagram of the red-green-refactor cycle that Jon Jagger made, by tab-completing the filename in the prompt, and the model figured out it was a state diagram (this was, admittedly also in the name of the file). It also found out the states and the transitions - much faster than trying to type it out. The red green refactor cycle from tdd was recognised. TDD was in the name of the file, but the connection with red-green-refactor was not. the only thing slightly off is 'rip' instead of *ship*. Which leads to some more fun interpretations further down.
|
|
|
|
The results I found surprisingly good, as Jon sketched this during a conversation, with no intent to convert it later. Here is the photo, you can see the shadow of my phone.
|
|
|
|

|
|
|
|

|
|
|
|
Below you can see the part of the session where QWEN interpreted the image. I asked after that to make a mermaid diagram. The mermaid syntax was broken, so either a larger or more cohesive model might be needed, or feedback from running `mmdc` - the mermaid command line.
|
|
|
|
# Inline session transcript
|
|
|
|
I have chosen a vision capable model now
|
|
Great! Now I should be able to see the image. Let me read it again:
|
|
read tdd-state-diagram-john-jagger.jpeg
|
|
|
|
Now I can see it! This is a TDD State Diagram drawn on a yellow sticky note. Here's the breakdown:
|
|
|
|
## States
|
|
|
|
| State | Description |
|
|
|------------------|----------------------------------------------------------|
|
|
| **RED** | The "fail first" state — write a failing test |
|
|
| **GREEN** | The "make it pass" state — minimal code to pass the test |
|
|
| **●** (terminal) | Accept/exit state — done, ship it |
|
|
|
|
## Transitions
|
|
|
|
| From → To | Label | Meaning |
|
|
|-----------------------------------|----------------------------|--------------------------------------------------------------|
|
|
| **RED** → **RED** (self-loop) | *"improve the diagnostic"* | Refine the failing test / understand what's wrong |
|
|
| **RED** → **GREEN** | *"pass"* | Write minimal code to make the test pass |
|
|
| **GREEN** → **RED** | *"change"* | Refactor or add new requirements (break the test on purpose) |
|
|
| **GREEN** → **GREEN** (self-loop) | *"relax"* | Refactor while tests stay green |
|
|
| **GREEN** → **●** (terminal) | *"rip"* | Ship it / done (rip the test away) |
|
|
|
|
## Flow
|
|
|
|
1. Start in **RED** — write a failing test
|
|
2. **Improve the diagnostic** (self-loop) — debug the failure
|
|
3. Make it **pass** → move to **GREEN**
|
|
4. **Change** the requirements → back to **RED**
|
|
5. **Relax** (refactor) while staying **GREEN**
|
|
6. When done, **rip** → terminal state
|
|
|
|
A classic, hand-drawn representation of the Red-Green-Refactor cycle!
|
|
|
|
## Mermaid Source
|
|
|
|
```mermaid
|
|
---
|
|
title: TDD State Diagram — John Jagger
|
|
---
|
|
stateDiagram
|
|
[*] --> RED
|
|
|
|
RED --> RED : improve the diagnostic
|
|
RED --> GREEN : pass
|
|
GREEN --> RED : change
|
|
GREEN --> GREEN : relax
|
|
GREEN --> [*] : rip
|
|
```
|
|
|
|
1. Start in RED — write a failing test
|
|
2. Improve the diagnostic (self-loop) — debug the failure
|
|
3. Make it pass → move to GREEN
|
|
4. Change the requirements → back to RED
|
|
5. Relax (refactor) while staying GREEN
|
|
6. When done, rip → terminal state
|
|
|
|
|