blog post about explaining image models
This commit is contained in:
parent
ef51fdcfff
commit
eb3af61884
@ -0,0 +1,141 @@
|
||||
%{
|
||||
title: "How to let a local model interpret images in Pi.dev with llama.cpp",
|
||||
author: "Willem van den Ende",
|
||||
tags: ~w(),
|
||||
description: "QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.",
|
||||
published: false
|
||||
}
|
||||
---
|
||||
|
||||
I have used visual language models occasionally, but since I switched from Claude Code to [Pi](https://pi.dev) I had not gotten around to setting it up. Jon Jagger mentioned he had just tried it out in a programming session with Claude Code, and then I decided that maybe I shoult take a bit of time to figure it out. It was not that difficult, sort of in the category: "Why did I not do that sooner?". It is quite handy to be able to draw a state diagram and use that as a starting point to draw out a workflow.
|
||||
|
||||
In this case, we were discussing giving developers feedback on how they are running the TDD cycle in CyberDojo (e.g. many 'red' cycles may mean the step is too big), and I was also interested in making a deterministic extension for Pi to slow down coding agents and people, and make sure that the *Refactoring* step actually happens. The state diagram was a good starting point, it got more complicated when building a Pi extension. More about that in a next post.
|
||||
|
||||
QWEN 3.6 models are multi modal, but getting that to work requires an additional file and configuration in llama.cpp and enabling images in the Pi.dev models configuration.
|
||||
|
||||
In addition to the configuration I described in [My local ageentic dev setup two months ago](/blog/engineering/my-local-agentic-dev-setup-today), I had to download an 'mmproj' file and specify it in my llama.cpp run script. And then it was vision capable. I put the mmproj file in the directory with my scripts, as the name is somewhat generic, and I do not yet know if it is the same for other models, or specific for this one.
|
||||
|
||||
There are some more changes to the script since the last time, most notable `--spec-default` for 'speculative decoding' which gave a nice speed bump.
|
||||
|
||||
``` bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
LLAMA_DIR="${ROOT_DIR}/llama.cpp"
|
||||
|
||||
# GGUF_DIR="$(llm gguf models-dir)"
|
||||
GGUF_DIR="/Users/willem/Library/Application Support/io.datasette.llm/gguf/models/"
|
||||
MAIN="${GGUF_DIR}Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
|
||||
|
||||
ls "${MAIN}"
|
||||
|
||||
exec "${LLAMA_DIR}/build/bin/llama-server" \
|
||||
-m "$MAIN" \
|
||||
--mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \
|
||||
--spec-default \
|
||||
-c 262144 \
|
||||
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \
|
||||
--presence-penalty 0.0 \
|
||||
--chat-template-kwargs '{"preserve_thinking": true}' \
|
||||
--parallel 1 \
|
||||
--jinja \
|
||||
--host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
For Pi.dev, I needed to add "image" to the part of `models.json` that has the model configuration:
|
||||
|
||||
``` json
|
||||
"models": [
|
||||
{
|
||||
"id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
|
||||
"name": "Qwen3.6-35B",
|
||||
"reasoning": true,
|
||||
"input": [
|
||||
"text", "image"
|
||||
],
|
||||
"compat": {
|
||||
"thinkingFormat": "qwen-chat-template"
|
||||
},
|
||||
"contextWindow": 262144,
|
||||
"maxTokens": 32768,
|
||||
"cost": {
|
||||
"input": 0,
|
||||
"output": 0,
|
||||
"cacheRead": 0,
|
||||
"cacheWrite": 0
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
That was all the configuration changes. Two lines and a copy.
|
||||
|
||||
I then fed a state diagram of the red-green-refactor cycle that Jon Jagger made, by tab-completing the filename in the prompt, and the model figured out it was a state diagram (this was, admittedly also in the name of the file). It also found out the states and the transitions - much faster than trying to type it out. The red green refactor cycle from tdd was recognised. TDD was in the name of the file, but the connection with red-green-refactor was not. the only thing slightly off is 'rip' instead of *ship*. Which leads to some more fun interpretations further down.
|
||||
|
||||
The results I found surprisingly good, as Jon sketched this during a conversation, with no intent to convert it later. Here is the photo, you can see the shadow of my phone.
|
||||
|
||||

|
||||
|
||||
Below you can see the part of the session where QWEN interpreted the image. I asked after that to make a mermaid diagram. The mermaid syntax was broken, so either a larger or more cohesive model might be needed, or feedback from running `mmdc` - the mer pppmaid command line.
|
||||
|
||||
# Inline session transcript
|
||||
|
||||
I have chosen a vision capable model now
|
||||
Great! Now I should be able to see the image. Let me read it again:
|
||||
read tdd-state-diagram-john-jagger.jpeg
|
||||
|
||||
Now I can see it! This is a TDD State Diagram drawn on a yellow sticky note. Here's the breakdown:
|
||||
|
||||
## States
|
||||
|
||||
| State | Description |
|
||||
|------------------|----------------------------------------------------------|
|
||||
| **RED** | The "fail first" state — write a failing test |
|
||||
| **GREEN** | The "make it pass" state — minimal code to pass the test |
|
||||
| **●** (terminal) | Accept/exit state — done, ship it |
|
||||
|
||||
## Transitions
|
||||
|
||||
| From → To | Label | Meaning |
|
||||
|-----------------------------------|----------------------------|--------------------------------------------------------------|
|
||||
| **RED** → **RED** (self-loop) | *"improve the diagnostic"* | Refine the failing test / understand what's wrong |
|
||||
| **RED** → **GREEN** | *"pass"* | Write minimal code to make the test pass |
|
||||
| **GREEN** → **RED** | *"change"* | Refactor or add new requirements (break the test on purpose) |
|
||||
| **GREEN** → **GREEN** (self-loop) | *"relax"* | Refactor while tests stay green |
|
||||
| **GREEN** → **●** (terminal) | *"rip"* | Ship it / done (rip the test away) |
|
||||
|
||||
## Flow
|
||||
|
||||
1. Start in **RED** — write a failing test
|
||||
2. **Improve the diagnostic** (self-loop) — debug the failure
|
||||
3. Make it **pass** → move to **GREEN**
|
||||
4. **Change** the requirements → back to **RED**
|
||||
5. **Relax** (refactor) while staying **GREEN**
|
||||
6. When done, **rip** → terminal state
|
||||
|
||||
A classic, hand-drawn representation of the Red-Green-Refactor cycle!
|
||||
|
||||
## Mermaid Source
|
||||
|
||||
```mermaid
|
||||
---
|
||||
title: TDD State Diagram — John Jagger
|
||||
---
|
||||
stateDiagram
|
||||
[*] --> RED
|
||||
|
||||
RED --> RED : improve the diagnostic
|
||||
RED --> GREEN : pass
|
||||
GREEN --> RED : change
|
||||
GREEN --> GREEN : relax
|
||||
GREEN --> [*] : rip
|
||||
```
|
||||
|
||||
1. Start in RED — write a failing test
|
||||
2. Improve the diagnostic (self-loop) — debug the failure
|
||||
3. Make it pass → move to GREEN
|
||||
4. Change the requirements → back to RED
|
||||
5. Relax (refactor) while staying GREEN
|
||||
6. When done, rip → terminal state
|
||||
|
||||
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 1010 KiB |
Loading…
x
Reference in New Issue
Block a user