draft: local dev setup today

This commit is contained in:
Firehose Bot 2026-04-29 22:41:16 +01:00
parent 9c769959ad
commit 33da823668

View File

@ -2,30 +2,149 @@
title: "My local agentic dev setup today", title: "My local agentic dev setup today",
author: "Willem van den Ende" author: "Willem van den Ende"
tags: ~w(pi.dev llamacpp mlx ai), tags: ~w(pi.dev llamacpp mlx ai),
description: "Yesterday my LinkedIn post about cancelling my Claude Code Max setup went viral. People asked me about my local LLM setup. A quick post about what I am using today", description: "Yesterday my LinkedIn post about cancelling my Claude Code Max plan went viral. People asked me about my local LLM setup. A quick post about what I am using today",
published: false published: false
} }
--- ---
I was planning to write about my local development setup at my leisure. Moving this forward as my post on LinkedIn yesterday about canceling my claude Max $100 plan and going local raised a lot more interest and questions than I expected. I was planning to write about my local development setup at my leisure. Moving this forward as my post on LinkedIn yesterday about canceling my claude Max $100 plan and going local raised a lot more interest and questions than I expected. This post attempts to answer the question: What hardware do you run, what software do you use (inference server and coding agent), and which models do you use?. I have put links to blog posts that may answer some of the other questions in the Further Reading section.
TLDR: I run models with [llamacpp](https://github.com/ggml-org/llama.cpp). I have a script that pulls and builds the latest llamacpp, because I want to try the latest open weights and open source models. Also the last couple of weeks are seeing almost daily performance improvements, and I like fast feedback. As a coding agent I use [Pi.dev](https://pi.dev), and for chat, questions, brainstorming about writing I use [GPTEL](https://github.com/karthink/gptel) in Emacs. My setup works for me, I am running this on a refurbished MacBook Pro M3 Max with 64GB of RAM. note that LLMs have gotten more performant per unit of hardware and per watt by orders of magnitude over the last couple of years, and there is no end in sight yet. Over the last month my local models have gotten about 2x as fast, while the same or better capability uses less RAM. I can keep a browser open now while running a coding agent ;-). Both models explained below I can just keep running as I go about my day (one model at a time).
TLDR: I run models with [llama.cpp](https://github.com/ggml-org/llama.cpp). I have a script that pulls and builds the latest llamacpp, because I want to try the latest open weights and open source models. Also the last couple of weeks are seeing almost daily performance improvements, and I like fast feedback. As a coding agent I use [Pi.dev](https://pi.dev), and for chat, questions, brainstorming about writing I use [GPTEL](https://github.com/karthink/gptel) in Emacs.
You may note the absence of an IDE in the above. I was an early adopter of eXtreme Programming. If I can write tests first, run them fast, and refactor, I am happy. I rarely need a debugger. I still have a Jetbrains Ultimate subscription, but that is more for technical coaching work than day to day work. LLMs allow me to do refactorings and make refactoring tools on the fly for languages like Elixir that are generally not supported by IDEs anyway. You may note the absence of an IDE in the above. I was an early adopter of eXtreme Programming. If I can write tests first, run them fast, and refactor, I am happy. I rarely need a debugger. I still have a Jetbrains Ultimate subscription, but that is more for technical coaching work than day to day work. LLMs allow me to do refactorings and make refactoring tools on the fly for languages like Elixir that are generally not supported by IDEs anyway.
Assumptions
- We are all figuring this out.
- Quality of a harness (coding agent + "skills" + extensions) can matter as least as much as the model
- Running open models and an open coding agent + custom extensions takes time, but pays off in understanding and a stable base where engineering effort compounds
- Open, local, models have (for me) crossed the point where they are good enough for daily work with a coding agent.
This is a power users' setup. There are other ways to achieve similar goals (some interesting ones are in the comments on the LinkedIn post).
# Inference: LLamaCPP # Inference: LLamaCPP
I got started with `ollama` and it looks like Hugging Face has a strong 'easy entry' as well. I run llama.cpp because it is quite fast, and more importantly, stable on MacOs. Model makers often support llamacpp in getting changes in to get their models out - the way inference works is still evolving, quite rapidly. I got started with `ollama` and it looks like Unsloth Studio is promising (but has no Mac hardware acceleration yet as far as I know. it is coming). I run llama.cpp because it is quite fast, and more importantly, stable on MacOs. Model makers often support llamacpp in getting changes in to get their models out - the way inference works is still evolving, quite rapidly. It doesn't matter as much for use in chat, but coding agents use 'tool calls' (xml or json the model emits to request e.g. an `ls` invocation in bash, or an 'edit' with parameters in JSON), and that is not easy to make reliable.
I try `mlx` - mac native inference, occasionally, because sometimes it is faster than llamacpp. Often works for chat, but less so for agentic coding. I try `mlx` - mac native inference, occasionally, because sometimes it is faster than llamacpp. Often works for chat, but less so for agentic coding.
I have used claude code to get me set up in the past. The [llama.cpp](https://github.com/ggml-org/llama.cpp) has good instructions on how to install and download models. If you want to ground yourself that is probably a better way to start than a prompt.
I have cloned the `llama.cpp` repository, inside my `llama-server-scripts` directory. Also a git repository. I have just put `llama.cpp` in `.gitignore`. I then had claude make a script to pull and build llama.cpp. This generally works :-). I normally install releases from everything, but I find it hard to wait when promising new models or performance optimisations come out.
```bash
➜ llama-server-scripts git:(main) ✗ cat build_llama.sh
#!/usr/bin/env bash
set -euo pipefail
LLAMA_DIR="llama.cpp"
if [ ! -d "$LLAMA_DIR" ]; then
echo "Error: $LLAMA_DIR directory not found. Clone it first." >&2
exit 1
fi
cd "$LLAMA_DIR"
echo "Pulling latest llama.cpp..."
git pull
echo "Configuring CMake build..."
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_METAL_EMBED_LIBRARY=ON
echo "Building..."
cmake --build build --config Release -j"$(sysctl -n hw.ncpu)"
echo ""
echo "Build complete."
./build/bin/llama-server --version
```
# Models : Qwen3.6 , 35B-A3B and 27B # Models : Qwen3.6 , 35B-A3B and 27B
These came out in the last two weeks. I ran the 3.5 models before, and they were good enough to tinker with over the easter holiday. 35B is a Mixture of Experts (MoE) model. While these cost more memory, they run much faster on a Mac, or on a machine like my Framework laptop where the whole model does not fit in the GPU - for each token only 3B parameters are active, against 27B parameters for the 'dense' model. The dense model can be more cohesive. In 3.5 the difference was notable in planning and summarisation, 27B is more detailed. But here too, "good enough" counts - the 35B model is often good enough for what I do, and runs much faster. Between 30 and 80 tokens per second as far as I can tell, 27B peaks out at 19 tokens per second at the moment. This makes a big difference when I'm having a chat, less so when I run it in the background while doing something else.
Unsloth has good documentation and set up scripts for both models [qwen3.6 at unsloth](https://unsloth.ai/docs/models/qwen3.6). The parameters in llama.cpp scripts may look intimidating at first, but I got used to it by starting somewhere and modifying as I saw things come in on https::/reddit/r/LocalLlama.
These came out last week. I ran the 3.5 models before, and they were good enough to tinker with over the easter holiday. 35B is a Mixture of Experts (MoE) model. While these cost more memory, they run much faster - for each token only 3B parameters are active, against 27B parameters for the 'dense' model. The dense model can be more cohesive. In 3.5 the difference was notable in planning and summarisation, 27B is more detailed. But here too, "good enough" counts - the 35B model is often good enough for what I do, and runs much faster. Between 30 and 80 tokens per second as far as I can tell, 27B peaks out at 19 tokens per second at the moment. This makes a big difference when I'm having a chat, less so when I run it in the background while doing something else. Note that you don't need a script to start, there is a 'router' script that will start llamacpp, and then via the web UI (which is quite nice now), you can choose which model(s) to load. Often good enough.
# 27B
I will start with the 27B script, because that is more copy-paste from Unsloth, and then modify to taste. Easier to follow along, hopefully.
Today's special is `--spec-default` . I couldn't even find documentation for it, so here is a [deepwiki query](https://deepwiki.com/search/what-does-specdefault-do_93cffa03-5266-4c9e-ba6a-331d12efd6db?mode=fast). It sets some default parameters for speculative decoding. This meant that the 27B model now goes over 20 tokens per second sometimes. Not blazing fast, but more comfortable for planning and it can do work in the background and finish inside Pi's timeout limits.
`-c 65536 \ ` sets the context size to 65K tokens. Since I mostly use this model for planning, asking questions etc, I don't need 256K tokens. I am of the 'reset early, reset often' school. Small, focused context for focused results. Contexts have become much less RAM consuming over the last two months, but starting a server.
`--chat-template-kwargs '{"preserve_thinking": true}' \` Keeps the reasoning traces. This means that the context in a multi-turn conversation is much easier to cache (all the same tokens come back), and some models perform better when they see reasoning tokens from previous turns (or so I heard). Since inference is relatively slow on a mac, effective caching makes a _big_ difference.
`-np 1` - only 1 process at a time. The GPU is already maxed out when a coding agent runs, I can also only single task, and additional processes is additional contexts, and I don't have that much RAM.
`--jinja` is for templating. The other ones are generating parameters that I probably copied from the unsloth huggingface page.
`-hf` will download the model from huggingface. `unsloth/Qwen3.6-27B-GGUF:Q4_K_M` is the name and the quantization of the model. I haven't done an extensive study yet as to which would be the best one for my machine. This roughly matches the default for `rapid-mlx`, so I have some comparison. Q4 means '4 bits integer'. Values are approximated through a Quantization process. You may lose accuracy, but the Qwen 3.6 models seem to be less sensitive to that. Smaller is generally faster to run, and costs less memory.
The other parameters are mostly general inference parameters, see the model page for options.
`run27b.sh`:
```bash
exec ./llama.cpp/build/bin/llama-server \
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
--spec-default \
--no-mmproj \
--fit on \
-np 1 \
-c 65536 \
--cache-ram 4096 -ctxcp 2 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}' \
--host 0.0.0.0 \
--port 8000
```
# 35 B
This has been my daily driver since the second half of last week. I downloaded this by hand. Full model name is probably `unsloth/Qwen3.6-35B-A3B-MXFP4_MOE.gguf`.
This one was downloaded by hand and follows an older pattern. I used Simon Willisons 'llm' tool to get started. That saves the models in a different place than `-hf` (which came later).
Here also I am running a smaller quantisation, to see if it works. Apparently not the best, but the last couple of days it has worked for me. Benchmarks came out after I started using it. There is a time for tinkering and a time for making small tools, tinkering will come back at some point.
note that this runs with a much larger context: `-c` indicates 256 K tokens. This is an area where small open models are following quite closely on the frontier. Cohesion is a different matter, but this model seems quite happy above 100K tokens. Makes improvising more relaxed. Here also `--spec-default` since yesterday. 35b model last month was running at about 30 tokens per second, now I often see well above 60 with this and other optimizations. It doesn't mean everything, (if you have to run the same prompt 3 times to get a result for instance), but iterating on a prompt is more enjoyable this way.
```bash
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LLAMA_DIR="${ROOT_DIR}/llama.cpp"
MAIN="$HOME/Library/Application Support/io.datasette.llm/gguf/models/Qwen3.6-35B-A3B-MXFP4_MOE.gguf"
ls "${MAIN}"
exec "${LLAMA_DIR}/build/bin/llama-server" \
-m "$MAIN" \
--spec-default \
-c 262144 \
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.0 \
--presence-penalty 0.0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--parallel 1 \
--jinja \
--host 0.0.0.0 --port 8000
```
@ -122,32 +241,20 @@ Configuration in `~/.pi/agent/models.json`:
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}] }]
} }
}
} }
``` ```
This is an excerpt. Notice that the contextWindow for 27B does not match the setting in the run script. There is duplciation here. The OpenAI API that I use here does not export the context window as far as I have found. So you specify this twice, which is annoyting. It would be nice to be able to specify this per request, as not every request needs a massive context. Suggestions welcome!
# Cutting room floor
If you want to know more about how I got here, I wrote a longer piece a few days about how [Smaller LLMs now work for open agents](/blog/engineering/smaller-open-llms-now-work-for-open-agents). After that the credit card that Anthropic held for our Claude subscription expired, and I had to make an effort to continue (entering the new creditcard). Cancelling was less effort than continuing, so I decided to go all in and see where it goes. As I said in the linkedin post, I had been using open weight LLms more, and Claude Code less. I am also fairly efficient with tokens,
My setup is more work to get going, and doesn't do everything that claude code + frontier models did, but I don't need to build large prototypes on a whim for a couple of weeks . This is kind of a 'local first' set up, I have un-metered LLMs locally and can opt hosted ones in as needed, by the token. It is good for iterating and incrementing an existing application in small slices, developing extensions etc. Assumption: every large system consists of many small parts, if we can work on the parts efficiently, we can work with smaller models.
Having said that, the turning point for me was a month or so ago when qwen-coder-next managed to make a sequence diagram of a long trace in a 500 KLoc CSharp codeb ase, with a fairly lazy prompt. I could not use hosted models for that, because of our NDA. The assumption is there to keep me grounded, but I prepare to be pleasantly surprised.
Why am I doing this?
Me and my clients care about digital autonomy. I also like to keep my environmental footprint minimal. I have done a bunch of larger prototypes (mostly Phoenix Liveview) and am now working on consolidating my workflow to work in small slices on shipped products. I have been using hosted LLMs for code since spring 2024, and local since summer 2024. Starting with analysis, to functions, to scripts to whole web apps.
Until a year or so back, I resisted tinkering too much with my local dev setup, but LLMs make e.g. fixing annoyences in Emacs or making extensions for a coding agent like Pi.dev relatively easy and fluent.
Further reading Further reading
--- ---
Rob Bowley on the unit economics of frontier labs - [Coding agent generates its' own extensions](04-19-coding-agent-generates-its-own-extensions.md) — Engineer solutions in the moment for the agent you're in a session with.
- [A pair pomodoro with Pi](04-20-a-pair-pomodoro-with-pi.md) — A brief experiment in working in short cycles with a coding agent.
- [How to get started with the Pi coding agent (on a VPS)](04-24-how-to-get-started-with-the-pi-coding-agent-on-a-vps.md) — Setting up Pi on a VPS is easier than I thought.
- [Smaller open LLMs now work for open agents](04-24-smaller-open-llms-now-work-for-open-agents.md) — A phase shift in quality and speed of open weight models and inference.
New tool to help you select an LLM based on your needs and codebase (not tried yet) - Nate Jones had a good podcast / video this week on Apples' play with local models for e.g. legal offices who can not get their work certified if their data leaves the office, no matter how encrypted it is. https://podcasts.apple.com/gb/podcast/ai-news-strategy-daily-with-nate-b-jones/id1877109372?i=1000763732500.
[Hoew does the human stay in the loop, while developing on their phone?](https://www.qwan.eu/2026/02/02/liveview_microprints.html) - me on the QWAN blog, about microprints built in to the application, so I can keep an eye on the code as it evolves. Full disclosure: I am long APPL, NVDA and Alibaba (makers of QWEN). I use other hardware and models (and don't have much Nvidia hardware).