final editing on my local agentic dev setup

This commit is contained in:
Firehose Bot 2026-04-30 09:10:51 +01:00
parent 33da823668
commit cfc26cf377
2 changed files with 30 additions and 5 deletions

View File

@ -7,7 +7,7 @@
} }
--- ---
I was planning to write about my local development setup at my leisure. Moving this forward as my post on LinkedIn yesterday about canceling my claude Max $100 plan and going local raised a lot more interest and questions than I expected. This post attempts to answer the question: What hardware do you run, what software do you use (inference server and coding agent), and which models do you use?. I have put links to blog posts that may answer some of the other questions in the Further Reading section. I was planning to write about my local development setup at my leisure. Moving this forward as my [post on LinkedIn](https://www.linkedin.com/posts/willemvandenende_cancelled-my-claude-code-plan-was-using-activity-7454883528661676032-KdZp) the other day about cancelling my Claude Max $100 plan and going local raised a lot more interest and questions than I expected. This post attempts to answer the question: What hardware do you run, what software do you use (inference server and coding agent), and which models do you use?. I have put links to blog posts that may answer some of the other questions in the Further Reading section.
My setup works for me, I am running this on a refurbished MacBook Pro M3 Max with 64GB of RAM. note that LLMs have gotten more performant per unit of hardware and per watt by orders of magnitude over the last couple of years, and there is no end in sight yet. Over the last month my local models have gotten about 2x as fast, while the same or better capability uses less RAM. I can keep a browser open now while running a coding agent ;-). Both models explained below I can just keep running as I go about my day (one model at a time). My setup works for me, I am running this on a refurbished MacBook Pro M3 Max with 64GB of RAM. note that LLMs have gotten more performant per unit of hardware and per watt by orders of magnitude over the last couple of years, and there is no end in sight yet. Over the last month my local models have gotten about 2x as fast, while the same or better capability uses less RAM. I can keep a browser open now while running a coding agent ;-). Both models explained below I can just keep running as I go about my day (one model at a time).
@ -22,7 +22,15 @@ Assumptions
- Running open models and an open coding agent + custom extensions takes time, but pays off in understanding and a stable base where engineering effort compounds - Running open models and an open coding agent + custom extensions takes time, but pays off in understanding and a stable base where engineering effort compounds
- Open, local, models have (for me) crossed the point where they are good enough for daily work with a coding agent. - Open, local, models have (for me) crossed the point where they are good enough for daily work with a coding agent.
This is a power users' setup. There are other ways to achieve similar goals (some interesting ones are in the comments on the LinkedIn post). As Patrick Debois noted, mine is a power users' setup. There are other ways to achieve similar goals. Some interesting ones are in the comments on the LinkedIn post, and a surprising one in the Afterword below.
In general it comes down to: more out of the box experience with something like Claude Code, Codex or OpenCode versus more control, personalisation, digital autonomy and data privacy with more of your own harness and local LLMs or hosted ones with strong privacy guarantees.
I made the table below in conversation with the 27B model mentioned below.
![comparison table between claude and Pi, some more explanation is in the further reading section](/images/blog/2026/claude-pi-comparison.png)
# Inference: LLamaCPP # Inference: LLamaCPP
@ -35,6 +43,9 @@ I have used claude code to get me set up in the past. The [llama.cpp](https://gi
I have cloned the `llama.cpp` repository, inside my `llama-server-scripts` directory. Also a git repository. I have just put `llama.cpp` in `.gitignore`. I then had claude make a script to pull and build llama.cpp. This generally works :-). I normally install releases from everything, but I find it hard to wait when promising new models or performance optimisations come out. I have cloned the `llama.cpp` repository, inside my `llama-server-scripts` directory. Also a git repository. I have just put `llama.cpp` in `.gitignore`. I then had claude make a script to pull and build llama.cpp. This generally works :-). I normally install releases from everything, but I find it hard to wait when promising new models or performance optimisations come out.
I have skipped the instructions for how to install a compiler etc, This [Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM ](https://www.reddit.com/r/LocalLLaMA/comments/1svdep5/field_report_coding_with_qwen_36_35ba3b_on_an_m2) also has instructions on how to set up xcode-build etc. As well as some tasks the 35B didnt do so well initially.
```bash ```bash
➜ llama-server-scripts git:(main) ✗ cat build_llama.sh ➜ llama-server-scripts git:(main) ✗ cat build_llama.sh
#!/usr/bin/env bash #!/usr/bin/env bash
@ -66,6 +77,7 @@ echo "Build complete."
``` ```
# Models : Qwen3.6 , 35B-A3B and 27B # Models : Qwen3.6 , 35B-A3B and 27B
These came out in the last two weeks. I ran the 3.5 models before, and they were good enough to tinker with over the easter holiday. 35B is a Mixture of Experts (MoE) model. While these cost more memory, they run much faster on a Mac, or on a machine like my Framework laptop where the whole model does not fit in the GPU - for each token only 3B parameters are active, against 27B parameters for the 'dense' model. The dense model can be more cohesive. In 3.5 the difference was notable in planning and summarisation, 27B is more detailed. But here too, "good enough" counts - the 35B model is often good enough for what I do, and runs much faster. Between 30 and 80 tokens per second as far as I can tell, 27B peaks out at 19 tokens per second at the moment. This makes a big difference when I'm having a chat, less so when I run it in the background while doing something else. These came out in the last two weeks. I ran the 3.5 models before, and they were good enough to tinker with over the easter holiday. 35B is a Mixture of Experts (MoE) model. While these cost more memory, they run much faster on a Mac, or on a machine like my Framework laptop where the whole model does not fit in the GPU - for each token only 3B parameters are active, against 27B parameters for the 'dense' model. The dense model can be more cohesive. In 3.5 the difference was notable in planning and summarisation, 27B is more detailed. But here too, "good enough" counts - the 35B model is often good enough for what I do, and runs much faster. Between 30 and 80 tokens per second as far as I can tell, 27B peaks out at 19 tokens per second at the moment. This makes a big difference when I'm having a chat, less so when I run it in the background while doing something else.
@ -74,7 +86,9 @@ Unsloth has good documentation and set up scripts for both models [qwen3.6 at un
Note that you don't need a script to start, there is a 'router' script that will start llamacpp, and then via the web UI (which is quite nice now), you can choose which model(s) to load. Often good enough. Note that you don't need a script to start, there is a 'router' script that will start llamacpp, and then via the web UI (which is quite nice now), you can choose which model(s) to load. Often good enough.
# 27B Once you have downloaded a model, go to port 8000 and you cna play with it. I quite the chat as it also has conversation forking built in, and shows performance metrics as it runs. This gives me a feel for how a particular model is doing without going into detailed evals.
I will document the 27B model first, as that configuration is the cleanest on my machine. # 27B
I will start with the 27B script, because that is more copy-paste from Unsloth, and then modify to taste. Easier to follow along, hopefully. I will start with the 27B script, because that is more copy-paste from Unsloth, and then modify to taste. Easier to follow along, hopefully.
@ -255,6 +269,17 @@ Further reading
- [How to get started with the Pi coding agent (on a VPS)](04-24-how-to-get-started-with-the-pi-coding-agent-on-a-vps.md) — Setting up Pi on a VPS is easier than I thought. - [How to get started with the Pi coding agent (on a VPS)](04-24-how-to-get-started-with-the-pi-coding-agent-on-a-vps.md) — Setting up Pi on a VPS is easier than I thought.
- [Smaller open LLMs now work for open agents](04-24-smaller-open-llms-now-work-for-open-agents.md) — A phase shift in quality and speed of open weight models and inference. - [Smaller open LLMs now work for open agents](04-24-smaller-open-llms-now-work-for-open-agents.md) — A phase shift in quality and speed of open weight models and inference.
- Nate Jones had a good podcast / video this week on Apples' play with local models for e.g. legal offices who can not get their work certified if their data leaves the office, no matter how encrypted it is. https://podcasts.apple.com/gb/podcast/ai-news-strategy-daily-with-nate-b-jones/id1877109372?i=1000763732500. - Nate B Jones had a good podcast / video this week on Apples' play with local models for e.g. legal offices who can not get their work certified if their data leaves the office, no matter how encrypted it is. [Nate B Jones on Apple and the next trillion dollars](https://podcasts.apple.com/gb/podcast/ai-news-strategy-daily-with-nate-b-jones/id1877109372?i=1000763732500)0
Full disclosure: I am long APPL, NVDA and Alibaba (makers of QWEN). I use other hardware and models (and don't have much Nvidia hardware). Full disclosure: I am long APPL, NVDA and BABA (AliBaba, makers of QWEN). I use other hardware and models too and don't have much Nvidia hardware.
This [Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM ](https://www.reddit.com/r/LocalLLaMA/comments/1svdep5/field_report_coding_with_qwen_36_35ba3b_on_an_m2) I have mentioned before has detailed instructions, as well as some development tasks explained in detail.
Afterword
----
What was remarkable to me in the Field Report above is that the writer chose OpenCode and Qwen3.6 35B over Claude:
> Why don't I just use [..] Claude Code? I had problems due to a lack of optimization for small context windows. Long-running tasks that complete large projects independently matter for me, so no Claude Code.
So maybe my intuition is right. I am not doing long running tasks like I was doing with Cladue Code _yet_ with Pi and Qwen. This shows it is possible, and might even be better. Detailed prompts and deterministic harnesses make the difference between frontier models and harnesses smaller. More about that maybe later. I hope this helps you, let me know if you have any questions or remarks.

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB