spelling errors

2026-04-29 16:01:29 +01:00 · 2026-04-29 16:01:29 +01:00 · c02cebdd05
commit c02cebdd05
parent b3011a6bd6
1 changed files with 98 additions and 0 deletions
--- a/app/priv/blog/engineering/2026/04-29-my-local-agentic-dev-setup-today.md
+++ b/app/priv/blog/engineering/2026/04-29-my-local-agentic-dev-setup-today.md
@ -0,0 +1,98 @@
 %{
  title: "My local agentic dev setup today",
  author: "Willem van den Ende"
  tags: ~w(pi.dev llamacpp mlx ai),
  description: "Yesterday my LinkedIn post about cancelling my Claude Code Max setup went viral. People asked me about my local LLM setup. A quick post about what I am using today",
  published: false
 }
 ---
 I was planning to write about my local development setup at my leisure. Moving this forward as my post on LinkedIn yesterday about canceling my claude Max $100 plan and going local raised a lot more interest and questions than I expected. 
 TLDR: I run models with [llamacpp](https://github.com/ggml-org/llama.cpp). I have a script that pulls and builds the latest llamacpp, because I want to try the latest open weights and open source models. Also the last couple of weeks are seeing almost daily performance improvements, and I like fast feedback. As a coding agent I use [Pi.dev](https://pi.dev), and for chat, questions, brainstorming about writing I use [GPTEL](https://github.com/karthink/gptel) in Emacs. 
 You may note the absence of an IDE in the above. I was an early adopter of eXtreme Programming. If I can write tests first, run them fast, and refactor, I am happy. I rarely need a debugger. I still have a Jetbrains Ultimate subscription, but that is more for technical coaching work than day to day work. LLMs allow me to do refactorings and make refactoring tools on the fly for languages like Elixir that are generally not supported by IDEs anyway.
 # Inference: LLamaCPP
 blog/engineering/how-to-get-started-with-the-pi-coding-agent-on-a-vps
 # Models : Qwen3.6 , 35B-A3B and 27B
 These came out last week. I ran the 3.5 models before, and they were good enough to tinker with over the easter holiday. 35B is a Mixture of Experts (MoE) model. While these cost more memory, they run much faster - for each token only 3B parameters are active, against 27B parameters for the 'dense' model. The dense model can be more cohesive. In 3.5 the difference was notable in planning and summarisation, 27B is more detailed. But here too, "good enough" counts - the 35B model is often good enough for what I do, and runs much faster. Between 30 and 80 tokens per second as far as I can tell, 27B peaks out at 19 tokens per second at the moment. This makes a big difference when I'm having a chat, less so when I run it in the background while doing something else.
 # Sandbox: Nono
 # Coding agent: Pi.dev
 I covered the general setup with a hosted model in [how to get started with the pi coding agent on a vps](/blog/engineering/how-to-get-started-with-the-pi-coding-agent-on-a-vps) the other day. Pi will point you to the installation documentation as soon as you start it.
 `~/.pi/agent/models.json`
 ```json
 {
     "providers": {
         "llama.cpp": {
           "baseUrl": "http://127.0.0.1:8000/v1",
             "api": "openai-completions",
             "apiKey": "dummy",
             "models": [
                 {
                     "id": "Qwen3.6-35B-A3B-MXFP4_MOE.gguf",
                     "name": "Qwen3.6-35B",
                     "reasoning": true,
                     "input": ["text"],
                      "compat": {
                        "thinkingFormat": "qwen-chat-template"
                       },
                     "contextWindow": 262144,
                       "maxTokens": 32768,
                     "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
                 },
                 {
                     "id": "unsloth/Qwen3.6-27B-GGUF:Q4_K_M",
                     "name": "Qwen3.6-27B",
                     "reasoning": true,
                     "input": ["text"],
                      "compat": {
                        "thinkingFormat": "qwen-chat-template"
                       },
                     "contextWindow": 262144,
                       "maxTokens": 32768,
                     "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
                 }]
         }
 }
 ```
 # Cutting room floor
 If you want to know more about  how I got here, I wrote a longer piece a few days about how  [Smaller LLMs now work for open agents](/blog/engineering/smaller-open-llms-now-work-for-open-agents). After that the credit card that Anthropic held for our Claude subscription expired, and I had to make an effort to continue (entering the new creditcard). Cancelling was less effort than continuing, so I decided to go all in and see where it goes. As I said in the linkedin post, I had been using open weight LLms more, and Claude Code less. I am also fairly efficient with tokens, 
 My setup is more work to get going, and doesn't do everything that claude code + frontier models did, but I don't need to build large prototypes on a whim for a couple of weeks . This is kind of a 'local first' set up, I have un-metered LLMs locally and can opt hosted ones in as needed, by the token. It is good for iterating and incrementing an existing application in small slices, developing extensions etc. Assumption: every large system consists of many small parts, if we can work on the parts efficiently, we can work with smaller models. 
 Having said that, the turning point for me was a month or so ago when qwen-coder-next managed to make a sequence diagram of a long trace in a 500 KLoc CSharp codeb ase, with a fairly lazy prompt. I could not use hosted models for that, because of our NDA. The assumption is there to keep me grounded, but I prepare to be pleasantly surprised.
 Why am I doing this?
 Me and my clients care about digital autonomy. I also like to keep my environmental footprint minimal. I have done a bunch of larger prototypes (mostly Phoenix Liveview) and am now working on consolidating my workflow to work in small slices on shipped products. I have been using hosted LLMs for code since spring 2024, and local since summer 2024. Starting with analysis, to functions, to scripts to whole web apps. 
 Until a year or so back, I resisted tinkering too much with my local dev setup, but LLMs make e.g. fixing annoyences in Emacs or making extensions for a coding agent like Pi.dev relatively easy and fluent. 
 Further reading
 ---
 Rob Bowley on the unit economics of frontier labs
 New tool to help you select an LLM based on your needs and codebase (not tried yet)
 [Hoew does the human stay in the loop, while developing on their phone?](https://www.qwan.eu/2026/02/02/liveview_microprints.html) - me on the QWAN blog, about microprints built in to the application, so I can keep an eye on the code as it evolves.