Watch out for the tokens! – Simone Giordano

When I first started working, we had one of the earliest internet connections via a 64 Kbps “CDN” line. The connection was slow, but it offered a flat-rate billing model. A few years later, HDSL arrived, offering a symmetric connection with speeds up to 2 Mbps, but billed on a pay-per-use basis. We counted every single byte to avoid excessive costs, while others were less careful and often received exorbitant bills. Today, no one really remembers pay-per-use internet, at least not in Italy.

The story repeats itself…

ChatGPT was initially free, and even Sora, which generates improbable videos, was offered at no cost. Then came monthly subscriptions, with ever-increasing prices, and now we’re in the phase of “token consumption.” You already know how this will end!

It’s increasingly clear that LLM models are becoming a pure “commodity.” On Hugging Face alone, a new and more powerful model is published every hour. Some studies have shown that in the most popular AI products, the LLM model itself accounts for less than 5% of the value. The rest consists of automation, tools, and “old-style” Python code.

It’s equally evident that continuously making requests to a remote AI system is unsustainable when the same answers can be obtained locally, using a smaller and optimized model. This is why Apple Silicon made a difference, and why NVIDIA recently unveiled its RTX Spark chip. The prediction that most AI models will run locally by the end of 2026 is coming true.

https://nvidianews.nvidia.com/news/nvidia-microsoft-windows-pcs-agents-rtx-spark

Token-based billing is accelerating this process.

Which server to choose

In a previous article, I discussed GGUF, the format that bundles everything needed to run a local LLM into a single file. It will then be the server (e.g., llama.cpp) that handles the appropriate hardware acceleration (likely NVIDIA CUDA or Apple Silicon).

For Apple Silicon, there is the option to run further optimized models based on MLX. The difference is significant: for the same model, you can achieve up to double the tokens per second. For my setup, I chose MLX and had to make a few decisions regarding the server:

llama.cpp remains by far the best option, but it does not yet support MLX.
Ollama, on the other hand, supports MLX, but I consider it too “simple” for advanced use.
oMLX and vMLX are extremely promising but still unstable.
LM Studio is currently the only solution that offers everything I need.

56 tokens per second for a battery-powered notebook is not bad at all 😎

Which model should I choose?

Currently, I am using Qwen 3.6, which offers the best compromise between speed, training efficiency, tool usage capabilities, and knowledge of programming languages. Yes, it also understands AL for NAV and Business Central.

However, as we mentioned earlier, a new model is released every day. My advice is to download them and test each one by having them perform the same benchmark task. This way, you can evaluate which model works best for your specific environment, taking into account:

Available memory
Type of task to be performed
Tool utilization

Regarding the last point, it is not guaranteed that a model will always execute the tools you have provided. For this reason, there is a specific training approach known as “Instruction Tuned” (indicated by the “it” suffix in the model’s filename).

Additionally, consider the number of parameters in a model, for example, “35B” stands for 35 billion parameters.

Some models, optimized for local use, activate only a subset of their trained parameters, resulting in lighter and faster execution. For instance, “A3B” means “only 3 billion parameters are active.”

Finally, there is quantization, the “magical” process that enables LLMs to run on laptops. As discussed in a previous article, 4-bit quantization is a good balance that slightly reduces precision while making full-sized models more manageable.

All these details can be found in the model’s filename. Here’s how to choose the right model!

https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-4bit

Code Development Use

Okay, but how do you use Claude Code? My cousin said that with a single prompt, they rewrote their company’s CRM from scratch, and it now works better than before!

There are several open-source alternatives, and new versions are emerging every day. I first tried extensions for Visual Studio Code, such as “Cline” and “Continue.”

However, for stability and parallel task management, I chose OpenCode.

Conclusion

Yes. AI is a powerful and fascinating tool. We can use it locally, with complete privacy and at zero cost (aside from the hardware expenses for the chip and memory). In software development, it can help us generate a foundational model or identify hidden vulnerabilities. More simply, it can also serve as a quick way to consult documentation, or act as a virtual companion with whom to share an action plan.

No. AI will not steal our jobs, nor can it rewrite your company’s ERP system in just half a day. What is rapidly growing, at an exponential rate, are the idiots who use AI to justify their utterances, and the poor-quality code that now permeates even many commercial software products.