I was wrong: Small is the new big for local AI agents

TL;DR: We’ve been obsessed with parameter counts for too long but running massive models locally is a pain. The real unlock isn't a smarter God-model; it's a swarm of small, specialized agents running locally. My M1 Mac is suddenly a powerhouse. Combined with frameworks like CrewAI, Small Language Models (SLMs) are proving that coordination beats raw intelligence.

I admit it. I fell for the "bigger is better" trap.

When I first started trying to run LLMs locally on my Apple Silicon M1, I went straight for the heavyweights. I wanted the smartest model possible, so I pulled down a 13b parameter model (the largest my 16gb ram could run), fired it up, and waited.

and waited.

My GPU was bleeding, lol. The token generation speed was glacial. It was unusable for any real workflow. I was thinking that the bigger the model, the better the output, which is technically true, however for the kind of tasks I actually do, the trade-off was a disaster.

So, I pivoted. I started looking into Small Language Models (SLMs) and running them through Ollama. And that’s when things got interesting.

The agentic unlock

Here is the thing: a single small model often hallucinates or misses the nuance. It feels "cheap." But when you take that same small model and wrap it in an agentic system, using tools like CrewAI, the game changes completely.

I’ve been building these systems locally recently, and it has been really great.

The concept is simple. Instead of asking one massive brain to write code, document it, and test it, you spin up three small, fast agents. one writes. one reviews. one documents.

Because they are small, they run fast. Because they are specialized via system prompts, they don't need to be geniuses at everything; they just need to be competent at one thing.

Why local still wins

There is a distinct feeling of power when you disconnect the internet and this thing still works.

Having this whole agentic system running fully locally is honestly the best. It’s fast, and more importantly, it is secure. I’m not sending my private code or sensitive data to an API endpoint. It lives on my mac.

When I was trying to force the 13b model to work, I was fighting the hardware. With SLMs, I’m working with the hardware. The M1 handles these quantized smaller models without breaking a sweat.

The setup

Here is what my current stack looks like for this:

Hardware: Apple Silicon M1 (proving you don’t need an H100 cluster for this).
Inference: Ollama. It just works. It abstracts away the messiness of model weights and serving.
Orchestration: CrewAI. this is where the magic happens. It manages the "team" of agents, handling the hand-offs and task delegation.

I generally do not really need a model as large as I thought I did. When you chain these components together, the aggregate intelligence of the system shoots up, even if the individual nodes are "dumber" compared to a GPT-4 class model.

Performance vs. latency

Speed is a feature.

When I’m coding or iterating on a product idea, I need flow. Waiting 30 seconds for a chunk of text breaks that flow. With these smaller models running locally, the response is snappy.

Yes, they are small. but together they are insanely good.

I tested a workflow where one agent generates a python script and another agent critiques it for security flaws. On a massive model, this would be a single, long context window prompt that takes time and costs money (if using an API). locally, with SLMs, it happens in seconds.

If one agent messes up, the critique agent catches it. The error rate drops significantly, not because the model is smarter, but because the system is self-correcting.

Bottom line

We are entering a phase where architecture matters more than model size.

If you are sitting on an apple silicon machine (M series) and ignoring local AI because you think you can't run the "good" models, you are missing the point. The "good" models are the ones that run fast enough to be useful.

Download Ollama. Spin up a CrewAI script. Use the small models. You’ll be surprised at how much you can get done when you stop waiting for a GPU to finish melting and start letting agents do the work.

I think local is the new cloud. and small is the new big.