The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.
That clip up top isn’t a render. That’s Qwen 3 Coder — 30 billion parameters, 8-bit MLX — running on this MacBook with the Wi-Fi off. Around 55 tokens per second. Total cost to keep running it: zero.
The thing that matters more than the spec sheet is what it actually unlocks.
Why the M5 Max changes the math
Until now, running a 30B+ parameter model meant a GPU rack — or paying a cloud API per token. The M5 Max changes that:
- 128 GB unified memory. The entire model lives in fast RAM. No GPU offload, no quantization tricks past 8-bit. The CPU and “GPU” share the same memory, so there’s no copy step between them.
- Mixture-of-experts plays perfectly with Apple Silicon. Qwen 3 Coder is 30B total but only 3B active per token. That’s a math problem the M5 Max’s memory bandwidth eats for breakfast.
- MLX runs at near-CUDA speed. Apple’s native ML framework hits ~55 tok/s on the 8-bit quant. No CUDA tax, no Nvidia driver politics, no $40,000 GPU bill.
- It’s a regular store-bought laptop. No GPU rack. No data center. No cloud bill. You can run it on a plane.
This wasn’t possible on a laptop a year ago. It is now.
What you can do with it now
Read a legal contract — and have it never leave your machine.
Most AI tools pipe your document to a server somewhere. With this setup, the bytes don’t leave the laptop. NDAs, supplier agreements, employment contracts — review them at your kitchen table without uploading them to anyone.
Write production code in a couple of seconds.
The video shows it: real Python function, real Qwen output, no edits. The agent’s tool-calling is good enough to drop into Claude Code’s loop, where it’ll edit files, run shell commands, and iterate. It’s plenty for everything from one-off scripts to refactoring real production code.
Analyze patient charts without a HIPAA violation.
For doctors, therapists, intake clinics — anything with PHI on it — local-only AI isn’t a nice-to-have, it’s the only legal option. Same model, same speed, zero bytes leaving the device.
Build agents that don’t charge you per call.
This is the one most people sleep on. Pay-per-token cloud APIs make agents expensive to leave running. Once the model is local, you can let an agent loop overnight, hit it with thousands of requests, kick off a watcher that scans your inbox every two minutes — and the cost stays at zero.
The full stack
Here’s the receipt:
- Hardware: M5 Max MacBook Pro, 128 GB unified memory.
- Model: Qwen3-Coder-30B-A3B-Instruct-MLX-8bit — about 30 GB on disk. Mixture-of-experts, ~3B params active per token.
- Server: A small Python proxy at localhost:4000 that speaks the Anthropic Messages API, so the Claude Code CLI thinks it’s talking to the cloud — except it’s talking to a hard drive.
- Total monthly cost: $0 once it’s downloaded.
That’s it. No Docker, no Kubernetes, no VPS. Just a laptop on a desk.
The performance, honestly
The local-AI space is full of overclaims, so the straight numbers:
- 55 tokens per second on a real coding task. Sustained, not peak.
- Two seconds to write a working find_median() function. Three to four seconds for most refactors.
- Tool-calling reliability is good enough for the Claude Code agentic loop. Not as locked-in as Sonnet 4.6, but plenty for getting work done.
- What it’s not: a Sonnet replacement for nuanced reasoning, long contexts, or really tricky debugging. For day-to-day code agent work, it more than holds its own.
Why the offline part matters
The reason “Wi-Fi off” keeps coming back in the demo isn’t a gimmick. It’s the whole thesis.
If a tool needs the internet, three things are true:
- Someone else can read what you sent.
- Someone else can charge you for it.
- Someone else can take it away.
If the same tool runs locally, none of those are true. That’s a different category of software. Not better at every task — but yours.
Benchmarks — actually run, not cited
Big claims need numbers. So here’s what Qwen 3 Coder 30B-A3B (8-bit MLX) actually scores on this MacBook, run end-to-end against the local localhost:4000 server. Every problem solved by the model, executed in a Python subprocess, scored pass/fail. Pass@1, temperature=0, single sample per problem.
| Benchmark | N | Pass@1 | Notes |
|---|---|---|---|
| HumanEval | 164/164 (full) | 81.7% | Python function-completion classic. Saturated benchmark; modern coding models cluster 75–95%. 14 min total wall-clock. |
| MBPP (sanitized) | 168/427 (sampled) | 83.3% | Mostly Basic Python Problems. Pass rate was stable since n=120; a few outlier tasks induce very long model responses, so I cut off at 168. |
Both runs used pass@1, temperature=0, 10s execution timeout, on the local 8-bit MLX quantization. No retries. No best-of-N tricks. Single sample per problem.
For context — what the bigger sibling scores on harder benchmarks
The Qwen team didn’t publish HumanEval/MBPP for any Qwen3-Coder variant — they consider those benchmarks saturated. Their official benchmarks are agentic, and they ran them on the flagship Qwen3-Coder-480B-A35B-Instruct (the bigger sibling, ~16× the active params of the 30B-A3B running on this laptop). For context — here’s what the flagship 480B scores on those harder agentic benchmarks compared to the major closed models:
| Agentic Benchmark | Qwen3-Coder 480B | Claude Sonnet 4 | GPT-4.1 | DeepSeek-V3 |
|---|---|---|---|---|
| SWE-bench Verified (500-turn) | 69.6 | 70.4 | — | — |
| Terminal-Bench | 37.5 | 35.5 | 25.3 | 2.5 |
| BFCL-v3 (function calling) | 68.7 | 73.3 | 62.9 | 64.7 |
| Aider-Polyglot | 61.8 | 56.4 | 52.4 | 56.9 |
| WebArena | 49.9 | 51.1 | 44.3 | 40.0 |
Source: Qwen team’s official blog. The 30B-A3B running on this MacBook is a smaller sibling of the 480B — it trades absolute peak agentic ceiling for fitting in 30 GB and running 24/7 on local hardware. For most coding tasks people actually do in a day, HumanEval/MBPP-class accuracy matters more than the SWE-bench top-line, and on those it sits where it should: useful, fast, local.
Where this is heading
The next year of the AI conversation isn’t going to be “which model is smartest.” It’s going to be “which workloads belong on your machine, and which belong on someone else’s.”
Compliance-bound work — legal, medical, financial — is going to move local fast. Code-agent loops will follow because the math (per-call cost vs. zero) is brutal. The M5 Max with 128 GB of unified memory is the laptop that lets that happen.
Try it yourself
The launchers are open source on GitHub: nicedreamzapp/claude-code-local. The README walks through downloading the model and pointing Claude Code at the local server.
For law firms, medical practices, and accountants that want help getting this running on their own hardware — that’s what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.
— matt

