If you write code for a living and you’ve been watching the local-AI space, May 9, 2026 is the date to circle. Salvatore Sanfilippo (yes, the guy who wrote Redis) shipped ds4 — a few thousand lines of hand-written C with Metal compute kernels, built for exactly one model: DeepSeek V4 Flash.
I ran the same prompt through three engines on the same 128 GB MacBook Pro:
- DeepSeek V4 Flash via
ds4— fully local, off-cloud - Cloud Claude through my Max plan
- Gemma 4 31B via MLX, also local
Local DeepSeek beat cloud Claude on wall-clock time. That sentence used to be science fiction.
▶ Watch the companion video — three engines, one prompt, three completely different aurora animations rendered in real time on the same machine.
The benchmark, for people who don’t want filler
| Engine | Time | Output | Where it ran |
|---|---|---|---|
DeepSeek V4 Flash (ds4 local) |
103 s | 3,259 tokens | Apple Silicon GPU |
| Cloud Claude (Max plan) | 192 s | ~3,500 tokens | Anthropic data center |
| Gemma 4 31B (MLX local) | 131 s | 1,992 tokens | Apple Silicon GPU |
The prompt was a single creative HTML task: “Build an animated northern lights scene — single file, vanilla JS, mountains, pine trees, twinkling stars, flowing aurora bands.”
Each engine produced a completely different aurora. None of them hit the network during inference. (Yes, I checked with lsof. Yes, this is the same lsof audit pattern from the AirGap NDA piece.)
Three architectural decisions in ds4 worth understanding
This is the part that matters if you’re a developer thinking about local AI infrastructure.
1. Asymmetric 2-bit quantization (only where quality is forgiving)
The naive approach to quantization treats every weight the same. ds4 doesn’t. Only the routed Mixture-of-Experts experts get compressed to 2-bit (specifically IQ2_XXS for up/gate, Q2_K for down). Every quality-critical path — shared experts, attention projections, routing, output head — stays at higher precision (Q8 or full).
Those routed experts are about 90% of the weight footprint. The other 10% is where small precision losses cause big accuracy losses. Quantize the 90%, leave the 10%, and you get an 81 GB file that still calls tools cleanly and writes coherent code.
This is the kind of tradeoff that only makes sense if you’ve stared at a specific model’s loss landscape long enough to know which weights tolerate compression. It’s a model-specific engineering decision dressed as a quantization recipe.
2. KV cache moved to disk (in 2026 SSDs are fast enough)
The “KV cache must live in RAM” assumption is from 2023. Modern Apple SSDs do 5+ GB/s sequential reads. ds4 writes session state to disk and reuses it across runs, keyed by SHA1 of token IDs.
The practical effect: when Claude Code sends its 25k-token system prompt, that prefill happens exactly once, ever. Every subsequent session — including totally different agent runs that happen to share that prefix — reads from disk in milliseconds instead of recomputing from token zero.
If you’ve used long-context models locally, you know prefill is the slowest thing in the loop. ds4 makes it free after the first hit. That’s the kind of “small change, huge implication” move that took years to normalize. (See also: the disk-KV section in the ds4 README.)
3. Pure Metal, not CUDA-with-a-shim
There’s no PyTorch, no TensorFlow, no llama.cpp wrapper layer in the hot path. The compute kernels under metal/*.metal are written specifically for this one model on this one architecture. The acknowledgments thank llama.cpp and GGML — ds4 borrows quant layouts and select kernels — but it’s not a fork.
This narrowness is the point. Generic frameworks pay a tax for being generic. When you commit to one model on one chip, you can hand-tune away that tax. ~27 tok/s on an M3 Max 128 GB. ~32 tok/s on M5 Max. For agent loops on a laptop, that’s plenty.
Why this matters for compliance-sensitive devs
The same week, I’m still maintaining AirGap AI — a wi-fi-off, lsof-audited workflow for analyzing privileged documents (NDAs, client files, PHI, etc.) on a laptop with no outbound connections. Until last week, that was a Llama 3.3 70B story. The capability ceiling was real.
ds4 raises that ceiling materially:
- 1M-token context — entire codebases, full deposition transcripts, complete contract sets, all in-memory in a single conversation
- Quasi-frontier reasoning — if you’ve used Claude Sonnet or Opus, DeepSeek V4 Flash sits in the same neighborhood for most agentic tasks
- Tool calling that works — Antirez tested it under coding agents (opencode, Pi, Claude Code) and the tool calls land reliably
For law firms, medical practices, and compliance-bound shops, the math just changed. You don’t have to choose between “frontier-grade reasoning” and “data never leaves the building.” The hardware exists, the engine exists, the model exists, and the integration with Claude Code exists.
(If you’re trying to get a bar-association-defensible AI workflow off the ground, the AirGap landing page is where I keep my notes. The ds4 stack is going in there next week.)
How to actually run it
The full stack, all open-source:
# 1. Build the engine (Apple Silicon with Metal)
git clone https://github.com/antirez/ds4
cd ds4 && make
# 2. Pull the q2 weights (~81 GB)
./download_model.sh q2
# 3. Boot the local Anthropic-compatible server
./ds4-server --ctx 200000 --kv-disk-dir ~/Library/Caches/ds4-kv \
--kv-disk-space-mb 16384
# 4. Point Claude Code at it
ANTHROPIC_BASE_URL=http://127.0.0.1:8000 \
ANTHROPIC_AUTH_TOKEN=dsv4-local \
ANTHROPIC_MODEL=deepseek-v4-flash \
claude
Or just clone nicedreamzapp/claude-code-local — DeepSeek V4 Flash is now the fourth fighter in the lineup, with a claude-ds4 wrapper that handles all of the above for you.
What this slots into
This isn’t a one-off. It’s the next click in a longer arc I’ve been writing about:
- Three Generations of Running Claude Code Locally on a MacBook — What I Actually Learned — the long path from “barely works” to “actually replaces my cloud usage”
- Cloud AI Coding Costs Keep Climbing — How to Pay $0 and Still Use Claude Code — the economic angle, before
ds4 - Pulling 10x My Subscription Value Out of Claude — what the cloud math actually looks like
- What It’s Actually Like to Code By Voice — With the AI Replying In My Own Cloned Voice — the voice loop these models now plug into
- Your Medical Practice Is Probably Using Cloud AI on PHI Right Now — why on-device matters for healthcare
- If Your Law Firm Is Using Cloud AI on Client Files, You Probably Have a Problem — the legal angle
- A Field Guide to Ambient Computing — the bigger frame this all sits inside
ds4 is the engine that finally makes the local-first version of all of those usable for production work. The local agent doesn’t have to pick which workload it’s good at anymore.
Where to follow
- 🛠️ github.com/nicedreamzapp/claude-code-local — the lineup, launchers, and benchmarks
- 🐳 github.com/antirez/ds4 — the engine itself
- 🌿 marijuanaunion.com — the broader writing on local AI, voice, and ambient computing
- 🔒 nicedreamzwholesale.com/airgap — the compliance-grade workflow notes
- 💬 Discord — NiceDreamzApps server
May 9, 2026. The day a single C file caught up to the data centers.
This is the technical companion to the headline piece on Marijuana Union. Companion video: youtu.be/7l8-s8xkpms. For local-AI consulting on compliance-sensitive workloads, see AirGap AI.
