HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

That’s not a slide deck. It’s three of the actual HumanEval problems being solved by the local model on this MacBook — prompt on the left, generated code streaming on the right, real test going green at the end. The stopwatch is real. The Wi-Fi is off.

Final score: 81.7% pass@1. One hundred and thirty-four out of one hundred and sixty-four. Fourteen minutes wall-clock. Single sample per problem, temperature zero, no retries.

Why anyone should care about this number

The Qwen team didn’t publish HumanEval scores for any Qwen3-Coder variant. They consider the benchmark saturated — and for cloud-served frontier models, fair, it is. They went straight to agentic benchmarks (SWE-bench Verified, BFCL, Aider-Polyglot) and ran them on the flagship 480B sibling.

For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

The methodology, in one screen

Setting	Value
Benchmark	HumanEval — 164 Python tasks (full)
Metric	pass@1 (first attempt only)
Temperature	0.0 — deterministic
Sampling	single sample per problem, no best-of-N
Execution	Python subprocess, 10s timeout
Hardware	M5 Max MacBook Pro · 128 GB unified memory
Model	Qwen3-Coder-30B-A3B-Instruct-MLX-8bit
Network	Wi-Fi OFF the entire run
Wall clock	14 minutes

Three real problems from the run

Picked from the actual results file — these are the problems that play in the video, with the canonical solutions verified to pass:

HumanEval/13 · greatest_common_divisor — model output: while b: a, b = b, a % b; return a · 1.76s · PASS
HumanEval/2 · truncate_number — model output: return number % 1.0 · 1.83s · PASS
HumanEval/14 · all_prefixes — model output: a list comprehension over range(len(string)) · 1.49s · PASS

And 161 more like them. With 30 failures along the way — that’s the honest part. The full results file is in the GitHub repo if you want to see exactly which ones failed.

The companion number — MBPP

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each — likely repetitive generation patterns). At n=168 the 95% confidence interval is ±5%, so the number is solid.

For context — the 480B flagship’s agentic numbers

Agentic Benchmark	Qwen3-Coder 480B	Claude Sonnet 4	GPT-4.1
SWE-bench Verified (500-turn)	69.6	70.4	—
Terminal-Bench	37.5	35.5	25.3
BFCL-v3	68.7	73.3	62.9
Aider-Polyglot	61.8	56.4	52.4

Source: Qwen team’s official blog. The 30B running on this MacBook is a smaller sibling — trades absolute peak agentic ceiling for fitting in 30 GB and running 24/7 on local hardware.

Reproduce it yourself

Open-source launchers (clone, run): github.com/nicedreamzapp/claude-code-local
HumanEval dataset: github.com/openai/human-eval
Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for the full 8-bit weights)
Total monthly cost: $0 after the model download

For law firms, medical practices, and accountants who want help getting this running on their own hardware — that’s what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt