HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

That’s not a slide deck. It’s three of the actual HumanEval problems being solved by the local model on this MacBook — prompt on the left, generated code streaming on the right, real test going green at the end. The stopwatch is real. The Wi-Fi is off.

Final score: 81.7% pass@1. One hundred and thirty-four out of one hundred and sixty-four. Fourteen minutes wall-clock. Single sample per problem, temperature zero, no retries.

Why anyone should care about this number

The Qwen team didn’t publish HumanEval scores for any Qwen3-Coder variant. They consider the benchmark saturated — and for cloud-served frontier models, fair, it is. They went straight to agentic benchmarks (SWE-bench Verified, BFCL, Aider-Polyglot) and ran them on the flagship 480B sibling.

For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

The methodology, in one screen

Setting Value
BenchmarkHumanEval — 164 Python tasks (full)
Metricpass@1 (first attempt only)
Temperature0.0 — deterministic
Samplingsingle sample per problem, no best-of-N
ExecutionPython subprocess, 10s timeout
HardwareM5 Max MacBook Pro · 128 GB unified memory
ModelQwen3-Coder-30B-A3B-Instruct-MLX-8bit
NetworkWi-Fi OFF the entire run
Wall clock14 minutes

Three real problems from the run

Picked from the actual results file — these are the problems that play in the video, with the canonical solutions verified to pass:

  1. HumanEval/13 · greatest_common_divisor — model output: while b: a, b = b, a % b; return a · 1.76s · PASS
  2. HumanEval/2 · truncate_number — model output: return number % 1.0 · 1.83s · PASS
  3. HumanEval/14 · all_prefixes — model output: a list comprehension over range(len(string)) · 1.49s · PASS

And 161 more like them. With 30 failures along the way — that’s the honest part. The full results file is in the GitHub repo if you want to see exactly which ones failed.

The companion number — MBPP

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each — likely repetitive generation patterns). At n=168 the 95% confidence interval is ±5%, so the number is solid.

For context — the 480B flagship’s agentic numbers

Agentic Benchmark Qwen3-Coder 480B Claude Sonnet 4 GPT-4.1
SWE-bench Verified (500-turn)69.670.4
Terminal-Bench37.535.525.3
BFCL-v368.773.362.9
Aider-Polyglot61.856.452.4

Source: Qwen team’s official blog. The 30B running on this MacBook is a smaller sibling — trades absolute peak agentic ceiling for fitting in 30 GB and running 24/7 on local hardware.

Reproduce it yourself

For law firms, medical practices, and accountants who want help getting this running on their own hardware — that’s what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt

Leave a Comment

Your email address will not be published. Required fields are marked *

Shopping Cart
06.12.26 Disclosure Day
Scroll to Top