That’s not a slide deck. It’s three of the actual HumanEval problems being solved by the local model on this MacBook — prompt on the left, generated code streaming on the right, real test going green at the end. The stopwatch is real. The Wi-Fi is off.
Final score: 81.7% pass@1. One hundred and thirty-four out of one hundred and sixty-four. Fourteen minutes wall-clock. Single sample per problem, temperature zero, no retries.
Why anyone should care about this number
The Qwen team didn’t publish HumanEval scores for any Qwen3-Coder variant. They consider the benchmark saturated — and for cloud-served frontier models, fair, it is. They went straight to agentic benchmarks (SWE-bench Verified, BFCL, Aider-Polyglot) and ran them on the flagship 480B sibling.
For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.
The methodology, in one screen
| Setting | Value |
|---|---|
| Benchmark | HumanEval — 164 Python tasks (full) |
| Metric | pass@1 (first attempt only) |
| Temperature | 0.0 — deterministic |
| Sampling | single sample per problem, no best-of-N |
| Execution | Python subprocess, 10s timeout |
| Hardware | M5 Max MacBook Pro · 128 GB unified memory |
| Model | Qwen3-Coder-30B-A3B-Instruct-MLX-8bit |
| Network | Wi-Fi OFF the entire run |
| Wall clock | 14 minutes |
Three real problems from the run
Picked from the actual results file — these are the problems that play in the video, with the canonical solutions verified to pass:
- HumanEval/13 · greatest_common_divisor — model output: while b: a, b = b, a % b; return a · 1.76s · PASS
- HumanEval/2 · truncate_number — model output: return number % 1.0 · 1.83s · PASS
- HumanEval/14 · all_prefixes — model output: a list comprehension over range(len(string)) · 1.49s · PASS
And 161 more like them. With 30 failures along the way — that’s the honest part. The full results file is in the GitHub repo if you want to see exactly which ones failed.
The companion number — MBPP
I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each — likely repetitive generation patterns). At n=168 the 95% confidence interval is ±5%, so the number is solid.
For context — the 480B flagship’s agentic numbers
| Agentic Benchmark | Qwen3-Coder 480B | Claude Sonnet 4 | GPT-4.1 |
|---|---|---|---|
| SWE-bench Verified (500-turn) | 69.6 | 70.4 | — |
| Terminal-Bench | 37.5 | 35.5 | 25.3 |
| BFCL-v3 | 68.7 | 73.3 | 62.9 |
| Aider-Polyglot | 61.8 | 56.4 | 52.4 |
Source: Qwen team’s official blog. The 30B running on this MacBook is a smaller sibling — trades absolute peak agentic ceiling for fitting in 30 GB and running 24/7 on local hardware.
Reproduce it yourself
- Open-source launchers (clone, run): github.com/nicedreamzapp/claude-code-local
- HumanEval dataset: github.com/openai/human-eval
- Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for the full 8-bit weights)
- Total monthly cost: $0 after the model download
For law firms, medical practices, and accountants who want help getting this running on their own hardware — that’s what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.
— matt
