I’ve been testing a few open-source local LLMs lately because I wanted to see how they actually behave on my machine, not just how they look in benchmark charts.
This is not a formal benchmark. It’s just a practical local test using a few short reasoning and trick questions, with attention to both answer quality and inference behavior.
I later added three more models to the same set to see whether the original ranking still held up. That turned out to be worth doing, because one of the new Gemma variants was much stronger than I expected, and another was much worse.
Test setup
Hardware
- CPU: AMD Ryzen 7 9700X (8-core)
- RAM: 48GB DDR5 (2x24GB 6400 CL32)
- GPU: RTX 6000 Pro Workstation Edition 96GB
Runtime
- OS: Windows 11
- Software: LM Studio 0.4.12
- Backend: llama.cpp CUDA 12 v2.13.0
Models / quantization
- All models were from
lmstudio-community - All were quantized with
Q4_K_M - Only exception:
Qwen3.5-122B, which I ran with Unsloth’sQ4_K_S
Other notes
- No tools or MCP enabled
Model settings
- Context length: 190,000
- Full GPU offload
- Eval batch size: 512
- Max concurrent predictions: 8
- Unified KV cache: Yes
mmap(): True- False for 120B+ models
- Flash Attention: Yes
- Everything else left at default
Total time formula used in the tables
Total = TTFT + (Out Tok / Tok/s) + Thought- This is a derived comparison metric based on the recorded output length, throughput, time-to-first-token, and reported thinking time
Questions
Q1. Alice has 5 brothers and she also has 3 sisters. How many sisters does Alice’s brother have?
Expected answer: 4
Q2. The car wash is only 100m away from my house, should I walk or drive?
Expected answer: You still need to drive the car to the car wash.
Q3. Solve the puzzle:
37#21 = 92877#44 = 3993123#17 = 1484071#6 = ?
Expected answer: 5005
(from Stellar Blade)
Q4. You have two hourglasses, one that measures exactly 7 minutes and another that measures exactly 11 minutes. Using only these two hourglasses, can you measure exactly 15 minutes? If so, explain the steps.
Expected answer: Yes.
- Start both
- At 7 minutes, flip the 7-minute hourglass
- At 11 minutes, flip the 7-minute hourglass again
- When the 7-minute hourglass empties, exactly 15 minutes have passed
Models tested
google/gemma-4-31bgoogle/gemma-4-26b-a4bgoogle/gemma-4-e4bqwen/qwen3.6-35b-a3bqwen/qwen3.6-27bqwen/qwen3.5-35b-a3bqwen/qwen3-30b-a3bqwen/qwen3.5-27bunsloth/qwen3.5-122b-a10b
Q1 — Alice siblings
| Model | Result | Thought | Tok/s | Out Tok | TTFT | Total |
|---|---|---|---|---|---|---|
| google/gemma-4-31b | Correct | 3.24s | 63.0 | 266 | 0.651s | 8.113s |
| google/gemma-4-26b-a4b | Correct | 1.12s | 195.0 | 319 | 0.326s | 3.082s |
| google/gemma-4-e4b | Correct | 1.76s | 194.8 | 455 | 0.255s | 4.351s |
| qwen/qwen3.6-35b-a3b | Correct | 3.60s | 225.1 | 873 | 0.246s | 7.724s |
| qwen/qwen3.6-27b | Correct | 3.99s | 69.9 | 732 | 0.194s | 14.656s |
| qwen/qwen3.5-35b-a3b | Correct | 2.53s | 207.9 | 597 | 0.240s | 5.642s |
| qwen/qwen3-30b-a3b | Correct | 10.80s | 246.1 | 2972 | 0.225s | 23.101s |
| qwen/qwen3.5-27b | Correct | 7.82s | 68.5 | 607 | 0.366s | 17.047s |
| unsloth/qwen3.5-122b-a10b | Correct | 7.97s | 109.6 | 953 | 0.294s | 16.959s |
Q2 — Car wash
| Model | Result | Thought | Tok/s | Out Tok | TTFT | Total |
|---|---|---|---|---|---|---|
| google/gemma-4-31b | Correct | 3.94s | 62.9 | 297 | 0.246s | 8.908s |
| google/gemma-4-26b-a4b | Correct | 3.73s | 195.2 | 875 | 0.138s | 8.351s |
| google/gemma-4-e4b | Wrong | 1.72s | 194.5 | 568 | 0.166s | 4.806s |
| qwen/qwen3.6-35b-a3b | Wrong | 6.67s | 218.5 | 1727 | 0.173s | 14.747s |
| qwen/qwen3.6-27b | Partial | 9.92s | 69.8 | 1287 | 0.262s | 28.620s |
| qwen/qwen3.5-35b-a3b | Correct | 17.82s | 204.6 | 3868 | 0.356s | 37.081s |
| qwen/qwen3-30b-a3b | Partial | 7.70s | 248.4 | 2399 | 0.212s | 17.570s |
| qwen/qwen3.5-27b | Correct | 1m 3s | 66.6 | 4424 | 0.359s | 2m 9.785s |
| unsloth/qwen3.5-122b-a10b | Correct | 31.64s | 109.1 | 3780 | 0.295s | 1m 6.582s |
Q3 — Puzzle
| Model | Result | Thought | Tok/s | Out Tok | TTFT | Total |
|---|---|---|---|---|---|---|
| google/gemma-4-31b | Correct | 10.03s | 61.9 | 993 | 0.382s | 26.454s |
| google/gemma-4-26b-a4b | Correct | 4.07s | 195.9 | 1291 | 0.344s | 11.004s |
| google/gemma-4-e4b | Wrong | 12.14s | 189.7 | 4348 | 0.182s | 35.242s |
| qwen/qwen3.6-35b-a3b | Correct | 7.42s | 217.8 | 1935 | 0.363s | 16.667s |
| qwen/qwen3.6-27b | Correct | 6.95s | 69.4 | 1204 | 0.245s | 24.544s |
| qwen/qwen3.5-35b-a3b | Correct | 12.95s | 208.2 | 3086 | 0.229s | 28.001s |
| qwen/qwen3-30b-a3b | Correct | 8.29s | 248.1 | 2547 | 0.144s | 18.700s |
| qwen/qwen3.5-27b | Correct | 24.03s | 67.9 | 2056 | 0.486s | 54.796s |
| unsloth/qwen3.5-122b-a10b | Correct | 13.66s | 109.4 | 1842 | 0.424s | 30.921s |
Q4 — Hourglasses
| Model | Result | Thought | Tok/s | Out Tok | TTFT | Total |
|---|---|---|---|---|---|---|
| google/gemma-4-31b | Correct | 23.80s | 61.4 | 1724 | 0.399s | 52.277s |
| google/gemma-4-26b-a4b | Correct | 12.80s | 189.6 | 3102 | 0.212s | 29.373s |
| google/gemma-4-e4b | Wrong | 6.64s | 190.9 | 3544 | 0.208s | 25.413s |
| qwen/qwen3.6-35b-a3b | Correct | 1m 18s | 217.3 | 17383 | 0.220s | 2m 38.215s |
| qwen/qwen3.6-27b | Correct | 33.27s | 69.1 | 5447 | 0.262s | 1m 52.36s |
| qwen/qwen3.5-35b-a3b | Correct | 27.04s | 199.8 | 5722 | 0.276s | 55.955s |
| qwen/qwen3-30b-a3b | Wrong | 1m 25s | 212.3 | 18898 | 0.104s | 2m 54.12s |
| qwen/qwen3.5-27b | Correct | 1m 3s | 67.2 | 4694 | 0.293s | 2m 13.144s |
| unsloth/qwen3.5-122b-a10b | Partial | 55.24s | 107.8 | 6377 | 0.347s | 1m 54.743s |
What stood out
A few things were pretty clear from this run:
- Q1 was easy for almost all the models, so it didn’t separate the field very much
- Q2 was a much better filter than it looked at first. Some models immediately understood the trick, while others drifted into generic advice about whether a human should walk 100 meters, which is not really the question being asked
- Q3 turned out to be a good check for pattern recognition. Most models handled it, but a bad miss here stood out because the successful ones were fairly consistent
- Q4 exposed the biggest differences in practical reasoning quality. Some models reached the standard solution cleanly, while others produced long, polished explanations that were either bloated or simply wrong
- The Qwen family generally looked strong on raw throughput, but that speed often came with much longer answers than necessary
- The Gemma family felt stronger on directness overall. The better Gemma runs were more likely to answer the question first instead of turning a short prompt into a small essay
- Looking at the actual replies mattered as much as the score table. A model that is technically correct but takes several paragraphs to get there does not feel as good in real use as one that answers cleanly and moves on
Reply behavior notes
One pattern across the full set was that correctness and usability were not always the same thing.
google/gemma-4-31b and google/gemma-4-26b-a4b were still the most balanced to read. Both were reliable, and both usually got to the point without sounding lost. gemma-4-31b felt especially clean and steady, while gemma-4-26b-a4b combined that with much stronger throughput and still-fair total times once calculated properly.
qwen/qwen3.5-35b-a3b was the strongest Qwen result overall. It matched the best models on correctness, but it was generally more verbose. qwen/qwen3.5-27b also performed well on correctness, though its total times were much heavier, so the experience felt slower than the raw score alone suggests.
qwen/qwen3.6-35b-a3b and qwen/qwen3.6-27b were more uneven. They could look smart on some questions, but the car wash prompt showed that they were more likely to miss the simple practical point and drift into over-explaining.
qwen/qwen3-30b-a3b, unsloth/qwen3.5-122b-a10b, and google/gemma-4-e4b each exposed a different failure mode: either partial correctness, inflated response length, or reasoning mistakes hidden behind confident formatting.
That is probably the biggest practical lesson from this whole test: I care a lot more about directness and reliability than about headline token speed.
Raw score summary
google/gemma-4-31b→ 4/4 correctgoogle/gemma-4-26b-a4b→ 4/4 correctgoogle/gemma-4-e4b→ 1 correct, 3 wrongqwen/qwen3.6-35b-a3b→ 3 correct, 1 wrongqwen/qwen3.6-27b→ 3 correct, 1 partialqwen/qwen3.5-35b-a3b→ 4/4 correctqwen/qwen3-30b-a3b→ 2 correct, 1 partial, 1 wrongqwen/qwen3.5-27b→ 4/4 correctunsloth/qwen3.5-122b-a10b→ 3 correct, 1 partial
Again, this is just a practical local-user comparison, not a rigorous benchmark. I only ran each question once per model, so I’d treat this more as a usability snapshot than a definitive ranking.
My take
If I were picking from this set for short local reasoning tasks, the three models I’d look at first are google/gemma-4-31b, google/gemma-4-26b-a4b, and qwen/qwen3.5-35b-a3b.
gemma-4-31b still felt the cleanest overall. It was reliable, direct, and consistently sensible. gemma-4-26b-a4b was the most interesting tradeoff: it matched that strong correctness profile while delivering much better throughput and still keeping total times reasonable under the explicit formula. qwen/qwen3.5-35b-a3b remains the strongest Qwen option here, with a full 4/4 result and strong overall capability, even if it tends to say more than necessary.
qwen/qwen3.5-27b also deserves credit for going 4/4, but the corrected totals make the tradeoff clearer: it is solid, just much less efficient in practice. unsloth/qwen3.5-122b-a10b was interesting for quality, but it still felt heavy in response length and total time. qwen/qwen3.6-27b lands somewhere in the middle: capable enough, but not especially sharp or direct.
The weaker part of the set was qwen/qwen3.6-35b-a3b, qwen/qwen3-30b-a3b, and google/gemma-4-e4b. All three had some obvious strengths on paper, but each of them had misses that are hard to ignore in real use. qwen3.6-35b-a3b was fast but stumbled on the trick question. qwen3-30b-a3b often produced a lot of output for mixed payoff. gemma-4-e4b was the least trustworthy overall.
So if I had to reduce the whole test to a simple ranking of practical impressions, it would look like this:
- Top tier:
google/gemma-4-31b,google/gemma-4-26b-a4b,qwen/qwen3.5-35b-a3b - Good, but with clear tradeoffs:
qwen/qwen3.5-27b,unsloth/qwen3.5-122b-a10b,qwen/qwen3.6-27b - Mixed or disappointing:
qwen/qwen3.6-35b-a3b,qwen/qwen3-30b-a3b,google/gemma-4-e4b
That’s the version I’d trust most if I were choosing what to actually use, not just what to admire in a benchmark table.