Gemma 4 vs Qwen Models: A Practical Local LLM Test

I’ve been testing a few open-source local LLMs lately because I wanted to see how they actually behave on my machine, not just how they look in benchmark charts.

This is not a formal benchmark. It’s just a practical local test using a few short reasoning and trick questions, with attention to both answer quality and inference behavior.

I later added three more models to the same set to see whether the original ranking still held up. That turned out to be worth doing, because one of the new Gemma variants was much stronger than I expected, and another was much worse.

Test setup

Hardware

  • CPU: AMD Ryzen 7 9700X (8-core)
  • RAM: 48GB DDR5 (2x24GB 6400 CL32)
  • GPU: RTX 6000 Pro Workstation Edition 96GB

Runtime

  • OS: Windows 11
  • Software: LM Studio 0.4.12
  • Backend: llama.cpp CUDA 12 v2.13.0

Models / quantization

  • All models were from lmstudio-community
  • All were quantized with Q4_K_M
  • Only exception: Qwen3.5-122B, which I ran with Unsloth’s Q4_K_S

Other notes

  • No tools or MCP enabled

Model settings

  • Context length: 190,000
  • Full GPU offload
  • Eval batch size: 512
  • Max concurrent predictions: 8
  • Unified KV cache: Yes
  • mmap(): True
    • False for 120B+ models
  • Flash Attention: Yes
  • Everything else left at default

Total time formula used in the tables

  • Total = TTFT + (Out Tok / Tok/s) + Thought
  • This is a derived comparison metric based on the recorded output length, throughput, time-to-first-token, and reported thinking time

Questions

Q1. Alice has 5 brothers and she also has 3 sisters. How many sisters does Alice’s brother have?
Expected answer: 4

Q2. The car wash is only 100m away from my house, should I walk or drive?
Expected answer: You still need to drive the car to the car wash.

Q3. Solve the puzzle:

  • 37#21 = 928
  • 77#44 = 3993
  • 123#17 = 14840
  • 71#6 = ?

Expected answer: 5005
(from Stellar Blade)

Q4. You have two hourglasses, one that measures exactly 7 minutes and another that measures exactly 11 minutes. Using only these two hourglasses, can you measure exactly 15 minutes? If so, explain the steps.
Expected answer: Yes.

  • Start both
  • At 7 minutes, flip the 7-minute hourglass
  • At 11 minutes, flip the 7-minute hourglass again
  • When the 7-minute hourglass empties, exactly 15 minutes have passed

Models tested

  • google/gemma-4-31b
  • google/gemma-4-26b-a4b
  • google/gemma-4-e4b
  • qwen/qwen3.6-35b-a3b
  • qwen/qwen3.6-27b
  • qwen/qwen3.5-35b-a3b
  • qwen/qwen3-30b-a3b
  • qwen/qwen3.5-27b
  • unsloth/qwen3.5-122b-a10b

Q1 — Alice siblings

Model Result Thought Tok/s Out Tok TTFT Total
google/gemma-4-31b Correct 3.24s 63.0 266 0.651s 8.113s
google/gemma-4-26b-a4b Correct 1.12s 195.0 319 0.326s 3.082s
google/gemma-4-e4b Correct 1.76s 194.8 455 0.255s 4.351s
qwen/qwen3.6-35b-a3b Correct 3.60s 225.1 873 0.246s 7.724s
qwen/qwen3.6-27b Correct 3.99s 69.9 732 0.194s 14.656s
qwen/qwen3.5-35b-a3b Correct 2.53s 207.9 597 0.240s 5.642s
qwen/qwen3-30b-a3b Correct 10.80s 246.1 2972 0.225s 23.101s
qwen/qwen3.5-27b Correct 7.82s 68.5 607 0.366s 17.047s
unsloth/qwen3.5-122b-a10b Correct 7.97s 109.6 953 0.294s 16.959s

Q2 — Car wash

Model Result Thought Tok/s Out Tok TTFT Total
google/gemma-4-31b Correct 3.94s 62.9 297 0.246s 8.908s
google/gemma-4-26b-a4b Correct 3.73s 195.2 875 0.138s 8.351s
google/gemma-4-e4b Wrong 1.72s 194.5 568 0.166s 4.806s
qwen/qwen3.6-35b-a3b Wrong 6.67s 218.5 1727 0.173s 14.747s
qwen/qwen3.6-27b Partial 9.92s 69.8 1287 0.262s 28.620s
qwen/qwen3.5-35b-a3b Correct 17.82s 204.6 3868 0.356s 37.081s
qwen/qwen3-30b-a3b Partial 7.70s 248.4 2399 0.212s 17.570s
qwen/qwen3.5-27b Correct 1m 3s 66.6 4424 0.359s 2m 9.785s
unsloth/qwen3.5-122b-a10b Correct 31.64s 109.1 3780 0.295s 1m 6.582s

Q3 — Puzzle

Model Result Thought Tok/s Out Tok TTFT Total
google/gemma-4-31b Correct 10.03s 61.9 993 0.382s 26.454s
google/gemma-4-26b-a4b Correct 4.07s 195.9 1291 0.344s 11.004s
google/gemma-4-e4b Wrong 12.14s 189.7 4348 0.182s 35.242s
qwen/qwen3.6-35b-a3b Correct 7.42s 217.8 1935 0.363s 16.667s
qwen/qwen3.6-27b Correct 6.95s 69.4 1204 0.245s 24.544s
qwen/qwen3.5-35b-a3b Correct 12.95s 208.2 3086 0.229s 28.001s
qwen/qwen3-30b-a3b Correct 8.29s 248.1 2547 0.144s 18.700s
qwen/qwen3.5-27b Correct 24.03s 67.9 2056 0.486s 54.796s
unsloth/qwen3.5-122b-a10b Correct 13.66s 109.4 1842 0.424s 30.921s

Q4 — Hourglasses

Model Result Thought Tok/s Out Tok TTFT Total
google/gemma-4-31b Correct 23.80s 61.4 1724 0.399s 52.277s
google/gemma-4-26b-a4b Correct 12.80s 189.6 3102 0.212s 29.373s
google/gemma-4-e4b Wrong 6.64s 190.9 3544 0.208s 25.413s
qwen/qwen3.6-35b-a3b Correct 1m 18s 217.3 17383 0.220s 2m 38.215s
qwen/qwen3.6-27b Correct 33.27s 69.1 5447 0.262s 1m 52.36s
qwen/qwen3.5-35b-a3b Correct 27.04s 199.8 5722 0.276s 55.955s
qwen/qwen3-30b-a3b Wrong 1m 25s 212.3 18898 0.104s 2m 54.12s
qwen/qwen3.5-27b Correct 1m 3s 67.2 4694 0.293s 2m 13.144s
unsloth/qwen3.5-122b-a10b Partial 55.24s 107.8 6377 0.347s 1m 54.743s

What stood out

A few things were pretty clear from this run:

  • Q1 was easy for almost all the models, so it didn’t separate the field very much
  • Q2 was a much better filter than it looked at first. Some models immediately understood the trick, while others drifted into generic advice about whether a human should walk 100 meters, which is not really the question being asked
  • Q3 turned out to be a good check for pattern recognition. Most models handled it, but a bad miss here stood out because the successful ones were fairly consistent
  • Q4 exposed the biggest differences in practical reasoning quality. Some models reached the standard solution cleanly, while others produced long, polished explanations that were either bloated or simply wrong
  • The Qwen family generally looked strong on raw throughput, but that speed often came with much longer answers than necessary
  • The Gemma family felt stronger on directness overall. The better Gemma runs were more likely to answer the question first instead of turning a short prompt into a small essay
  • Looking at the actual replies mattered as much as the score table. A model that is technically correct but takes several paragraphs to get there does not feel as good in real use as one that answers cleanly and moves on

Reply behavior notes

One pattern across the full set was that correctness and usability were not always the same thing.

google/gemma-4-31b and google/gemma-4-26b-a4b were still the most balanced to read. Both were reliable, and both usually got to the point without sounding lost. gemma-4-31b felt especially clean and steady, while gemma-4-26b-a4b combined that with much stronger throughput and still-fair total times once calculated properly.

qwen/qwen3.5-35b-a3b was the strongest Qwen result overall. It matched the best models on correctness, but it was generally more verbose. qwen/qwen3.5-27b also performed well on correctness, though its total times were much heavier, so the experience felt slower than the raw score alone suggests.

qwen/qwen3.6-35b-a3b and qwen/qwen3.6-27b were more uneven. They could look smart on some questions, but the car wash prompt showed that they were more likely to miss the simple practical point and drift into over-explaining.

qwen/qwen3-30b-a3b, unsloth/qwen3.5-122b-a10b, and google/gemma-4-e4b each exposed a different failure mode: either partial correctness, inflated response length, or reasoning mistakes hidden behind confident formatting.

That is probably the biggest practical lesson from this whole test: I care a lot more about directness and reliability than about headline token speed.

Raw score summary

  • google/gemma-4-31b4/4 correct
  • google/gemma-4-26b-a4b4/4 correct
  • google/gemma-4-e4b1 correct, 3 wrong
  • qwen/qwen3.6-35b-a3b3 correct, 1 wrong
  • qwen/qwen3.6-27b3 correct, 1 partial
  • qwen/qwen3.5-35b-a3b4/4 correct
  • qwen/qwen3-30b-a3b2 correct, 1 partial, 1 wrong
  • qwen/qwen3.5-27b4/4 correct
  • unsloth/qwen3.5-122b-a10b3 correct, 1 partial

Again, this is just a practical local-user comparison, not a rigorous benchmark. I only ran each question once per model, so I’d treat this more as a usability snapshot than a definitive ranking.

My take

If I were picking from this set for short local reasoning tasks, the three models I’d look at first are google/gemma-4-31b, google/gemma-4-26b-a4b, and qwen/qwen3.5-35b-a3b.

gemma-4-31b still felt the cleanest overall. It was reliable, direct, and consistently sensible. gemma-4-26b-a4b was the most interesting tradeoff: it matched that strong correctness profile while delivering much better throughput and still keeping total times reasonable under the explicit formula. qwen/qwen3.5-35b-a3b remains the strongest Qwen option here, with a full 4/4 result and strong overall capability, even if it tends to say more than necessary.

qwen/qwen3.5-27b also deserves credit for going 4/4, but the corrected totals make the tradeoff clearer: it is solid, just much less efficient in practice. unsloth/qwen3.5-122b-a10b was interesting for quality, but it still felt heavy in response length and total time. qwen/qwen3.6-27b lands somewhere in the middle: capable enough, but not especially sharp or direct.

The weaker part of the set was qwen/qwen3.6-35b-a3b, qwen/qwen3-30b-a3b, and google/gemma-4-e4b. All three had some obvious strengths on paper, but each of them had misses that are hard to ignore in real use. qwen3.6-35b-a3b was fast but stumbled on the trick question. qwen3-30b-a3b often produced a lot of output for mixed payoff. gemma-4-e4b was the least trustworthy overall.

So if I had to reduce the whole test to a simple ranking of practical impressions, it would look like this:

  • Top tier: google/gemma-4-31b, google/gemma-4-26b-a4b, qwen/qwen3.5-35b-a3b
  • Good, but with clear tradeoffs: qwen/qwen3.5-27b, unsloth/qwen3.5-122b-a10b, qwen/qwen3.6-27b
  • Mixed or disappointing: qwen/qwen3.6-35b-a3b, qwen/qwen3-30b-a3b, google/gemma-4-e4b

That’s the version I’d trust most if I were choosing what to actually use, not just what to admire in a benchmark table.