Gemma 4 vs Qwen Models: A Practical Local LLM Test

April 23, 2026

#local-llm #open-source #machine-learning

I’ve been testing a few open-source local LLMs lately because I wanted to see how they actually behave on my machine, not just how they look in benchmark charts.

This is not a formal benchmark. It’s just a practical local test using a few short reasoning and trick questions, with attention to both answer quality and inference behavior.

I later added three more models to the same set to see whether the original ranking still held up. That turned out to be worth doing, because one of the new Gemma variants was much stronger than I expected, and another was much worse.

Test setup

Hardware

CPU: AMD Ryzen 7 9700X (8-core)
RAM: 48GB DDR5 (2x24GB 6400 CL32)
GPU: RTX 6000 Pro Workstation Edition 96GB

Runtime

OS: Windows 11
Software: LM Studio 0.4.12
Backend: llama.cpp CUDA 12 v2.13.0

Models / quantization

All models were from lmstudio-community
All were quantized with Q4_K_M
Only exception: Qwen3.5-122B, which I ran with Unsloth’s Q4_K_S

Other notes

No tools or MCP enabled

Model settings

Context length: 190,000
Full GPU offload
Eval batch size: 512
Max concurrent predictions: 8
Unified KV cache: Yes
mmap(): True
- False for 120B+ models
Flash Attention: Yes
Everything else left at default

Total time formula used in the tables

Total = TTFT + (Out Tok / Tok/s) + Thought
This is a derived comparison metric based on the recorded output length, throughput, time-to-first-token, and reported thinking time

Questions

Q1. Alice has 5 brothers and she also has 3 sisters. How many sisters does Alice’s brother have?
Expected answer: 4

Q2. The car wash is only 100m away from my house, should I walk or drive?
Expected answer: You still need to drive the car to the car wash.

Q3. Solve the puzzle:

37#21 = 928
77#44 = 3993
123#17 = 14840
71#6 = ?

Expected answer: 5005
(from Stellar Blade)

Q4. You have two hourglasses, one that measures exactly 7 minutes and another that measures exactly 11 minutes. Using only these two hourglasses, can you measure exactly 15 minutes? If so, explain the steps.
Expected answer: Yes.

Start both
At 7 minutes, flip the 7-minute hourglass
At 11 minutes, flip the 7-minute hourglass again
When the 7-minute hourglass empties, exactly 15 minutes have passed

Models tested

google/gemma-4-31b
google/gemma-4-26b-a4b
google/gemma-4-e4b
qwen/qwen3.6-35b-a3b
qwen/qwen3.6-27b
qwen/qwen3.5-35b-a3b
qwen/qwen3-30b-a3b
qwen/qwen3.5-27b
unsloth/qwen3.5-122b-a10b

Q1 — Alice siblings

Model	Result	Thought	Tok/s	Out Tok	TTFT	Total
google/gemma-4-31b	Correct	3.24s	63.0	266	0.651s	8.113s
google/gemma-4-26b-a4b	Correct	1.12s	195.0	319	0.326s	3.082s
google/gemma-4-e4b	Correct	1.76s	194.8	455	0.255s	4.351s
qwen/qwen3.6-35b-a3b	Correct	3.60s	225.1	873	0.246s	7.724s
qwen/qwen3.6-27b	Correct	3.99s	69.9	732	0.194s	14.656s
qwen/qwen3.5-35b-a3b	Correct	2.53s	207.9	597	0.240s	5.642s
qwen/qwen3-30b-a3b	Correct	10.80s	246.1	2972	0.225s	23.101s
qwen/qwen3.5-27b	Correct	7.82s	68.5	607	0.366s	17.047s
unsloth/qwen3.5-122b-a10b	Correct	7.97s	109.6	953	0.294s	16.959s

Q2 — Car wash

Model	Result	Thought	Tok/s	Out Tok	TTFT	Total
google/gemma-4-31b	Correct	3.94s	62.9	297	0.246s	8.908s
google/gemma-4-26b-a4b	Correct	3.73s	195.2	875	0.138s	8.351s
google/gemma-4-e4b	Wrong	1.72s	194.5	568	0.166s	4.806s
qwen/qwen3.6-35b-a3b	Wrong	6.67s	218.5	1727	0.173s	14.747s
qwen/qwen3.6-27b	Partial	9.92s	69.8	1287	0.262s	28.620s
qwen/qwen3.5-35b-a3b	Correct	17.82s	204.6	3868	0.356s	37.081s
qwen/qwen3-30b-a3b	Partial	7.70s	248.4	2399	0.212s	17.570s
qwen/qwen3.5-27b	Correct	1m 3s	66.6	4424	0.359s	2m 9.785s
unsloth/qwen3.5-122b-a10b	Correct	31.64s	109.1	3780	0.295s	1m 6.582s

Q3 — Puzzle

Model	Result	Thought	Tok/s	Out Tok	TTFT	Total
google/gemma-4-31b	Correct	10.03s	61.9	993	0.382s	26.454s
google/gemma-4-26b-a4b	Correct	4.07s	195.9	1291	0.344s	11.004s
google/gemma-4-e4b	Wrong	12.14s	189.7	4348	0.182s	35.242s
qwen/qwen3.6-35b-a3b	Correct	7.42s	217.8	1935	0.363s	16.667s
qwen/qwen3.6-27b	Correct	6.95s	69.4	1204	0.245s	24.544s
qwen/qwen3.5-35b-a3b	Correct	12.95s	208.2	3086	0.229s	28.001s
qwen/qwen3-30b-a3b	Correct	8.29s	248.1	2547	0.144s	18.700s
qwen/qwen3.5-27b	Correct	24.03s	67.9	2056	0.486s	54.796s
unsloth/qwen3.5-122b-a10b	Correct	13.66s	109.4	1842	0.424s	30.921s

Q4 — Hourglasses

Model	Result	Thought	Tok/s	Out Tok	TTFT	Total
google/gemma-4-31b	Correct	23.80s	61.4	1724	0.399s	52.277s
google/gemma-4-26b-a4b	Correct	12.80s	189.6	3102	0.212s	29.373s
google/gemma-4-e4b	Wrong	6.64s	190.9	3544	0.208s	25.413s
qwen/qwen3.6-35b-a3b	Correct	1m 18s	217.3	17383	0.220s	2m 38.215s
qwen/qwen3.6-27b	Correct	33.27s	69.1	5447	0.262s	1m 52.36s
qwen/qwen3.5-35b-a3b	Correct	27.04s	199.8	5722	0.276s	55.955s
qwen/qwen3-30b-a3b	Wrong	1m 25s	212.3	18898	0.104s	2m 54.12s
qwen/qwen3.5-27b	Correct	1m 3s	67.2	4694	0.293s	2m 13.144s
unsloth/qwen3.5-122b-a10b	Partial	55.24s	107.8	6377	0.347s	1m 54.743s

What stood out

A few things were pretty clear from this run:

Q1 was easy for almost all the models, so it didn’t separate the field very much
Q2 was a much better filter than it looked at first. Some models immediately understood the trick, while others drifted into generic advice about whether a human should walk 100 meters, which is not really the question being asked
Q3 turned out to be a good check for pattern recognition. Most models handled it, but a bad miss here stood out because the successful ones were fairly consistent
Q4 exposed the biggest differences in practical reasoning quality. Some models reached the standard solution cleanly, while others produced long, polished explanations that were either bloated or simply wrong
The Qwen family generally looked strong on raw throughput, but that speed often came with much longer answers than necessary
The Gemma family felt stronger on directness overall. The better Gemma runs were more likely to answer the question first instead of turning a short prompt into a small essay
Looking at the actual replies mattered as much as the score table. A model that is technically correct but takes several paragraphs to get there does not feel as good in real use as one that answers cleanly and moves on

Reply behavior notes

One pattern across the full set was that correctness and usability were not always the same thing.

google/gemma-4-31b and google/gemma-4-26b-a4b were still the most balanced to read. Both were reliable, and both usually got to the point without sounding lost. gemma-4-31b felt especially clean and steady, while gemma-4-26b-a4b combined that with much stronger throughput and still-fair total times once calculated properly.

qwen/qwen3.5-35b-a3b was the strongest Qwen result overall. It matched the best models on correctness, but it was generally more verbose. qwen/qwen3.5-27b also performed well on correctness, though its total times were much heavier, so the experience felt slower than the raw score alone suggests.

qwen/qwen3.6-35b-a3b and qwen/qwen3.6-27b were more uneven. They could look smart on some questions, but the car wash prompt showed that they were more likely to miss the simple practical point and drift into over-explaining.

qwen/qwen3-30b-a3b, unsloth/qwen3.5-122b-a10b, and google/gemma-4-e4b each exposed a different failure mode: either partial correctness, inflated response length, or reasoning mistakes hidden behind confident formatting.

That is probably the biggest practical lesson from this whole test: I care a lot more about directness and reliability than about headline token speed.

Raw score summary

google/gemma-4-31b → 4/4 correct
google/gemma-4-26b-a4b → 4/4 correct
google/gemma-4-e4b → 1 correct, 3 wrong
qwen/qwen3.6-35b-a3b → 3 correct, 1 wrong
qwen/qwen3.6-27b → 3 correct, 1 partial
qwen/qwen3.5-35b-a3b → 4/4 correct
qwen/qwen3-30b-a3b → 2 correct, 1 partial, 1 wrong
qwen/qwen3.5-27b → 4/4 correct
unsloth/qwen3.5-122b-a10b → 3 correct, 1 partial

Again, this is just a practical local-user comparison, not a rigorous benchmark. I only ran each question once per model, so I’d treat this more as a usability snapshot than a definitive ranking.

My take

If I were picking from this set for short local reasoning tasks, the three models I’d look at first are google/gemma-4-31b, google/gemma-4-26b-a4b, and qwen/qwen3.5-35b-a3b.

gemma-4-31b still felt the cleanest overall. It was reliable, direct, and consistently sensible. gemma-4-26b-a4b was the most interesting tradeoff: it matched that strong correctness profile while delivering much better throughput and still keeping total times reasonable under the explicit formula. qwen/qwen3.5-35b-a3b remains the strongest Qwen option here, with a full 4/4 result and strong overall capability, even if it tends to say more than necessary.

qwen/qwen3.5-27b also deserves credit for going 4/4, but the corrected totals make the tradeoff clearer: it is solid, just much less efficient in practice. unsloth/qwen3.5-122b-a10b was interesting for quality, but it still felt heavy in response length and total time. qwen/qwen3.6-27b lands somewhere in the middle: capable enough, but not especially sharp or direct.

The weaker part of the set was qwen/qwen3.6-35b-a3b, qwen/qwen3-30b-a3b, and google/gemma-4-e4b. All three had some obvious strengths on paper, but each of them had misses that are hard to ignore in real use. qwen3.6-35b-a3b was fast but stumbled on the trick question. qwen3-30b-a3b often produced a lot of output for mixed payoff. gemma-4-e4b was the least trustworthy overall.

So if I had to reduce the whole test to a simple ranking of practical impressions, it would look like this:

Top tier: google/gemma-4-31b, google/gemma-4-26b-a4b, qwen/qwen3.5-35b-a3b
Good, but with clear tradeoffs: qwen/qwen3.5-27b, unsloth/qwen3.5-122b-a10b, qwen/qwen3.6-27b
Mixed or disappointing: qwen/qwen3.6-35b-a3b, qwen/qwen3-30b-a3b, google/gemma-4-e4b

That’s the version I’d trust most if I were choosing what to actually use, not just what to admire in a benchmark table.