Picking the model for the hardware I have

Last time I ended on: the model is a commodity. But you still have to pick the commodity — ideally one that runs on hardware I already own, not hardware I'd have to pay a lot more for. So I sat down to choose. And the benchmark turned out smarter than my intuition again.

Intuition said: grab the bigger model, it's smarter. Ran the bench — not true.

→ A Mixture-of-Experts model (MoE — only a small slice of the weights fires per query) beat a dense model twice its size. Both on citation quality and on speed — twice as fast. No real reason to pay for the pricier card. → In 4-bit packing the model I need fits in 24 GB of VRAM. Not 48 GB of flagship GDDR7 — 24, which roughly halves the price tag. → And the "thinking" mode the new models brag about only got in the way here: the model deliberated 5–7× longer and produced the same answer. For corpus-grounded extraction you don't need to think, you need to answer. Off it goes.

What the bench confirmed: answer quality doesn't scale with model size. It rides on the corpus, the retrieval, and the guard — the very layer I'm building. Exactly what I wrote last time.

Picking the model for the hardware I have

Related reading

The full turn: the model is an API call

Do I even need my own GPU?

Recorded my daughter's recital — by Saturday night, a working pipeline came out