The full turn: the model is an API call

Last part I ended on the question: do I need my own GPU at all? Here's the answer — and it turned the whole architecture around.

I spent a week picking hardware for the model. But the model is a commodity — you can call it over an API for kopecks. Why buy a GPU to run something that's already available?

So the architecture settled into two circuits.

→ Public — the showcase and the trial. Here the model is a call to someone else's API (a Russian one, in the Russian circuit, 152-ФЗ-compliant), and my trust layer runs on top: the same corpus, retrieval, clause-level citations, honest refusal. No GPU needed at all — the public circuit runs on a plain CPU.

→ Sovereign — for the client who must keep everything inside their own circuit: law, medicine, state data. Here the model runs on iron on-site, without a single external call. And the card is bought by whoever actually needs it — when the contract is signed.

One question remained: would someone else's model hold my "cite-or-refuse" discipline? I ran Russian models through my own bench. An inexpensive Russian model holds it: answers from the corpus, attaches the clause citation, stays honestly silent when there's no basis.

I also ran my own model, but through someone else's API instead of my own iron — same numbers. The delivery channel for the model doesn't affect quality. The layer on top does. Which was the point.

Same conclusion as the first part — now checked three ways: by benchmark, by hardware, and by price. The model is a commodity. So is the GPU. The advantage lives in the layer above them: the corpus, the verifiable citation, the honest refusal. That's what I'm building as the product. The GPU I'll buy when a client pays for it.

The full turn: the model is an API call

Related reading

Picking the model for the hardware I have

Do I even need my own GPU?

Recorded my daughter's recital — by Saturday night, a working pipeline came out