The full turn: the model is an API call
Last part I ended on the question: do I need my own GPU at all? Here's the answer — and it turned the whole architecture around.
I spent a week picking hardware for the model. But the model is a commodity — you can call it over an API for kopecks. Why buy a GPU to run something that's already available?
So the architecture settled into two circuits.
→ Public — the showcase and the trial. Here the model is a call to someone else's API (a Russian one, in the Russian circuit, 152-ФЗ-compliant), and my trust layer runs on top: the same corpus, retrieval, clause-level citations, honest refusal. No GPU needed at all — the public circuit runs on a plain CPU.
→ Sovereign — for the client who must keep everything inside their own circuit: law, medicine, state data. Here the model runs on iron on-site, without a single external call. And the card is bought by whoever actually needs it — when the contract is signed.
One question remained: would someone else's model hold my "cite-or-refuse" discipline? I ran Russian models through my own bench. An inexpensive Russian model holds it: answers from the corpus, attaches the clause citation, stays honestly silent when there's no basis.
I also ran my own model, but through someone else's API instead of my own iron — same numbers. The delivery channel for the model doesn't affect quality. The layer on top does. Which was the point.
Same conclusion as the first part — now checked three ways: by benchmark, by hardware, and by price. The model is a commodity. So is the GPU. The advantage lives in the layer above them: the corpus, the verifiable citation, the honest refusal. That's what I'm building as the product. The GPU I'll buy when a client pays for it.