Fine-tuning, RAG, and reasoning aren't alternatives — they're stack layers

Someone in a forum asked me: why is your approach — fine-tuning your own open-weight models for a domain — better than RAG or reasoning?

Short answer: it isn't "better" — it sits underneath them in the stack.

A fine-tuned domain model is the foundation. RAG is the plumbing that pipes fresh client-specific information into prompts. Reasoning is the electrical that lights up multi-step deliberation when the problem warrants it. Arguing "which is better — foundation, plumbing, or electrical?" in a working house is a strange argument.

Where each layer works, where each breaks

RAG is great when you need fresh, customer-specific information — internal docs, regulations, last quarter's lab results. It pulls relevant chunks at inference time and stitches them into the prompt. Where it breaks: the model itself doesn't "speak" the domain. If the right chunk wasn't retrieved, or the question demands methodology rather than fact, the answer is confident and wrong. Embedding quality and chunking strategy become the bottleneck. And one more thing: if RAG sits on top of the ChatGPT API, you pay API rates on every query — that doesn't go away.

Reasoning models (o1, R1, Gemini Thinking) shine on novel multi-step logic — math olympiad, code debugging, scenario analysis. The problem for the Russian market: most are foreign-cloud only, which is a hard 152-ФЗ wall for any regulated personal data. And even where you can use them, they reason generically — they don't know your federation rules, your anti-doping protocols, your training standards. Inference cost is also order-of-magnitude higher per query than a fine-tuned 27B-31B running locally.

Fine-tuning a domain model and deploying it sovereign — what I do — gives the model real domain knowledge: methodology, vocabulary, regulatory context, all baked into weights. It doesn't retrieve at inference. It carries.

One example from this week

Yesterday I re-based the lineup of ЛИИ models from Qwen 3.6 to Gemma 4 — seven of the eight (Mobile was already locked on Gemma-3n-4B). Provisional, pending an A/B benchmark on May 13. The reason was the tokenizer. Qwen 3 base spends ~3.12 tokens per Russian word (Occiglot benchmark, T-Bank's T-pro 2.0 paper). Gemma 4 spends ~2.0. That's a ~50% gap, and it compounds: cheaper CPT, longer effective context, ~8× cheaper output cost on OpenRouter. T-Bank's entire T-pro 2.0 release was predicated on engineering a custom Cyrillic tokenizer for Qwen — replacing 34K low-frequency tokens with high-frequency Cyrillic merges from RuAdapt, cl100k_base, and mGPT. That's a lot of engineering to fix exactly this problem.

The context that makes this less academic

Last week H100 / H200 / B200 rentals on Western clouds crossed $1,000/hr sustained. r/LocalLLaMA thread, 172 upvotes. One commenter, an engineer at a major hyperscaler: "demand is higher than any of us can meet." A Series C startup found 20× A100 only in Italy. The era of subsidized compute is over.

What this means for a Russian customer: RAG over a foreign-cloud frontier model is not only a legal problem. It's increasingly a structural cost problem. A ~30B Russian-tuned open-weight model on a spot RTX 6000 Pro at Selectel becomes more defensible every month.

What's actually in the box for the client

When I deliver an integration, all three layers ship together:

→ A domain-tuned ЛИИ model — Sport, Law, Medical, Education — as the foundation → A RAG pipeline over the client's live document base — fresh facts → CoT prompting for genuinely hard query classes — where it earns its latency

Plus an own benchmark for the domain to catch regressions. Plus a safety + PII memorization audit. Plus deployment on 152-ФЗ infrastructure or on the client's own hardware.

What's next

July 15 is the hard gate for the first open-weight release in the lineup: ЛИИ-Спорт-27B Preview. With scoring on a custom benchmark (200+ questions across 8 categories, from rules and regulations through sports medicine and anti-doping), open weights on HuggingFace, a paper on Habr. It'll be the first Russian open-weight LLM in the sports domain. After that: Law, Medical, School, University.

To circle back. RAG and reasoning aren't going anywhere — and shouldn't. They're the right tools for their layers. Just, before you put someone else's general-purpose API beneath them, it's worth putting a foundation underneath that knows your domain and lives in your jurisdiction.

Fine-tuning, RAG, and reasoning aren't alternatives — they're stack layers

Where each layer works, where each breaks

One example from this week

The context that makes this less academic

What's actually in the box for the client

What's next

Related reading

It's not a camera

Launched an LLM integration service — starting with education

655 questions in one evening. A benchmark for my own model.