Fine-tuned a model to GPT-5.4 quality for $330
Published a second Habr piece today — how I fine-tuned a model for Russian K-12 teachers, and what it cost.
Short version: $330, sixteen hours on an H200, 30,000 synthetic pairs — and Qwen3.5-27B with my LoRA adapter takes 9th of 30 on EduBench-RU. Above GigaChat-2 Max, YandexGPT 5.1 Pro, Grok, GLM, Qwen3 235B. At the same level as GPT-5.4 — a 0.01-point gap, well inside the margin of error.
"At the same level" — not "beat it." I wouldn't fly that 0.01 as a victory flag. The difference lives elsewhere: my model runs locally, on servers inside the Russian perimeter. GPT-5.4 doesn't. For schools under 152-ФЗ that's the deciding factor.
The interesting part of the story isn't what worked — it's what didn't.
I trained a 32B version in parallel. More parameters. Three times the GPU time — 45 hours vs 17. Lower training loss — 0.47 vs 0.51, formally better.
Final score: half a point worse.
→ Architecture beats size. Qwen3.5 is a newer architecture than Qwen3. Fewer parameters, better results on Russian and on task structure. → Training loss and benchmark quality aren't the same thing. The model got better at predicting the next token and worse at answering a teacher's question. → A day of wasted GPU sometimes costs more than the useful model. Don't ask me why I trained both in parallel instead of sequentially.
Full breakdown — cost tables, QLoRA config, the story of how max_tokens: 512 killed the first Gemini run, and the ±0.3-point spread between judges on the same answer — on Habr →