655 questions in one evening. A benchmark for my own model.

Tonight I sat down to build a benchmark for LII-Sport. The language model for the Russian sport domain is in my pipeline; before the public release on June 15 I need an instrument to test it. To understand how deeply the model knows sports, rules, training methodology, regulatory frameworks. Without testing, no client demo and no HuggingFace upload.

Closed by 1 a.m. LII-Sport-Bench-RU v0.1 — 655 expert questions across 35 sports. Eight categories per sport: rules and regulations, training methodology, biomechanics and physiology, sport psychology, federations and regulatory framework, history, anti-doping (RUSADA / WADA), scenario reasoning. Each question carries a reference answer citing source, plus three criteria for the LLM judge.

The story is not speed. The story is how it came together and what I do with it next.

I wasn't alone

Eleven AI agents worked in parallel. Each owned a slice — one for basketball, one for volleyball, one for winter sports, and so on. I drove the orchestration: sport selection, category distribution, question tone and format, retry logic when an agent stalled or hit a token limit (one of each happened tonight — normal for parallel scale).

This is methodology, not magic. SportQA, the academic baseline (NAACL 2024), was assembled the same way — synthetic generation plus expert review. Doing 655 expert questions by hand alone is three to four months of work. With agents in parallel it's one evening for a draft.

What's in the file — but it's a draft

Eight sports go deep (50 questions each): basketball, volleyball, football, hockey, athletics, wrestling, gymnastics, swimming. These are the sports the program targets with federation and university partners.

Twelve sports get broad coverage (15 questions each): boxing, judo, biathlon, tennis, table tennis, shooting, cross-country skiing, weightlifting, fencing, snowboard, beach volleyball, sambo.

Fifteen sports test the recognition floor (5 questions each): chess, curling, triathlon, rugby, sport tourism, and others. The point is to verify the model doesn't go blank on less mainstream disciplines.

Each question carries a reference answer citing source — federation rule clause, FSSP article, RUSADA module — and three judge criteria: what counts as correct, incomplete, bonus.

What the machine wrote is a candidate, not a final

Next phase is expert review. Between tonight and May 13 I'm going through the questions one by one: phrasing, factuality, cross-checking against official documents from the Russian Basketball Federation, Volleyball Federation, RUSADA, the Ministry of Sport.

After review, the first run. Two language model candidates for the LII-Sport base get evaluated through the benchmark. Base model decision: May 13. Public release of LII-Sport Preview with the scores: June 15, on HuggingFace and Habr.

What it's for, for me

The benchmark is an instrument. Before each LII-Sport release I run the model through these 655 questions and see where it sags — regulatory, scenario, biomechanics. It's a map for where to work on corpus and training next.

Whether it opens publicly — I'll decide by June 15. If yes, it'll be the first Russian-language sport reference point. Not the main goal. Main goal is building the model.

→ Manual review — through May 13 → First evaluation run on the candidates — this week → Public model release plus benchmark — June 15

655 questions in one evening. A benchmark for my own model.

Related reading

It's not a camera

Fine-tuned a model to GPT-5.4 quality for $330

Fine-tuning, RAG, and reasoning aren't alternatives — they're stack layers