Recorded my daughter's recital — by Saturday night, a working pipeline came out

My youngest performed on stage for the first time on Friday. The All-Russian Festival-Competition «The Wondrous Chime of Gusli» at ЧДМШ №1 in Cheboksary. 33 performances over three hours.

I recorded on iPhone with Insta360 gimbal — as a parent, not as a producer. Came out to 3 hours 5 minutes of one continuous file. 8.8 GB.

Saturday morning I was looking at that file thinking: dozens of families won't get their kid's clip if I don't do this.

So I decided to build a pipeline — cut the recording into individual portfolio clips for each performer, with a branded school header card.

By Saturday evening 33 files were on Google Drive, ready to share. Between those two points — 30 hours and a fair number of rakes underfoot.

What was supposed to be simple

The idea is simple: AI listens to audio, finds the announcer's introductions, between announcements = a performance, ffmpeg cuts.

The reality is simple: nothing works the first try.

→ Gemini 3.1 Pro listened to 3 hours of audio plus 5 program pages. Returned timestamps. On my daughter's slot it was off by 15 minutes — said 1:49 when it was actually 1:26. Didn't work.

→ Groq Whisper API — throttled by free tier. They allow 7,200 seconds of audio per hour. I had 11,084 seconds. Didn't work.

→ OpenRouter Whisper — 500 on their end. Didn't work.

→ Local Whisper.cpp without VAD — transcribed «Субтитры создавал DimaTorzok» («Subtitles by DimaTorzok») 370 times in a row across the entire recording. Whisper hallucinates on music when there's no clear speech. Didn't work.

→ Whisper.cpp with silero VAD — caught 195 seconds of speech out of 11,000. Threshold too strict. Didn't work.

→ Same Whisper.cpp with threshold 0.2 plus aggressive ffmpeg loudness pre-normalization — caught 532 seconds. Worked.

(by attempt five I'd started to enjoy the exact way each next layer was breaking)

Two whisper-cpp bugs that ate a couple hours

First: with VAD on, timestamps in the JSON output are relative to the concatenated speech, not to original timeline. Whisper compressed 11,000 seconds of audio into 532 seconds of speech, and timestamps ran from 0 to 532. All my segments collapsed into one cluster. Hour and a half before I found that the SRT output has original timestamps.

Second: whisper-cpp's SRT writer stretches each segment's end time to the next segment's start. The phrase «Сергей Маков» (two seconds of speech) and «Областной колледж» (four minutes of music later) — SRT wrote them as one four-minute segment. Everything collapsed into one cluster again.

Fixed by estimating real duration from word count: Russian announcer speech ≈ 2.5 words per second, so «Сергей Маков» is 0.8 seconds, not four minutes. Clustering by real pauses — worked.

Then it got interesting

After the fixes, automatic segmentation hit 22 of 33 performances. The remaining 11 — my daughter's ensemble «Янрав», the Tatar «Хэзинэ», the Mari «Чинчывий» — ended up in wrong clusters because of name distortions in transcription.

I could have kept tuning the algorithm. But I realized: AI catches 60-80% of cases, the last 20-40% is a UI-for-manual-review problem, not a better-AI problem.

In two hours I built a local web editor in Python + HTML. Left panel — Whisper transcript by segments. Center — video player. Right — cut list. Each row has a «⬅» button that writes the current player time into start or end. The «+ Add from unmatched» button adds a missing performer from the program's canonical list.

In 25 minutes I verified all 33 rows, fixed misalignments, added what was missing. Saved.

Then ffmpeg cut by timestamps, Python with Pillow generated branded header cards for each cut, ffmpeg concatenated header + clip + closing card.

10 minutes — 33 files in the final/ folder.

What I didn't expect

This pipeline isn't a one-off festival gift.

The same scripts run on any next concert this school holds — or our summer project together (Liga Zaliva, a multi-discipline academy for local kids that includes a music track). The school records on iPhone — I process.

The web editor I built in a rush in two hours is the admin interface for a platform I'm about to build for arts schools over the next two months.

One 8.8 GB file Saturday morning — turned out by evening it was three launches at once. A gift for dozens of families. A reusable pipeline. A template for a new product.

And my daughter now has her first concert video, properly cut and labeled. With «ЛИИ × ЧДМШ №1» in the bottom-right corner.

More as it arrives.

Related reading