How three raw cameras and a transcript became a finished, speaker-following, subtitled episode — with no one touching a timeline.
Two men named Tony Zhang sat down in a study full of books and talked, in Mandarin, for forty-four minutes. Three cameras rolled. The goal of post-production was simple to state and fiddly to do: turn the raw footage into an episode that cuts to whoever is speaking, reads cleanly with Chinese subtitles, and sounds like a proper podcast — without a human editor scrubbing a timeline shot by shot.
Everything below was done programmatically: the angles were identified, the cameras were synchronized to each other, the cut was generated from a speaker-diarized transcript, the subtitles were written and timed, and the whole thing was rendered. This page is the record of what was built, how, and why each decision was made.
Six files arrived. Their names told the first part of the story: four MVI_* clips from a Canon (which splits a long recording into 4 GB chunks and auto-stops at its 29:59 limit), one IMG_* clip from an iPhone, and one earlier export of a third angle. Probing each for resolution, frame rate, codec, and audio confirmed a clean three-camera shoot — and a single frame from each revealed who each camera was pointed at.



That gave the editing grammar for the entire episode: a wide establishing two-shot, a tight guest angle, and a tight host angle.
Three cameras started and stopped at different moments, and camera clocks drift — so their internal timestamps can't be trusted to line them up. The fix is to listen instead of look: every camera recorded the same room audio, so the soundtracks can be cross-correlated to find the exact offset between them.
Rather than match raw waveforms (each microphone colors the sound differently), the pipeline compares a log-energy envelope of each track — the rhythm of speech onsets, laughs, and pauses, which every mic hears the same way. The peak of the correlation is the offset, refined down to the sample.
With every angle placed on one shared timeline, a frame pulled from each camera at the same instant shows the same gesture from three directions — the proof that the sync is real.
The conversation had already been transcribed and speaker-diarized — 1,179 timed segments, each labeled guest or host. That transcript is the director. The rule is intuitive: when the guest talks, show the guest; when the host talks, show the host; open on the wide, and break long monologues with it for variety.
The hard part is restraint. Of those 1,179 segments, 693 were under 1.5 seconds — "mm", "right", "yeah", a half-second of overlap. Cutting on each would strobe. So short turns are merged, a minimum shot length is enforced, and back-channels are absorbed into the surrounding shot. The result is a calm, conversation-led edit.
| The automatic cut | |
|---|---|
| Shots generated | 216, gap-free across 43.7 min |
| Median shot length | ~10 seconds |
| Screen time — guest | 62% |
| Screen time — host | 27% |
| Screen time — wide | 11% |
The same transcript becomes the subtitles. Each speaker gets a color — gold for the guest, blue for the host — so you can follow the conversation at a glance. Long lines are split into readable, word-timed cues using the per-word timestamps, so the text advances with the speech instead of dumping a paragraph on screen. A transcription hiccup (the show's name, "T4", looped a dozen times) was detected and collapsed. 1,131 cues in all.


Only the picture cuts between angles; the soundtrack is one continuous, loudness-normalized master (−16 LUFS). Cuts can never click, pop, or drift the sound.
The four minutes of mic-checking at the top and the after-chat at the end are cut. The episode opens on "well… Tony and Tony" and ends on the sign-off — 39:23.
Each shot is encoded to identical specs so the pieces join seamlessly; the captions are burned into the final 1080p file.
Six raw files, ~18.8 GB, three cameras — in, and one finished episode out: speaker-following, subtitled, sound-mastered, 39 minutes 23 seconds. No timeline was scrubbed by hand.
Episode 01 of Tony & Tony Talks in the Triangle — two career-and-life veterans, one Mandarin conversation in North Carolina, on what AI is doing to the shape of a career.
A forty-minute episode is a quarry for short clips. The transcript already pins every line to the millisecond, so making a vertical short isn't re-watching the whole thing — it's: locate the line → take the whole sentence → apply the vertical caption template.
Step one is always checking the timestamps: a "~28:30" written from memory turns out to be 29:07 once checked against merged.json — two of the six candidates were off like that. The validated lineup — all six now produced (★ = priority picks):
| Line | Time | Audience |
|---|---|---|
| ★ "You don't have eight pay grades anymore — my AI has only one" | 26:25 | pharma / professionals (top hook) |
| ★ "Needing no product at all is your biggest competitor" | 16:48 | marketing one-liner |
| ★ "Dear young people: the gap between us is shrinking" | 29:07 | students / new grads |
| Don't make your best salesperson a manager — make them a trainer | 36:26 | managers |
| The Peter Principle: everyone in their post is incompetent | 37:47 | spicy / comment-bait |
| BTS: AI cuts the cameras like a director | 00:58 | behind-the-scenes / meta |
The last one hides an Easter egg: in the pre-show small talk, the guest is describing exactly what AI should do — "like a director: when someone gets serious and their voice rises, cut to them" — and that is precisely how this episode was cut by this pipeline.
Vertical 1080×1920, auto speaker-follow crop + karaoke captions, reusing the same sync and edit data as the full cut. All six produced — ready to publish.
The finished episode and all six vertical clips — click to play. Public links, no sign-in.
▶ Full episode · 39:23