T4 · EP01 — Production Notes

The brief

Two men named Tony Zhang sat down in a study full of books and talked, in Mandarin, for forty-four minutes. Three cameras rolled. The goal of post-production was simple to state and fiddly to do: turn the raw footage into an episode that cuts to whoever is speaking, reads cleanly with Chinese subtitles, and sounds like a proper podcast — without a human editor scrubbing a timeline shot by shot.

Everything below was done programmatically: the angles were identified, the cameras were synchronized to each other, the cut was generated from a speaker-diarized transcript, the subtitles were written and timed, and the whole thing was rendered. This page is the record of what was built, how, and why each decision was made.

STEP 01Reading the footage

Six files arrived. Their names told the first part of the story: four MVI_* clips from a Canon (which splits a long recording into 4 GB chunks and auto-stops at its 29:59 limit), one IMG_* clip from an iPhone, and one earlier export of a third angle. Probing each for resolution, frame rate, codec, and audio confirmed a clean three-camera shoot — and a single frame from each revealed who each camera was pointed at.

wide two-shot — Three angles, one moment — the same instant of the conversation seen from each camera.

guest close-up — Three angles, one moment — the same instant of the conversation seen from each camera.

That gave the editing grammar for the entire episode: a wide establishing two-shot, a tight guest angle, and a tight host angle.

STEP 02Syncing without a clapperboard

Three cameras started and stopped at different moments, and camera clocks drift — so their internal timestamps can't be trusted to line them up. The fix is to listen instead of look: every camera recorded the same room audio, so the soundtracks can be cross-correlated to find the exact offset between them.

Rather than match raw waveforms (each microphone colors the sound differently), the pipeline compares a log-energy envelope of each track — the rhythm of speech onsets, laughs, and pauses, which every mic hears the same way. The peak of the correlation is the offset, refined down to the sample.

It checked its own work

The Canon's four chunks were synced independently, yet they lined up end-to-end to within ~30 milliseconds — and the gap between the third and fourth chunk landed exactly where the camera's 30-minute auto-stop would put it. Independent measurements agreeing to a frame is about as strong a validation as audio sync gets. The iPhone matched with 0.87 confidence.

With every angle placed on one shared timeline, a frame pulled from each camera at the same instant shows the same gesture from three directions — the proof that the sync is real.

STEP 03Cutting to whoever is talking

The conversation had already been transcribed and speaker-diarized — 1,179 timed segments, each labeled guest or host. That transcript is the director. The rule is intuitive: when the guest talks, show the guest; when the host talks, show the host; open on the wide, and break long monologues with it for variety.

The hard part is restraint. Of those 1,179 segments, 693 were under 1.5 seconds — "mm", "right", "yeah", a half-second of overlap. Cutting on each would strobe. So short turns are merged, a minimum shot length is enforced, and back-channels are absorbed into the surrounding shot. The result is a calm, conversation-led edit.

The automatic cut
Shots generated	216, gap-free across 43.7 min
Median shot length	~10 seconds
Screen time — guest	62%
Screen time — host	27%
Screen time — wide	11%

STEP 04Writing the subtitles

The same transcript becomes the subtitles. Each speaker gets a color — gold for the guest, blue for the host — so you can follow the conversation at a glance. Long lines are split into readable, word-timed cues using the per-word timestamps, so the text advances with the speech instead of dumping a paragraph on screen. A transcription hiccup (the show's name, "T4", looped a dozen times) was detected and collapsed. 1,131 cues in all.

cold open with subtitle — Speaker-colored captions, burned in at 1080p.

guest with subtitle — Speaker-colored captions, burned in at 1080p.

STEP 05The render

Switch video, never audio

Only the picture cuts between angles; the soundtrack is one continuous, loudness-normalized master (−16 LUFS). Cuts can never click, pop, or drift the sound.

Trim to the show

The four minutes of mic-checking at the top and the after-chat at the end are cut. The episode opens on "well… Tony and Tony" and ends on the sign-off — 39:23.

Normalize, concatenate, burn

Each shot is encoded to identical specs so the pieces join seamlessly; the captions are burned into the final 1080p file.

Why it was built this way

Sync on sound, not on clocks. Camera timestamps lie; shared audio doesn't. Cross-correlation is the only thing that lines three independent cameras up to the frame.
The transcript is the editor. Who is on screen should follow who is talking — and the diarized transcript already knows, for every moment, who that is.
Protect the viewer from the data. Real speech is full of half-second interruptions. Merging them is what separates an edit from a strobe light.
One audio bed. Keeping the sound continuous and only switching the picture is how professional multicam works — and it makes the whole thing robust.
Every step verifies. Sync was checked against the cameras' own physics; the cut was checked for gaps; the render was checked frame by frame. Nothing was assumed.

The result

Six raw files, ~18.8 GB, three cameras — in, and one finished episode out: speaker-following, subtitled, sound-mastered, 39 minutes 23 seconds. No timeline was scrubbed by hand.

Episode 01 of Tony & Tony Talks in the Triangle — two career-and-life veterans, one Mandarin conversation in North Carolina, on what AI is doing to the shape of a career.

From one episode to many clips

A forty-minute episode is a quarry for short clips. The transcript already pins every line to the millisecond, so making a vertical short isn't re-watching the whole thing — it's: locate the line → take the whole sentence → apply the vertical caption template.

Step one is always checking the timestamps: a "~28:30" written from memory turns out to be 29:07 once checked against merged.json — two of the six candidates were off like that. The validated lineup — all six now produced (★ = priority picks):

Line	Time	Audience
★ "You don't have eight pay grades anymore — my AI has only one"	26:25	pharma / professionals (top hook)
★ "Needing no product at all is your biggest competitor"	16:48	marketing one-liner
★ "Dear young people: the gap between us is shrinking"	29:07	students / new grads
Don't make your best salesperson a manager — make them a trainer	36:26	managers
The Peter Principle: everyone in their post is incompetent	37:47	spicy / comment-bait
BTS: AI cuts the cameras like a director	00:58	behind-the-scenes / meta

The last one hides an Easter egg: in the pre-show small talk, the guest is describing exactly what AI should do — "like a director: when someone gets serious and their voice rises, cut to them" — and that is precisely how this episode was cut by this pipeline.

Vertical 1080×1920, auto speaker-follow crop + karaoke captions, reusing the same sync and edit data as the full cut. All six produced — ready to publish.

Watch

The finished episode and all six vertical clips — click to play. Public links, no sign-in.

▶ Full episode · 39:23

1No more eight pay grades 2Your biggest competitor 3The gap is shrinking 4Make them a trainer 5The Peter Principle 6BTS: AI cuts like a director