Eight AI-driven TikTok channels, one pipeline. The complete mechanism.
From the JSON config file to a vertical MP4 ready to ship on six platforms — what's actually under the hood when AI video production runs solo. Real numbers, real decisions, real bugs.
As I'm writing this, eight TikTok and YouTube Shorts channels are publishing continuously from my machine. No content team. No editor. A solo operation that consumes between four and ten minutes of my time per episode, script-writing excluded. The rest — voiceover, captions, B-roll, audio mix, render, multi-platform publishing — is delegated to a pipeline that fits in a nine-hundred-line Python file and a handful of calls to public-domain tools.
What follows is the complete mechanism. The tools, the technical decisions, the costs. The way this cottage industry actually functions, as opposed to the version Twitter threads sell — the one where "a single prompt generates a channel." I won't lie to you: the prompt is the easy part.
The eight channels
| Channel | Format | State | |---|---|---| | WhyFactory | FR 50-60s, micro-documentaries "why does this everyday thing exist" | 8 episodes published, 18 in backlog | | Cocorico Histoire | FR 60s, French history and reversals | 21 episodes — one did 3× the rest (the post-mortem) | | Cocorico POV | FR 45s, historical scenes in first person | Experimental fork | | Cocorico Quiz | FR 60s, French civic test questions | French naturalization niche | | Tech-Malin | FR 45s, vulgarized tech tips | Tech-savvy audience bet | | Motivation Motor | EN 30s, motivational quote cuts | Test bench for very short formats | | Reddit Horror | EN 60s, narrated r/nosleep posts | Volume strategy | | Aliasify TikTok | EN, persona-driven channel | Incarnation test |
Each channel carries its own visual identity, its own locked editorial doc, its own prioritized backlog. What they share is exclusively the production chain that functions as their infrastructure.
Architecture
JSON config
→ Python orchestrator (scripts/generate.py)
→ Three-tier TTS (Kokoro / F5 / ElevenLabs)
→ Whisper + script-forced-align (captions)
→ Pexels + Wikimedia (B-roll)
→ FFmpeg (sidechain ducking + loudnorm -14 LUFS)
→ HyperFrames render (MP4 1080×1920)
→ Shortflow MCP call from Claude
→ publication on YouTube, TikTok, Instagram, Facebook, Threads, LinkedIn
The orchestration is deliberately trivial. Each step is a CLI invocation — cacheable, re-runnable, inspectable. No magic. A chain of tools that each do one thing, and do it correctly. The loose coupling is what makes the whole readable and debuggable — the first time an audio mix exits at the wrong loudness, you'll know exactly which link failed.
Step 1 — The script
A configuration file describes an episode. The narrative segments, the associated B-roll hints, the pronunciation overrides for the voice engine, and the factual sources destined for the public description.
{
"id": "ep01",
"slug": "azerty",
"title": "Why your keyboard is AZERTY",
"segments": [
{
"text": "I just learned a wild thing about the AZERTY keyboard.",
"broll": "vintage typewriter",
"duration_hint": 2.5
},
{
"text": "AZERTY has no ergonomic logic. It's the inheritance of 1873 typewriters.",
"broll": "1873 typewriter wikimedia",
"duration_hint": 4.0
}
],
"sources": [
{ "title": "Wikipedia — AZERTY", "url": "https://en.wikipedia.org/wiki/AZERTY" }
]
}
The script is the only part I don't fully automate. I write it, often with Claude as a copilot, then a Python script segments it, validates the token alignment between the displayed text and the spoken text (more on that in the captions section), and emits the JSON above.
I have tried full LLM-generated scripts. It's technically possible. It is editorially mediocre, and consistently so. What distinguishes AI content from AI content that reads as human isn't the render quality or the caption precision: it's the editorial voice. I devote the final section of this piece to that point, because it is in reality the only non-delegable aspect.
Step 2 — Three-tier voice synthesis
This is the single architectural decision that has saved me money at scale.
ElevenLabs produces the best quality on the market. It also costs money. At my cadence (five to ten episodes per channel per month, multiplied by eight channels, so forty to eighty final renders), routing every script iteration through ElevenLabs would burn through my credits in days. The strategy fits in three engines:
| Tier | Engine | Cadence | Cost | Use |
|---|---|---|---|---|
| v1 | Kokoro ff_siwis (local) | ~1s/segment | Free | Script iteration. Twenty passes per episode is normal. |
| v2 | F5 local | ~3min/segment | Free | Montage validation. Slower, but the prosody holds in reality. |
| prod | ElevenLabs Liam | ~5s/segment | Credits | Publishable render. One pass, exactly one. |
A command-line flag picks the tier:
python3 scripts/generate.py episodes/configs/ep01.json # v1
python3 scripts/generate.py episodes/configs/ep01.json --v2 # v2
python3 scripts/generate.py episodes/configs/ep01.json --prod # ElevenLabs
The cache is indexed on a hash of each segment's content. Modifying a comma in segment three invalidates the cache only for that segment. The other eleven survive intact. That's what makes the iteration economically sustainable: the unit of re-render is the segment, not the episode.
Real costs on a sixty-second episode:
- v1 (script iteration): twenty cycles, ten minutes of cumulative compute, $0.
- v2 (montage validation): one cycle, around twelve minutes, $0.
- prod (final): one cycle, one minute, around three thousand ElevenLabs characters, $0.16.
A published episode costs sixteen cents in voice synthesis. At eighty episodes per month, the bill stops at thirteen dollars. To compare with the few hundred dollars an "always on ElevenLabs" strategy would have imposed.
Step 3 — Captions: Whisper and forced alignment
Whisper-generated captions, delivered raw, are robotic. Whisper mis-ingests certain words, cuts at the wrong place, ignores your original line breaks. The pipeline corrects this by feeding the reference script into the transcription process.
Concretely: Whisper receives the audio and the script. A forced-align step matches every Whisper-transcribed token to a token from the script. The output is a word-level timing file in which the displayed text is rigorously what you wrote, and the chronology comes from the real audio.
LOOKAHEAD = 6 tokens
MIN_WORD_GAP = 0.08s
These two parameters make karaoke captions credible. LOOKAHEAD lets the aligner skip past Whisper's parasitic insertions ("uh", hesitations, false starts). MIN_WORD_GAP prevents two consecutive words from overlapping when the voice synthesis runs them together a bit too tightly.
A constraint that can bite: the displayed text and the spoken text must carry the same number of tokens, whitespace-separated. If you display "3,000" and want "three thousand" spoken, that's three tokens against two — alignment breaks. Prefer "3000" displayed and "three-thousand" spoken: one against one. Or "3 000" displayed and "three thousand" spoken: two against two. The generator validates this constraint before emitting the JSON; a mismatch produces a render warning and a uniformly distributed timing — not wrong, but imprecise.
This is exactly the type of invisible detail that eats two days when you discover it alone. You owe me two days.
Step 4 — B-roll: Pexels first, Wikimedia second
Each segment carries a B-roll hint, as a short search string. The pipeline queries the Pexels API for vertical videos; failing that, it falls back on Wikimedia Commons images.
Three rules pulled from previous mistakes:
- Pexels ID blocklist. Some theoretically free videos carry watermarks, or are mis-tagged portrait. I maintain a blocklist of IDs that always end up unsightly. Thirty minutes saved per episode.
- Automatic Ken Burns on still images. A motionless image breaks the engagement curve. A light pan combined with a progressive zoom maintains perceived motion and preserves attention.
- Bottom vignette overlay. A gradient at 85% opacity at the bottom of every frame guarantees caption legibility regardless of B-roll colorimetry. Without it, each clip would need manual color correction.
Details devoid of glamour, sure. But this is precisely what separates "looks like AI video" from "looks like a human-edited cut."
Step 5 — FFmpeg audio mix
Two chained operations:
# 1. Silent padding to line the voiceover up on the video timeline
ffmpeg -i vo.wav -af "apad=pad_dur=2" vo-padded.wav
# 2. Mix voice + music with sidechain ducking and LUFS normalization
ffmpeg -i vo-padded.wav -i music.wav \
-filter_complex "[1:a][0:a]sidechaincompress=threshold=0.05:ratio=8[ducked]; \
[0:a][ducked]amix=inputs=2:duration=longest, \
loudnorm=I=-14:LRA=11:TP=-1.0" \
-c:a aac -b:a 192k final.aac
The -14 LUFS target matches the integrated loudness standard of TikTok and YouTube Shorts. Above it, the platform automatically attenuates the signal; below it, the video sounds flat next to the rest of the feed. Hitting the target at mix time guarantees none of your published videos will suffer from the "this one is louder than the others" problem.
Step 6 — HyperFrames render
HyperFrames is the framework I use to turn an HTML/CSS composition, animated with GSAP, into an MP4. It's not a traditional video editor: it's a deterministic render engine for HTML compositions, in the spirit of Remotion but with a stricter contract.
Three properties that make it a fit for this production chain:
- Compositions are HTML files. You write them in any IDE, you version them under Git, you structure them the way you'd structure a web page.
- Animations express themselves as GSAP timelines. The framework guarantees pause and seek, which lets the engine inspect and capture frames at any sampling rate.
- Determinism is non-negotiable: no
Math.random(), noDate.now(), nosetTimeout. Two renders of the same file produce the same MP4, byte for byte.
The render holds in one command:
npx hyperframes render
The output is a renders/<id>-<slug>.mp4 file. The complete path, from JSON to MP4, takes three to five minutes for a sixty-second episode, depending on B-roll composition.
Step 7 — Publishing to six platforms via Shortflow MCP
This is the step where Shortflow enters the chain — yes, the product this blog is an extension of.
The render sits on disk. From Claude Code, one MCP call publishes it to YouTube, TikTok, Instagram, Facebook Pages, Threads and LinkedIn:
> Publish renders/ep01-azerty.mp4 to all my connected platforms
with the title "Why your keyboard is AZERTY" and schedule it
for tomorrow at 9am.
⏵ Calling create_publication...
✓ Uploaded to library (12.4 MiB, 60.2s)
✓ Scheduled on 6 targets: YouTube, TikTok, Instagram, Facebook, Threads, LinkedIn
Before Shortflow, the publishing step was the most painful part of the pipeline. Six tabs to open, the file to drag six times, the caption to compose six times, the thumbnail to choose six times. Around twenty minutes per episode. At eighty monthly episodes, that came to twenty-six hours of repetitive clicks per month — a part-time job consisting exclusively of pushing the same file into different interfaces.
With Shortflow's MCP server, the publishing call is part of the same Claude conversation in which I edited the script. Total publishing time per episode now sits at around fifteen seconds, the bulk of which is the file upload to our servers.
The real subject: editorial voice
I lied a little in the introduction. The technical layer of this pipeline isn't hard. It's laborious, occasionally messy, but it settles in two weeks of rigorous tinkering. What is truly hard — what determines whether your AI content reads as AI or not — isn't in the render: it's in the voice.
Here's what I learned writing for WhyFactory: the narrator isn't a presenter. They're a friend who just discovered something and feels an urgent need to relay it to you. That posture determines everything downstream. It isn't "let me explain X to you", it's "hold on, listen to this." The urgency comes from real excitement, not from staging — and the viewer can sense the difference in milliseconds of cadence.
A handful of concrete rules, distilled from thirty-plus scripts:
- The narrator discovers with the viewer, never in front of them. Open in first person and signal the recent discovery: "I just learned something, I can't get over it." The narrator is learning what the viewer is learning, at the same pace. Zero asymmetry.
- The narrator has opinions. A neutral voice is a forgettable voice. "My favorite is this one." "I have to say I share that view." Partiality is what carves out a presence.
- Facts are plot twists, not information. The "except" conjunction inverts every revelation. Every fact betrays what we believed a second ago.
- Anticipate the viewer. "For those of you thinking this is AI, I get it." "I know what you're thinking as you read this." The technique collapses the distance between the voice and the person watching.
- Weave the CTA into the narrative, never tack it on. "By the way, we're approaching a hundred thousand", slipped between two segments, works. "Subscribe to support the content", parked in the outro, kills the rhythm. The distinction is small; the effects aren't.
WhyFactory's editorial doc holds fifteen hundred words of rules like these. It's the set that makes a fifty-second video about Pringles tubes read like a story told at a bar at eleven p.m., rather than a TED Talk script.
What I'd do differently
Three things, if I were restarting today:
First, write the editorial doc before the technical stack. Every channel I shipped that underperformed is one where I rushed into production before formalizing the voice. The technique settles in a week of focused work. The voice demands twenty to thirty scripts of iteration before it truly holds.
Second, integrate thumbnail generation into the HyperFrames composition. Today I generate them in a separate FFmpeg step, which occasionally produces aesthetic mismatches between the thumbnail and the video. Better to bake them into the composition, guaranteeing they inherit the same graphic grid automatically.
Third, instrument multi-platform analytics from day one. I built Shortflow in part because I reached a point where I had eight channels and no consolidated view of performance. Don't repeat that mistake.
The numbers, end to end
Per episode, all in:
- Human time: around ninety minutes — forty minutes of writing, twenty minutes of configuration and iteration, ten minutes of B-roll selection, twenty minutes of review and publishing.
- Direct cost: around $0.16 in voice synthesis, the rest at zero.
- Output: a sixty-second vertical MP4, scheduled across six platforms.
At eighty monthly episodes, the math settles at a hundred and twenty hours of human work, thirteen dollars of TTS, four hundred and eighty multi-platform publish events. Without Shortflow, add around a hundred and fifty hours of manual operations inside platform interfaces.
Where to start
If you're starting next week, the highest-leverage move isn't in the pipeline. It's in the editorial document.
Pick one channel concept — one niche, one voice, one format. Write the editorial doc: positioning, voice rules, red flags. Aim for a thousand words. Then write five scripts by hand, with that document as the reference. Then start automating.
Channels rarely die because their render engine is too slow. They die because their voice is flat. Get the voice right first; the pipeline is a solvable engineering problem, and there are about a dozen viable stacks — HyperFrames is my pick, Remotion is a legitimate alternative, Revid.ai and Captions.ai work for different use cases.
If you want the publishing layer of your stack — the part that ships your render to six platforms from a single MCP call — handled for you, that's exactly what Shortflow does. Connecting a channel from the dashboard takes two minutes; publishing via Claude is operational ten minutes later.
A question about the pipeline? The contact sits in the legal notice. I read and I reply.