Lesson 3 of 5
Auto-Transcription & Clips
Estimated time: 10 minutes
Auto-Transcription & Clips
Your pipeline can transcribe and extract clips automatically. In this lesson, you'll fine-tune transcription quality, generate YouTube-ready chapters, and configure intelligent clip detection that finds the best moments in your content.
Building on the pipeline
This lesson assumes you've set up the content factory pipeline from the previous lesson. We'll be configuring the transcriber and media-processor skills in more detail.
Transcription Deep Dive
Configure transcription quality
The transcriber skill has several options that affect output quality:
Key settings:
| Setting | Options | Recommendation |
|---|---|---|
language | auto, en, es, etc. | Use auto for multilingual content |
speaker-detection | on, off | Always on for podcasts/interviews |
max-speakers | 1-10 | Set to your typical guest count + host |
timestamps | segment, word-level | word-level for accurate clip cutting |
Test transcription on a real episode
Process a full episode and review the output:
cp ~/Downloads/episode-42.mp3 ~/content-factory/inbox/Once complete, check the transcript:
cat ~/content-factory/processed/episode-42/transcript.txt | head -30Example output with speaker detection:
[00:00:00] HOST: Welcome back to the show. Today we're
talking about building AI automations with OpenClaw.
[00:00:08] HOST: My guest is Sarah Chen, who's been
building chat-based tools for the last three years.
[00:00:15] SARAH: Thanks for having me. I'm excited
to talk about this because I think most people
underestimate how powerful chat interfaces can be.
[00:00:24] HOST: Let's start with the basics. What
made you switch from traditional web apps to
chat-first tools?
[00:00:31] SARAH: It was honestly an accident. I built
a Slack bot for our internal team and people started
using it more than the actual web dashboard...
Generate YouTube chapters
The transcriber auto-generates chapter markers based on topic shifts:
cat ~/content-factory/processed/episode-42/chapters.json{
"chapters": [
{ "time": "00:00:00", "title": "Introduction & Guest Intro" },
{ "time": "00:02:45", "title": "Why Chat-First Tools" },
{ "time": "00:08:12", "title": "Building Your First Bot" },
{ "time": "00:18:30", "title": "Scaling to Production" },
{ "time": "00:31:15", "title": "Common Mistakes to Avoid" },
{ "time": "00:42:00", "title": "The Future of AI Agents" },
{ "time": "00:53:20", "title": "Rapid Fire Q&A" }
]
}
Copy-paste these directly into your YouTube description for automatic chapter markers.
Intelligent Clip Detection
The real magic is finding clip-worthy moments automatically. OpenClaw analyzes the transcript for high-energy segments, quotable statements, and topic-complete sections.
Configure clip detection
| Setting | Description |
|---|---|
clip-min-duration | Minimum clip length in seconds |
clip-max-duration | Maximum clip length in seconds |
clip-count | How many clips to extract |
clip-criteria | What makes a good clip |
clip-format | Output aspect ratios |
Review suggested clips
After processing, check the clips report in chat or in the output folder:
🎬 Clip Suggestions for episode-42
Clip 1 — "The Aha Moment" (0:31-1:12)
Score: 94/100 | Type: Quotable + Surprising
"I built a Slack bot for our internal team and
people started using it more than the actual web
dashboard. That's when I knew chat-first was the future."
📎 clip-01-vertical.mp4 | clip-01-horizontal.mp4
Clip 2 — "The 3-Minute Rule" (18:42-19:55)
Score: 89/100 | Type: Actionable
"If your automation takes more than 3 minutes to
set up, you've over-engineered it. Start with the
simplest version that works."
📎 clip-02-vertical.mp4 | clip-02-horizontal.mp4
Clip 3 — "AI Agents vs Chatbots" (42:15-43:28)
Score: 87/100 | Type: High-energy
"An AI agent isn't just a chatbot with better prompts.
It's the difference between asking for directions and
having a driver."
📎 clip-03-vertical.mp4 | clip-03-horizontal.mp4
Approve or adjust clips
From chat, you can refine clips:
Clip 1 looks great, approve it.
Clip 2 — extend to 19:30-20:15 to include the example.
Clip 3 — skip this one, regenerate a different clip.
OpenClaw re-processes only the changed clips:
✅ Clip 1 approved (no changes)
✅ Clip 2 re-cut: 19:30-20:15 (45s)
🔄 Clip 3 regenerating... found alternative at 53:40-54:52
"The biggest mistake is thinking you need to automate
everything at once. Pick one workflow, nail it, then expand."
Score: 85/100 | Type: Actionable
For vertical Shorts (9:16 aspect ratio), the media processor automatically:
- Crops to center on the active speaker
- Adds captions burned into the video
- Applies your brand colors to the caption style
Configure caption style:
For specialized content (technical jargon, brand names, etc.), add a custom vocabulary:
This helps the transcriber correctly spell domain-specific terms instead of guessing.
The pipeline automatically generates SRT files. Upload them to YouTube for accurate closed captions:
1
00:00:00,000 --> 00:00:08,200
Welcome back to the show. Today we're
talking about building AI automations.
2
00:00:08,200 --> 00:00:15,400
My guest is Sarah Chen, who's been building
chat-based tools for the last three years.
YouTube's auto-captions are often inaccurate. Your Whisper-generated SRT will be significantly better.
Checkpoint
What's the advantage of word-level timestamps over segment-level?
You should now have:
- Transcription configured with speaker detection and word-level timestamps
- YouTube chapters auto-generated from topic analysis
- Clip detection running with configurable criteria and formats
- A review workflow for approving or adjusting suggested clips
Next: turning your transcript into blog posts, social content, and newsletters.