YouTube/Podcast Factory

Lesson 3 of 5

Auto-Transcription & Clips

Estimated time: 10 minutes

Auto-Transcription & Clips

Your pipeline can transcribe and extract clips automatically. In this lesson, you'll fine-tune transcription quality, generate YouTube-ready chapters, and configure intelligent clip detection that finds the best moments in your content.

Building on the pipeline

This lesson assumes you've set up the content factory pipeline from the previous lesson. We'll be configuring the transcriber and media-processor skills in more detail.

Transcription Deep Dive

Configure transcription quality

The transcriber skill has several options that affect output quality:

Configure transcription

Key settings:

SettingOptionsRecommendation
languageauto, en, es, etc.Use auto for multilingual content
speaker-detectionon, offAlways on for podcasts/interviews
max-speakers1-10Set to your typical guest count + host
timestampssegment, word-levelword-level for accurate clip cutting

Test transcription on a real episode

Process a full episode and review the output:

Process episode
cp ~/Downloads/episode-42.mp3 ~/content-factory/inbox/

Once complete, check the transcript:

Preview transcript
cat ~/content-factory/processed/episode-42/transcript.txt | head -30

Example output with speaker detection:

[00:00:00] HOST: Welcome back to the show. Today we're
talking about building AI automations with OpenClaw.

[00:00:08] HOST: My guest is Sarah Chen, who's been
building chat-based tools for the last three years.

[00:00:15] SARAH: Thanks for having me. I'm excited
to talk about this because I think most people
underestimate how powerful chat interfaces can be.

[00:00:24] HOST: Let's start with the basics. What
made you switch from traditional web apps to
chat-first tools?

[00:00:31] SARAH: It was honestly an accident. I built
a Slack bot for our internal team and people started
using it more than the actual web dashboard...

Generate YouTube chapters

The transcriber auto-generates chapter markers based on topic shifts:

View chapters
cat ~/content-factory/processed/episode-42/chapters.json
{
  "chapters": [
    { "time": "00:00:00", "title": "Introduction & Guest Intro" },
    { "time": "00:02:45", "title": "Why Chat-First Tools" },
    { "time": "00:08:12", "title": "Building Your First Bot" },
    { "time": "00:18:30", "title": "Scaling to Production" },
    { "time": "00:31:15", "title": "Common Mistakes to Avoid" },
    { "time": "00:42:00", "title": "The Future of AI Agents" },
    { "time": "00:53:20", "title": "Rapid Fire Q&A" }
  ]
}

Copy-paste these directly into your YouTube description for automatic chapter markers.

Intelligent Clip Detection

The real magic is finding clip-worthy moments automatically. OpenClaw analyzes the transcript for high-energy segments, quotable statements, and topic-complete sections.

Configure clip detection

Configure clips
SettingDescription
clip-min-durationMinimum clip length in seconds
clip-max-durationMaximum clip length in seconds
clip-countHow many clips to extract
clip-criteriaWhat makes a good clip
clip-formatOutput aspect ratios

Review suggested clips

After processing, check the clips report in chat or in the output folder:

🎬 Clip Suggestions for episode-42

Clip 1 — "The Aha Moment" (0:31-1:12)
  Score: 94/100 | Type: Quotable + Surprising
  "I built a Slack bot for our internal team and
  people started using it more than the actual web
  dashboard. That's when I knew chat-first was the future."
  📎 clip-01-vertical.mp4 | clip-01-horizontal.mp4

Clip 2 — "The 3-Minute Rule" (18:42-19:55)
  Score: 89/100 | Type: Actionable
  "If your automation takes more than 3 minutes to
  set up, you've over-engineered it. Start with the
  simplest version that works."
  📎 clip-02-vertical.mp4 | clip-02-horizontal.mp4

Clip 3 — "AI Agents vs Chatbots" (42:15-43:28)
  Score: 87/100 | Type: High-energy
  "An AI agent isn't just a chatbot with better prompts.
  It's the difference between asking for directions and
  having a driver."
  📎 clip-03-vertical.mp4 | clip-03-horizontal.mp4

Approve or adjust clips

From chat, you can refine clips:

Clip 1 looks great, approve it.
Clip 2 — extend to 19:30-20:15 to include the example.
Clip 3 — skip this one, regenerate a different clip.

OpenClaw re-processes only the changed clips:

✅ Clip 1 approved (no changes)
✅ Clip 2 re-cut: 19:30-20:15 (45s)
🔄 Clip 3 regenerating... found alternative at 53:40-54:52
   "The biggest mistake is thinking you need to automate
   everything at once. Pick one workflow, nail it, then expand."
   Score: 85/100 | Type: Actionable

For vertical Shorts (9:16 aspect ratio), the media processor automatically:

  • Crops to center on the active speaker
  • Adds captions burned into the video
  • Applies your brand colors to the caption style

Configure caption style:

Caption style

For specialized content (technical jargon, brand names, etc.), add a custom vocabulary:

Custom vocabulary

This helps the transcriber correctly spell domain-specific terms instead of guessing.

The pipeline automatically generates SRT files. Upload them to YouTube for accurate closed captions:

1
00:00:00,000 --> 00:00:08,200
Welcome back to the show. Today we're
talking about building AI automations.

2
00:00:08,200 --> 00:00:15,400
My guest is Sarah Chen, who's been building
chat-based tools for the last three years.

YouTube's auto-captions are often inaccurate. Your Whisper-generated SRT will be significantly better.

Checkpoint

Knowledge Check

What's the advantage of word-level timestamps over segment-level?

You should now have:

  • Transcription configured with speaker detection and word-level timestamps
  • YouTube chapters auto-generated from topic analysis
  • Clip detection running with configurable criteria and formats
  • A review workflow for approving or adjusting suggested clips

Next: turning your transcript into blog posts, social content, and newsletters.