YouTube/Podcast Factory

Lesson 2 of 5

Setting Up the Upload Pipeline

Estimated time: 8 minutes

Setting Up the Upload Pipeline

In this lesson, you'll install the content factory skills and configure a watched folder that automatically triggers processing when you drop in a new audio or video file.

Prerequisites

Make sure your Gateway is running and you have an OpenAI API key (for Whisper transcription) or a local Whisper installation.

Installing the Skills

The content factory uses three ClawHub skills working together. Install all of them:

Install the media processing skill

Install media skill
clawhub install media-processor

This skill handles audio/video file detection, format conversion, and ffmpeg operations.

Install the transcription skill

Install transcriber
clawhub install transcriber

Configure it with your preferred transcription backend:

Configure Whisper API
openclaw skills config transcriber --provider openai --model whisper-1

Fastest option. Uses your existing OpenAI API key. Costs about $0.006/minute of audio.

Install the content repurposer skill

Install repurposer
clawhub install content-repurposer

This is the AI-powered skill that turns transcripts into blog posts, quotes, and social content.

Verify all skills are installed

List installed skills
clawhub list

You should see all three:

Installed skills:
  media-processor     v1.8.2   Audio/video processing & ffmpeg
  transcriber         v2.1.0   Speech-to-text transcription
  content-repurposer  v1.5.3   AI content generation from transcripts

Configuring the Watch Folder

Create a content inbox folder

Create inbox folder
mkdir -p ~/content-factory/inbox

Configure the watch folder

Tell the media processor to monitor your inbox:

Configure watch folder
openclaw skills config media-processor --watch ~/content-factory/inbox --output ~/content-factory/processed --formats "mp4,mp3,wav,mov,m4a"

Create the processing pipeline

Now wire the skills together into a pipeline. Create a pipeline config:

Create pipeline

This means: when a new file lands in the inbox, transcribe it, then generate all content assets, then notify the review channel.

Test with a sample file

Drop a short audio file (even a 1-minute voice memo) into the inbox:

Add test file
cp ~/Downloads/test-episode.mp3 ~/content-factory/inbox/

Watch the pipeline process it:

Check pipeline status
openclaw pipeline status content-factory
Pipeline: content-factory
Status: Processing
Current step: transcriber (2/3)
File: test-episode.mp3
Duration: 1:23
Started: 30 seconds ago

When complete, you'll see a notification in your chat:

🎬 Content Factory — Processing Complete

File: test-episode.mp3 (1:23)
Transcript: ✅ 247 words
Blog draft: ✅ Generated
Quotes: ✅ 3 extracted
Clips: ✅ 1 suggested (0:15-0:42)

Review the output: ~/content-factory/processed/test-episode/

If you upload content to a cloud service (Google Drive, Dropbox, S3), you can use a webhook instead:

Webhook pipeline

Then configure your cloud storage to send a webhook to http://localhost:18789/webhooks/content-upload when new files are added.

After processing, your output folder looks like this:

~/content-factory/processed/episode-42/
  ├── transcript.txt           # Plain text transcript
  ├── transcript.srt           # SRT subtitles with timestamps
  ├── chapters.json            # Timestamped chapter markers
  ├── blog-post.md             # Full blog post draft
  ├── quotes.json              # Extracted quote snippets
  ├── clips/
  │   ├── clip-01.mp4          # Short clip 1
  │   ├── clip-02.mp4          # Short clip 2
  │   └── clip-03.mp4          # Short clip 3
  ├── newsletter.md            # Newsletter draft
  └── show-notes.md            # Podcast show notes

All files are drafts for your review — nothing is published automatically.

Checkpoint

Knowledge Check

What triggers the content factory pipeline?

You should now have:

  • All three skills installed (media-processor, transcriber, content-repurposer)
  • A watched inbox folder configured
  • A working pipeline tested with a sample file

Next: diving deeper into transcription quality, timestamps, and clip extraction.