Ingestion
Your upload gets read, transcribed, and classified.
The moment a file lands, several pipelines fire in parallel — none of them generate a single frame. Bytes go straight to object storage; ffmpeg normalizes formats; a transcription pass turns spoken words into a timed list bound to the exact frame they were said on; and a small classifier reads the transcript plus the video's shape to build a profile of what KIND of video this is.
- →Direct-to-S3 multipart upload — bytes never proxy through our serverless functions, so a 4-hour podcast moves at network speed.
- →Word-level transcription with brand-aware bias — proper nouns and product names you've told us about land correctly the first time.
- →Post-pass LLM cleanup repairs proper-noun mistakes the transcriber misheard, preserving exact per-word timing.
- →Content fingerprint classifier writes a profile: content type, energy, cadence, audio role, visual pace, recommended caption preset.