How it works

Five layers, no generation.

ClipWith reads your video, reasons about how to edit it, then invokes traditional editing tools to do the actual work. Here is what happens, layer by layer, from the moment a file lands to the moment a finished MP4 ships back to you.

01

Ingestion

Your upload gets read, transcribed, and classified.

The moment a file lands, several pipelines fire in parallel — none of them generate a single frame. Bytes go straight to object storage; ffmpeg normalizes formats; a transcription pass turns spoken words into a timed list bound to the exact frame they were said on; and a small classifier reads the transcript plus the video's shape to build a profile of what KIND of video this is.

  • Direct-to-S3 multipart upload — bytes never proxy through our serverless functions, so a 4-hour podcast moves at network speed.
  • Word-level transcription with brand-aware bias — proper nouns and product names you've told us about land correctly the first time.
  • Post-pass LLM cleanup repairs proper-noun mistakes the transcriber misheard, preserving exact per-word timing.
  • Content fingerprint classifier writes a profile: content type, energy, cadence, audio role, visual pace, recommended caption preset.
02

Visual understanding

On paid plans, the video gets watched frame by frame.

Vision-tier projects also run through a visual AI layer that indexes every second — faces, objects, actions, camera movement, scene transitions, on-screen text, time-locked to the exact frames where they appear. After this layer runs, the question 'when does the product hit the desk' has a frame-accurate answer the editor can act on.

  • Semantic frame index — every face, object, action, scene tagged with timestamps.
  • Structural pass identifies chapters, highlights, and hook candidates without you scrubbing.
  • Free-tier users get the transcript-only path; visual reasoning is gated to Starter and above.
  • Index is cached on your project so subsequent edits feel instant — no re-scan per prompt.
03

Reasoning

When you type a prompt, the editor reasons before it cuts.

Your prompt joins everything the system already knows about the footage — transcript, visual index, content fingerprint, your brand context, the active style preset — and a tiered LLM stack reads it all. A small fast model handles trivial edits; a stronger model takes anything multi-step or visually complex. The output is never a video. The output is a list of EDIT OPERATIONS — what to trim, what to overlay, what to color-grade, which tool to call next.

  • Three-tier model ladder routes each prompt to the cheapest model that can handle it well.
  • System prompt encodes universal editing principles (cut motivation, pacing, eye-trace, attention curves) rather than creator-specific recipes.
  • Content fingerprint feeds the reasoning so defaults match the footage even when your prompt is vague.
  • Output is RFC 6902 JSON Patch operations + a natural-language explanation — fully auditable, never opaque.
04

Specialized tools

Beyond cuts, dedicated systems handle the things one model can't.

Captions, music, voice cleanup, silence detection, on-screen-eye-contact correction — each lives behind a tool the reasoning layer can call when it's the right move. These tools are deterministic, billed per use, and labeled in the explanation so you always know exactly what produced each element of your edit.

  • Karaoke captions with per-word emphasis presets (color/font/scale/case/gradient/frosted-glass) for creator-style reels.
  • Background music generated to match the cadence and energy of your video, ducked under speech automatically.
  • Studio-voice cleanup uses a specialized speech-restoration model — denoise, EQ, level — at broadcast quality.
  • Eye-contact correction subtly re-aligns gaze toward the camera when you were reading off a teleprompter.
  • Silence detection finds dead air with frame accuracy; trims are surfaced for one-click application.
05

Render

The composition becomes a real MP4, your footage preserved.

When you export, the composition document goes to a programmatic video renderer running in parallel cloud workers. Your original frames are composited with the overlays, captions, color grades, and music — never regenerated. The result is a deterministic MP4 at the aspect ratio of your choice, ready to download or publish.

  • Lambda-based parallel chunked rendering — a 5-minute clip finishes in under a minute, no local CPU burn.
  • Outputs deterministic: same composition document, same export, every time.
  • Aspect ratios on demand — 9:16 / 16:9 / 1:1 / 4:5 from the same source.
  • Direct publish path planned for TikTok, Reels, and Shorts on your cadence; manual download always available.

The stack

Real systems doing real work.

We don’t use a single big model for everything — that way leads to hallucinated frames. Instead, ClipWith composes best-in-class systems, each doing one job extremely well, orchestrated by the reasoning layer. The video that comes out has been touched by editorial logic, not synthesized pixels.

Visual understanding

TwelveLabs Marengo + Pegasus

Frame-level semantic index + structural summary of every uploaded video.

Transcription

OpenAI Whisper + Anthropic Haiku cleanup

Word-level timing with proper-noun bias and a post-pass repair layer for accuracy.

Reasoning

Anthropic Claude (Haiku / Sonnet / Opus ladder)

Picks the edit operations to make based on your prompt and the understanding layers.

Voice cleanup + eye contact

NVIDIA Maxine

Studio Voice restoration and Eye Contact correction on the audio + video.

Music

Suno (via KIE.ai)

Background-track generation matched to the energy of your video.

Rendering

Remotion + AWS Lambda

Programmatic video renderer running parallel cloud chunks — your footage is composited, not regenerated.

What this gets you

Three things most AI video tools can’t deliver.

Your footage stays your footage

No frame is regenerated. The export is the video you uploaded — trimmed, captioned, color-graded, scored — never invented. Your face, voice, and brand are intact end-to-end.

Deterministic outputs

Same composition, same export, every time. The reasoning layer makes editorial decisions; the render layer executes them deterministically. No slot-machine outputs.

Roughly 1/100th the compute

We're orchestrating tools, not regenerating pixels. The cost and carbon footprint per edit is dramatically lower than generative video tools — cheaper for us, cleaner for you.