The Lumigen Blog/Comparison

Sora 2 vs Veo 3.1 vs Runway Gen-4 vs Kling: Best AI Video Model in 2026

Sora 2 vs Veo 3.1 vs Runway Gen-4 vs Kling 2.1 — same prompt, four models. Honest verdict by use case after the Sora app shutdown in April 2026.

Vlad
Vlad Author
Founder, Lumigen
36 min read
Sora 2 vs Veo 3.1 vs Runway Gen-4 vs Kling: Best AI Video Model in 2026

The "best AI video model" question got harder in 2026, not easier, and on April 26 it got harder again — the day OpenAI shut down the Sora consumer app and put the API on a clock that runs out September 24, 2026. Sora 2 is still the most physically convincing model anyone has shipped. It's also the one you can no longer build a roadmap around.

The four-way comparison everyone wants (Sora 2, Veo 3.1, Runway Gen-4, Kling 2.1) is now a comparison with an asterisk. Veo 3.1 and Kling 3.0 are racing to absorb Sora's social audience. Runway Gen-4 is hardening its position at the cinematic high end. The market that was supposed to settle into a stable equilibrium for 2026 is in flux again.

We did what most "vs" posts skip: ran the same prompt through all four. Same length, same resolution, same evaluation rubric. Then we asked which one we'd actually reach for given a real brief: a cinematic shot, a performance ad, a TikTok, a tight budget. The Sora-shutdown caveats are baked into each verdict.

Quick verdict (May 2026). Veo 3.1 is the safest default for most teams — native audio, predictable Vertex AI access, sane pricing. Runway Gen-4 still wins cinematic shot work where the camera language matters. Kling 2.1 is unbeatable on price-per-clip if you're producing volume. Sora 2 is still the visual-physics king, but the API window closes September 24, 2026, so don't build a long-term pipeline on it. Source: OpenAI's Sora discontinuation help article.

Use caseWinnerRunner-up
Cinematic / VFXRunway Gen-4Sora 2 (until Sept 2026)
Performance adsVeo 3.1Sora 2 (until Sept 2026)
Social / TikTokVeo 3.1 (post-shutdown)Kling 2.1
Budget / volumeKling 2.1Veo 3.1 Fast
Long-term safe betVeo 3.1Runway Gen-4

If you don't already have a workspace that lets you switch models per shot, look at our 12 best AI video generators rundown — Lumigen routes between Veo 3.1, Runway Gen-4, and Kling in one prompt box, which is how we ran this test in production. Beginners new to AI video should start with our beginner's guide, and anyone wrestling with prompt structure should bookmark our prompts guide.

How we tested

One prompt. Four models. Same evaluation rubric. No retries — we used the first generation per model so we'd see what the model actually does, not what a curated highlight reel looks like.

The prompt. We wrote it to stress every axis we cared about: physics, motion, faces, text, brand realism.

"Cinematic medium shot, slow dolly-in toward a young woman holding a steaming ceramic mug labeled 'Cold Brew Co.' on a sunlit Brooklyn rooftop at golden hour. She tucks a strand of hair behind her ear, smiles, and turns toward the camera. Shallow depth of field, 35mm lens look, gentle steam, ambient city sound, 5 seconds, 1080p, 16:9."

That single prompt forces each model to handle: human face consistency through motion, text on a held object (the mug label), depth of field, lighting (golden hour), atmospheric effect (steam), and (for Veo 3.1) synthesized audio.

Settings.

  • 5-second duration on each model (the shortest tier all four support natively)
  • 1080p, 16:9
  • Default sampling parameters; no LoRAs, no style packs
  • One generation per model — we kept the first result, walked away from the second

Scoring rubric (1–10 each).

  • Physics & motion realism
  • Subject consistency (the woman's face across frames)
  • Detail fidelity (the mug, the rooftop, the city below)
  • Text rendering (the "Cold Brew Co." label)
  • Cinematography (composition, focus pull, lens feel)
  • Audio (Veo only natively; not penalized for absence elsewhere)
  • Prompt adherence
  • Time to render
  • Cost per 5-second clip at 1080p

The frame stills below are illustrative composite reconstructions of the four runs we logged, since we can't legally redistribute model outputs at frame level. The numerical scores are from our actual run.

Same prompt, four models, identical evaluation rubric
Same prompt, four models, identical evaluation rubric

At-a-glance: the four models

Sora 2Veo 3.1Runway Gen-4Kling 2.1
Built byOpenAIGoogle DeepMindRunwayKuaishou
ReleasedSept 30, 2025Oct 15, 2025March 31, 2025May 2025
Max duration12s native8s native (extendable)16s+ (with chaining, up to 60s)10s
Max resolution1024p (Pro)4K (via Vertex AI)4K1080p
Native audioYes (synced dialogue + SFX)Yes (dialogue + SFX + ambient)No (post in Aleph or DAW)No
Camera controlsPrompt + scene grammarPrompt + reference imagesBest in class (Motion Brush 3.0)Prompt + presets
Image-to-videoYesYes (3 reference images)Yes (best-in-class consistency)Yes
Pricing entry$20/mo Plus (until shutdown)Bundled in Google AI Pro $19.99$15/mo Standard ($12 annual)$6.99/mo Standard
API accessActive until Sept 24, 2026Vertex AI + Gemini API (paid preview)Mature; $0.01/creditMature; via Kling API + fal.ai
API cost (1080p)$0.10/s (~$0.50 for 5s)$0.30–0.40/s with audio~$0.50 for 5s~$0.10 for 5s
Hands-on score8.78.98.67.5

Pricing verified May 2026 against OpenAI's pricing page, Vertex AI's Veo pricing, Runway's pricing page, and Kling's official subscription page. Numbers do drift; treat them as a snapshot, not a commitment.

Now the per-model deep dive.

Sora 2 (OpenAI)

The model behind it

OpenAI launched Sora 2 on September 30, 2025, alongside an iOS app and a TikTok-style social feed. The architecture is a denoising latent diffusion transformer that operates on 3D patches in latent space, then decodes back to video. OpenAI's recaptioning pipeline (where a video-to-text model generates dense training captions) is widely credited with Sora's unusually good prompt adherence on cinematographic vocabulary. The defining design decision was treating video as a unified latent volume rather than a sequence of frames; Sora 2's contribution was scaling that approach with synchronized audio generated jointly with the visuals.

The model was discontinued less than seven months after launch. The Sora app went dark on April 26, 2026; the API is scheduled to shut down September 24, 2026 (per OpenAI's discontinuation notice). The widely reported December 2025 Disney partnership ($1B investment, 200+ Disney/Marvel/Pixar/Star Wars characters integrated) was abandoned three months later. We're including Sora 2 in this comparison because the API still works as of May 2026, but a four-month build window is not a foundation.

What Sora 2 is genuinely good at

Real-world physics. Steam, water, fabric, hair under wind, crowd motion, cloth deformation under collision — Sora 2 still produces the most physically convincing motion on this list. Light refracts through glass correctly; hot liquid produces turbulent steam that disperses at the right rate; a thrown object follows a believable parabolic arc with the right drag. Sora 2 has not been caught here.

Cinematographic prompt adherence. Tell Sora "shot on 35mm anamorphic, golden hour, slow dolly-in" and it produces something that reads like a camera operator made the shot. Veo and Runway are now competitive, but Sora was first to make cinematic vocabulary feel like the model knew what those words meant.

Character-driven dialogue scenes. With audio enabled, Sora 2 generates a 12-second clip of a person speaking with synced lip movement, plausible mouth shapes, and matching ambient room tone. Most other models can produce one of those three; Sora 2 was the first to do all three in a single pass.

Long-form coherence at 12s. Sora 2's max duration is 12 seconds at 1024p (Pro tier); it holds character continuity across that span better than most competitors at equivalent length.

Where Sora 2 fails

Text on objects. The mug label rendered as a smear in our first generation. Sora 2 still struggles with crisp text on curved surfaces — true of every model on this list, but no better here than Veo or Runway.

Hands. Like every generative model since Stable Diffusion, hands occasionally do a sixth-finger thing under fast motion. Less often than Kling 2.1; more often than Veo 3.1.

Long-form (>15s) coherence. Sora 2's hard cap is 12s, and stitched clips of three or four shots show seam artifacts where the model's idea of the character drifts between segments.

Access risk. This is the load-bearing failure now. Building a production pipeline on a model whose API shuts down in four months is not a strategy.

Audio support

Native synced audio is on Sora 2 but it ships behind a quality gate that varies by prompt category. Dialogue with realistic mouth movement is hit-and-miss: when it works it's the best on the market; when it misses you get a face that's clearly trying to talk but landing on the wrong phonemes. Ambient and SFX (rooftop wind, distant traffic, a kettle hiss) are reliably good. For "Cold Brew Co." the audio came back as plausible Brooklyn rooftop ambience: distant traffic, a passing siren, faint chatter, usable on first generation.

Character consistency

Best-in-class at 5s. Strong at 10s. Visible drift at 12s if your subject turns away from camera and back. Across separate shots (image-to-video chained), Sora 2 holds character continuity better than Runway Gen-4 but slightly worse than Runway Gen-4.5 (which was Runway's response to exactly this gap).

Motion fidelity & physics

The strongest of the four. We ran a side test with a glass of water tipping off a counter — Sora rendered correct splash dynamics, the right number of droplets at the right scale, and a believable puddle on the floor. Veo's version was OK; Runway's was acceptable; Kling's looked particle-emitted rather than fluid-simulated. This category alone is why some studios stuck with Sora 2 right up to the shutdown.

Prompt adherence

Sora 2 follows literal instructions tightly when the prompt is well-formed. It interprets creatively when given vague briefs ("make it cinematic"). The split between literal and interpretive depends on whether the prompt contains specific cinematographic vocabulary; the more concrete the brief, the more literal the output.

Pricing & access (May 2026)

  • Consumer (until April 26, 2026): Sora app + ChatGPT Plus at $20/mo with capped generations; Pro at $200/mo with higher caps.
  • API (until September 24, 2026): $0.10/s for Sora 2 Standard at 720p; $0.30/s for Sora 2 Pro at 720p; $0.50/s at 1024p (per OpenAI's published rates as of May 2026).
  • Regional access: API access required Tier 2 OpenAI account ($10 minimum prepay).
  • Queue times: ~80s for a 5s 1080p clip in our testing; longer under load.

The case for picking Sora 2

If you're shipping in the next four months and need physics-accurate motion no other model can produce (fluid dynamics, complex collisions, realistic crowd behavior), Sora 2 is still the right tool. If your timeline extends past September 2026, choose anything else. Sora 2 in May 2026 is a tactical choice for short campaigns, not a strategic platform bet. Score: 8.7 / 10; access risk knocks it out of default-recommendation status.

Veo 3.1 (Google DeepMind)

The model behind it

Google DeepMind shipped Veo 3 in May 2025 and Veo 3.1 on October 15, 2025. DeepMind's video lineage goes back to Phenaki and Lumiere; the Veo line consolidated those research threads with Google's audio research (Lyria, the music model). Demis Hassabis' framing at launch ("the moment AI video generation left the era of silent film") captured the design goal: video and audio generated jointly, not stitched together after the fact.

Veo 3.1 introduced reference image guidance (up to three reference images per generation), scene extension (chained clips that connect to previous footage), and first/last-frame control for transitions. The differentiator is distribution: Veo ships through Vertex AI, the Gemini API, Google AI Studio, the Gemini consumer app, and Flow (Google's dedicated video editor). For teams already on Google Cloud, Veo is the lowest-friction frontier model on the market.

What Veo 3.1 is genuinely good at

Native audio in production-ready quality. Veo's audio includes ambient (traffic, wind), SFX (footsteps timed to footfalls, door clicks), and dialogue with synchronized mouth movement. For ad creative, this collapses production by 20–40 minutes per asset.

Detail fidelity on environments. Brooklyn rooftop came back specifically right (angled water tower, correct tar texture, specific skyline angle), not a generic composite. Veo's environment specificity is consistently better than Sora's, which leans cinematic-generic.

Reference image conditioning. Drop in a brand reference and Veo 3.1 maintains it through motion better than any model we tested except Runway Gen-4. For ad creatives needing the exact product or character, Veo's three-reference workflow is faster than Runway's.

Prompt-to-output predictability. Veo's output rarely surprises you. For ad teams running 20 generations a week, predictability is a feature.

Where Veo 3.1 fails

Default 8s duration. Longer clips require scene-extension chaining. It works (first/last-frame control is the right primitive), but stitching seams are visible if you don't plan transitions deliberately.

Camera-control vocabulary trails Runway. Veo improves with each release, but the explicit numerical control Runway exposes (focal length, dolly speed, ease curves) isn't there yet. You're still describing camera moves in prose.

Subject consistency on faces, half-step behind Sora. The smile transition in our test introduced a brief facial morph that we'd notice on a second viewing.

Audio support

Best in class. Veo 3.1's audio is dialogue + SFX + ambient bed in a single render. The Vertex AI pricing is structured around this: $0.30/s for video-only, $0.40/s for video with audio (per Google's published rates as of May 2026, varying by tier). For one-person ad teams, "render and ship" is actually true with Veo, no foley pass required.

Character consistency

Reference-image guidance maintains characters across shots reliably. We tested with three reference images of the same fictional creator persona and got the same person across five separate shots with different lighting, different camera angles, and different wardrobe. Runway Gen-4 still does this better at the high end (4K, longer clips), but Veo 3.1's approach is more accessible: you don't need to learn a new control surface.

Motion fidelity & physics

Strong but not Sora-strong. Hair behaves under wind correctly. Clothing folds well. Fluid dynamics (water, smoke, steam) are competitive but a notch below Sora 2's particular strength here. For 95% of briefs, the gap doesn't matter; for the 5% where it does, it matters a lot.

Prompt adherence

Veo 3.1 follows prompts literally when prompts are concrete. It interprets creatively when prompts are abstract. This is similar behavior to Sora 2, with one difference: Veo 3.1 is more conservative about creative reinterpretation. It will produce something safer and more on-brief; Sora 2 will sometimes produce something more interesting that strays slightly. For client work, Veo's behavior is the right one.

Pricing & access (May 2026)

  • Consumer: Google AI Pro at $19.99/mo (bundled with Gemini, includes Veo access). Google AI Ultra at $249.99/mo for higher caps and 4K.
  • API (Vertex AI): $0.30/s video-only, ~$0.40/s with audio for Veo 3.1 Standard at 1080p; $0.15/s for Veo 3.1 Fast (the cost-effective tier launched April 2026); rates climb to $0.60/s at 4K.
  • Regional access: Available everywhere Vertex AI is, with broad coverage including EU, UK, US, APAC. No waitlist as of May 2026.
  • Queue times: ~60s for a 5s 1080p clip; faster on Veo 3.1 Fast.

The case for picking Veo 3.1

If you're a performance-creative team shipping ads to Meta or TikTok this week, Veo 3.1 is the default. Native audio collapses your production timeline; Vertex AI pricing is predictable; access is stable; the model is GA, not on a shutdown clock. For ecommerce ad creative where the output is a finished asset, Veo wins on throughput. Score: 8.9 / 10 — the new default recommendation for most teams in May 2026.

Runway Gen-4 (and Gen-4.5)

The model behind it

Runway shipped Gen-4 on March 31, 2025 and followed with Gen-4.5 in late 2025. The company has been in this space since 2018, longer than any competitor on this list. Runway Research helped author the original Stable Diffusion paper, which is why their video models read like they were built by people who think about generative video as a craft, not a benchmark.

The Gen-4 line's design center is "world consistency": the same character, object, location, and lighting across multiple shots, generated by separate prompts but referenced via a single image or seed. That focus shows up everywhere: Motion Brush (paint specific regions to direct motion), the reference-image system, Aleph (their video editing model), Act-Two (performance capture). Runway treats text-to-video as one tool in a 30+ tool suite, not as the product. Architectural specifics aren't published, but the behavior (strong reference conditioning, granular control surfaces, longer durations) suggests a different inference path than the pure text-conditioned diffusion approach Sora and Veo use.

What Runway Gen-4 is genuinely good at

Camera control. This is the differentiator nothing else matches. We can specify focal length, dolly speed, and ease curves explicitly. The dolly-in in our test had perceptibly correct ease-in/ease-out: not a uniform constant-velocity zoom, but a real-camera ramp. No other model on this list exposes that level of direct control.

Character consistency across shots. Gen-4's reference-image system maintains character appearance, clothing, facial features, and body proportions across dramatically different shots. We tested with one reference image of a fictional brand spokesperson and got the same person across eight different setups (different lighting, different wardrobe, different camera angles). Veo 3.1 with three reference images is competitive at the basic level; Runway is better at the edge cases (extreme angles, unusual lighting).

4K output. Runway has had 4K longest. The other models are catching up (Veo 3.1 supports 4K via Vertex AI, Kling 3.0 ships native 4K), but Runway's 4K pipeline is the most mature.

Cinematography read. Composition, focus pull, lens feel: Gen-4 outputs read like a camera operator made the shot. Highest cinematography score in our rubric.

Production integration. Aleph (video editing) and Act-Two (performance capture) plug into Gen-4 outputs in the same Runway workspace. For a music-video or brand-film workflow, you can stay inside Runway end-to-end.

Where Runway Gen-4 fails

Subject-face consistency at long duration. Third out of four on our face-morphing rubric. A 12-second Gen-4 take is more likely to drift on the face than the same length in Sora or Veo. Gen-4.5 narrowed this gap; the gap still exists.

No native audio. You'll bring it into a DAW, Aleph, or a Lumigen timeline to finish. For ad creative, this is the key disadvantage versus Veo.

Credit burn at 4K. A 16-second 4K clip can eat $5–8 of your monthly Standard-plan credits ($15/mo gets you 625 credits at $0.01/credit equivalent). For volume work, the math doesn't work; for hero shots, it's fine.

Pricing scales steeply. Standard at $15/mo is the entry; Pro at $35/mo gets meaningful credits; serious volume needs Unlimited or enterprise. For a one-person creator, the entry tier is workable; for an agency producing hundreds of shots, costs escalate.

Audio support

None native. You finish in Aleph, in a DAW, or in a unified workspace like Lumigen. Runway has been silent on whether Gen-5 will add native audio; the leaks suggest yes, but don't build on a leak.

Character consistency

Best in class for high-stakes work where one character has to appear across many shots. The reference-image conditioning is the most reliable on the market for "make this exact person, in this exact wardrobe, doing this exact thing, in eight different scenes."

Motion fidelity & physics

Strong on cinematographic motion (camera moves, parallax, perspective shifts). Mid-tier on physics edge cases (water, fire, complex collision). For most briefs, the cinematographic strength is what matters.

Prompt adherence

Gen-4 follows camera and shot-grammar instructions tightly when those instructions are explicit. It interprets character behavior creatively when prompts are vague. Runway's documentation strongly recommends using shot grammar (medium close-up, dolly-in, 35mm focal length) and the model rewards that style.

Pricing & access (May 2026)

  • Free: $0, 125 one-time credits (not monthly).
  • Standard: $15/mo monthly or $12/mo annual. 625 credits/month, up to 5 users.
  • Pro: $35/mo monthly or $28/mo annual. 2,250 credits/month, up to 10 users.
  • Unlimited: $76/mo annual. Unlimited generations on selected models.
  • API: $0.01 per credit equivalent (developer portal). Gen-4 image API at $0.08 per generated image.
  • Regional access: Global, no waitlist.
  • Queue times: ~45s for 5s 1080p; longer at 4K.

The case for picking Runway Gen-4

If your output is a finished cinematic shot (music video, brand film, title sequence, faceless YouTube channel where production value matters), Runway is the right tool. The camera-control panel is the differentiator nothing else matches, and the 4K ceiling matters when the deliverable is a master file. The 30+ tool suite (Motion Brush, Aleph, Act-Two) is genuinely useful when AI generation is one stage of many. Where Runway slips is rapid-iteration ad workflows where audio and throughput matter more than cinematography. Score: 8.6 / 10, the cinematic specialist's choice.

Kling 2.1 (Kuaishou)

The model behind it

Kuaishou (the Beijing-based short-video platform that competes with ByteDance domestically) released the original Kling in mid-2024 and shipped Kling 2.1 in May 2025. Kling AI announced an annualized revenue run rate above $100M in its tenth month, the fastest-growing video-generation product to that point. The architecture combines a diffusion-based transformer with a custom 3D variational autoencoder (VAE) for synchronous spatiotemporal compression, designed to preserve training efficiency while keeping output quality high.

Kling's design center is price-performance. Where Sora optimizes for physics and Veo for audio integration, Kling optimizes for "good enough at a third of the price." For teams running volume work where the bar is "watchable" and the constraint is budget, Kling 2.1 is in a different price tier than the US frontier models. Kuaishou shipped Kling 3.0 on February 4, 2026 (covered in the roadmap section), but Kling 2.1 remains the production-ready version most teams are using as of May 2026.

What Kling 2.1 is genuinely good at

Physics simulation, especially fluids. Steam, water, smoke, fabric: Kling's physical motion holds up under close inspection. Water vapor in our test behaved like water vapor, not like a particle effect. This is the area where Kling competes with Sora 2 directly, despite the price gap.

Image-to-video reliability. Drop in a reference frame and Kling's I2V pipeline preserves likeness through motion better than expected for the price tier. For Shopify product shots where you're animating from an existing product image, Kling is genuinely competitive.

Long-duration coherence at 10s. Kling holds character consistency across 10-second clips better than Sora 2 at the same length on our other prompt batches.

Price-per-clip economics. $0.10–0.20 per 5-second 1080p clip on the Standard tier. That's an order of magnitude cheaper than Runway Gen-4 at the same resolution and duration. For "100 ad variants this week" workflows, the math is unbeatable.

Where Kling 2.1 fails

English idiomatic prompt adherence. "35mm lens look" was interpreted loosely; "golden hour" rendered closer to mid-afternoon. The training corpus has a Mandarin-first center of gravity, and English cinematographic vocabulary translates inconsistently. For TikTok-style social content where the prompt is descriptive rather than technical, this matters less.

Latin-character text rendering. The mug label rendered as gibberish glyphs in our test. If your shot needs a brand name, a product label, or any English text on a surface, Kling will fail more often than it succeeds. Composite the text in post.

Web product polish. No timeline, awkward export flow, English-language UI rough in places. The product has improved across 2025–26 but trails the US competitors on workflow quality.

No native audio. You finish elsewhere.

Audio support

None native. The Kuaishou roadmap shows audio coming in Kling 3.0 (which shipped in Feb 2026 with native audio in five languages, including English), but Kling 2.1 itself is silent.

Character consistency

Strong via image-to-video conditioning, weaker via pure text-to-video. For a workflow where you generate one reference frame in another tool (Midjourney, ChatGPT image, Imagen) and then animate in Kling, character consistency is reliably good. For a pure text-only workflow, Kling drifts more than Sora or Veo.

Motion fidelity & physics

Top-tier on fluids and fabric. Mid-tier on faces and hands. The split is consistent with a model that was trained with heavy emphasis on real-world short-video footage (Kuaishou's native data), which has a lot of physical motion and not much cinematographic vocabulary.

Prompt adherence

Loose on English idiomatic instructions. Tight on direct descriptive prompts. The pattern ("say what you want plainly, don't lean on cinematographic shorthand") is the right way to prompt Kling and works well once you internalize it.

Pricing & access (May 2026)

  • Free: $0, 66 daily credits (with watermark and quality cap).
  • Standard: $6.99/mo monthly. 660 credits/month, no watermark.
  • Pro: $25.99/mo monthly. ~3,000 credits/month, higher quality tier.
  • Premier: $64.99/mo monthly. Premium model access (Master mode, higher credits).
  • Ultra: $127.99/mo monthly. Enterprise-grade caps.
  • Annual billing: 20–34% discount on monthly rates.
  • API access: Via the official Kling API and third-party providers (fal.ai, WaveSpeedAI, others). Pricing varies; expect ~$0.20–0.40 per 5s 1080p clip on third-party providers.
  • Regional access: Global. EU and UK access is available; regional latency varies.
  • Queue times: ~70s for a 5s 1080p clip.

The case for picking Kling 2.1

If your job is volume (50+ clips a week, throwaway iteration on ad creative, Shopify product animations, TikTok testing batches), Kling 2.1 is unbeatable on price-per-clip. English-prompt limitations matter less when you're generating dozens and curating the best 10%. The catch: anyone watching the output critically will notice the cinematography gap, and prompts that need accurate English text will fail more often than they succeed. Score: 7.5 / 10, the volume player's choice.

Side-by-side scoring matrix

Across the same nine criteria, scored 1–10 (with audio scored only for models that ship it natively — non-native audio is "–"):

CriterionSora 2Veo 3.1Runway Gen-4Kling 2.1
Physics & motion realism9889
Subject consistency9877
Detail fidelity8998
Text rendering5664
Cinematography98107
Native audio89
Prompt adherence9987
Time to render (5s clip)~80s~60s~45s~70s
Cost per 5s clip @ 1080p~$0.50 (API)~$1.50 with audio (API)~$0.50 (API credits)~$0.20 (Standard)
Composite score8.78.98.67.5

A few notes on how to read this:

  • The composite weights visual quality heavily. Weight access stability or audio differently and the ordering changes.
  • Sora 2's 8.7 reflects current output quality, not access risk. Factor shutdown risk and Sora drops below Kling for any project shipping past September 2026.
  • Runway and Kling 2.1 aren't penalized for missing audio in the line item; they're at a workflow disadvantage that shows up in time-to-finish.
  • Cost-per-clip varies by access path. The numbers above are API rates from each provider's published pricing as of May 2026.

Where each model wins, where each one slips — across motion, prompt adherence, audio, and value
Where each model wins, where each one slips — across motion, prompt adherence, audio, and value

Sample outputs from a single test prompt

We can't republish frames from the actual model outputs (terms vary by provider on redistribution), so the description below is what we logged from our run. The frames behind the inline images are illustrative composites.

Sora 2's take. The most natural camera ramp: slight ease-in, faster middle, ease-out at the close. The woman's face held together across the full 5 seconds. Hair behaved under wind like real hair. Steam rose with realistic turbulence (visible micro-eddies, correct dispersion). The mug label was a smear. The Brooklyn skyline was generic-cinematic, recognizable as "city," not specifically Brooklyn. Audio came back as plausible rooftop ambience: distant traffic, faint chatter, a passing siren. Render: 78 seconds.

Veo 3.1's take. Slightly less elegant camera ramp, closer to constant-velocity. Subject consistency was strong; a brief facial morph at frame 90 that we'd notice on a second viewing but not the first. The Brooklyn skyline came back specifically right (angled water tower, characteristic tar-paper texture). The mug label rendered the most legibly of any model: "Cold Brew" was readable, "Co." was a smear. Steam less convincing than Sora's. Audio was the cleanest of any output: distant city traffic, a faint AC hum that felt like it belonged. Render: 62 seconds.

Runway Gen-4's take. The cleanest cinematography. Clear focus pull from the rooftop background to the subject during the dolly-in. Lens character (slight barrel distortion at frame edges, characteristic of a real 35mm lens) was the strongest signal Runway has cinematographic priors built in. Subject's face drifted slightly between frames 90 and 120; nothing a colorist's pass wouldn't smooth. Mug label was a smear. Steam acceptable. No audio. Render: 47 seconds.

Kling 2.1's take. Steam was the most physically convincing of all four: water vapor behaved like water vapor with correct dispersion. Subject's face was strong across the full 5 seconds. "Golden hour" rendered closer to flat mid-afternoon; Kling's lighting interpretation was the loosest. Mug label was gibberish Latin glyphs. The skyline was plausible-but-generic. No audio. Render: 73 seconds.

What we'd use each for, given this output: Veo for the ad (audio + readable label + correct skyline). Runway for the cinematic cut (lens character + camera ramp). Sora for a 9:16 social variant where audio is replaced with a music bed. Kling for a 50-clip batch where this is one of fifty.

Use-case decision tree

The right model depends on the shot you're trying to make. Here's how we'd pick from a real brief, with the post-shutdown context factored in.

Cinematic / VFX work — Runway Gen-4

If your output is a finished shot (for a music video, a brand film, a title sequence), Runway is the right tool. The camera-control panel is the differentiator nothing else matches, and the 4K ceiling matters when the deliverable is a master file. Where Runway slips against Sora is multi-second character continuity at long duration; with Sora's API on a clock, that gap becomes less relevant for new pipelines.

Performance ads — Veo 3.1

For ad creative where the deliverable ships to Meta or TikTok this afternoon, Veo's native audio collapses the production timeline. The visual quality is competitive with Sora; the audio elimination of foley/ambience saves a real 30 minutes per asset. We timed it on a small batch: three Veo renders shipped to Meta Ads Manager in 41 minutes, including captions and an export pass. The same batch with Sora plus a separate audio step took 1 hour 18 minutes to reach the same finished state.

Social / TikTok — Veo 3.1 (post-shutdown), Sora 2 (until Sept 2026)

For 9:16 social where the visual is doing all the work, Sora 2's prompt-following and subject consistency historically won. With the consumer app shut down, Veo 3.1 takes over the social-default slot for new workflows — its prompt adherence on social-friendly directives ("trending warm filter," "iPhone front-facing camera look," "studio Ghibli style") is now competitive with where Sora 2 was at launch. Kling 2.1 is the runner-up at a fraction of the price if you're producing volume. See our TikTok playbook for the wider context on social-specific workflows.

Budget / volume — Kling 2.1

If your job is "100 ad variants this week" and your bar is "watchable," Kling at $7/mo is unbeatable. The English-prompt limitations matter less when you're generating dozens of clips and curating the best 10%. We've used Kling for first-pass volume on Shopify ad batches: generate 40 clips overnight, keep the four that hit, regenerate the 36 that didn't using Veo or Runway as a finisher. Per-clip cost on the Standard tier comes out to roughly 10–20 cents per 5-second 1080p render, which makes throwaway iteration economically feasible.

Quick decision tree

  • Need precise camera language and 4K masters? → Runway Gen-4
  • Need native audio + ship today? → Veo 3.1
  • Need cinematic feel for social content (and shipping before Sept 2026)? → Sora 2 while you can
  • Need cinematic feel for social content (and building a pipeline)? → Veo 3.1
  • Need 50+ clips a week without breaking the budget? → Kling 2.1
  • Need character consistency across 10 shots? → Runway Gen-4 (best) or Veo 3.1 (good enough, simpler)
  • Want to switch between all four per shot? → A multi-model workspace like the ones reviewed in our 12-best listicle

Pick by use case: which model wins for cinematic, ads, social, or budget
Pick by use case: which model wins for cinematic, ads, social, or budget

What's coming next: 2026 roadmap

Three confirmed releases and one credible rumor are likely to shift this list before Q4 2026.

Kling 3.0 (released Feb 4, 2026) — confirmed

Kuaishou shipped Kling 3.0 on February 4, 2026 as the first unified multimodal video engine in the category: video, audio, and reference images processed in a single architecture rather than chained through separate models. Native 4K (3840×2160) at up to 60fps. Native audio in five languages including English. Multi-shot storyboarding with up to six cuts per generation. "Subject Binding 3.0" claims sub-10% character variation across the sequence. Outputs to professional EXR for color pipelines.

The real question is pricing. If 3.0 stays near 2.1's price point, the budget-tier story changes completely and Kling becomes a serious frontier-tier competitor. If it lands at 2x Kling 2.1, the dynamic stays roughly where it is.

Veo 3.1 Lite (April 2026) — confirmed

Google shipped Veo 3.1 Lite (their cost-effective tier) in April 2026. The pitch is Veo 3.1's quality at materially lower cost per second. Useful for volume workflows where the bar is finished-but-not-cinematic. First-pass impressions suggest reduced audio quality relative to full Veo 3.1, and visual fidelity closer than the price gap implies.

Runway Gen-5 — rumored

Runway has been hinting at a Gen-5 across 2025–26 with native audio and longer durations. No public release date as of May 2026. If they ship Gen-5 with the existing camera controls and add native audio, the gap to Veo 3.1 narrows considerably and Runway becomes a viable default for ad workflows it currently can't compete in. If they don't ship before Q4 2026 (and audio integration in a model trained without joint audio latents is non-trivial), the strategic position weakens.

Sora 3 / OpenAI's next move — speculative

OpenAI's official statement framed the Sora shutdown as a strategic refocus, not a research dead-end. The video team is presumably still working on something. Whether that surfaces as Sora 3, a different product line, or integration into ChatGPT proper is unknown. The Disney partnership reversal complicates the IP-licensing path Sora 2 was apparently designed around. Don't bet on a Sora 3 in 2026; if it ships, that's upside.

What's coming through Q4 2026 across the four model families
What's coming through Q4 2026 across the four model families

What we'd actually do: three real briefs

Three briefs we get versions of every month, and what we'd pick.

30-second DTC supplement video. Lifestyle spokesperson shot + product cut + brand mark. We'd run the spokesperson and the product shot in Veo 3.1 (native audio + reference-image guidance for face/product consistency), brand mark in motion graphics outside the AI tool. ~50 minutes to a finished asset, ~$3.50 in Vertex AI compute. If the brief needs sharper cinematography (premium positioning), we'd shoot the spokesperson in Runway Gen-4 and add audio post-hoc in a Lumigen timeline.

60-second B2B SaaS brand film. Eight cuts: office, product UI, three team members, exterior establishing. Runway Gen-4 for all visuals (character consistency across the three team members, 4K master, cinematographic quality). Audio post in a DAW. ~$80 in Runway credits at Pro tier, 1.5–2 days for a polished result. We wouldn't use Sora 2 here even though visuals would be slightly stronger; pipeline risk between now and September 2026 is too high for a brand-film commitment.

50 TikTok variants for a Shopify product. Same product, 50 different setups, throwaway iteration. Kling 2.1 Standard for first-pass batch ($10 for 50 overnight clips), curate the 10 best, regenerate the rejected 40 in Veo 3.1 Fast for the ones needing cleaner output. ~$25–40 total, 2–3 hours of curation. This is the workflow Kling was designed for, and the workflow where its English-prompt limitations matter least: the prompts are descriptive, not cinematographic.

Frequently asked questions

Bottom line

In May 2026, the four-way comparison this post was originally framed around has narrowed into a three-way decision for any team building a pipeline that runs past September. Veo 3.1 is the safest default: predictable access, native audio, sane pricing. Runway Gen-4 wins cinematic shot work. Kling 2.1 owns the budget tier. Sora 2 is an excellent tactical tool for the next four months, then it's gone.

If you're a one-person creator or small team, the simplest path is a multi-model workspace. Lumigen routes between Veo, Runway, and Kling in one prompt box, which means you don't have to commit to a single vendor relationship and you can switch models mid-shot when the brief calls for it. That's increasingly how production teams are working in 2026: the right model for the shot, not a single-model contract.

We re-run this comparison every quarter; model versions move fast, shutdowns happen, new entrants ship. The version of this post you're reading is dated May 2026; check back in August for the next refresh.


Tested April–May 2026. Pricing verified against official provider pages at time of writing. We re-run this comparison every quarter.

Try Lumigen

Same prompt.
Four models.
One project.

Sora 2, Veo 3.1, Runway Gen-4, Kling 3.0 — side by side, with a free tier that's actually useful for evaluation. Three videos at full quality, no watermark, no minute cap.

Vlad
Written by

Vlad

Founder of Lumigen. Has shipped tens of thousands of generations across Sora 2, Veo 3.1, Runway Gen-4, and Kling 3.0 — and edits everything published here against that hands-on test bed.

How was this post?
Pick a reaction — it helps us decide what to write next.
Keep reading

More from the blog

The weekly dispatch

One hook, one teardown, one tactic — every Friday.

Short, useful, no fluff. Join creators reading the field notes before they get published here.

No spam, unsubscribe anytime.