To clone a viral TikTok-style UGC video, the first thing you need is a multimodal model that can quickly and accurately break down the source video's shot structure.
As of March 2026, I tested four mainstream video-capable models on the exact same clip to see how well they could identify each shot's structure, including dialogue, scene setup, background, and audio cues.
Original video
For this test, I used a TikTok UGC video for a nasal spray product. The clip contains three main shots:
- A man standing indoors holding both a medical bill and a nasal spray bottle.
- The same person sitting in a car and introducing the product.
- A first-person product close-up showing the bottle in detail.
All four models were tested through OpenRouter:
- Healer Alpha
- ByteDance Seed: Seed-2.0-Lite
- Qwen3.5 Plus
- Google: Gemini 3.1 Pro
Test prompt
You are an expert in video language analysis and cinematography. Your task is to analyze the shot structure of the provided video and output a JSON array describing each shot.
Instructions
- Segment the video into shots. A shot is defined as a continuous sequence captured without a cut.
- For each shot, extract the following information.
- Output ONLY a JSON array. Do not include explanations, markdown, or additional text.
- All content must be in English.
- If information is uncertain or not observable, use null.
JSON Schema
Each element in the array represents one shot with the following fields:
shot_id: Sequential number of the shot starting from 1.subject: The main object, person, animal, or scene that appears in the shot.context_environment: The background, setting, or location where the subject exists.action: What the subject is doing (e.g., walking, turning head, typing, waving).style: The visual or artistic style of the video (e.g., cinematic, documentary, cartoon, futuristic, vintage, black-and-white).camera_motion_positioning: Camera movement or viewpoint (e.g., static, tracking shot, pan left, tilt up, aerial view, over-the-shoulder).composition: Shot framing or composition (e.g., close-up, medium shot, wide shot, extreme close-up, low-angle shot).ambiance_colour_lighting: Lighting, color tone, atmosphere, time of day, and emotional tone.audio: Description of audio elements (e.g., background music, ambient noise, footsteps, silence).dialog: Any spoken dialogue by characters. If no speech is present, return null.
Healer Alpha
Healer Alpha delivered what was essentially an A-plus answer. It was the cleanest and most complete result in the test, leaving very little room for the other models to outperform it. Its grasp of wardrobe details, presentation style, and shot intent was consistently sharp. We still do not know the model's real identity, but based on this result alone, it is clearly a serious contender.
ByteDance Seed-2.0-Lite
ByteDance Seed-2.0-Lite performed in line with what you would expect from a Lite model. It correctly identified the three core shots and did a solid job tracking the character and setting. That said, it still has a lot of room to improve when it comes to accurately transcribing dialogue.
Qwen3.5 Plus
Qwen3.5 Plus was steady and reliable. It also returned three shots, and its visual understanding was good enough to pass the test comfortably. From the character's clothing to the overall meaning of the video, its interpretation was basically correct. Dialogue extraction, however, is still not one of its strongest areas.
Gemini 3.1 Pro
The last model in the lineup was Gemini 3.1 Pro, and it was in a different class. It captured nearly every important element in the video with striking precision, from the speaker's clothing and recording style to the dialogue itself. Its transcript quality in particular was far ahead of the rest. In this test, Gemini still looked like the dominant force in multimodal understanding.
Conclusion
After this simple head-to-head test, the conclusion is fairly clear: Gemini remains the strongest multimodal system overall, especially for precise dialogue extraction and shot-level understanding. That said, models like Qwen and Seed are already getting close enough to be useful in real production workflows.
If you are building AI video tools or designing your own automated content pipeline, this comparison should give you a practical sense of what each model is good at today.
If you want to try this kind of workflow yourself, Flowtra's latest agent mode already supports cloning viral TikTok videos, swapping models, and more, powered by the latest Kling 3.0 generation stack.
