To clone a viral TikTok-style UGC video, the first thing you need is a multimodal model that can quickly and accurately break down the source video''s shot structure.

As of March 2026, I tested four mainstream video-capable models on the exact same clip to see how well they could identify each shot''s structure, including dialogue, scene setup, background, and audio cues.

Original video

For this test, I used a TikTok UGC video for a nasal spray product. The clip contains three main shots:

A man standing indoors holding both a medical bill and a nasal spray bottle.
The same person sitting in a car and introducing the product.
A first-person product close-up showing the bottle in detail.

All four models were tested through OpenRouter:

Healer Alpha
ByteDance Seed: Seed-2.0-Lite
Qwen3.5 Plus
Google: Gemini 3.1 Pro

Test prompt

You are an expert in video language analysis and cinematography.

Your task is to analyze the shot structure of the provided video and output a JSON array describing each shot.

Instructions
	1.	Segment the video into shots.
A shot is defined as a continuous sequence captured without a cut.
	2.	For each shot, extract the following information.
	3.	Output ONLY a JSON array.
Do not include explanations, markdown, or additional text.
	4.	All content must be in English.
	5.	If information is uncertain or not observable, use null.

⸻

JSON Schema

Each element in the array represents one shot.

{
  "shot_id": integer,
  "subject": string,
  "context_environment": string,
  "action": string,
  "style": string,
  "camera_motion_positioning": string,
  "composition": string,
  "ambiance_colour_lighting": string,
  "audio": string,
  "dialog": string
}


⸻

Field Definitions

shot_id
Sequential number of the shot starting from 1.

subject
The main object, person, animal, or scene that appears in the shot.

context_environment
The background, setting, or location where the subject exists.

action
What the subject is doing (e.g., walking, turning head, typing, waving).

style
The visual or artistic style of the video (e.g., cinematic, documentary, cartoon, futuristic, vintage, black-and-white).

camera_motion_positioning
Camera movement or viewpoint (e.g., static, tracking shot, pan left, tilt up, aerial view, over-the-shoulder).

composition
Shot framing or composition (e.g., close-up, medium shot, wide shot, extreme close-up, low-angle shot).

ambiance_colour_lighting
Lighting, color tone, atmosphere, time of day, and emotional tone (e.g., warm sunset light, neon cyberpunk lighting, gloomy blue tones).

audio
Description of audio elements (e.g., background music, ambient noise, footsteps, silence).

dialog
Any spoken dialogue by characters. If no speech is present, return null.

⸻

Example Output

[
  {
    "shot_id": 1,
    "subject": "a young woman with a laptop",
    "context_environment": "modern coffee shop interior",
    "action": "typing on a laptop while looking at the screen",
    "style": "cinematic realistic",
    "camera_motion_positioning": "static camera at eye level",
    "composition": "medium shot",
    "ambiance_colour_lighting": "warm indoor lighting with soft shadows",
    "audio": "light background music and quiet cafe chatter",
    "dialog": null
  },
  {
    "shot_id": 2,
    "subject": "barista preparing coffee",
    "context_environment": "coffee counter area",
    "action": "pouring milk into a cup of coffee",
    "style": "cinematic realistic",
    "camera_motion_positioning": "slow push-in camera movement",
    "composition": "close-up",
    "ambiance_colour_lighting": "warm tones with soft morning light",
    "audio": "sound of milk steaming and coffee machine",
    "dialog": null
  }
]

Healer Alpha

Healer Alpha delivered what was essentially an A-plus answer. It was the cleanest and most complete result in the test, leaving very little room for the other models to outperform it. Its grasp of wardrobe details, presentation style, and shot intent was consistently sharp. We still do not know the model''s real identity, but based on this result alone, it is clearly a serious contender.

[
  {
    "shot_id": 1,
    "subject": "middle-aged man wearing a plaid flannel shirt",
    "context_environment": "living room with Christmas tree, stockings, and holiday decorations in background",
    "action": "speaking directly to camera while holding a small bottle in one hand and a medical bill in the other",
    "style": "direct-to-camera commercial testimonial",
    "camera_motion_positioning": "static camera at eye level",
    "composition": "medium shot from waist up",
    "ambiance_colour_lighting": "bright natural daylight from windows, warm indoor atmosphere",
    "audio": "man speaking clearly with no background music",
    "dialog": "The math is simple. Spend $20 now or risk a $30,000 bill later."
  },
  {
    "shot_id": 2,
    "subject": "same man in plaid shirt",
    "context_environment": "inside a parked car",
    "action": "spraying nasal spray into his right nostril",
    "style": "direct-to-camera commercial testimonial",
    "camera_motion_positioning": "static camera mounted inside vehicle",
    "composition": "close-up on face",
    "ambiance_colour_lighting": "natural daylight through car windows",
    "audio": "sound of nasal spray being administered",
    "dialog": null
  },
  {
    "shot_id": 3,
    "subject": "man holding Nasomin nasal spray bottle",
    "context_environment": "inside a parked car",
    "action": "presenting product to camera while speaking enthusiastically",
    "style": "direct-to-camera commercial testimonial",
    "camera_motion_positioning": "static camera, slight handheld feel",
    "composition": "close-up on face and product",
    "ambiance_colour_lighting": "natural daylight through car windows",
    "audio": "man speaking",
    "dialog": "Give your nose a daily power wash with Nasomin. It''s a natural iodine spray, not a drug."
  },
  {
    "shot_id": 4,
    "subject": "hand holding Nasomin Nasal Iodine Spray bottle",
    "context_environment": "car interior with steering wheel and dashboard visible in background",
    "action": "holding and slightly rotating bottle to display label",
    "style": "product insert shot for commercial",
    "camera_motion_positioning": "static point-of-view shot looking down at hand",
    "composition": "extreme close-up of product",
    "ambiance_colour_lighting": "natural daylight, clean product lighting",
    "audio": "male voiceover",
    "dialog": "Nasomin uses 1% iodine to wash away impurities before they become a problem. Smart prevention costs $20. Get yours now."
  }
]

ByteDance Seed-2.0-Lite

ByteDance Seed-2.0-Lite performed in line with what you would expect from a Lite model. It correctly identified the three core shots and did a solid job tracking the character and setting. That said, it still has a lot of room to improve when it comes to accurately transcribing dialogue.

[
  {
    "shot_id": 1,
    "subject": "A man in a plaid flannel shirt",
    "context_environment": "Christmas-decorated residential living room, featuring a small decorated Christmas tree and holiday stockings hung on a fireplace mantel",
    "action": "Holding a nasal spray bottle and a medical document, presenting both to the camera while speaking",
    "style": "realistic vlog style",
    "camera_motion_positioning": "static eye-level camera",
    "composition": "medium full shot",
    "ambiance_colour_lighting": "warm natural indoor daytime lighting with soft neutral tones",
    "audio": "man''s speaking voice, quiet ambient home background noise",
    "dialog": "This $20 nasal spray is an alternative to the $30000 medical expense shown on the document"
  },
  {
    "shot_id": 2,
    "subject": "The same man in a plaid flannel shirt",
    "context_environment": "Driver''s seat of a personal car, with car interior dashboard and seats visible",
    "action": "Uses the nasal spray, then holds the bottle up to the camera to explain its purpose",
    "style": "realistic vlog style",
    "camera_motion_positioning": "static eye-level camera",
    "composition": "medium close-up",
    "ambiance_colour_lighting": "bright natural daylight filtering through car windows",
    "audio": "man''s speaking voice, soft ambient outdoor traffic noise",
    "dialog": "This Nasomin nasal iodine spray works as a natural nasal cleanse to keep your nose healthy"
  },
  {
    "shot_id": 3,
    "subject": "Triquetra Nasomin Nasal Iodine Spray bottle",
    "context_environment": "Driver''s seat of a car, with steering wheel and car dashboard in the background",
    "action": "Held up to the camera to clearly display its full product label",
    "style": "realistic product showcase vlog style",
    "camera_motion_positioning": "static camera",
    "composition": "extreme close-up",
    "ambiance_colour_lighting": "bright natural daylight illuminating the product label",
    "audio": "man''s voiceover continuing to explain the product, soft ambient background noise",
    "dialog": "This spray naturally cleanses your nose as a mineral nasal cleanse, affordable and effective"
  }
]

Qwen3.5 Plus

Qwen3.5 Plus was steady and reliable. It also returned three shots, and its visual understanding was good enough to pass the test comfortably. From the character''s clothing to the overall meaning of the video, its interpretation was basically correct. Dialogue extraction, however, is still not one of its strongest areas.

[
  {
    "shot_id": 1,
    "subject": "man in plaid shirt holding a spray bottle and a document",
    "context_environment": "living room with Christmas tree and stockings",
    "action": "speaking to camera while comparing a 20 dollar product to a 30000 dollar bill",
    "style": "social media promotional video",
    "camera_motion_positioning": "static camera at eye level",
    "composition": "medium shot",
    "ambiance_colour_lighting": "bright natural indoor lighting",
    "audio": "man speaking directly to audience",
    "dialog": "comparing costs of nasal spray versus medical bills"
  },
  {
    "shot_id": 2,
    "subject": "man in car driver seat",
    "context_environment": "interior of a vehicle",
    "action": "spraying nasal iodine into nose and holding bottle up to camera",
    "style": "vlog style selfie video",
    "camera_motion_positioning": "handheld selfie angle",
    "composition": "close-up",
    "ambiance_colour_lighting": "natural daylight through car windows",
    "audio": "man speaking about the product",
    "dialog": "explaining the benefits of the nasal spray"
  },
  {
    "shot_id": 3,
    "subject": "hand holding Nasomin Nasal Iodine Spray bottle",
    "context_environment": "car interior with steering wheel and dashboard visible",
    "action": "holding the product steady for display",
    "style": "product showcase",
    "camera_motion_positioning": "static point of view shot",
    "composition": "extreme close-up",
    "ambiance_colour_lighting": "clear daylight highlighting the label",
    "audio": "background audio continues or silence",
    "dialog": null
  }
]

Gemini 3.1 Pro

The last model in the lineup was Gemini 3.1 Pro, and it was in a different class. It captured nearly every important element in the video with striking precision, from the speaker''s clothing and recording style to the dialogue itself. Its transcript quality in particular was far ahead of the rest. In this test, Gemini still looked like the dominant force in multimodal understanding.

[
  {
    "shot_id": 1,
    "subject": "A man wearing a plaid shirt holding a nasal spray bottle and a document",
    "context_environment": "A living room decorated for Christmas with a tree and stockings by a fireplace",
    "action": "Speaking enthusiastically to the camera while presenting the bottle and the document, with text overlays showing prices",
    "style": "Promotional vlog style",
    "camera_motion_positioning": "Static camera at eye level",
    "composition": "Medium shot",
    "ambiance_colour_lighting": "Bright, natural daytime lighting with a warm and festive atmosphere",
    "audio": "Spoken dialogue",
    "dialog": "The math is simple. Spend $20 now, or risk a $30,000 bill later."
  },
  {
    "shot_id": 2,
    "subject": "The same man in a plaid shirt holding the nasal spray bottle",
    "context_environment": "Inside a car, sitting in the driver''s seat",
    "action": "Demonstrating how to use the nasal spray by spraying it into his nose, then holding the bottle up and pointing to it while talking",
    "style": "Promotional vlog style",
    "camera_motion_positioning": "Static camera, slight low angle",
    "composition": "Close-up",
    "ambiance_colour_lighting": "Bright, natural daylight",
    "audio": "Sound of the spray bottle being used, followed by spoken dialogue",
    "dialog": "Give your nose a daily power wash with Nasomin. It''s a natural iodine spray, not a drug."
  },
  {
    "shot_id": 3,
    "subject": "A hand holding a bottle of Nasomin Nasal Iodine Spray",
    "context_environment": "Inside a car, with the steering wheel and dashboard forming the background",
    "action": "Holding the product bottle still for the camera to read the label",
    "style": "Product showcase",
    "camera_motion_positioning": "Static camera, first-person perspective",
    "composition": "Close-up",
    "ambiance_colour_lighting": "Natural daylight",
    "audio": "Spoken voiceover covering the shot",
    "dialog": "Nasomin uses 1% iodine to wash away impurities before they become a problem. Smart prevention costs $20. Get yours now."
  }
]

Conclusion

After this simple head-to-head test, the conclusion is fairly clear: Gemini remains the strongest multimodal system overall, especially for precise dialogue extraction and shot-level understanding. That said, models like Qwen and Seed are already getting close enough to be useful in real production workflows.

If you are building AI video tools or designing your own automated content pipeline, this comparison should give you a practical sense of what each model is good at today.

If you want to try this kind of workflow yourself, Flowtra's latest agent mode already supports cloning viral TikTok videos, swapping models, and more, powered by the latest Kling 3.0 generation stack.