Is AI Avatar Livestream Selling Outdated?
As we enter September 2025, you may have heard numerous times about AI avatar livestream selling or other AI-generated video product introductions.
Does this model truly drive sales growth and cut costs? Let’s take a look at its latest market performance.
1. Rapid Market Expansion in Developed Regions
AI avatar selling is expanding rapidly in overseas markets, especially in developed regions like North America and Europe. The global AI avatar market is estimated to reach approximately $2 billion in 2024 and is projected to grow to $15 billion by 2028, with a compound annual growth rate (CAGR) of over 40%.
2. Cost Advantages Win Over Brands
AI avatar livestream selling is gaining favor among more brands due to its low costs and the elimination of concerns about human streamers’ fatigue or negative public incidents. For instance, JD.com (a leading Chinese e-commerce platform) leveraged its founder’s AI avatar in livestreams, achieving sales of tens of millions of RMB.
3. Enhanced User Experience Driven by Advanced Tech
With the continuous advancement of large language models (LLMs) and multimodal AI capabilities, AI avatars have become more realistic, with significantly improved interactivity and personalized experiences. They are well-suited for long-duration livestreams and multilingual broadcasts, effectively boosting user conversion rates and brand influence.
Flowtra AI’s latest AI avatar selling features, along with the AI avatar ad creation workflow video (from UGC content published in September by YouTube creator RoboNuggets: https://www.youtube.com/watch?v=KrdthDfS-KI&t=65s), still prove this is a cost-effective method for product promotion. Now, let’s dive into the core of this workflow and explore how to craft prompts that enable AI to generate professional-grade product introduction scripts.
Before we start, let’s quickly understand how Flowtra AI modifies and automates the n8n workflow from the video. You can first refer to the workflow diagram below:
The key steps involve letting AI complete basic character and product recognition. The biggest challenge, however, lies in getting AI to generate appropriate scripts and realistic backgrounds for the avatar’s dialogue.
Step 1: Analyze the Product and Character
In the original video, the creator mentions using image editing software (e.g., Photoshop) to combine character and product images—a step that is actually unnecessary. OpenRouter’s multimodal LLM API supports multiple image inputs, so we can skip this step. Additionally, we need to optimize the original prompt, which was designed for generating ads in multiple styles. The revised prompt focuses on letting AI first identify whether the image depicts a product, a character, or both.
Analyze the given image and determine if it primarily depicts a product or a character, or BOTH.
- If the image is of a product, return the analysis in JSON format with the following fields:
{
"type": "product",
"brand_name": "(Name of the brand shown in the image, if visible or inferable)",
"color_scheme": [
{
"hex": "(Hex code of each prominent color used)",
"name": "(Descriptive name of the color)"
}
],
"font_style": "(Describe the font family or style used: serif/sans-serif, bold/thin, etc.)",
"visual_description": "(A full sentence or two summarizing what is seen in the image, ignoring the background)"
}
- If the image is of a character, return the analysis in JSON format with the following fields:
{
"type": "character",
"outfit_style": "(Description of clothing style, accessories, or notable features)",
"visual_description": "(A full sentence or two summarizing what the character looks like, ignoring the background)"
}
- If it is BOTH, return both descriptions in JSON format:
{
"type": "both",
"product": {
"brand_name": "...",
"color_scheme": [...],
"font_style": "...",
"visual_description": "..."
},
"character": {
"outfit_style": "...",
"visual_description": "..."
}
}
For this specific feature, the first two options (single product or single character) are irrelevant—we only need to retain the "BOTH" option. The analysis will extract:
- For the product: Brand name, color scheme, and font style.
- For the character: Outfit style and appearance description.
AI handles this task easily. For example, here’s the analysis result when identifying an image of a woman and a red necklace:
{
"type": "character",
"product": {
"brand_name": "N/A",
"font_style": "N/A",
"color_scheme": [
{
"hex": "#B22222",
"name": "Crimson Red"
},
{
"hex": "#D3D3D3",
"name": "Light Gray"
},
{
"hex": "#FFFFFF",
"name": "White"
}
],
"visual_description": "A silver necklace with a large, oval-shaped crimson red gemstone surrounded by smaller white gemstones is displayed on a white mannequin bust. The necklace has a delicate silver chain and a smaller cluster of white gemstones above the main pendant."
},
"character": {
"outfit_style": "Casual and comfortable, wearing a white sweatshirt with a light beige or cream-colored jacket or cardigan over it. She has small stud earrings.",
"visual_description": "A young woman with fair skin, dark brown hair with lighter brown highlights, and delicate features is looking directly at the camera. She has a gentle expression and light pink lipstick."
}
}
It’s clear that AI demonstrates sharp color perception and provides detailed descriptions of the character’s outfit.
Step 2: Generate Image and Video Scripts (The Most Critical Step)
Next, we need to clarify AI’s task and guide it to use the analysis results from the previous step. Below is the prompt we used to generate image and video scripts for the "woman + red necklace" example:
Your task: Create 1 image prompt and {how much videos} video prompts as guided by your system guidelines. Scene 0 will be the image prompt, and Scenes 1 onward will be the video prompts.
***
Description of the reference images are given below. Most likely these are descriptions of a character advertising the product, or the product to be advertised itself.
{result from analysis images}
We then added a system prompt to ensure AI understands the output requirements—using Scene 0 for the image prompt and Scenes 1+ for video prompts. The scripts must adhere to the "UGC-style casual realism" principle:
- Everyday, relatable environments.
- Amateur iPhone shooting style.
- Natural lighting and slightly imperfect framing.
- Candid poses and genuine expressions.
Additionally, the character’s accent must remain consistent across video scenes.
In short, the goal is to transform a product into a series of structured, realistic image and video prompts that mimic content created by influencers or content creators.
### System Prompt
UGC Image + Video Prompt Generator 🎥🖼️
Have Scene 0 as the image prompt
and Scenes 1 onward are the video prompts
At the beginning of each image prompt, use this prefix, but replace the (verb) with an appropriate word depending on the product. Example: show it to the camera, wear it on camera, etc "Take the product in the image and have the character (verb) it to the camera. Place them at the center of the image with both the product and character visible"
-----
If the user wants UGC authentic casual content: Use **UGC - style casual realism** as instructed below, unless the user specifies otherwise.
If the user explicitly requests a different style or a more professional setting, follow their instructions.
-----
### Ask
Your task: Take the reference image or the product in the reference image and place it into realistic, casual scenes as if captured by content creators or influencers.
Your task is to generate **both image and video prompts** based on the user’s request.
- Use the number of scenes explicitly stated by the user. If not specified, default to **2 scenes**.
- Output must be an array of scene objects, each containing:
- `scene`: A number starting from 0 and incrementing by 1
Have Scene 0 as the image prompt
and Scenes 1 onward are the video prompts
- `prompt`: A JSON object describing the scene
-----
### Guidance
- Always follow **UGC - style casual realism** principles unless the users asks otherwise:
- Everyday realism with authentic, relatable environments
- amateur iPhone photo/video style
- Slightly imperfect framing and natural lighting
- Candid poses, genuine expressions
- Imperfections allowed unless otherwise specified by the user or unless image reference shows otherwise. (blemishes, uneven skin)
- **Camera parameter** must include casual descriptors:
- "amateur iPhone selfie", "uneven framing", etc.
- have the camera movement be fixed unless otherwise stated
- If dialogue is needed:
- Use the EXACT dialogue in the script description given by the user if it looks like a dialogue. Note that each scene will only be 8 seconds long, so split the dialogue between scene 1 and onward if it's too long
- But if the user asks you to think of the dialogue - keep it casual, spontaneous, under 150 characters
- Natural tone (as if talking to a friend)
- Avoid formal, salesy, or scripted language
- Use ellipses (...) to signal pauses
- Describe the accent and style of voice and keep it consistent across scenes. Use the voice description as basis to keep it consistent across scenes
- always describe the accent and voice of the character in the video prompts
- prefix the video prompt with: "dialogue, the character in the video says:"
- Default age range: 21 to 38 unless stated otherwise by the user
- **Avoid**:
- Using double quotes inside prompts
- Mentioning copyrighted characters
- Generating more or fewer scenes than requested
- Overly describing the product in the image prompt and video prompts. If you need to describe it, just say to refer to the reference image provided
- For the video prompt, avoid having the character use the product unless otherwise stated
- For the dialogue, avoid having words in all caps
- For the video prompts, don't refer back to previous scenes
-----
### Examples
good_example:
- |
{
"scene": 0,
"prompt": {
"action": "character holds product casually",
"character": "inferred from image",
"product": "the product in the reference image",
"setting": "casual everyday environment, such as a kitchen or car",
"camera": "amateur iPhone selfie, slightly uneven framing, casual vibe",
"style": "UGC, unfiltered, realistic"
}
},
{
"scene": 1,
"prompt": {
"video\_prompt": "dialogue, the character in the video says: this stuff’s actually pretty good... and it's got protein too",
"voice\_type": "Australian accent, deep female voice",
"emotion": "chill, upbeat",
"setting": "car, parked",
"camera": "amateur iPhone selfie video",
"camera movement": "fixed"
}
}
and so on depending on how many scenes
### N - Notation:
Format: JSON
Example Output:
```json
{
"scenes": [
{
"scene": 0,
"prompt": "image prompt as a JSON object"
},
{
"scene": 1,
"prompt": "video prompt as a JSON object"
},
{
"scene": 2,
"prompt": "video prompt as a JSON object"
}
]
}
and so on depending on how many scenes the user needs
Generated Results
Here’s the final output from AI:
{
"scenes": [
{
"scene": 0,
"prompt": {
"style": "UGC, unfiltered, realistic",
"action": "character holds product casually",
"camera": "amateur iPhone selfie, slightly uneven framing, casual vibe",
"product": "silver necklace with a large, oval-shaped crimson red gemstone surrounded by smaller white gemstones",
"setting": "brightly lit living room",
"character": "young woman with fair skin, dark brown hair with lighter brown highlights, wearing a white sweatshirt and light beige jacket"
}
},
{
"scene": 1,
"prompt": {
"camera": "amateur iPhone selfie video",
"emotion": "chill, upbeat",
"setting": "brightly lit living room",
"voice_type": "Australian accent, deep male voice",
"video_prompt": "dialogue, the character in the video says: This necklace is unreal! The red stone is so vibrant, it goes with everything.",
"camera_movement": "fixed"
}
}
]
}
Performance of the Generated Image and Video
-
Image Quality: The necklace blends naturally with the character’s neck, and both the character and the product are highly (restored to the original). The character’s clothing and background also appear realistic:
-
Video Quality: The character’s facial expressions transition smoothly. While there is a slight discrepancy in the voice, this can be resolved with further post-optimization:
We used Veo3—the current leading video AI model—which generates 8-second clips at a time. If you need longer videos, simply specify the number of clips in your prompt and merge them using the https://fal.ai/models/fal-ai/ffmpeg-api/merge-videos/api tool.
Thank you for reading! Flowtra AI is an AI advertising platform specifically designed for small businesses, serving Etsy/Shopify sellers and physical stores such as bars and pizzerias.