Alibaba Unveils Wan 2.5-Preview: Unified AI for Synchronized Text, Image, Video, and Audio Creation
Alibaba today announced Wan 2.5-Preview, a natively multimodal model that brings understanding and generation across text, images, video, and audio into one architecture. Available in public preview through Alibaba Cloud Model Studio’s Wan text-to-video preview, the release centers on joint multimodal training and reinforcement learning from human feedback (RLHF) to improve instruction following, visual fidelity, and synchronized audio-visual output.

Native multimodality as the core design
Unlike systems that connect separate models for different media types, Wan 2.5-Preview uses a unified backbone trained jointly on textual, auditory, and visual data.
One model, many modalities: the same system is designed to understand context across words, visuals, and sound, and generate them together with tight, consistent alignment.
For creators, marketers, and brand teams, that unification matters because it reduces friction between ideation and finished assets. With multimodal inputs such as text, reference images, and even audio, the model is positioned to keep timing, tone, and style coherent across everything it produces.
Video: synchronized audio-visual generation with cinematic control
Wan 2.5-Preview’s headline capability is native audio-visual generation. The model produces up to 10-second 1080p clips in a single pass with synchronized vocals, music, and sound effects aligned to on-screen motion and scene changes.
- Audio built-in: support for narration, multi-person vocals, ambient sound, and background music without manual post-sync.
- Multimodal prompting: guide a clip’s look, pacing, and soundtrack with text, reference images, or audio stems in one workflow.
- Upgraded control system: director-style instructions for composition, motion, pacing, and cinematic camera moves.
- Structural stability: consistent motion and identity throughout each shot for more believable dynamics.
According to Alibaba Cloud’s preview documentation, the wan2.5-t2v-preview model supports standard HD outputs and synchronized audio via supplied files, aligning video to the provided soundtrack when present. Technical references for formats, durations, and size options are listed in the Model Studio documentation for text-to-video and image-to-video.
At-a-glance: Wan 2.5-Preview capabilities
| Category | Highlights |
|---|---|
| Architecture | Native multimodal backbone (text, image, video, audio) with RLHF alignment |
| Video output | Up to 10 seconds; resolutions including 1080p HD; strong dynamics and structural stability |
| Audio | Native synchronized generation; supports vocals, SFX, ambience, and music; alignment to provided audio |
| Inputs | Text prompts; reference images; optional audio input for A/V sync |
| Control | Cinematic composition and motion guidance; upgraded instruction-following for timing and camera moves |
| Image creation | Photorealism and diverse styles; creative typography; professional-grade charts |
| Image editing | Conversational editing with pixel-level precision; multi-concept fusion; material transformation; color swapping |
| Access | Preview models via Alibaba Cloud Model Studio; API endpoints for text-to-video and image-to-video |
Images: from photorealism to precise, instruction-based edits
Wan 2.5-Preview also advances image generation and editing with emphasis on instruction fidelity and fine control. The model is designed to handle photorealistic renders, illustration and graphic styles, creative text effects, and data visuals with professional polish.
- Conversational editing: modify images through natural language instructions with pixel-level accuracy.
- Multi-concept fusion: combine motifs, subjects, or attributes with fewer visual artifacts.
- Material and color control: change textures, finishes, and product colors for campaign variants or prototyping.
- Style breadth: from product photography and portraiture to illustration, infographic, and typographic art.
For brand builders and visual teams, this consolidates tasks that typically span multiple tools, drafting, style matching, and detailed retouch, into one model-driven flow, reducing iteration cycles while preserving intent.
Deep alignment: joint training plus RLHF
The preview centers on what Alibaba describes as deep alignment: training Wan 2.5 across paired text, audio, and visual datasets and refining with large-scale RLHF. The aim is to keep every modality in step with the others and with the creator’s direction.
- Coherence over time: frame-to-frame and beat-to-beat stability for motion and sound.
- Instruction fidelity: improved adherence to complex, nuanced prompts and scene directions.
- Quality lift: more natural video dynamics and sharper images versus prior modular approaches.
This focus is especially relevant to short-form video, where pacing, rhythm, and on-screen identity must remain consistent, and to branded visuals, where typography, color, and product details need to stay reliable across versions.
Preview access through Model Studio
Wan 2.5-Preview is accessible via Alibaba Cloud Model Studio, with endpoints for text-to-video and image-to-video. The image-to-video preview model (wan2.5-i2v-preview) brings the same synchronized audio capability to animation workflows, preserving key identity features from source frames while adding motion and sound. Documentation for both preview endpoints is available here: text-to-video and image-to-video.
In the preview phase, the models list support for common HD sizes and 5- and 10-second durations, along with synchronized audio from provided files. These references outline accepted formats and parameters so teams can assess fit for prototyping, content tests, and early creative development within existing brand workflows.
What creators and brands can expect
For creators working across design, video, and audio, Wan 2.5-Preview signals a shift from toolchains toward unified, instruction-based production:
- Faster short-form storycraft: sound and picture are born together, reducing manual sync and revision loops.
- Visual identity consistency: stronger stability for characters, products, and typography across cuts and variations.
- Reference-driven direction: align outputs with campaign mood boards, voiceover reads, or temp tracks without jumping tools.
- Lower overhead for pilots: preview-ready assets without a full post stack can accelerate testing and concept approvals.
For startups and small teams, the preview suggests new ways to scale creative throughput without scaling headcount, especially in social, product marketing, and brand experimentation. For agencies and studios, native multimodality points to tighter collaboration between writers, art directors, editors, and sound designers inside a single model surface.
Context: part of a broader push toward unified creative stacks
Alibaba Cloud has been investing in end-to-end AI platforms, and Model Studio is positioned as its development and deployment layer for enterprise and creator use cases. The addition of synchronized audio-visual generation in Wan 2.5-Preview extends that stack into more production-ready territory for short-form content, motion branding, and interactive experiences.
As the preview rolls out, the key focus areas will be instruction fidelity for complex prompts, identity preservation in image-to-video, and the breadth of synchronized audio scenarios, from dialog-heavy scenes to music-led edits. The company frames the release as a step toward fully unified creative workflows where text, image, video, and audio are composed and refined within one deeply aligned model.
Availability
Wan 2.5-Preview is available now in Alibaba Cloud Model Studio for testing and evaluation through preview endpoints. For feature details and supported parameters, see the product page and API references:
- Alibaba Cloud Model Studio
- Wan text-to-video (preview) documentation
- Wan image-to-video (preview) documentation
With a native multimodal core, synchronized audio-visual generation, and fine-grained image creation and editing, Wan 2.5-Preview arrives as a practical milestone for creators seeking cohesive outputs and faster concept-to-cut cycles, all inside a unified model surface.



