Hey everyone! Just wanted to drop a quick breakdown of my experience using WAN 2.2, the latest version of the AI model from WAN-AI. I’ve been playing around with AI video generation tools for a while, and I was really hyped for this update after using WAN 2.1 for a few months.
Why I Tried WAN 2.2
I’d been using WAN 2.1 mainly for text-to-video (T2V) and image-to-video (I2V) generation. The visuals were decent, but I often saw weird motion artifacts or jittery frames. So when I saw WAN 2.2 was out with promises of higher resolution, better motion control, and even VFX improvements, I had to try it out.
The new release has three model versions:
-
T2V-A14B (Text-to-Video)
-
I2V-A14B (Image-to-Video)
-
TI2V-5B (Text + Image to Video hybrid)
I tested all three using ComfyUI and here’s what I found.
Setup & Requirements
I ran WAN 2.2 locally using:
-
ComfyUI
-
Python + CUDA
-
GPU: RTX 4090 (24 GB VRAM)
If you’re on a 16 GB card or less, you might need to enable model offloading or use a reduced version.
Models are available on HuggingFace and GitHub.
What’s New & Actually Improved?
1. Mixture of Experts Architecture
WAN 2.2 uses a MoE setup with specialized denoising experts for low/high noise frames. It has 27B parameters, but only activates ~14B per step, so it’s relatively efficient while improving quality. Definitely saw smoother transitions in motion.
2. Native 1080p Support
No more upscaling tricks. The output looks amazing. Color balance, lighting, and cinematic composition are on point. Great for storytelling and animation loops.
3. New Motion Engine (VACE 2.0)
This is a big one: WAN 2.2 adds much better camera movement and background stability. I tested zooms, pans, even simulated dolly shots — all looked way more fluid than WAN 2.1.
4. Prompt Planning & In-Context Control
With prompt extension enabled, you can use large language models (like Qwen or DashScope) to flesh out scenes and automate some of the text generation. Helps with better frame-to-frame coherence.
5. LoRA Fine-Tuning (Fast & Low Sample)
Training LoRA models only takes 10–20 images now, and it supports multi-LoRA blending. I trained one for a “cyberpunk fire dancer” concept in like 15 minutes.
My Workflow (TI2V Example)
-
Prompt Used:
A white cat with sunglasses skateboarding on a sunny beach, cinematic lighting, slow-motion camera rotating around -
Settings:
-
Resolution: 1280x704
-
FPS: 24
-
Duration: 120 frames
-
Steps: 20
-
CFG: 3.5
-
Sampler: Euler
-
Scheduler: Simple
-
-
Render Time:
On my RTX 4090, it took around 5 minutes to render a 5-second video at 720p.
Downsides I Noticed
-
VRAM Hungry: Even with optimization, the A14B models need ~24 GB for smooth runs. Definitely not entry-level hardware-friendly.
-
Prompt Engineering Still Matters: Garbage in, garbage out. It’s not magic. You still need well-written, detailed prompts for cinematic results.
-
TI2V Can Overwhelm New Users: While TI2V gives the most flexibility, it might be overwhelming if you’re not familiar with image/video coordination.
Final Thoughts
WAN 2.2 is easily one of the most impressive open-source multimodal AI models I’ve used in 2025. If you’re already familiar with ComfyUI, it’s a no-brainer to test it out. Motion control is leagues ahead of the previous generation, and the native 1080p output makes it production-worthy.
If you’re running on high VRAM and want to create stylized short films, animated concepts, or AI-generated music videos, give WAN 2.2 a shot.
