Cross-modal memory keeps identity alive.
Slot-paired visual and audio memories preserve face, appearance, voice timbre, and face-voice correspondence across distant shots.
Long video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present JoyAI-Echo, a framework that breaks these barriers through four key advances. Central to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a 7.5× speedup to substantially boost visual quality and alignment. Empowered by these two components, JoyAI-Echo decisively outperforms Happy Oyster (Directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks. Beyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation. For the first time, JoyAI-Echo simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output—without compromise, inaugurating a new era of interactive video generation. Codes and weights will be open-sourced.
Slot-paired visual and audio memories preserve face, appearance, voice timbre, and face-voice correspondence across distant shots.
Memory-conditioned acceleration and low-step distillation deliver a 7.5x speedup over the original multi-step pipeline.
The Director Agent expands rough user intent into structured screenplay, shots, characters, scenes, and localized revisions.
A one-step audio-visual super-resolution stage sharpens 1K output without breaking streaming generation constraints.
Human preference study across long-form consistency, audio quality, prompt following, and human-centric generation.
The numbers denote the percentage of user preferences.
| Aspect | Long Video | Short Video (Human-Centric) | ||||
|---|---|---|---|---|---|---|
| JoyAI-Echo | Tie | HappyOyster (Directing) |
JoyAI-Echo | Tie | Wan 2.6 | |
| Visual aesthetics | 63.6% | 8.8% | 27.6% | 58.8% | 14.7% | 26.5% |
| Audio quality | 81.7% | 6.5% | 11.8% | 32.3% | 30.9% | 36.8% |
| Prompt following | 80.6% | 13.5% | 5.9% | 33.8% | 36.8% | 29.4% |
| IP consistency | 59.4% | 12.9% | 27.7% | -- | -- | -- |
Haoran Li1,* Jie Huang1,* Fredreic Li2,* Shichen Ma1,* Yijun Liu3 Jiaqi Shi4 Yanwen Ma5 Yaofeng Su6 Xin Lu4 Haoyu Wang3 Xiaoxiao Ma4 Guohui Zhang4 Yaowei Li2 Mingchen Zhong4 Junhao Zhuang1 Songchun Zhang7 Weiyang Jin8 Yuxuan Bian9 Shiyi Zhang3 Haojun Xu5 Shuai Lu1 Xin Han1 Wei Tang1 Tong He1 Jiaqi Wang1 Ping Luo8 Haoyang Huang1 Zeyue Xue1,8,*,+ Nan Duan1
1Joy Future Academy, JD 2Peking University 3Tsinghua University 4The University of Science and Technology of China 5Beihang University 6Fudan University 7The Hong Kong University of Science and Technology 8The University of Hong Kong 9The Chinese University of Hong Kong
@techreport{joyai2026echo,
title = {JoyAI-Echo: Pushing the Frontier of Long Audio-Visual Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {May}
}