JoyAI-Echo

Long Video Generation

Technical Report | May 28, 2026 | Echo Team @ Joy Future Academy, JD

JoyAI-Echo: Pushing the Frontier of Long Audio-Visual Generation

A memory-driven audio-visual generation framework for minute-level coherent video, real-time streaming, conversational control, and high-resolution output.

Paper Code JoyAI-Echo Cases

5 min minute-level coherent stories

7.5x generation speedup

A/V memory-driven consistency

Abstract

Long video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present JoyAI-Echo, a framework that breaks these barriers through four key advances. Central to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a 7.5× speedup to substantially boost visual quality and alignment. Empowered by these two components, JoyAI-Echo decisively outperforms Happy Oyster (Directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks. Beyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation. For the first time, JoyAI-Echo simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output—without compromise, inaugurating a new era of interactive video generation. Codes and weights will be open-sourced.

Long Cases Audio

Short Cases Audio

Key Conclusions

Cross-modal memory keeps identity alive.

Slot-paired visual and audio memories preserve face, appearance, voice timbre, and face-voice correspondence across distant shots.

Streaming generation becomes practical.

Memory-conditioned acceleration and low-step distillation deliver a 7.5x speedup over the original multi-step pipeline.

Interaction is part of the generation loop.

The Director Agent expands rough user intent into structured screenplay, shots, characters, scenes, and localized revisions.

Resolution stays aligned with latency.

A one-step audio-visual super-resolution stage sharpens 1K output without breaking streaming generation constraints.

Results

Human preference study across long-form consistency, audio quality, prompt following, and human-centric generation.

GSB user study on long- and short-video generation

The numbers denote the percentage of user preferences.

Aspect	Long Video			Short Video (Human-Centric)
Aspect	JoyAI-Echo	Tie	HappyOyster (Directing)	JoyAI-Echo	Tie	Wan 2.6
Visual aesthetics	63.6%	8.8%	27.6%	58.8%	14.7%	26.5%
Audio quality	81.7%	6.5%	11.8%	32.3%	30.9%	36.8%
Prompt following	80.6%	13.5%	5.9%	33.8%	36.8%	29.4%
IP consistency	59.4%	12.9%	27.7%	--	--	--

Authors

Haoran Li^1,* Fredreic Li^2,* Shichen Ma^1,* Jie Huang¹ Yijun Liu³ Jiaqi Shi⁴ Yanwen Ma⁵ Yaofeng Su⁶ Xin Lu⁴ Haoyu Wang³ Xiaoxiao Ma⁴ Guohui Zhang⁴ Yaowei Li² Mingchen Zhong⁴ Junhao Zhuang¹ Songchun Zhang⁷ Weiyang Jin⁸ Yuxuan Bian⁹ Shiyi Zhang³ Haojun Xu⁵ Shuai Lu¹ Xin Han¹ Wei Tang¹ Tong He¹ Jiaqi Wang¹ Ping Luo⁸ Haoyang Huang¹ Zeyue Xue^1,8,*,+ Nan Duan¹

¹Joy Future Academy, JD ²Peking University ³Tsinghua University ⁴The University of Science and Technology of China ⁵Beihang University ⁶Fudan University ⁷The Hong Kong University of Science and Technology ⁸The University of Hong Kong ⁹The Chinese University of Hong Kong

* Equal contribution. + Project lead.

BibTeX

@techreport{joyai2026echo,
  title        = {JoyAI-Echo: Pushing the Frontier of Long Audio-Visual Generation},
  author       = {{Echo Team @ Joy Future Academy, JD}},
  institution  = {Joy Future Academy, JD},
  year         = {2026},
  month        = {May}
}