Echo-Infinity Learnable Evolving Memory for Real-Time
Infinite Video Generation

1 The Chinese University of Hong Kong 2 Joy Future Academy, JD 3 The Hong Kong University of Science and Technology 4 Tsinghua University 5 The University of Hong Kong 6 Peking University 7 University of Science and Technology of China

† Corresponding Author

Echo-Infinity demonstrates hour-scale and real-time video generation with a learnable memory to filter, abstract, and compress any-length history at constant cost, suggesting a practical path toward infinite video generation.

Note: this page contains a large number of streaming videos. Please allow a few seconds for them to load — the best viewing experience appears once all clips have buffered.

Abstract

We present Echo-Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Queries, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frame to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id fmax throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

Contributions

Three core advances that together unlock infinite, real-time video generation.

End-to-End Memory Query

We propose Echo-Infinity, an autoregressive framework towards real-time infinite video generation, replacing handcrafted memory curation with end-to-end trainable Memory Query that is optimized to filter, abstract, and compress arbitrary-length history at constant computation cost.

Unified Relative RoPE Recipe

We employ Unified Relative RoPE Recipe, which keeps every active temporal RoPE id within the trained range throughout training and inference, avoiding the RoPE train-test extrapolation gap.

Generalizable & Real-Time

We achieve generalizable and state-of-the-art performance on long, short, and interactive video generation benchmarks. We further provide the first demonstration of stable quality over 24-hour real-time video generation (>1.3M frames), paving the way toward infinite video generation.

+0 h
Continuous Generation
first > 24 h, infinite-time
0 M
Frames Generated
stable visual quality
0
FPS on H100
real-time streaming
+0%
Throughput Overhead
vs. memory-free baseline

Method

A learnable memory state, updated by attention and gating on every eviction, plus a relative RoPE schedule shared between training and inference.

Echo-Infinity method overview

Figure 2. Echo-Infinity introduces an end-to-end trainable Memory Query that filters, abstracts, and compresses evicted history KV caches through attention and gating, enabling evolving compression of arbitrarily long histories. To avoid temporal RoPE extrapolation during inference or even overflow, Echo-Infinity uses Relative RoPE throughout training and inference, which anchors the sink frame to start from id 0 and lets the newest frame id grow at most to the backbone's pretrained maximum temporal RoPE id fmax (e.g., fmax=20 for Wan2.1-1.3B), closing the RoPE extrapolation gap.

Three-Tier KV Organization

Inspired by the human memory hierarchy, the per-layer cache is split into sink frames (global anchor), a local window (working buffer of recent frames), and a learnable memory query (long-term evolving store).

Memory Update on Eviction

When the local window slides forward, evicted KVs feed a cross-attention encoder that refreshes the queries, followed by a sigmoid-gated residual that selectively overwrites memory state — jointly optimized end-to-end with the video DiT.

Unified Relative RoPE Schedule

Sink stays at id 0; the newest frame grows up to fmax; older frames rotate backward once that bound is reached. Every active id stays inside the trained range [0, fmax], in both training and inference — no overflow, no extrapolation.

Qualitative Comparison

Each row shows the same prompt rendered by four methods. Echo-Infinity (Ours) on the left, then LongLive, MemFlow, and Memorize-and-Generate. Click any prompt to enlarge.

Quantitative Results

Table 1. Single-Prompt 30s / 240s Long Video Evaluation (VBench-Long / MovieGen)

Method #Params Throughput (FPS) ↑ 30s ↑ 240s ↑
Quality Semantic User Pref. Quality User Pref.
LongLive 1.3B20.7 83.5980.2810.47 79.796.13
MemFlow 1.3B18.7 83.3580.8510.13 79.315.93
Memorize-and-Generate 1.3B21.7 83.6981.0114.73 75.492.13
∞-RoPE 1.3B17.0 83.3874.675.13 79.9914.13
Echo-Infinity (Ours) 1.3B18.5 85.61 82.01 59.53 81.23 71.67

Table 2. Single-Prompt 5s Video Evaluation (VBench)

Method #Params Throughput (FPS) ↑ Evaluation scores ↑
Total Quality Semantic
LTX-Video1.9B8.9880.0082.3070.79
Wan-2.11.3B0.7884.2685.3080.09
SkyReels-V21.3B0.4982.6784.7074.53
MAGI-14.5B0.1979.1882.0467.74
Self Forcing (chunk-wise)1.3B17.083.0883.9779.53
Causal Forcing (chunk-wise)1.3B17.083.9484.5981.35
LongLive1.3B20.783.2984.0980.06
MemFlow1.3B18.783.6284.5280.02
Memorize-and-Generate1.3B21.784.0684.8480.96
Echo-Infinity (w/o Memory Update) 1.3B18.9 84.57 85.51 80.80
Echo-Infinity (w/ Memory Update) 1.3B18.5 85.35 86.32 81.49

Table 3. Multi-Prompt 60s Interactive Evaluation (MemFlow benchmark)

Method Quality Score ↑ CLIP Score ↑
0–10s 10–20s 20–30s 30–40s 40–50s 50–60s
LongLive 79.38 34.0832.0932.0331.5530.8830.49
MemFlow 78.91 33.4831.9431.9530.8730.5330.23
Memorize-and-Generate 79.15 33.5831.4331.1430.6530.4830.27
∞-RoPE 79.22 33.1532.4731.4130.4630.2930.17
Echo-Infinity (Ours) 81.71 34.1032.4231.9931.1830.8330.74

Citation

@article{bian2026echoinfinity,
  title   = {Echo-Infinity: Learnable Evolving Memory for Real-Time Infinite Video Generation},
  author  = {Bian, Yuxuan and Xue, Zeyue and Zhang, Songchun and Zhang, Shiyi and Jin, Weiyang and Li, Yaowei and Zhuang, Junhao and Li, Haoran and Huang, Jie and Huang, Haoyang and Duan, Nan and Xu, Qiang},
  journal = {arXiv preprint arXiv:2606.04527},
  year    = {2026}
}