Echo Team · Joy Future Academy, JD · June 2026 · CC BY 4.0

Echo-Memory
A Controlled Study of Memory in Action World Models

When the camera leaves and returns, which memory keeps the same world instead of a plausible but different scene?

Paper PDF Checkpoints Code

Controlled memory ablations on a shared Wan action-to-video stack — reproducible rows, evaluation scripts, and qualitative revisit panels.

Authors & affiliations

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

HKU · Joy Future Academy, JD · CUHK · PKU · Fudan · Tsinghua · HKUST · UMich

01 · Overview

One backbone, one protocol — only memory changes.

Echo-Memory holds the video backbone, training recipe, and data protocol fixed, and swaps only the memory module. The goal is to separate replay fidelity from return memory when the camera leaves and comes back to the same place.

Shared stack — chunk-wise action-conditioned world generation on Wan.
Controlled variable — Context, Compression, Spatial, or State-Space memory.
Three probes — replay metrics, in-domain 180° loop, open-domain edited return.
Release — ablation scripts, GT replay, revisit assets, and paper-aligned figures.

Echo-Memory framework overview — Controlled memory study over chunk-wise action-world generation. Click to expand

02 · Memory Design

Context · Compression · Spatial · State-Space

All variants plug into the same write–read interface; we only change what is stored and how history is retrieved. A no-memory I2V floor re-generates from the first frame as a lower bound.

Context

Raw recent frames at K = 1, 5, or 20 chunks — tests whether longer windows alone stop drift.

Compression

Learned compact tokens at ratio r = 4 — history without growing raw-frame storage.

Spatial

Explicit spatial read/write state — targets layout, object pose, and viewpoint carry.

State-Space

Block-wise SSM updates — recurrent carry beyond short context windows on revisit.

Memory design matrix — Four memory families under a shared write–read interface. Click to expand

03 · Checkpoints

Paper baselines on Hugging Face

Wan 2.1 1.3B memory rows — epoch-0, 30,000 steps, static in-domain pool. Released weights: Echo-Team/Echo-Memory

Family	Paper row	HF path	Steps
Raw context	Context K=1	`context_k1/epoch-0.safetensors`	30,000
Raw context	Context K=20	TODO	TODO
Spatial	Spatial Memory	TODO	TODO
State-space	Block-wise SSM	TODO	TODO
State-space	Legacy Hybrid	TODO	TODO
Spatial	concat text (abl.)	TODO	TODO
Spatial	inject none (abl.)	TODO	TODO
Spatial	cross-attn t32 (abl.)	TODO	TODO
State-space	SSM ctx1/e4/h21	TODO	TODO
State-space	SSM ctx5/e1/h21	TODO	TODO
State-space	SSM ctx5/e4/h81	TODO	TODO

Download

huggingface-cli download Echo-Team/Echo-Memory context_k1/epoch-0.safetensors --local-dir ./ckpts

In-domain eval (Echo-Memory repo)

export WAN_BASE_MODEL=/path/to/Wan2.1-T2V-1.3B
export DATASET_BASE_PATH=data/Context-as-Memory-Dataset
export CKPT=./ckpts/context_k1/epoch-0.safetensors
bash eval/v2/run_static_consistency_loop_and_revisit.sh

Keep the row folder in CKPT — env/memory_baseline_runtime.py infers memory flags from the path. Full index: doc/checkpoints.md.

04 · Evaluation

Replay · In-domain revisit · Open-domain return

Each branch asks a different question: Can the model reconstruct the past? Can it close a loop in-domain? After an edited first frame, does it return to the same world?

Replay

PSNR, SSIM, LPIPS on chunk-wise reconstruction — measures short-horizon pixel fidelity.

In-domain

180° trajectory loop closure with VLM-assisted scoring on held layouts.

Open-domain

Edited first frames and 45° return probes — stresses object identity and scene persistence.

Dynamic SpatialVID

Training and inference wrappers are public; the dynamic eval protocol is TODO.

Three-branch evaluation summary — Replay health vs. return memory under the same stack. Click to expand

05 · Qualitative Evidence

Return probes expose identity drift.

Qualitative panels follow a simple diagnostic: first frame → leave the view → revisit tail. We compare whether memory restores the same object, pose, background, and camera geometry — not merely a plausible new scene.

Representative memory comparisons across variants. Click to expand

Static Replay

Static Compression r=4 replay — Compression r = 4

Static Spatial Memory replay — Spatial Memory

Static legacy VideoSSM replay — Legacy Hybrid

Static Block-wise SSM replay — Block-wise SSM

SpatialVID Replay

Dynamic Context K=1 replay — Context K=1

Dynamic Context K=5 replay — Context K=5

Dynamic Context K=20 replay — Context K=20

Dynamic Spatial Memory replay — Spatial Memory

Dynamic legacy VideoSSM replay — Legacy Hybrid

Dynamic Block-wise SSM replay — Block-wise SSM

Dynamic previews use one randomly selected training scene replayed with the same first frame, prompt, and GT camera trajectory across all six rows.

06 · Main Conclusions

Replay quality ≠ memory quality.

Replay metrics and return probes do not always agree — a model can look sharp on reconstruction yet fail when the camera returns. Rankings reorder once identity under revisit is measured.

Raw context — more history helps open-domain return more than replay alone.
Compression — compact tokens can preserve replay while losing identity on return.
Spatial vs. SSM — explicit state and block-wise SSM trade off layout carry and long-horizon stability.
Takeaway — treat replay as a health check, not the final memory benchmark.

Replay vs revisit metrics — Rank shift from replay to return — replay is not the final memory score. Click to expand

07 · News & Roadmap

Release notes and next steps.

News

2026/06/13
SpatialVID support added: dynamic training/inference recipes, 5-second first-chunk replay previews, and dynamic eval TODO.
2026/06/06
Echo-Memory released: paper, project page, public code, replay/revisit eval assets, and baseline checkpoints.

Roadmap

Models

Wan 2.1 1.3B backbone and training recipes
Four memory families — Context, Compression, Spatial, State-Space
Dynamic training pool — SpatialVID subset export & settings
Paper checkpoints — Echo-Team/Echo-Memory
Wan 2.2 + multi-scale 5B / 14B

Eval

Dynamic eval beyond static replay/revisit
More revisit probes and scoring presets

08 · Citation

BibTeX

Echo-Memory: A Controlled Study of Memory in Action World Models (June 2026). Licensed under CC BY 4.0. Cite the arXiv preprint below.

Source: arXiv
arXiv ID: 2606.09803
PDF: arxiv.org/pdf/2606.09803
License: CC BY 4.0

@article{king2026echomemory,
  title={Echo-Memory: A Controlled Study of Memory in Action World Models},
  author={King, Wayne and Xue, Zeyue and Bian, Yuxuan and Huang, Jie and Li, Haoran and Li, Yaowei and Su, Yaofeng and Li, Yuming and Wang, Haoyu and Zhang, Shiyi and Zhang, Songchun and Niu, Yuwei and Xu, Sihan and Zhuang, Junhao and Huang, Haoyang and Duan, Nan},
  journal={arXiv preprint arXiv:2606.09803},
  year={2026},
  month={jun},
  eprint={2606.09803},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.09803}
}

Echo-Memory A Controlled Study of Memory in Action World Models