Echo Team · Joy Future Academy, JD · June 2026 · CC BY 4.0

Echo-Memory
A Controlled Study of Memory in Action World Models

When the camera leaves and returns, which memory keeps the same world instead of a plausible but different scene?

Controlled memory ablations on a shared Wan action-to-video stack — reproducible rows, evaluation scripts, and qualitative revisit panels.

Authors & affiliations

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

HKU · Joy Future Academy, JD · CUHK · PKU · Fudan · Tsinghua · HKUST · UMich

01 · Overview

One backbone, one protocol — only memory changes.

Echo-Memory holds the video backbone, training recipe, and data protocol fixed, and swaps only the memory module. The goal is to separate replay fidelity from return memory when the camera leaves and comes back to the same place.

  • Shared stack — chunk-wise action-conditioned world generation on Wan.
  • Controlled variable — Context, Compression, Spatial, or State-Space memory.
  • Three probes — replay metrics, in-domain 180° loop, open-domain edited return.
  • Release — ablation scripts, GT replay, revisit assets, and paper-aligned figures.
Echo-Memory framework overview
Controlled memory study over chunk-wise action-world generation. Click to expand

02 · Memory Design

Context · Compression · Spatial · State-Space

All variants plug into the same write–read interface; we only change what is stored and how history is retrieved. A no-memory I2V floor re-generates from the first frame as a lower bound.

Context

Raw recent frames at K = 1, 5, or 20 chunks — tests whether longer windows alone stop drift.

Compression

Learned compact tokens at ratio r = 4 — history without growing raw-frame storage.

Spatial

Explicit spatial read/write state — targets layout, object pose, and viewpoint carry.

State-Space

Block-wise SSM updates — recurrent carry beyond short context windows on revisit.

Memory design matrix
Four memory families under a shared write–read interface. Click to expand

03 · Checkpoints

Paper baselines on Hugging Face

Wan 2.1 1.3B memory rows — epoch-0, 30,000 steps, static in-domain pool. All weights: Echo-Team/Echo-Memory

Family Paper row HF path Steps
Raw contextContext K=1context_k1/epoch-0.safetensors30,000
Raw contextContext K=20context_k20/epoch-0.safetensors30,000
SpatialSpatial Memoryspatial_mem/epoch-0.safetensors30,000
State-spaceBlock-wise SSMblock_wise_ssm_two_chunk/epoch-0.safetensors30,000
State-spaceLegacy Hybridvideossm_hybrid/epoch-0.safetensors30,000
Spatialconcat text (abl.)spatial_concat_text_two_chunk/epoch-0.safetensors30,000
Spatialinject none (abl.)spatial_inject_none_two_chunk/epoch-0.safetensors30,000
Spatialcross-attn t32 (abl.)spatial_cross_attn_readout_t32_g4_two_chunk/epoch-0.safetensors30,000
State-spaceSSM ctx1/e4/h21ssm_ablation_ctx1_every4_hint21/epoch-0.safetensors30,000
State-spaceSSM ctx5/e1/h21ssm_ablation_ctx5_every1_hint21/epoch-0.safetensors30,000
State-spaceSSM ctx5/e4/h81ssm_ablation_ctx5_every4_hint81/epoch-0.safetensors30,000

Download

huggingface-cli download Echo-Team/Echo-Memory spatial_mem/epoch-0.safetensors --local-dir ./ckpts

In-domain eval (Echo-Memory repo)

export WAN_BASE_MODEL=/path/to/Wan2.1-T2V-1.3B
export DATASET_BASE_PATH=data/Context-as-Memory-Dataset
export CKPT=./ckpts/spatial_mem/epoch-0.safetensors
bash eval/v2/run_static_consistency_loop_and_revisit.sh

Keep the row folder in CKPTenv/memory_baseline_runtime.py infers memory flags from the path. Full index: doc/checkpoints.md.

04 · Evaluation

Replay · In-domain revisit · Open-domain return

Each branch asks a different question: Can the model reconstruct the past? Can it close a loop in-domain? After an edited first frame, does it return to the same world?

Replay

PSNR, SSIM, LPIPS on chunk-wise reconstruction — measures short-horizon pixel fidelity.

In-domain

180° trajectory loop closure with VLM-assisted scoring on held layouts.

Open-domain

Edited first frames and 45° return probes — stresses object identity and scene persistence.

Three-branch evaluation summary
Replay health vs. return memory under the same stack. Click to expand

05 · Qualitative Evidence

Return probes expose identity drift.

Qualitative panels follow a simple diagnostic: first frame → leave the view → revisit tail. We compare whether memory restores the same object, pose, background, and camera geometry — not merely a plausible new scene.

Representative memory comparisons
Representative memory comparisons across variants. Click to expand

06 · Main Conclusions

Replay quality ≠ memory quality.

Replay metrics and return probes do not always agree — a model can look sharp on reconstruction yet fail when the camera returns. Rankings reorder once identity under revisit is measured.

  • Raw context — more history helps open-domain return more than replay alone.
  • Compression — compact tokens can preserve replay while losing identity on return.
  • Spatial vs. SSM — explicit state and block-wise SSM trade off layout carry and long-horizon stability.
  • Takeaway — treat replay as a health check, not the final memory benchmark.
Replay vs revisit metrics
Rank shift from replay to return — replay is not the final memory score. Click to expand

07 · News & Roadmap

Release notes and next steps.

News

  • Bilingual project page (EN / 中文) and Developer Guide released.

  • Report on ResearchGate (CC BY 4.0) and project page released.

  • Paper baseline checkpoints released on Hugging Face — Echo-Team/Echo-Memory (Wan 2.1 1.3B, epoch-0, 30k steps).

  • Code release: Wan 2.1 1.3B memory ablations, replay/revisit eval, eval/v2/revisit_suite/, and assets/opendomain_revisit/.

Roadmap

Models

  • Wan 2.1 1.3B backbone and training recipes
  • Four memory families — Context, Compression, Spatial, State-Space
  • Dynamic training pool — SpatialVID subset export & settings
  • Paper checkpointsEcho-Team/Echo-Memory
  • Wan 2.2 + multi-scale 5B / 14B

Eval

  • Dynamic eval beyond static replay/revisit
  • More revisit probes and scoring presets

Community

Join the Echo-Memory WeChat group for release updates, checkpoint questions, and maintainer coordination.

Echo-Memory WeChat group QR code
Echo-Memory 交流群 · scan to join (QR refreshes periodically)

08 · Citation

BibTeX

Echo-Memory: A Controlled Study of Memory in Action World Models (June 2026). Licensed under CC BY 4.0. Switch between ResearchGate and arXiv BibTeX below.

Source
ResearchGate
License
CC BY 4.0
@article{king2026echomemory,
  title={Echo-Memory: A Controlled Study of Memory in Action World Models},
  author={King, Wayne and Xue, Zeyue and Bian, Yuxuan and Huang, Jie and Li, Haoran and Li, Yaowei and Su, Yaofeng and Li, Yuming and Wang, Haoyu and Zhang, Shiyi and Zhang, Songchun and Niu, Yuwei and Xu, Sihan and Zhuang, Junhao and Huang, Haoyang and Duan, Nan},
  journal={Echo-Memory technical report},
  publisher={ResearchGate},
  year={2026},
  month={jun},
  doi={10.13140/RG.2.2.19906.34248},
  url={https://doi.org/10.13140/RG.2.2.19906.34248},
  note={Licensed under CC BY 4.0}
}