Context
Raw recent frames at K = 1, 5, or 20 chunks — tests whether longer windows alone stop drift.
When the camera leaves and returns, which memory keeps the same world instead of a plausible but different scene?
Controlled memory ablations on a shared Wan action-to-video stack — reproducible rows, evaluation scripts, and qualitative revisit panels.
01 · Overview
Echo-Memory holds the video backbone, training recipe, and data protocol fixed, and swaps only the memory module. The goal is to separate replay fidelity from return memory when the camera leaves and comes back to the same place.
02 · Memory Design
All variants plug into the same write–read interface; we only change what is stored and how history is retrieved. A no-memory I2V floor re-generates from the first frame as a lower bound.
Raw recent frames at K = 1, 5, or 20 chunks — tests whether longer windows alone stop drift.
Learned compact tokens at ratio r = 4 — history without growing raw-frame storage.
Explicit spatial read/write state — targets layout, object pose, and viewpoint carry.
Block-wise SSM updates — recurrent carry beyond short context windows on revisit.
03 · Checkpoints
Wan 2.1 1.3B memory rows — epoch-0, 30,000 steps, static in-domain pool. All weights: Echo-Team/Echo-Memory
| Family | Paper row | HF path | Steps |
|---|---|---|---|
| Raw context | Context K=1 | context_k1/epoch-0.safetensors | 30,000 |
| Raw context | Context K=20 | context_k20/epoch-0.safetensors | 30,000 |
| Spatial | Spatial Memory | spatial_mem/epoch-0.safetensors | 30,000 |
| State-space | Block-wise SSM | block_wise_ssm_two_chunk/epoch-0.safetensors | 30,000 |
| State-space | Legacy Hybrid | videossm_hybrid/epoch-0.safetensors | 30,000 |
| Spatial | concat text (abl.) | spatial_concat_text_two_chunk/epoch-0.safetensors | 30,000 |
| Spatial | inject none (abl.) | spatial_inject_none_two_chunk/epoch-0.safetensors | 30,000 |
| Spatial | cross-attn t32 (abl.) | spatial_cross_attn_readout_t32_g4_two_chunk/epoch-0.safetensors | 30,000 |
| State-space | SSM ctx1/e4/h21 | ssm_ablation_ctx1_every4_hint21/epoch-0.safetensors | 30,000 |
| State-space | SSM ctx5/e1/h21 | ssm_ablation_ctx5_every1_hint21/epoch-0.safetensors | 30,000 |
| State-space | SSM ctx5/e4/h81 | ssm_ablation_ctx5_every4_hint81/epoch-0.safetensors | 30,000 |
Download
huggingface-cli download Echo-Team/Echo-Memory spatial_mem/epoch-0.safetensors --local-dir ./ckpts
In-domain eval (Echo-Memory repo)
export WAN_BASE_MODEL=/path/to/Wan2.1-T2V-1.3B
export DATASET_BASE_PATH=data/Context-as-Memory-Dataset
export CKPT=./ckpts/spatial_mem/epoch-0.safetensors
bash eval/v2/run_static_consistency_loop_and_revisit.sh
Keep the row folder in CKPT — env/memory_baseline_runtime.py infers memory flags from the path.
Full index:
doc/checkpoints.md.
04 · Evaluation
Each branch asks a different question: Can the model reconstruct the past? Can it close a loop in-domain? After an edited first frame, does it return to the same world?
PSNR, SSIM, LPIPS on chunk-wise reconstruction — measures short-horizon pixel fidelity.
180° trajectory loop closure with VLM-assisted scoring on held layouts.
Edited first frames and 45° return probes — stresses object identity and scene persistence.
05 · Qualitative Evidence
Qualitative panels follow a simple diagnostic: first frame → leave the view → revisit tail. We compare whether memory restores the same object, pose, background, and camera geometry — not merely a plausible new scene.
06 · Main Conclusions
Replay metrics and return probes do not always agree — a model can look sharp on reconstruction yet fail when the camera returns. Rankings reorder once identity under revisit is measured.
07 · News & Roadmap
Bilingual project page (EN / 中文) and Developer Guide released.
Report on ResearchGate (CC BY 4.0) and project page released.
Paper baseline checkpoints released on Hugging Face — Echo-Team/Echo-Memory (Wan 2.1 1.3B, epoch-0, 30k steps).
Code release: Wan 2.1 1.3B memory ablations, replay/revisit eval,
eval/v2/revisit_suite/, and assets/opendomain_revisit/.
Join the Echo-Memory WeChat group for release updates, checkpoint questions, and maintainer coordination.
08 · Citation
Echo-Memory: A Controlled Study of Memory in Action World Models (June 2026). Licensed under CC BY 4.0. Switch between ResearchGate and arXiv BibTeX below.
@article{king2026echomemory,
title={Echo-Memory: A Controlled Study of Memory in Action World Models},
author={King, Wayne and Xue, Zeyue and Bian, Yuxuan and Huang, Jie and Li, Haoran and Li, Yaowei and Su, Yaofeng and Li, Yuming and Wang, Haoyu and Zhang, Shiyi and Zhang, Songchun and Niu, Yuwei and Xu, Sihan and Zhuang, Junhao and Huang, Haoyang and Duan, Nan},
journal={Echo-Memory technical report},
publisher={ResearchGate},
year={2026},
month={jun},
doi={10.13140/RG.2.2.19906.34248},
url={https://doi.org/10.13140/RG.2.2.19906.34248},
note={Licensed under CC BY 4.0}
}
@article{king2026echomemory,
title={Echo-Memory: A Controlled Study of Memory in Action World Models},
author={King, Wayne and Xue, Zeyue and Bian, Yuxuan and Huang, Jie and Li, Haoran and Li, Yaowei and Su, Yaofeng and Li, Yuming and Wang, Haoyu and Zhang, Shiyi and Zhang, Songchun and Niu, Yuwei and Xu, Sihan and Zhuang, Junhao and Huang, Haoyang and Duan, Nan},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2026},
month={jun},
eprint={xxxx.xxxxx},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/TBD},
note={Licensed under CC BY 4.0. Replace xxxx.xxxxx when posted.}
}