TL;DR. Multi-scale granularity pruning of SCM attention — at token, block, and chain levels — yields a 9.7× end-to-end speedup with negligible quality loss.

How it works

Turbo4DGen identifies and removes redundant computations at multi-scale granularity in the SCM attention mechanism — at the token, block, and chain levels — through a rolling cache and an adaptive bypassing mechanism, while maintaining high generation quality. To the best of our knowledge, this is the first systematic framework for accelerating 4D generation.

High-level overview of Turbo4DGen.
Figure 1. System overview. Turbo4DGen wraps the SCM attention pipeline with caching and bypassing modules acting at three levels of granularity.

Performance and efficiency

We report end-to-end latency, speedup, and peak memory across baselines. “” indicates an out-of-memory error; F and V denote the number of generated frames and views, respectively. Methods marked with “*” are not directly applicable to 4D generation, and their code is modified for fair evaluation.

Performance and efficiency table.
Table 1. Latency, speedup, and peak memory against baselines. Lower latency and memory are better; higher speedup is better.

Comparisons on the iPhone dataset

We compare against TrajectoryCraft on five in-the-wild scenes. Each scene provides an input monocular video, a geometric warp render from a novel camera (incomplete, with holes), and a mask indicating regions to be filled. The diffusion model produces the completed novel-view output conditioned on these signals.

Method Apple Block Paper Spin Teddy
TrajectoryCraft 256.1 / 1.0× 263.2 / 1.0× 264.5 / 1.0× 296.7 / 1.0× 276.6 / 1.0×
Ours 52.3 / 4.9× 51.6 / 5.1× 56.3 / 4.7× 57.1 / 5.2× 57.6 / 4.8×
Table 2. Per-scene latency (seconds) / speedup over TrajectoryCraft on identical hardware.
Input
monocular video
TrajectoryCraft (baseline)
Ours
Figure 3. Per-scene comparison on the iPhone dataset. Our method matches the baseline's qualitative fidelity at roughly the speed.

Comparisons on the Objaverse-Dy-4D dataset

We compare against the diffusion baseline on six dynamic objects from Objaverse-Dy-4D. Each clip shows synthesized novel-view trajectories over time at the indicated azimuth. Select a scene below.

Input
Baseline
Ours
Figure 4. Per-scene comparison on Objaverse-Dy-4D. Our method preserves the baseline's visual fidelity at a fraction of the latency.

BibTeX

turbo4dgen.bib
@inproceedings{man2026turbo4dgen,
  title     = {Turbo4DGen: Ultra-Fast Acceleration for 4D Generation},
  author    = {Man, Yuanbin and Huang, Ying and Ren, Zhile and Yin, Miao},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}