Turbo4DGen: Ultra-Fast Acceleration for 4D Generation

TL;DR. Multi-scale granularity pruning of SCM attention — at token, block, and chain levels — yields a 9.7× end-to-end speedup with negligible quality loss.

How it works

Turbo4DGen identifies and removes redundant computations at multi-scale granularity in the SCM attention mechanism — at the token, block, and chain levels — through a rolling cache and an adaptive bypassing mechanism, while maintaining high generation quality. To the best of our knowledge, this is the first systematic framework for accelerating 4D generation.

High-level overview of Turbo4DGen. — Figure 1. System overview. Turbo4DGen wraps the SCM attention pipeline with caching and bypassing modules acting at three levels of granularity.

Performance and efficiency

We report end-to-end latency, speedup, and peak memory across baselines. “✗” indicates an out-of-memory error; F and V denote the number of generated frames and views, respectively. Methods marked with “*” are not directly applicable to 4D generation, and their code is modified for fair evaluation.

Comparisons on the iPhone dataset

We compare against TrajectoryCraft on five in-the-wild scenes. Each scene provides an input monocular video, a geometric warp render from a novel camera (incomplete, with holes), and a mask indicating regions to be filled. The diffusion model produces the completed novel-view output conditioned on these signals.

Method	Apple	Block	Paper	Spin	Teddy
TrajectoryCraft	256.1 / 1.0×	263.2 / 1.0×	264.5 / 1.0×	296.7 / 1.0×	276.6 / 1.0×
Ours	52.3 / 4.9×	51.6 / 5.1×	56.3 / 4.7×	57.1 / 5.2×	57.6 / 4.8×

Table 2. Per-scene latency (seconds) / speedup over TrajectoryCraft on identical hardware.

Apple

Block

Paper

Spin

Teddy

Input
monocular video

TrajectoryCraft (baseline)

Ours

Figure 3. Per-scene comparison on the iPhone dataset. Our method matches the baseline's qualitative fidelity at roughly 5× the speed.

Comparisons on the Objaverse-Dy-4D dataset

We compare against the diffusion baseline on six dynamic objects from Objaverse-Dy-4D. Each clip shows synthesized novel-view trajectories over time at the indicated azimuth. Select a scene below.

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

Scene 6

Input

Baseline

Ours

Figure 4. Per-scene comparison on Objaverse-Dy-4D. Our method preserves the baseline's visual fidelity at a fraction of the latency.

BibTeX

turbo4dgen.bib

@inproceedings{man2026turbo4dgen,
  title     = {Turbo4DGen: Ultra-Fast Acceleration for 4D Generation},
  author    = {Man, Yuanbin and Huang, Ying and Ren, Zhile and Yin, Miao},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}