## SlotFormer: Unsupervised Visual Dynamics Simulationwith Object-Centric Models

#### Under Review at ICLR 2023

Anonymous Authors

## Abstract

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, , we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.

## Method

SlotFormer architecture overview. Taking multiple video frames $\{x_t\}_{t=1}^T$ as input, we first extract object slots $\{\mathcal{S}_t\}_{t=1}^T$ using the pretrained object-centric model. Then, slots are linearly projected and added with temporal positional encoding. The resulting tokens are fed to the Transformer module to generate future slots $\{\hat{\mathcal{S}}_{T+k}\}_{k=1}^K$ in an autoregressive manner.

SlotFormer integrated with downstream task models. Given a video and a question about the future event, (left) a vanilla VQA model only reasons over observed frames and predicts the wrong answer. (right) In contrast, we leverage SlotFormer to simulate accurate future frames, which contain useful information for producing the correct answer. In this way, the unsupervised dynamics knowledge learned by SlotFormer can be transfered to improve downstream supervised tasks.

## Qualitative Results

### OBJ3D

We show prediction results of SlotFormer and baselines on OBJ3D. We use 6 burn-in frames, and rollout until 50 frames.

### CLEVRER

We show prediction results of SlotFormer and baselines on CLEVRER. We use 6 burn-in frames, and rollout until 48 frames.

### PHYRE

We show prediction results of SlotFormer on PHYRE. All of the rollouts are generated by observing only the first frame. We pause the first frame for 0.5s to show the initial configuration more clearly.

### Physion

We show prediction results of SlotFormer on Physion. As discussed in the paper, Physion dataset features diverse physical phenomenon, and complicated visual appearance such textured objects, distractors, and diverse backgrounds. We adopt STEVE as the object-centric model, which can handle visually more complex data than SAVi.

However, STEVE uses a Transformer-based slot decoder to reconstruct the patch tokens produced by a trained dVAE encoder. So even its reconstructed images are of low quality (see Recon. column). In addition, we do not train SlotFormer using pixel reconstruction loss on Physion dataset due to GPU memory constraint. Therefore, the generated videos are visually unpleasant (See Rollout column). Nevertheless, they still preserve the correct motion of objects, such as object collisions and falling to the ground.

#### GT                 Recon.            Rollout                  GT                Recon.             Rollout

You may wonder whether STEVE really learns meaningful scene decomposition. Here we show the segmentation masks produced by STEVE, which highlights reasonable object segmentation. Empirically, we observe that image reconstruction loss helps with visual quality (e.g., color, shape), but has little effect on object dynamics. Therefore, using purely a slot reconstruction loss in the latent space is enough to learn the dynamics.

## References

[1] Lin, Zhixuan, et al. "Improving generative imagination in object-centric world models." ICML. 2020.

[2] Wang, Yunbo, et al. "Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms." NeurIPS. 2017.

[3] Zoran, Daniel, et al. "PARTS: Unsupervised segmentation with slots, attention and independence maximization." ICCV. 2021.

[4] Yi, Kexin, et al. "CLEVRER: CoLlision Events for Video REpresentation and Reasoning." ICLR. 2020.

[5] Bakhtin, Anton, et al. "Phyre: A new benchmark for physical reasoning." NeurIPS. 2019.

[6] Bear, Daniel, et al. "Physion: Evaluating Physical Prediction from Vision in Humans and Machines." NeurIPS Datasets and Benchmarks Track. 2021.

[7] Kipf, Thomas, et al. "Conditional Object-Centric Learning from Video." ICLR. 2021.

[8] Singh, Gautam, Yi-Fu Wu, and Sungjin Ahn. "Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos." NeurIPS. 2022.