VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Brain and Artificial Intelligence Lab, Northwestern Polytechnical University
Baidu VIS
Aerial Robotics Group, HKUST

Abstract

Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with RGB image input only and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods.

Method

The proposed VDG. (a) VDG Initialization: uses the off-the-shelf VO network $\mathcal{P}(\cdot)$, $\mathcal{M}(\cdot)$, and $\mathcal{D}(\cdot)$ to estimate the global poses $T_t$, motion masks $M_t$, and depth maps $D_t$ (see Sec. IV-A.1). Given poses $T_t$ and corresponding depth maps $D_t$, we project the depth maps into 3D space to initialize the Gaussian points $G^k_t =\{\tilde{\mu}^k_t, \Sigma^k, \widetilde{\alpha}^k_t, S^k\}$. Note that the velocity $v$ of each Gaussian is set to 0 (see Sec. IV-A.2). (b) VDG Training Procedure: Given initialized Gaussians $G^k_t$, we train our VDG using RGB and depth supervision (see Sec. IV-A.3). Moreover, we apply motion mask supervision to decompose static and dynamic scenes (Sec. IV-B). In the end, we adopt a training strategy to refine vo-given poses $T_t$ (Sec. IV-C).

Results

Quantitative performance of novel view synthesis on the Waymo Open Dataset and KITTI benchmark. '-' means SplaTAM cannot rendering original resolution image on a single NVIDIA V100 GPU.
Interpolate start reference image

Pose accuracy on the Waymo Open and KITTI datasets. Note that the unit of $RPE_r$ is in degrees, ATE is in the ground truth scale and $RPE_t$ is scaled by 100.

BibTeX

@article{li2024GGRt, title={VDG: Vision-Only Dynamic Gaussian for Driving Simulation}, author={Hao Li and Jingfeng Li and Dingwen Zhang and Chenming Wu and Jieqi Shi and Chen Zhao and Haocheng Feng and Errui Ding and Jingdong Wang and Junwei Han}, year={2024}, eprint={2406.18198}, }

VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Video (with voiceover)

Abstract

Method

Results

BibTeX