VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Brain and Artificial Intelligence Lab, Northwestern Polytechnical University
Baidu VIS
Aerial Robotics Group, HKUST

Video (with voiceover)

Abstract

Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with RGB image input only and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods.
Interpolate start reference image

Our proposed VDG is crafted to effectively and uniformly reconstruct large, dynamic urban scenes as well as predicted poses with only image input. Here, we showcase our reconstruction results and pose evaluation on KITTI and Waymo datasets, and further compared with the latest pose-free methods. The reconstructed visualizations reveal that our method enables us to model static and dynamic objects without pose priors. Moreover, our method achieves much more accurate pose prediction than other pose-free methods.


Method

Interpolate start reference image The proposed VDG. (a) VDG Initialization: uses the off-the-shelf VO network \(\mathcal{P}(\cdot)\), \(\mathcal{M}(\cdot)\), and \(\mathcal{D}(\cdot)\) to estimate the global poses \(T_t\), motion masks \(M_t\), and depth maps \(D_t\) (see Sec. IV-A.1). Given poses \(T_t\) and corresponding depth maps \(D_t\), we project the depth maps into 3D space to initialize the Gaussian points \(G^k_t =\{\tilde{\mu}^k_t, \Sigma^k, \widetilde{\alpha}^k_t, S^k\}\). Note that the velocity \(v\) of each Gaussian is set to 0 (see Sec. IV-A.2). (b) VDG Training Procedure: Given initialized Gaussians \(G^k_t\), we train our VDG using RGB and depth supervision (see Sec. IV-A.3). Moreover, we apply motion mask supervision to decompose static and dynamic scenes (Sec. IV-B). In the end, we adopt a training strategy to refine vo-given poses \(T_t\) (Sec. IV-C).

Results

Interpolate start reference image Quantitative performance of novel view synthesis on the Waymo Open Dataset and KITTI benchmark. '-' means SplaTAM cannot rendering original resolution image on a single NVIDIA V100 GPU.
Interpolate start reference image Pose accuracy on the Waymo Open and KITTI datasets. Note that the unit of $RPE_r$ is in degrees, ATE is in the ground truth scale and $RPE_t$ is scaled by 100.

BibTeX

@article{li2024GGRt,
      title={VDG: Vision-Only Dynamic Gaussian for Driving Simulation}, 
      author={Hao Li and Jingfeng Li and Dingwen Zhang and Chenming Wu and Jieqi Shi and Chen Zhao and Haocheng Feng and Errui Ding and Jingdong Wang and Junwei Han},
      year={2024},
      eprint={2406.18198},
}