The proposed VDG. (a) VDG Initialization: uses the off-the-shelf VO network \(\mathcal{P}(\cdot)\), \(\mathcal{M}(\cdot)\), and \(\mathcal{D}(\cdot)\) to estimate the global poses \(T_t\), motion masks \(M_t\), and depth maps \(D_t\) (see Sec. IV-A.1). Given poses \(T_t\) and corresponding depth maps \(D_t\), we project the depth maps into 3D space to initialize the Gaussian points \(G^k_t =\{\tilde{\mu}^k_t, \Sigma^k, \widetilde{\alpha}^k_t, S^k\}\). Note that the velocity \(v\) of each Gaussian is set to 0 (see Sec. IV-A.2). (b) VDG Training Procedure: Given initialized Gaussians \(G^k_t\), we train our VDG using RGB and depth supervision (see Sec. IV-A.3). Moreover, we apply motion mask supervision to decompose static and dynamic scenes (Sec. IV-B). In the end, we adopt a training strategy to refine vo-given poses \(T_t\) (Sec. IV-C).