OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

1Peking University, 2Baidu VIS, 3Beihang University
*Corresponding authors

Abstract

This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations. To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency. These features exhibit both intra-object consistency and inter-object distinction. Then, we propose a two-stage codebook to discretize these features from coarse to fine levels. At the coarse level, we consider the positional information of 3D points to achieve location-based clustering, which is then refined at the fine level. Finally, we introduce an instance-level 3D-2D feature association method that links 3D points to 2D masks, which are further associated with 2D CLIP features. Extensive experiments, including open vocabulary-based 3D object selection, 3D point cloud understanding, click-based 3D object selection, and ablation studies, demonstrate the effectiveness of our proposed method.

Method

Interpolate start reference image

Two types of open-vocabulary understanding based on 3DGS. (a) 2D pixel-level understanding. The language-embedded 3D Gaussians are rendered into 2D feature maps, obtaining regions related to the query text as results. (b) 3D point-level understanding (ours), which selects Gaussians in 3D space related to the query text as results.

Interpolate start reference image

Framework. (a) We use the view-independent SAM boolean mask to train 3D instance features with 3D consistency for 3DGS. (b) We propose a two-level codebook for discretizing instance features from coarse to fine. (c) An instance-level 3D-2D feature association method to associate 2D CLIP features with 3D points without training.

Interpolate start reference image

Visualization of 3D point features at different stages. (a) Reference image/mesh; (b) instance features learned from Sec. 3.1; (c)-(d) Point features after discretization by coarse-level and fine-level codebook (Sec. 3.2).

Interpolate start reference image

3D point-2D CLIP feature association (Sec. 3.2). We render 3D instance points to an arbitrary training view, and associate 3D points with 2D masks based on the principle of joint IoU and feature similarity, which have already been extracted with mask-level CLIP features, thereby indirectly associating 3D points with CLIP features.



Videos

Instance feature visualization of 3D points.

Demonstration of scene editing capabilities. Based on the original scene reconstructed with OpenGaussian, we can select objects for removal, insertion, or color modification.

Text/click-based 3D object Selection.


BibTeX

@article{wu2024opengaussian,
      title={OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding}, 
      author={Yanmin Wu and Jiarui Meng and Haijie Li and Chenming Wu and Yahao Shi and Xinhua Cheng and Chen Zhao and Haocheng Feng and Errui Ding and Jingdong Wang and Jian Zhang},
      year={2024},
      eprint={2406.02058},
}