You can find the project poster here.
The project’s goal was to train a panoptic segmentation model for 3D scenes in an unsupervised manner. Additionally, returned object instances are conditioned with a feature vector in CLIP space, so the scene can be queried using natural language.
The segmentation model was trained using a self-training loop. Initial instances were created by leveraging 2D pixel-wise feature extractors, projecting pixel-wise features to 3D, and segmenting object instances using an approach based on spectral graph clustering, as presented in this paper. The initial instance’s feature vectors were also initially extracted in 2D, projected to 3D and aggregated over all points of an instance. (Actually, they were the same ones used for segmentation as well.)