NVIDIA Research
Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

1NVIDIA
2Bar-Ilan University
NeurIPS 2023

We present SDS-Complete: A test-time optimization method for completing point clouds captured by depth sensors, leveraging pre-trained text-to-image diffusion model. The inputs to our method are an incomplete point cloud (blue) along with a textual description of the object. The output is a complete surface (gray) that is consistent with the input points (blue). The method works well on a variety of objects captured by real-world point-cloud sensors.

Abstract


Point-cloud data collected in real-world applications are often incomplete, because objects are being observed from specific viewpoints, which only capture one perspective. Data can also be incomplete due to occlusion and low-resolution sampling. Existing approaches to completion rely on training models with datasets of predefined objects to guide the completion of point clouds. Unfortunately, these approaches fail to generalize when tested on objects or real-world setups that are poorly represented in their training set. Here, we leverage recent advances in text-guided 3D shape generation, showing how to use image priors for generating 3D objects. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantics of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects using test-time optimization without expensive collection of 3D data. We evaluate SDS-Complete on a collection of incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners. We find that it effectively reconstructs objects that are absent from common datasets, reducing Chamfer loss by about 50% on average compared with current methods.


Overview




Here, we address the challenge of completing 3D objects in the wild from real-world partial point clouds. This is achieved by leveraging priors about object shapes that are encoded in pretrained text-to-image diffusion models. Our key idea is that since text-to-image diffusion models were trained on a vast number of diverse objects, they contain a strong prior about the shape and texture of objects, and that prior can be used for completing object missing parts. For example, given a partial point cloud, knowing that it corresponds to a chair can guide the completion process, because objects from this class are expected to exhibit some types of symmetries and parts that are captured in 2D images.

A similar intuition has been used for generating 3D objects “from scratch" (DreamFusion). DreamFusion uses the SDS loss, which measures agreement between 2D model prior and renderings of the 3D shape. Unfortunately, naively applying the SDS loss to our problem of point cloud completion fails. This is because, as we show below, it does not combine well the hard constraints implied by the points collected from the sensor with the prior embedded in the diffusion model.

To address these challenges, we introduce SDS-Complete: a method to complete a given partial point cloud using several considerations. First, we use a Signed Distance Function (SDF) surface representation, and constrain the zero level set of the SDF to go through the input points. Second, we use information about areas with no collected points, to rule out object parts in these areas. Third, we use a prior about camera position and orientation and a curriculum of out-painting when sampling camera positions. Finally, we use the SDS loss to incorporate prior guided by the class of an object on the rendered images.

We demonstrate that SDS-Complete generates completions for various objects with different shape types from two real-world datasets: the Redwood dataset, which contains various incomplete real-world depth camera partial scans, and the KITTI dataset, which contains object LiDAR scans from driving scenarios. In both cases, SDS-Complete outperforms the current state-of-the-art methods.

Pipeline


SDS-loss optimizes two neural functions: A signed distance function \(f_\theta\) representing the surface and a volumetric coloring function \(\mathbf{c}_\varphi\). Together, \((\mathbf{c}_\varphi,f_\theta)\) define a radiance field, which is used to render novel image views \(Im_0,\ldots Im_n\). The SDS loss is applied to the renderings to encourage them to be compatible with the input text \(\mathbf{y}\). Three sensor-compatibility losses verify that the reconstructed surface is compatible with the sensor observations in various aspects.

Results


Our primary goal is to evaluate our SDS-Complete and baseline method in real-world scenarios. This is in contrast to evaluating test splits from the synthetic datasets that were used for training the baseline methods. To achieve relevant evaluation datasets, we based the evaluation on partial real-world point clouds obtained from depth images and LiDAR scans.

Redwood dataset


Example arranged in two columns, where each column has the following structure: Left: The partial scans that are given as input to our model. Middle: the completed surface. Right: the completed surface together with the input points. For more examples, please refer to the paper.

KITTI dataset


The completed surface (gray) together with the input points (blue).

Citation


@article{kasten2023point,
  title={Point-Cloud Completion with Pretrained Text-to-image Diffusion Models},
  author={Kasten, Yoni and Rahamim, Ohad and Chechik, Gal},
  journal={arXiv preprint arXiv:2306.10533},
  year={2023}
}