ZeroShape: Presenting Our Architecture for Shape Reconstruction

cover
30 Dec 2024

Abstract and 1 Introduction

2. Related Work

3. Method and 3.1. Architecture

3.2. Loss and 3.3. Implementation Details

4. Data Curation

4.1. Training Dataset

4.2. Evaluation Benchmark

5. Experiments and 5.1. Metrics

5.2. Baselines

5.3. Comparison to SOTA Methods

5.4. Qualitative Results and 5.5. Ablation Study

6. Limitations and Discussion

7. Conclusion and References

A. Additional Qualitative Comparison

B. Inference on AI-generated Images

C. Data Curation Details

3. Method

3.1. Architecture

We now present our architecture (see Fig. 3) for shape reconstruction. Our architecture is based on two established practices from prior works in this field: 1) usage of intermediate geometric representation [33, 56, 64, 67, 70] and 2) explicit reasoning with spatial feature maps [5, 63, 68]. Specifically, our model consists of three submodules: a depth and camera estimator, a geometric unprojection unit and a projection-guided shape reconstructor.

Depth and camera estimator. We propose to estimate the 3D visible object surface as an intermediate representation. To infer the full shape of an object, one must understand the visible surface—not only because the visible surface is often a large part of the full surface, but also because an accurate visible surface facilitates geometric reasoning of the full object reconstruction. This is because cues for reconstruction that allow for generalization, such as symmetry, curvature, and repetition, can be more effectively detected and leveraged in the 3D space. For example, if an object is symmetric, then accurately inferring the 3D symmetry planes from a partial 3D surface is much easier than from 2D RGB or relative depth.

Figure 3. Overview of our model. Our consists of three modules: a depth and camera estimator, a geometric unprojection unit and a projection-guided shape reconstructor. The depth and camera estimator predicts the depth and camera intrinsics from the input image with a DPT backbone. The geometric unprojection unit converts the depth and intrinsics estimation into a normalized 3D visible surface, which is parameterized by a three-channel projection map. The shape reconstructor finally reconstructs the full occupancy field by fetching localized information from projection map through cross attention.

We use a view-centric coordinate system, because prior works show that view-centric learning is beneficial to generalization [55, 56]. Therefore the camera coordinate frame is the “world” coordinate frame for shape reconstruction, which means that only the camera intrinsics matrix is required to unproject pixels to 3D. Note that unprojection is fully differentiable w.r.t. D and K, so we can easily use it as a module in an end-to-end learning-based model. Additionally, the projection maps are foreground-segmented, and the represented visible surface is normalized in the 3D space to be zero-mean and unit-scale before being fed into the next module.

Figure 4. Effect of Intrinsics. Unprojecting an accurate depth map into a 3D surface surface with erroneous intrinsics leads to skewed shape with wrong 3D aspect ratio.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;

(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;

(3) Anh Thai, Georgia Institute of Technology;

(4) Varun Jampani, Stability AI;

(5) James M. Rehg, University of Illinois at Urbana-Champaign.