ZeroShape: A Comparison to SOTA Methods

cover
1 Jan 2025

Abstract and 1 Introduction

2. Related Work

3. Method and 3.1. Architecture

3.2. Loss and 3.3. Implementation Details

4. Data Curation

4.1. Training Dataset

4.2. Evaluation Benchmark

5. Experiments and 5.1. Metrics

5.2. Baselines

5.3. Comparison to SOTA Methods

5.4. Qualitative Results and 5.5. Ablation Study

6. Limitations and Discussion

7. Conclusion and References

A. Additional Qualitative Comparison

B. Inference on AI-generated Images

C. Data Curation Details

5.3. Comparison to SOTA Methods

We compare our approach to other state-of-the-art methods on the benchmark we curated. We now present and analyze the quantitative results for each dataset.

Results on OmniObject3D. We present our main quantitative comparison results on OmniObject3D, which covers a great variety of object types. The results are shown in Tab. 1. Comparing with other SOTA zero-shot 3D reconstruction methods, we see our approach achieves significantly better performance.

Results on Ocrtoc3D. We present additional quantitative comparison results on Ocrtoc3D. Ocrtoc is smaller than OmniObject, but still covers many object types, and the input images are real photos. The results are shown in Tab. 3. Similar to the results on OmniObject3D, our approach outperforms previous SOTA methods by a large margin.

Results on Pix3D. We also present quantitative comparison results on Pix3D. Unlike OmniObject3D and Ocrtoc3D, the object variety of this evaluation dataset is much lower — all objects are furniture and more than two third of the images are chairs and sofas. Therefore, the evaluation results are highly bias towards this specific class of objects. The results are shown in Tab. 2, and our method still achieves state-of-the-art performance. It is worth noting that Point-E

Table 4. Ablation study on OmniObject3D. The design choices of our architecture are quantitatively justified: enforcing explicit geometric reasoning, and implementing it through unprojection with estimated depth and intrinsics is essential.

and Shap-E also perform well on this dataset. We hypothesize this is might relate to the abundance of similar furniture categories in their training set.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;

(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;

(3) Anh Thai, Georgia Institute of Technology;

(4) Varun Jampani, Stability AI;

(5) James M. Rehg, University of Illinois at Urbana-Champaign.