ZeroShape: Data Curation Details - Synthetic Training Dataset Generation and More

cover
3 Jan 2025

Abstract and 1 Introduction

2. Related Work

3. Method and 3.1. Architecture

3.2. Loss and 3.3. Implementation Details

4. Data Curation

4.1. Training Dataset

4.2. Evaluation Benchmark

5. Experiments and 5.1. Metrics

5.2. Baselines

5.3. Comparison to SOTA Methods

5.4. Qualitative Results and 5.5. Ablation Study

6. Limitations and Discussion

7. Conclusion and References

A. Additional Qualitative Comparison

B. Inference on AI-generated Images

C. Data Curation Details

C. Data Curation Details

In this section we describe our data generation procedure for training and for rendering the object scans from OmniObject3D to generate one of our benchmark test sets.

C.1. Synthetic Training Dataset Generation

Image Rendering. For an arbitrary 3D mesh asset, our Blender-based rendering pipeline first loads it into a scene and normalizes it to fit inside a unit cube. Our scene consists of a large rectangular bowl with a flat bottom, a common scene setup that 3D artists use for rendering to allow for realistic shading, and 4 point light sources and one area light source. We randomly place cameras around the object with 30mm to 70mm focal length for a 35mm sensor size equivalent. We randomly vary the distance, elevation (from 5 to 65 degrees), the LookAt point of the camera and generate images of 600 × 600 resolution (see Fig. 11). This variation in object/camera geometry allows capturing the variability of projective geometry in real world scenarios, coming from different capture devices and camera poses. This is in contrast with prior work that uses fixed intrinsics, fixed distance, and LookAt pointed at the center of the object.

In addition to RGB images, we extract segmentation masks, depth maps, intrinsics, extrinsics and object pose. We center crop the objects, mask out the background, resize images to 224 × 224 and process the additional annotations to account for the crop, segmentation and resize.

C.2. Generating the OmniObject3D Testing Set

The original videos released by the OmniObject3D dataset have noisy foreground masks and are mostly taken indoor on a tabletop. To improve the lighting variability and ensure accurate segmentations, we follow the rendering procedure described in the previous section to generate testing data. Different from our training set generation, we use HDRI environment maps to generate scene lighting, which results in high lighting quality and diversity (see Fig. 12).

Figure 7. Additional qualitative results and comparison on OmniObject3D.

Figure 8. Additional qualitative results and comparison on Ocrtoc3D.

Figure 9. Qualitative results and comparison on Pix3D.

Figure 10. Qualitative results on images generated with DALL·E 3. These results demonstrate the zero-shot generalization ability of ZeroShape to complex novel images.

Figure 11. Synthetic Training Data Generation. We render training images with varying lighting, camera intrinsics and extrinsics. The images are center-cropped, foreground-segmented and resized before being used as training input.

Figure 12. OmniObject3D Testing Data Generation. For OmniObject3D, we generate realistic testing images with varying lighting, camera intrinsics and extrinsics. To increase rendering realism and diversity, we use diverse HDRI environment maps for scene lighting.

This paper is available on arxiv under CC BY 4.0 DEED license.


[6] https://github.com/autonomousvision/occupancy_networks

[7] https://github.com/laughtervv/DISN

Authors:

(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;

(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;

(3) Anh Thai, Georgia Institute of Technology;

(4) Varun Jampani, Stability AI;

(5) James M. Rehg, University of Illinois at Urbana-Champaign.