WildRoomGen: 3D Room Generation from In-the-Wild Indoor Image Collections

Ming-Jia Yang 1    Yang Liu 2    Bin Zhou 1   
1 Beihang University    2 Microsoft Research Asia   

Abstract

3D indoor scene generation is challenging due to the lack of high-quality and diverse 3D datasets. Recent generative approaches using in-the-wild image collections offer promising solutions, but issues with scene quality, diversity, and multi-view consistency persist. In this paper, we introduce WildRoomGen, an efficient image-conditioned 3D room generation framework designed to overcome these limitations. WildRoomGen comprises two key components: (1) RoomGen, a GAN-based single-view conditioned 3D room generator that learns from large-scale, single-view room images to generate diverse NeRF-based 3D rooms. RoomGen significantly improves generation quality and diversity through enhanced camera estimation, perspective projection-based image feature embedding, and the utilization of pretrained image feature and pseudo-depth priors. (2) RoomRecon, a feedforward NeRF reconstruction network that addresses 3D inconsistency issues of RoomGen and prior methods due to the use of image super-resolution for image enhancement, while being trained solely on RoomGen's generated results without the need for 3D room data. We extensively evaluate the quality and diversity of the 3D rooms generated by WildRoomGen, highlighting its effectiveness and efficiency. Furthermore, we demonstrate the generality of our approach and its scalability to data sizes.

Method

The framework overview of WildRoomGen. Given a single-view room image as input, RoomGen extracts its DiNO features and projects them into triplanes via perspective projection. After combining the noise and room-size features, the triplane features are decoded as a NeRF, whose rendered color images and depth images are used for training the generator, discriminator, and image camera pose estimator. RoomRecon takes multi-view super-resolution rendering images of WildRoomGen as inputs to reconstruct a triplane-based NeRF that improves rendering fidelity while ensuring multi-view consistency. RoomGen uses in-the-wild images for training only and utilizes in-shelf pretrained image encoders and pseudo-depth priors. RoomRecon is trained solely on RoomGen's results.

Results

Various 3D scenes created by WildRoomGen, given AI-generated image as condition.

Comparision

Qualitative comparisons with Text2Room and ZeroNVS.