ModeDreamer

Mode Guiding Score Distillation for Text-to-3D Generation
using Reference Image Prompts

Arxiv

1VinAI Research, 2Trinity College Dublin
*Equal contribution

ModeDreamer can generate high-quality and diverse 3D objects comparable to state-of-the-art approaches.

Abstract

Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

Pipeline Method

ModeDreamer Pipeline

An overview of our method. Starting with input prompt \(y\), we generate a reference image \(x_{\text{ref}}\) using a text-to-image model. Both the text prompt and the image prompt are used with the IP-Adapter for score distillation, following our ISD gradient \(\nabla_\theta \mathcal{L}_{\text{ISD}}\). To mitigate view bias by reference image and the Janus problem, we incorporate additional multi-view regularization by jointly optimizing \(\nabla_\theta \mathcal{L}_{\text{ISD}}\) with \(\nabla_\theta \mathcal{L}_{\text{SDS-MVD}}\).

GPTEval3D Results

Methods Text-Asset Alignment 3D Plausibility Text-Geometry Alignment Texture Details Geometry Details Overall ↑
RichDreamer [39] 1295 1225 1260 1356 1251 1277
MVDream [46] 1271 1147 1251 1325 1255 1250
ProlificDreamer [53] 1262 1059 1152 1246 1181 1180
LatentNeRF [35] 1222 1145 1157 1180 1161 1173
Instant3D [22] 1200 1088 1153 1152 1181 1155
Magic3D [24] 1152 1001 1084 1178 1100 1100
DreamGaussian [48] 1101 954 1159 1126 1131 1094
SJC [51] 1130 995 1034 1080 1043 1056
Fantasia3D [2] 1068 892 1006 1109 1027 1021
Dreamfusion [38] 1000 1000 1000 1000 1000 1000
One2345 [25] 872 829 850 911 860 864
Shap-E [17] 843 842 846 784 846 836
Point-E [37] 725 690 689 716 746 713
ISD (ours) 1291 1271 1269 1370 1266 1294

Table 4. Comparison with text-to-3D methods using GPTEval3D [56] benchmark. The best results are in red while the second best results are in yellow.

T3Bench Results

Method Time
(mins)
Single Object Single Object with Surr Multiple Objects
Qual. ↑ Align. ↑ Avg ↑ Qual. ↑ Align. ↑ Avg ↑ Qual. ↑ Align. ↑ Avg ↑
Dreamfusion [38] 30 24.9 24.0 24.4 19.3 29.8 24.6 17.3 14.8 16.1
Magic3D [24] 40 38.7 35.3 37.0 29.8 41.0 35.4 26.6 24.8 25.7
LatentNeRF [35] 65 34.2 32.0 33.1 23.7 37.5 30.6 21.7 19.5 20.6
Fantasia3D [2] 45 29.2 23.5 26.4 21.9 32.0 27.0 22.7 14.3 18.5
SJC [51] 25 26.3 23.0 24.7 17.3 22.3 19.8 11.7 5.8 8.7
ProlificDreamer [53] 240 51.1 47.8 49.4 42.5 47.0 44.8 45.7 25.8 35.8
MVDream [46] 30 53.2 42.3 47.8 36.3 48.5 42.4 39.0 28.5 33.8
DreamGaussian [48] 7 19.9 19.8 19.8 10.4 17.8 14.1 12.3 9.5 10.9
GeoDream [32] 400 48.4 33.8 41.1 35.2 34.5 34.9 34.3 16.5 25.4
RichDreamer [39] 70 57.3 40.0 48.6 43.9 42.3 43.1 34.8 22.0 28.4
ISD (ours) 40 55.4 52.6 54.0 45.7 59.0 52.4 43.4 39.4 41.4

Table 1. Comparative results for the text-to-3D task across three settings of T3Bench. The best results are in red while the second best results are in yellow.

BibTeX

@article{tran2024modedreamer,
  title={ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts},
  author={Tran, Uy Dieu and Luu, Minh and Nguyen, Phong Ha and Nguyen, Khoi and Hua, Binh-Son},
  journal={arXiv preprint arXiv:2411.18135},
  year={2024}
}