Will 2024 be the year of text-to-3D?

Latest developments in open-source 3D generative models.

Jun 29, 2024

In recent years, we've witnessed a remarkable surge in text-to-3D and image-to-3D AI models, revolutionizing the way we create and interact with three-dimensional content. These innovative approaches have opened up new possibilities for artists, designers, developers, and creators across various industries, enabling them to bring their ideas to life in the digital realm with unprecedented ease and speed.

While closed-source apps like Meshy.ai, Sloyd.ai, Alpha3D, 3DFY.ai, and Luma Labs's Genie have dominated the field with their impressive capabilities, a new wave of open-source alternatives is rapidly gaining ground. These emerging open-source models, architectures, and frameworks are not only democratizing access to 3D content creation but also pushing the boundaries of what's possible in this exciting domain.

The closed-source solutions have set a high bar for performance, quality, and speed, offering users powerful tools to generate complex 3D models from simple text descriptions or 2D images. They have found applications in diverse fields such as gaming, film production, virtual reality, augmented reality, and product visualization. However, their proprietary nature often means limited accessibility and customization options for users and researchers.

Enter the world of open-source text-to-3D and image-to-3D models. These projects, driven by passionate communities of developers and researchers, are making significant strides in bridging the gap between proprietary and freely available solutions. By leveraging cutting-edge machine learning techniques, computer vision algorithms, and 3D modeling principles, these open-source initiatives are slowly but surely closing the gap with their closed-source counterparts.

The advantages of open-source models in this space are multiple. They offer transparency, allowing users to understand and modify the underlying algorithms. This openness fosters innovation, as developers can build upon existing work, experiment with new approaches, and contribute improvements back to the community. Additionally, open-source solutions often provide greater flexibility in terms of deployment, integration with other tools, and customization to specific use cases.

As these open-source projects continue to evolve, we're seeing exciting developments in areas such as:

Improved accuracy and detail in 3D model generation
Enhanced support for diverse input formats and styles
Faster processing times and more efficient resource utilization
Better integration with popular 3D modeling and rendering software
Expanded datasets for training and fine-tuning models
Novel architectures that combine the strengths of different approaches

The implications of these advancements are far-reaching. As open-source text-to-3D and image-to-3D models become more sophisticated, we can expect to see their adoption across a wide range of industries. From rapid prototyping in product design to creating immersive virtual environments for education, training, and simulation, the potential applications are vast and varied.

Moreover, the democratization of 3D content creation through open-source tools has the power to level the playing field for small businesses, independent creators, and educational institutions. It enables them to compete with larger entities by providing access to powerful 3D generation capabilities without the need for substantial financial investments in proprietary software, although there’s still some needed investment on the hardware side, specifically on the powerful NVIDIA GPUs (because of the better performance capabilities due to the CUDA library) .

Without further ado, let’s jump to explore the latest developments in open-source text-to-3D and image-to-3D models and frameworks!

DreamFusion

Page | Paper

In their paper titled "DreamFusion: Text-to-3D using 2D Diffusion," authors Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall propose a novel method for text-to-3D synthesis that leverages pretrained image diffusion models as effective priors. This approach eliminates the need for specialized training data in the 3D domain and requires no modifications to existing image diffusion models.

Generating 3D images from text poses unique challenges compared to traditional text-to-image synthesis. One of the main obstacles is the lack of large-scale labeled datasets containing both textual descriptions and corresponding 3D images. This makes it difficult for deep learning models to learn meaningful representations for generating 3D objects solely from textual input. Another challenge is finding efficient denoising architectures for 3D data. Denoising plays a crucial role in generating high-quality images from noisy inputs or incomplete information. However, applying existing denoising techniques designed for two-dimensional (2D) data may not be suitable for handling noise in 3D data.

To address these challenges, Poole et al. propose DreamFusion - a novel method that utilizes a pretrained 2D text-to-image diffusion model as a prior for optimizing a parametric image generator for text-to-3D synthesis. The first step involves training an image diffusion model on a large dataset of 2D images and corresponding textual descriptions. This model learns to generate realistic images from text by optimizing a loss function based on probability density distillation. Next, the authors use this pretrained 2D diffusion model as a prior for optimizing a parametric image generator for text-to-3D synthesis. They achieve this by using gradient descent to refine a randomly initialized 3D model known as a Neural Radiance Field (NeRF). The optimization process aims to minimize the loss incurred when rendering the 3D model from various angles in its 2D projections.

The resulting 3D models generated from textual input can be visualized from any perspective, illuminated by different light sources, or seamlessly integrated into diverse 3D environments. This allows for immersive and interactive experiences with virtual objects created solely from text descriptions. One potential application of DreamFusion is in the field of virtual reality (VR) and augmented reality (AR). By generating realistic 3D objects from text, it could enable users to interact with virtual objects in VR/AR environments without requiring specialized training data or complex denoising techniques.

ProlificDreamer

Page | Paper | Code

In their paper titled "ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation," authors Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu introduce a novel approach to address the limitations of score distillation sampling (SDS) in text-to-3D generation. SDS has shown promise in leveraging pretrained text-to-image diffusion models but has been plagued by issues such as over-saturation, over-smoothing, and low diversity in generated samples. The key innovation proposed by the authors is the modeling of the 3D parameter as a random variable rather than a constant as done in SDS. This leads to the development of variational score distillation (VSD), a particle-based variational framework that aims to tackle these challenges.

Michelangelo style statue of dog reading news on a cellphone.

The authors demonstrate that SDS can be viewed as a special case of VSD but often produces subpar samples across different configuration weights. In contrast, VSD proves to be effective with various configuration weights by employing ancestral sampling from diffusion models. It not only enhances sample diversity but also improves overall sample quality.

Additionally, the authors present several enhancements in the design space for text-to-3D generation including optimizations related to distillation time schedule and density initialization. The proposed approach, dubbed ProlificDreamer, showcases impressive capabilities in generating high rendering resolution (512x512) outputs and high-fidelity Neural Radiance Fields (NeRF) with intricate structures and complex visual effects like smoke and drops. By fine-tuning meshes initialized from NeRF using VSD, the generated 3D models exhibit high details and photorealistic qualities. This research was presented at NeurIPS 2023 as a Spotlight paper.

Magic3D

Page | Paper

Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as prompt-based editing approach, it provides new ways to control 3D synthesis, opening up new avenues to various creative applications.

Existing techniques for generating 3D models from text prompts have several limitations that hinder their efficiency and quality. For example, DreamFusion - one of the most widely used methods - utilizes a single-stage optimization process that can take up to 1.5 hours to generate a model. This is due to its reliance on high-resolution diffusion priors (up to 512 × 512) which significantly slows down the process. Moreover, DreamFusion's approach also suffers from memory and compute inefficiencies as it uses a dense voxel grid representation for scenes. This not only increases processing time but also limits the level of detail that can be achieved in the final model.

To address these limitations, the authors of this paper (working at NVIDIA) propose Magic3D - a two-stage optimization framework that combines multiple diffusion priors with an efficient scene representation based on hash grids.

In the first stage, Magic3D optimizes a coarse neural field representation using multiple diffusion priors while utilizing hash grids for scene representation. This allows for quick generation of view-consistent geometry while reducing memory usage and computation time compared to DreamFusion's dense voxel grid approach. The second stage involves optimizing mesh representations with high-resolution diffusion priors (up to 512 × 512) using an efficient differentiable rasterizer and camera close-ups. This allows for the recovery of high-frequency details in geometry and texture, resulting in a more realistic and detailed final model.

In terms of speed, on average, Magic3D can generate high-quality 3D models in just 40 minutes - half the time required by DreamFusion.

One of the most significant advantages of Magic3D is its ability to provide control over the 3D synthesis process. By incorporating advancements from text-to-image editing applications, users can now manipulate various aspects such as lighting, materials, textures, and camera angles through simple text prompts. This not only makes 3D content creation more accessible for novices but also enhances the workflow for expert artists.

The efficiency and quality offered by Magic3D open up new possibilities for creative applications across various industries. In gaming and entertainment, it can be used to quickly generate realistic characters or environments based on text descriptions provided by writers or game designers. In architecture, it can assist architects in creating virtual representations of their designs with ease. For robotics simulation, it can aid engineers in generating accurate models for testing purposes.

HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance

Page | Paper | Code

Recent advancements in text-to-3D generation have been remarkable, with most existing methods leveraging pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. However, these methods often result in artifacts and inconsistencies across different views due to suboptimal optimization approaches and limited understanding of 3D geometry. Additionally, the inherent constraints of NeRFs in rendering crisp geometry and stable textures often require a two-stage optimization to attain high-resolution details.

To address these limitations, authors of the paper Junzhe Zhu, Peiye Zhuang,’ and Sanmi Koyejo propose holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation in a single-stage optimization. Their method computes denoising scores in both the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps, they introduced a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. This ensures a more controlled and efficient optimization process.

To generate high-quality renderings, authors propose regularization for the variance of z-coordinates along NeRF rays. This helps to stabilize the geometry and texture rendering. Furthermore, paper introduces a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions.

Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

Page | Paper | Code

Fantasia3D is a cutting-edge method for high-quality text-to-3D content creation. Proposed by authors Rui Chen et.al., this innovative approach sets itself apart from existing methods in the realm of automatic 3D content generation.

Recent advancements in this field have been fueled by the availability of pre-trained large language models and image diffusion models. This has led to the emergence of text-to-3D content creation as a prominent research topic. However, existing methods often utilize implicit scene representations that link geometry and appearance through volume rendering. While effective to some extent, these approaches fall short in capturing finer geometries and achieving photorealistic rendering. This limits their ability to generate top-notch 3D assets.

Fantasia3D addresses this issue by introducing a novel approach that focuses on disentangling the modeling and learning of geometry and appearance. For geometry learning, the method relies on a hybrid scene representation and introduces the encoding of surface normals extracted from this representation as input for the image diffusion model. On the other hand, for appearance modeling, Fantasia3D incorporates spatially varying bidirectional reflectance distribution function (BRDF) into the text-to-3D task. By learning surface materials for photorealistic rendering of generated surfaces, this disentangled framework enhances compatibility with popular graphics engines while enabling functionalities such as relighting, editing, and physical simulation of the resulting 3D assets.

The efficacy of Fantasia3D is demonstrated through comprehensive experiments showcasing its superiority over existing methods across various text-to-3D task settings. As presented at ICCV conference in 2023, this innovative approach opens up new possibilities in high-quality 3D content creation by bridging the gap between geometry and appearance modeling.

Zero-1-to-3: Zero-shot One Image to 3D Object

Page | Paper | Code

In recent years, there has been a growing interest in developing computer vision systems that can accurately manipulate and reconstruct 3D objects from limited visual input. This has led to significant advancements in the field of view synthesis and 3D reconstruction, which have numerous applications in areas such as virtual reality, gaming, and robotics. However, most existing approaches require multiple images or depth information to generate novel views of an object or reconstruct its 3D structure. To address this limitation, a team of researchers from MIT and NVIDIA created a novel framework called "Zero-1-to-3" for manipulating camera viewpoints of objects based on just a single RGB image. Their research paper titled "Zero-1-to-3: Zero-shot One Image to 3D Object" introduces this groundbreaking approach that leverages geometric priors learned by large-scale diffusion models from natural images to enable accurate view synthesis in an under-constrained setting.

Most existing methods rely on large datasets with multiple images or depth information for training their models. This limits their applicability to real-world scenarios where obtaining such data may not be feasible. Moreover, even if trained on synthetic data generated using computer graphics techniques, these models often fail to generalize well when presented with out-of-distribution datasets or diverse real-world images. This is because they lack robustness against variations in lighting conditions, textures, and object appearances.

To overcome these limitations, the authors propose a conditional diffusion model that utilizes a synthetic dataset for learning parameters controlling the relative camera viewpoint. This enables the generation of new images depicting the same object from different perspectives following a specified camera transformation. The key innovation lies in leveraging geometric priors learned by large-scale diffusion models from natural images to enable novel view synthesis in an under-constrained setting. The authors demonstrate that their approach significantly outperforms existing state-of-the-art models for single-view 3D reconstruction and novel view synthesis by harnessing Internet-scale pre-training.

The proposed framework, Zero-1-to-3, consists of two main components - a conditional diffusion model and a synthetic dataset. The conditional diffusion model is trained on the synthetic dataset to learn parameters controlling the relative camera viewpoint. This allows for accurate manipulation of object viewpoints from limited visual input. The synthetic dataset used for training is created using computer graphics techniques and contains various objects with different textures, lighting conditions, and backgrounds. This diverse dataset ensures that the model learns robust representations that can generalize well to real-world scenarios.

The proposed framework has numerous applications in areas such as virtual reality, gaming, robotics, autonomous driving, etc., where precise control over camera transformations or accurate 3D scene reconstruction is crucial. It can also be used for generating realistic images from limited visual input or enhancing low-quality images by synthesizing new views. Moreover, the viewpoint-conditioned diffusion methodology introduced in this work can also be employed for other tasks such as image translation and style transfer by conditioning on different camera transformations.

Conclusion

The landscape of text-to-3D and image-to-3D technologies is rapidly evolving, with open-source solutions making significant strides. While proprietary models still hold the lead in many aspects, the gap is narrowing, and the future looks incredibly promising for open-source alternatives.

The democratization of 3D content creation through these open-source tools is not just a technological advancement; it's a paradigm shift that has the potential to revolutionize industries and empower creators worldwide. As these models continue to improve, we can expect to see an explosion of creativity and innovation across various fields, from entertainment and education to manufacturing and scientific visualization.

However, the journey is far from over. The open-source community faces challenges in terms of computational resources, data quality, and achieving consistently photorealistic results. Yet, the developers and researchers are pushing the boundaries, refining algorithms, and expanding the possibilities of what can be achieved.

I encourage you to experiment with the latest tools, or simply staying informed about the latest breakthroughs. The future of 3D content creation is being shaped right now, and it's more accessible and exciting than ever before!

Thank you for reading Parallel Mind. This post is public so feel free to share it.

Will 2024 be the year of text-to-3D?

Latest developments in open-source 3D generative models.

DreamFusion

ProlificDreamer

Michelangelo style statue of dog reading news on a cellphone.

Magic3D

HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance

Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

Zero-1-to-3: Zero-shot One Image to 3D Object

Conclusion

Discussion about this post