You may have heard about how powerful text-to-image AI models have become in the past couple of years. But did you know that the same technology could help make the leap from 2D to 3D?
AI-generated 3D models have a broad use case in today’s digital landscape. Video games and film rely on skilled 3D artists and modeling software such as Blender to create 3D assets to populate computer-generated scenes.
However, is it possible that the industry could use machine learning to create 3D assets with less effort, similar to how 2D artists today are starting to adopt technology such as DALL-E and Midjourney?
This article will explore a novel algorithm that tries to create an effective text-to-3D model using existing diffusion models.
What is Dreamfusion?
One major issue with creating a diffusion model that generates 3D assets directly is that there is simply not a lot of 3D data available. 2D diffusion models have become so powerful because of the vast dataset of images found on the internet. The same can’t be said with 3D assets.
Some 3D generative techniques work around this lack of data by taking advantage of this abundance of 2D data.
DreamFusion is a generative model that can create 3D models based on a provided text description. The DreamFusion model uses a pre-trained text-to-image diffusion model to generate realistic three-dimensional models from text prompts.
Despite having no 3D training data, this approach has generated coherent 3D assets with high-fidelity appearance and depth.
How Does It Work?
The DreamFusion algorithm consists of two main models: a 2D diffusion model and a neural network that can convert 2D images into a cohesive 3D scene.
Google’s Imagen Text-to-Image Model
The first part of the algorithm is the diffusion model. This model is responsible for converting text to images.
Imagen is a diffusion model that can generate a large sample of image variations of a particular object. In this case, our image variations should cover all possible angles of the provided object. For example, if we wanted to generate a 3D model of a horse, we would want 2D images of the horse from all possible angles. The goal is to use Imagen to provide as much information as possible (colors, reflections, density) for the next model in our algorithm.
Creating 3D Models with NeRF
Next, Dreamfusion uses a model known as a Neural Radiance Field or NeRF to actually create the 3D model from the generated image set. NeRFs are able to create complex 3D scenes given a dataset of 2D images.
Let’s try to understand how a NeRF works.
The model aims to create a continuous volumetric scene function optimized from the provided dataset of 2D images.
If the model creates a function, what are the input and output?
The scene function takes in a 3D location and a 2D viewing direction as input. The function then outputs a color (in the form of RGB) and a specific volume density.
To generate a 2D image from a specific viewpoint, the model will generate a set of 3D points and run those points through the scene function to return a set of color and volume density values. Volume rendering techniques will then convert those values into a 2D image output.
Using NeRF and 2D Diffusion Models Together
Now that we know how a NeRF works, let’s see how this model can generate accurate 3D models from our generated images.
For each provided text prompt, DreamFusion trains a randomly initialized NeRF from scratch. Each iteration chooses a random camera position in a set of spherical coordinates. Think of the model encased in a glass sphere. Each time we generate a new image of our 3D model, we’ll choose a random point in our sphere as the vantage point of our output. DreamFusion will also choose a random light position l to use for rendering.
Once we have a camera and light position, a NeRF model will be rendered. DreamFusion will also randomly choose between a colored render, a textureless render, and a rendering of the albedo without any shading.
We’ve mentioned earlier that we want our text-to-image model (Imagen) to produce enough images to create a representative sample.
How does Dreamfusion accomplish this?
Dreamfusion simply modifies the input prompt slightly to achieve the intended angles. For example, we can achieve high elevation angles by appending “overhead view” to our prompt. We can generate other angles by appending phrases such as “front view”, “side view”, and “back view”.
Scenes are repeatedly rendered from random camera positions. These renderings then pass through a score distillation loss function. A simple gradient descent approach will slowly improve the 3D model until it matches the scene described by the text.
Once we’ve rendered the 3D model using NeRF, we can use the Marching Cubes algorithm to output a 3D mesh of our model. This mesh can then be imported into popular 3D renderers or modeling software.
Limitations
While DreamFusion’s output is impressive enough since it uses existing text-to-image diffusion models in a novel way, the researchers have noted a few limitations.
The SDS loss function has been observed to produce oversaturated and over-smoothed results. You can observe this in the unnatural coloring and lack of precise detail found in the outputs.
The DreamFusion algorithm is also limited by the resolution of the Imagen model output, which is 64 x 64 pixels. This leads to the synthesized models lacking finer details.
Lastly, the researchers have noted that there is an inherent challenge in synthesizing 3D models from 2D data. There are many possible 3D models that we can generate from a set of 2D images, which makes optimization quite difficult and even ambiguous.
Conclusion
DreamFusion’s 3D renderings work so well because of the ability of text-to-image diffusion models to create any object or scene. It’s impressive how a neural network can understand a scene in 3D space without any 3D training data. I recommend reading the entire paper to learn more about the technical details of the DreamFusion algorithm.
Hopefully, this technology will improve to eventually create photo-realistic 3D models. Imagine entire video games or simulations that use AI-generated environments. It could lower the barrier of entry for video game developers to create immersive 3D worlds!
What role do you think text-to-3D models will play in the future?
Leave a Reply