Stable Diffusion has now been approved for public release for both commercial and non-commercial purposes after going through several legal and ethical regulations. It’s great news for developers and academics that the code has been made publicly available online at last. So, without further ado, let’s dive headfirst into the field of deep learning and understand the model’s architecture and various applications.
The area of artificial intelligence (AI) has been progressing at a rapid pace, allowing developers and researchers to enable things that were previously impossible to do. With minimum user input, AI models can now generate poetry, music, and artwork. Stable DIffusion, a new text-to-image model developed by Stability AI, LMU, and Runway with backing from EleutherAI and LAION, produces amazing art in a matter of seconds from the input of relevant words. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer are all contributors to the model’s development. While OpenAI’s Dall-E 2 and Google’s Imagen are both formidable image synthesis models, the Stable Diffusion model has the advantage of being freely available to the public.
The diffusion model, created by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli, is currently the gold standard for image synthesis, and it serves as the foundation for the stable diffusion model. However, due to its design, the diffusion model can be particularly computationally and time-consuming. It can take a lot of time and a lot of GPUs to train the most powerful DM. As a result, only researchers with access to extensive computing resources can benefit from them. In addition to that, it also leaves a huge carbon footprint. Therefore, scientists have developed a refined model that employs latent space rather than pixel space. Let’s get a handle on the overall structure of diffusion models before we delve into stable diffusion models.
Most recently, diffusion models have surpassed GANs as the most effective and accurate model for image synthesis. They are generative probabilistic models that involve a two-step training process, forward and reverse Diffusion. Images are convolved with Gaussian noise at each time step in forward diffusion before being restored to their original state in reverse Diffusion. Using a Markov chain, gaussian noise is added at each link in the chain until the final image is made up entirely of random noise. And then, the model is trained to reconstruct the original image from the one containing only gaussian noise during the inversion step. In applications such as text-to-image conversion, inpainting (reconstructing a portion of an image that has been warped or clipped), and image modification, diffusion models have proven to be highly effective.
Stable Diffusion Model
The Stable Diffusion Model adds an autoencoder to the preceding model design. The encoder accepts an image as input and transforms it into a representation in a reduced-dimensional latent space. The forward and inverse diffusion processes are then applied to this latent representation, and the accompanying decoder is used to convert it into the reconstructed picture. A decent quality image dataset can be as small as 512X512 pixels and, at worst, 1024X1024, so using the latent representation of the image in the diffusion process significantly decreases the computations that need to be done. They termed this the Latent Diffusion Model (LDM). The LDMs can drastically cut down on both the time and energy spent training the model and running it while not compromising on the model’s performance. Its accessibility on consumer GPUs means that even the most resource-constrained individual or research facility can reap its rewards. The model weights have also been made publicly available so that the model can be utilized immediately without any additional training.
Stable Diffusion, like the diffusion model, has many practical and theoretical uses. To do this, Stable Diffusion utilizes a conditional denoising autoencoder to create images from a variety of inputs, including text, semantic maps, and images themselves. This conditional mechanism can be a domain-specific encoder that projects the input to an intermediate representation, which is then mapped to the intermediate layers of the denoising pipeline of DM, UNet via a cross-attention layer.
Text-to-Image Generation: With this kind of modeling, the user provides a written description, and high-quality images that correspond to the description are returned as output. In this scenario, BERT, a language-based transformer, is used as the conditioning mechanism to infer the latent code from the input text description.
Inpainting and Outpainting: This application involves filling masked regions of an image with new content, either replacing the corrupted or cropped region or introducing new content into it. In inpainting, the user selects an area of an image to crop, then feeds that cropped image into the model to generate a new image without the unwanted region. New information, such as a missing hand in the image, can likewise be introduced in this way. The model can be used for outpainting, where the image is stretched beyond its initial pixel size, and additional content is added there.
Image Transformations: This model can also perform image transformations in response to a user-provided textual description. To do that, the user must supply both an image and a written description of the transformed image. A change will be made to the image so that it more closely resembles the user’s description.
Super-Resolution: The stable diffusion model can also be used to improve image quality by increasing the resolution of low-quality or blurred images. The model was trained using low-resolution photos and their matching high-resolution images, so that when a user inputs a poor-quality image, the model returns a converted high-quality image.
The release of the Stable Diffusion model’s source code paves the way for a wide range of new uses across disciplines, but it also poses risks, such as the proliferation of false information and “deep fakes.” What do you think about the newly released image synthesis model? Let us know in the comments section.