Quickly generating high-quality images is crucial for creating realistic simulated environments, particularly for training self-driving cars to navigate unpredictable hazards safely. Traditional generative AI techniques, while impressive, have their own drawbacks. For example, the diffusion models can produce stunningly realistic images but are often too slow and computationally intensive for certain applications. In contrast, autoregressive models, which power large language models (LLMs) like ChatGPT, generate images rapidly but usually compromise on quality, resulting in images with errors.
Researchers from MIT and NVIDIA have collaborated to develop a novel solution that leverages the strengths of both techniques. Their hybrid image-generation tool, known as HART (short for hybrid autoregressive transformer), utilizes an autoregressive model to swiftly outline the fundamental image, followed by a smaller diffusion model that enhances detail and refines the visuals.
HART boasts the capability to create images that not only match but often exceed the quality of leading diffusion models, achieving this at a speed nearly nine times faster. The efficiency of this process significantly reduces computational resource consumption, allowing HART to operate effectively on standard laptops or smartphones. Furthermore, users can generate images through a simple natural language prompt input into the HART interface.
This tool has vast potential applications, from aiding researchers in training robots for complex real-world tasks to assisting designers in creating stunning video game environments. Sarim Khan, co-lead author of a paper detailing HART, draws an analogy to painting: “If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART.”
Alongside Khan, co-lead author Yecheng Wu, an undergraduate student at Tsinghua University, and senior author Song Han—an associate professor at MIT and distinguished scientist at NVIDIA—highlight the potential of HART during their forthcoming presentation at the International Conference on Learning Representations.
Traditional diffusion models, such as Stable Diffusion and DALL-E, are known for their remarkable detail, achieved through iterative processes that predict and adjust for pixel noise over many cycles, often exceeding 30 steps in total. While this process guarantees high-quality images, it also makes it slow and resource-intensive. Conversely, autoregressive models predict image patches sequentially, enabling faster generation but at the cost of detail and accuracy.
HART’s innovative design employs the autoregressive model for preliminary predictions of compressed image tokens, followed by a smaller diffusion model tasked with predicting residual tokens. This allows the diffusion model to enhance the initial image by correcting details overlooked during initial predictions, enabling it to achieve superior reconstruction quality.
“We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes,” explains Khan.
With the streamlined process of only needing eight steps to generate an image—compared to the conventional 30 steps—HART retains the speed of the autoregressive model while significantly improving image detail generation. This efficiency stems from the diffusion model handling a simpler, more focused task after the autoregressive model has established the framework.
During HART’s development, the team faced challenges in integrating the diffusion model effectively. They discovered that applying the diffusion model too early in the autoregressive process led to an error accumulation. Opting instead to utilize the diffusion model to predict residual tokens as a final step dramatically improved overall image quality.
The final architecture combines a robust autoregressive transformer model featuring 700 million parameters and a lightweight diffusion model with 37 million parameters, achieving image quality comparable to diffusion models with 2 billion parameters—yet it operates about nine times faster and requires 31% less computation.
Furthermore, because HART employs an autoregressive model, it aligns seamlessly with the emerging trend of unified vision-language models. This opens the door for innovative capabilities, such as interacting with AI models to illustrate processes, like assembling furniture.
As Tang notes, “LLMs are a good interface for all sorts of models, like multimodal models and those capable of reasoning. An efficient image-generation model would unlock a lot of possibilities.” The researchers also aim to extend HART’s architecture for applications in video generation and audio prediction tasks, considering its scalable and adaptable nature.
This research has received funding from the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the U.S. National Science Foundation, with training infrastructure support from NVIDIA.