ELM

Abstract

The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the performance gains achieved when scaling up model size. We further elucidate the design space of language models for vision generation, including tokenizer choice, model choice, model scalability, vocabulary design, and sampling strategy through extensive comparative experiments. Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains. Finally, our elucidate language model for image generation, termed as ELM, achieves state-of-the-art performance on the ImageNet 256×256 benchmark.

Method

ELM utilizes a Binary Autoenceoder (BAE) and splits each binary code into two subcode and ensures large randomness during inference.

The major contributions of ELM:

For the difference between image and language sequence generation, we identify the fundamental differences between the token distributions of discretized images and text, finding image tokens exhibit much higher randomness than language tokens, posing greater challenges to model the sequences.
For the image tokenizer, we examine Vector Quantization (VQ-VAE) and Binary Autoencoder (BAE). We find that BAE can always achieve 100% code utilization, result in better reconstruction ability and generation performance. Meanwhile, binary codes also allow higher flexibility when constructing vocabulary.
For the language modeling method, we thoroughly examine AutoRegressive (AR) models and Masked Language Models (MLMs), within the realm of image generation. Our findings suggest that AR mechanism holds greater potential in the visual domain.
For the learning behavior of language models in image domain, we show that AR models can learn effective image patterns without inductive bias, identify distinct patterns across model sizes. Specifically, all the models across different sizes effectively capture the importance of local importance during image generation, while larger models also learn the global relationship in some layers, which offer some comsice explaination of the performanec gain for the larger models.
For the vocabulary design, leveraging an image discretization mechanism with BAE, our results reveal that a vocabulary decomposition helps improve performance and reduce computational cost. Specifically, we conclude that spliting each code into two subcode is the most suitable choice; stronger BAE with larger code dimension allows higher potential of image generation, but also requires larger generation model.
For the sampling strategies, we thoughly explored the effective combination of the key components during inference, including the classifier-free guidiance (CFG) scale, introduction of randomness (temperature in the gumbel noise for MLMs and top-k for the AR models), and the sample iteration for MLMs. We found that a lineary increased CFG scale during generation and a large scale of randomness is important for a low level of FID.
Combining all key ingredients of the design space explicitly explored, we reach a strong Elucidated Language model for iMage generation, termed as ELM, and achieve state-of-the-art performance on the ImageNet 256×256 benchmark.

See more detailed results in our paper!

Results

Comparison result of our ELM and other Language Models on ImageNet 256×256.

Generated samples of ELM-2B with 2-12 BAE

Visualizing the performance improvement along with scaling up the tokenizer and model size.

This page was adapted from this source code.

Elucidating the design sapce of language model for image generation