We scale Representation Autoencoders (RAEs) to large-scale text-to-image synthesis. Compared to it's VAE counterpart, RAEs offer faster convergence and improved generation quality across model scales, training compute and training stages.
TL;DR: RAE scales well to large-scale text-to-image generation, achieving faster convergence and improved generation quality.
Representation Autoencoders (RAEs) transform diffusion modeling by training in high-dimensional semantic spaces rather than compressed latent representations. We bring RAEs to large-scale text-to-image (T2I) generation, demonstrating that they are not only more effective but also simpler to train than traditional VAEs. By operating directly on semantic tokens from a frozen vision encoder (SigLIP-2), RAEs avoid the information loss typical of VAEs and enable a more natural integration with multimodal systems.
Key Insights:
To scale the representation autoencoder in the T2I domain, we first train a RAE decoder on a larger and more diverse dataset than ImageNet.
Throughout this section, we choose SigLIP-2 So400M (patch size 14)
Following RAE, we adopt $\ell_1$, LPIPS
We use a dataset combining roughly 73M data from three main data sources: web image sources from FuseDiT
As shown in the following table, expanding decoder training beyond ImageNet to include web-scale and synthetic data yields only marginal gains on ImageNet itself, but provides moderate improvements on more diverse images (YFCC). This indicates that exposure to a broader distribution enhances the decoder’s generalizability. Text images, however, form a notable exception. For text reconstruction, training on Web + Synthetic data yields little improvement over ImageNet-only training. In contrast, performance improves substantially once text-specific data is included, highlighting that reconstruction quality is very sensitive to the composition of the training data. As shown in the figure above, training the RAE decoder with additional text data is essential for accurate text reconstruction. Overall, RAE reconstruction improves with scale, but the composition of data---not just its size---matters: each domain benefits most from domain-matched coverage.
|
We also evaluate RAE using different pretrained encoders. In particular, we replace SigLIP-2 with WebSSL-L
|
In this section, we extend the recently proposed RAE framework to the T2I domain and systematically stress-test its core design choices under large-scale multimodal settings. In particular, we investigate whether the dimension-dependent noise schedule, the noise-augmented decoding strategy, and the wide DDT head (DiTDH)---all central to RAE’s effectiveness on ImageNet---remain equally important when scaling diffusion models.
We adopt the MetaQuery architecture
For this DiT model, we adopt a design based on LightningDiT
We also train visual instruction tuning
The RAE work
We follow the RAE setting and use $n{=}4096$ as the base dimension for computing the scaling factor $\alpha$. We experiment with and without applying the dimension-dependent shift when training text-to-image diffusion models on RAE latents, as shown below.
|
Consistent with
While dimension-aware noise scheduling proves essential, we find that other design choices in RAE, which was originally developed for smaller-scale ImageNet models, provide diminishing returns at T2I scale. Here we examine two such techniques: noise-augmented decoding and the wide DDT head (DiTDH).
Noise-augmented decoding. RAE proposes a noise-augmented decoding strategy to bridge the mismatch between clean encoder latents used during training and slightly perturbed latents generated at inference. Formally, it trains the RAE decoder on smoothed inputs \(z' = z + n\), where \(n \sim \mathcal{N}(0,\, \sigma^2 I)\) and \(\sigma\) is sampled from \(|\mathcal{N}(0,\, \tau^2)|\).
We visualize the effect of noise-augmented decoding at different training stages in the following figure. The gains are noticeable early in training (before \(\sim\)15k steps), when the model is still far from convergence, but become negligible at later stages. This suggests that noise-augmented decoding acts as a form of regularization that matters most when the model has not yet learned a robust latent manifold.
Wide DDT Head. The DiTDH architecture augments a standard DiT with a shallow but wide DDT head, increasing denoising width without widening the entire backbone. The original RAE experiments were conducted at smaller model scales, where the DiT backbones had hidden widths around 1024, comparable to the RAE latent width (e.g., SigLIP-2 has 1152-dim tokens). In that regime, widening the DDT head compensates for the backbone's limited width without incurring the computational cost of widening the full network architecture.
Our T2I setting is substantially different: DiTs at $\geq$2B parameters are already wide by construction, and the data regime is far more diverse than ImageNet. We revisit DiTDH under these larger-scale conditions to determine whether its advantages persist when both model capacity and data complexity increase. Specifically, we train three DiT variants (0.5B, 2.4B, 3.1B) and construct their corresponding DiTDH counterparts by appending a two-layer, wide (\(d{=}2688\)) DDT head, which introduces an additional +0.28B parameters to each model configuration.
The following figure shows that the benefits of DiTDH are most pronounced at smaller scales. At 0.5B parameters, DiTDH achieves substantial improvement, demonstrating that the wide DDT head effectively addresses the width bottleneck when backbone capacity is limited. However, as model size increases to 2.4B and 3.1B, the performance gap narrows considerably, suggesting that raw model capacity increasingly dominates over architectural modifications.
In this section, we conduct an extensive study comparing text-to-image diffusion training using the RAE (SigLIP-2) encoder versus a standard VAE. For the VAE baseline, we adopt the state-of-the-art model from FLUX
We organize our comparison into two stages: pretraining and finetuning. We train the Diffusion Transformer from scratch in each latent space (RAE vs. VAE) to remove confounding factors. We ensure apples-to-apples comparison. The only component that differs is the latent space and its decoder (SigLIP-2 RAE vs. FLUX VAE).
For the VAE baseline, this corresponds to the standard two-tower vision–language setup used in recent unified models such as Bagel
Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in the figure above, the RAE-based model converges significantly faster than its VAE counterpart, achieving a 4.0Ă— speedup on GenEval and a 4.6Ă— speedup on DPG-Bench.
Scaling.
We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models
In the figure above, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the VAE baseline.
We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity.
This observation aligns with discussions in large-scale visual SSL literature
We also experiment with training RAE with WebSSL ViT-L
|
||||||||||||||||||
Following standard practice in T2I training
RAE-based models consistently outperform VAE-based models. We finetune both family of models for \{4, 16, 64, 128, 256\} epochs and compare the performance on GenEval and DPG-Bench in the following figure. We observe that across all iterations, the RAE-based model shows an advantage on both GenEval and DPG-Bench across all settings.
RAE-based models are less prone to overfitting. As shown in the following figure, VAE-based models begin to degrade in performance after 64 epochs and deteriorate more noticeably by 256, whereas RAE-based models remain stable and show only a mild decline. Examining the diffusion loss curves (see the appendix) suggests that this difference stems from overfitting in the VAE setting—the training loss drops rapidly and deeply—while the RAE loss decreases more gradually and stabilizes at a higher value. We hypothesize that the higher-dimensional and semantically structured latent space of the RAE\footnote{SigLIP-2 produces 1152-dim. tokens vs. $<$100 in typical VAEs} may provide an implicit regularization effect, helping mitigate overfitting during finetuning.
RAE's advantage generalizes across settings. To verify whether RAE's advantage over VAE extends beyond our main setup, we conduct two additional experiments: 1) fine-tuning only the DiT while freezing the LLM (following recent works
We conduct a comparative study to study how the choice of visual generation backbone—VAE versus RAE—affects multimodal understanding performance.
We evaluate the trained models on standard benchmarks: MME
Similar to prior findings
|
A unique advantage of using RAE for generation is that the LLM operates entirely in the same latent space used for image understanding, leaving the representation and pixel spaces fully decoupled. This allows the LLM to produce latents it can directly interpret, without the need for repeated decode–re-encode cycles between pixels and features.
Here, we demonstrate one direct benefit of operating in a unified latent space: the LLM itself can act as a
We consider two verifier metrics: Prompt Confidence and Answer Logits. For Prompt Confidence, we follow
With the verifier defined, we adopt the standard test-time scaling protocol
|
|||||||||||||||||||||||||||
We demonstrate that Representation Autoencoders (RAEs) successfully scale to large-scale text-to-image generation. Our findings show that RAEs not only work at scale but actually simplify the design: complex modifications like wide DDT heads become unnecessary as model capacity increases. By offering faster convergence, better generation quality, and a shared latent space for unified modeling, RAE establishes itself as a simple yet powerful foundation for next-generation generative models.
@article{scale-rae-2026,
title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
author={
Shengbang Tong and Boyang Zheng and Ziteng Wang and
Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie
},
journal={arXiv preprint},
year={2026}
}