Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Authors

Affiliations

Resources

TL;DR: RAE scales well to large-scale text-to-image generation, achieving faster convergence and improved generation quality.

Overview

Representation Autoencoders (RAEs) transform diffusion modeling by training in high-dimensional semantic spaces rather than compressed latent representations. We bring RAEs to large-scale text-to-image (T2I) generation, demonstrating that they are not only more effective but also simpler to train than traditional VAEs. By operating directly on semantic tokens from a frozen vision encoder (SigLIP-2), RAEs avoid the information loss typical of VAEs and enable a more natural integration with multimodal systems.

Key Insights:

Decoding: Training decoders on diverse web and synthetic data is critical for high-fidelity reconstruction, especially for text.
Scalability: RAEs consistently outperform VAEs across all model scales, from 0.5B to 9.8B parameters.
Simplicity: Scaling simplifies the architecture; complex tricks like wide DDT heads become unnecessary as capacity increases.
Stability: RAEs are remarkably robust to overfitting during fine-tuning, maintaining quality where VAEs degrade.
Unified Modeling: A shared latent space allows the model to "see" what it generates, enabling capabilities like self-verification.

Pipeline overview for representation autoencoders — **RAE converges faster than VAE in text-to-image pretraining.** We train Qwen-2.5 1.5B + DiT 2.4B models from scratch on both RAE (SigLIP-2) and VAE (FLUX) latent spaces for up to 60k iterations. RAE converges significantly faster than VAE on both GenEval (4.0×) and DPG-Bench (4.6×).

Scaling Decoder Training Beyond ImageNet

To scale the representation autoencoder in the T2I domain, we first train a RAE decoder on a larger and more diverse dataset than ImageNet. Throughout this section, we choose SigLIP-2 So400M (patch size 14) as the frozen encoder, and train a ViT-based decoder to reconstruct the image from these tokens at $224\times224$ resolution.

RAE decoders trained on more data (web, synthetic text) generalize across domains. — **Decoders trained only on ImageNet reconstruct natural images well but struggle with text-rendering scenes.** Adding web and text data greatly improves text reconstruction while maintaining natural-image quality. Compared to proprietary VAEs, RAE achieves competitive overall fidelity.

Training objective and data.

Following RAE, we adopt $\ell_1$, LPIPS, and adversarial losses. Additionally, we integrate Gram Loss, which is found beneficial for reconstruction. The training objective is set as $ L(x, \hat{x}) = \ell_1(x, \hat{x}) + \omega_L \text{LPIPS}(x, \hat{x}) + \omega_G \text{Gram}(x, \hat{x}) + \omega_A \text{Adv}(x,\hat{x})$.

We use a dataset combining roughly 73M data from three main data sources: web image sources from FuseDiT, synthetic images generated by FLUX.1-schnell, and RenderedText, which focuses on text-rendering scenes. Details are provided in Section.

Data composition matters for reconstruction fidelity of RAE

As shown in the following table, expanding decoder training beyond ImageNet to include web-scale and synthetic data yields only marginal gains on ImageNet itself, but provides moderate improvements on more diverse images (YFCC). This indicates that exposure to a broader distribution enhances the decoder’s generalizability. Text images, however, form a notable exception. For text reconstruction, training on Web + Synthetic data yields little improvement over ImageNet-only training. In contrast, performance improves substantially once text-specific data is included, highlighting that reconstruction quality is very sensitive to the composition of the training data. As shown in the figure above, training the RAE decoder with additional text data is essential for accurate text reconstruction. Overall, RAE reconstruction improves with scale, but the composition of data---not just its size---matters: each domain benefits most from domain-matched coverage.

Data Sources	#Data	ImageNet ↓	YFCC ↓	Text ↓
ImageNet	1.28M	0.462	0.970	2.640
Web	39.3M	0.529	0.629	2.325
Web + Synthetic	64.0M	0.437	0.683	2.406
Web + Synthetic + Text	73.0M	0.435	0.702	1.621

Data matters for RAE's reconstruction fidelity. Training on web-scale images consistently improves reconstruction quality across all domains.

Different encoders

We also evaluate RAE using different pretrained encoders. In particular, we replace SigLIP-2 with WebSSL-L, a large-scale self-supervised model. As shown in the following table, WebSSL-L achieves stronger reconstruction performance than SigLIP-2 across all domains. Both SigLIP-2 and WebSSL-L consistently outperform SDXL VAE, though they still fall short of FLUX VAE.

Family	Model	ImageNet ↓	YFCC ↓	Text ↓
VAE	SDXL	0.930	1.168	2.057
VAE	FLUX	0.288	0.410	0.638
RAE	WebSSL ViT-L	0.388	0.558	1.372
RAE	SigLIP-2 ViT-So	0.435	0.702	1.621

Comparison of reconstruction performance. After expanding training data, RAE outperforms SDXL-VAE across all domains, though it still trails FLUX-VAE. Within RAE variants, WebSSL reconstructs better than SigLIP-2.

RAE is Simpler in T2I

In this section, we extend the recently proposed RAE framework to the T2I domain and systematically stress-test its core design choices under large-scale multimodal settings. In particular, we investigate whether the dimension-dependent noise schedule, the noise-augmented decoding strategy, and the wide DDT head (DiT^DH)---all central to RAE’s effectiveness on ImageNet---remain equally important when scaling diffusion models.

We adopt the MetaQuery architecture for text-to-image (T2I) generation and unified modeling. The model begins with a pretrained language model and introduces a set of learnable query tokens that are prepended to the text prompt. The LLM jointly processes the text and queries, producing query-token representations that serve as the conditioning signal. A 2-layer MLP connector then projects these representations from the LLM’s hidden space into the Diffusion Transformer (DiT).

For this DiT model, we adopt a design based on LightningDiT and train it using the flow matching objective. Critically, our model does not operate in a compressed VAE space. Instead, the DiT learns to model the distribution of high-dimensional, semantic representations generated by the frozen representation encoder. During inference, the DiT generates a set of features conditioned on the query tokens, which are then passed to our trained RAE decoder for rendering into pixel space.

We also train visual instruction tuning for image understanding. For this, we use a separate 2-layer MLP projector that maps visual tokens into the LLM’s embedding space. Importantly, these visual tokens come from the same frozen representation encoder whose features the diffusion model is trained to generate.

Model architecture operating on RAE tokens — Overview of training pipeline. Left: RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right: End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Noise scheduling remains crucial

The RAE work argues that conventional noise schedules become suboptimal when applied to high-dimensional latent spaces. The paper proposes a dimension-dependent noise schedule shift that rescales the diffusion timestep according to the effective data dimension $m = N \times d$ (number of tokens $\times$ token dimension). Formally, given a base schedule $t_n \in [0, 1]$ defined for a reference dimension $n$, the shifted timestep is computed as \begin{equation*} t_m = \frac{\alpha t_n}{1 + (\alpha - 1)t_n}, \quad \text{where} \quad \alpha = \sqrt{\frac{m}{n}}. \end{equation*}

We follow the RAE setting and use $n{=}4096$ as the base dimension for computing the scaling factor $\alpha$. We experiment with and without applying the dimension-dependent shift when training text-to-image diffusion models on RAE latents, as shown below.

Setting	GenEval ↑	DPG-Bench ↑
w/o shift	23.6	54.8
w/ shift	49.6	76.8

Effect of shift on GenEval and DPG-Bench performance.

Consistent with, applying the noise shift dramatically improves both GenEval and DPG-Bench scores, demonstrating that adjusting the schedule to the effective latent dimension is critical for T2I.

Design Choices that Saturate

While dimension-aware noise scheduling proves essential, we find that other design choices in RAE, which was originally developed for smaller-scale ImageNet models, provide diminishing returns at T2I scale. Here we examine two such techniques: noise-augmented decoding and the wide DDT head (DiT^DH).

Noise-augmented decoding. RAE proposes a noise-augmented decoding strategy to bridge the mismatch between clean encoder latents used during training and slightly perturbed latents generated at inference. Formally, it trains the RAE decoder on smoothed inputs $z' = z + n$, where $n \sim \mathcal{N}(0,\, \sigma^2 I)$ and $\sigma$ is sampled from $|\mathcal{N}(0,\, \tau^2)|$.

We visualize the effect of noise-augmented decoding at different training stages in the following figure. The gains are noticeable early in training (before $\sim$15k steps), when the model is still far from convergence, but become negligible at later stages. This suggests that noise-augmented decoding acts as a form of regularization that matters most when the model has not yet learned a robust latent manifold.

Wide DDT Head. The DiT^DH architecture augments a standard DiT with a shallow but wide DDT head, increasing denoising width without widening the entire backbone. The original RAE experiments were conducted at smaller model scales, where the DiT backbones had hidden widths around 1024, comparable to the RAE latent width (e.g., SigLIP-2 has 1152-dim tokens). In that regime, widening the DDT head compensates for the backbone's limited width without incurring the computational cost of widening the full network architecture.

Our T2I setting is substantially different: DiTs at $\geq$2B parameters are already wide by construction, and the data regime is far more diverse than ImageNet. We revisit DiT^DH under these larger-scale conditions to determine whether its advantages persist when both model capacity and data complexity increase. Specifically, we train three DiT variants (0.5B, 2.4B, 3.1B) and construct their corresponding DiT^DH counterparts by appending a two-layer, wide ($d{=}2688$) DDT head, which introduces an additional +0.28B parameters to each model configuration.

The following figure shows that the benefits of DiT^DH are most pronounced at smaller scales. At 0.5B parameters, DiT^DH achieves substantial improvement, demonstrating that the wide DDT head effectively addresses the width bottleneck when backbone capacity is limited. However, as model size increases to 2.4B and 3.1B, the performance gap narrows considerably, suggesting that raw model capacity increasingly dominates over architectural modifications.

DiT with DDT head vs. standard DiT at different model scales — DiT^DH yields large gains at 0.5B (+11.2 GenEval), but the advantage diminishes at $>$2.4B, where backbone capacity dominates.

Training Diffusion Model with RAE vs. VAE

In this section, we conduct an extensive study comparing text-to-image diffusion training using the RAE (SigLIP-2) encoder versus a standard VAE. For the VAE baseline, we adopt the state-of-the-art model from FLUX. All experiments follow the same setup described in Section, with identical training configurations; the only difference lies in whether diffusion is performed in the RAE or VAE latent space.

Experimental Protocol.

We organize our comparison into two stages: pretraining and finetuning. We train the Diffusion Transformer from scratch in each latent space (RAE vs. VAE) to remove confounding factors. We ensure apples-to-apples comparison. The only component that differs is the latent space and its decoder (SigLIP-2 RAE vs. FLUX VAE). For the VAE baseline, this corresponds to the standard two-tower vision–language setup used in recent unified models such as Bagel and UniFluid.

Pretraining.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in the figure above, the RAE-based model converges significantly faster than its VAE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

Pretraining scaling comparison between RAE and VAE — **RAE consistently outperforms VAE across all model scales during pretraining.** The performance gap widens with scale, indicating that RAE scales more effectively than VAE.

In the figure above, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the VAE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature, which highlight the need for high-quality data scaling to fully exploit model capacity.

Generalizing to other vision encoders.

We also experiment with training RAE with WebSSL ViT-L. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (the following table), showing that RAE works well with different pretrained encoders.

Model Variant	GenEval ↑	DPG-Bench ↑
VAE-based models
FLUX VAE	39.6	70.5
RAE-based models
WebSSL ViT-L	46.0	72.8
SigLIP-2 ViT-So	49.5	76.9

SSL encoders are effective RAE backbones for T2I. A WebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE.

Finetuning.

Following standard practice in T2I training, models are finetuned on a smaller high-quality dataset after large-scale pretraining. We run this finetuning stage for both RAE- and VAE-based models under identical settings. Unless otherwise noted, we use the BLIP-3o 60k dataset and start from the 1.5B LLM + 2.4B DiT checkpoint trained for 30k steps in Section. We update both the LLM and the DiT; additional details are provided in the appendix.

RAE-based models consistently outperform VAE-based models. We finetune both family of models for \{4, 16, 64, 128, 256\} epochs and compare the performance on GenEval and DPG-Bench in the following figure. We observe that across all iterations, the RAE-based model shows an advantage on both GenEval and DPG-Bench across all settings.

RAE-based models are less prone to overfitting. As shown in the following figure, VAE-based models begin to degrade in performance after 64 epochs and deteriorate more noticeably by 256, whereas RAE-based models remain stable and show only a mild decline. Examining the diffusion loss curves (see the appendix) suggests that this difference stems from overfitting in the VAE setting—the training loss drops rapidly and deeply—while the RAE loss decreases more gradually and stabilizes at a higher value. We hypothesize that the higher-dimensional and semantically structured latent space of the RAE\footnote{SigLIP-2 produces 1152-dim. tokens vs. $<$100 in typical VAEs} may provide an implicit regularization effect, helping mitigate overfitting during finetuning.

Finetuning comparison between RAE and VAE — **RAE-based models outperform VAE-based models and are less prone to overfitting.** We train both models for 256 epochs and observe that (1) RAE-based models consistently achieve higher performance, and (2) VAE-based models begin to overfit rapidly after 64 epochs.

RAE's advantage generalizes across settings. To verify whether RAE's advantage over VAE extends beyond our main setup, we conduct two additional experiments: 1) fine-tuning only the DiT while freezing the LLM (following recent works), and 2) scaling to different sizes DiT models (0.5B--9.8B parameters). The following figure shows that RAE consistently outperforms VAE in both settings. The left panel shows that both selective fine-tuning (DiT-only) and joint fine-tuning (LLM+DiT) favor RAE over VAE; notably, the top-performing VAE configuration reaches 78.2, while the weakest RAE approach achieves 79.4. The right panel shows continued RAE gains across the scaling range, with larger models exhibiting greater improvements. We include the DPG-bench results in the appendix.

Scaling finetune comparing RAE and VAE — **RAE-based models outperform VAEs across different settings.** *Left*: When fine-tuning only the DiT versus the full LLM+DiT system, RAE models consistently achieve higher GenEval scores. *Right*: RAE models maintain their advantage over VAE across all DiT model scales (0.5B--9.8B parameters), with the performance gap widening as model size increases.

Implications for Unified Models

Visual understanding.

We conduct a comparative study to study how the choice of visual generation backbone—VAE versus RAE—affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME, TextVQA, AI2D, SeedBench, MMMU, and MMMU-Pro. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings, we observe in the following table that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

Model	MME_P	TVQA	AI2D	Seed	MMMU	MMMU_P
Und.-only	1374.8	44.7	63.9	67.1	40.2	20.5
RAE-based	1468.7	39.6	66.7	69.8	41.1	19.8
VAE-based	1481.7	39.3	66.7	69.7	37.2	18.7

Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Test-time scaling in latent space.

A unique advantage of using RAE for generation is that the LLM operates entirely in the same latent space used for image understanding, leaving the representation and pixel spaces fully decoupled. This allows the LLM to produce latents it can directly interpret, without the need for repeated decode–re-encode cycles between pixels and features.

Here, we demonstrate one direct benefit of operating in a unified latent space: the LLM itself can act as a verifier for the latents generated by the diffusion model. This enables a new test-time scaling (tts) method that operates only in the latent space, which we refer to as latent-tts (the following figure).

Latent-space test-time scaling with LLM verifier — **Test-time scaling in latent space.** Our framework allows the LLM to directly evaluate and select generation results within the latent space, bypassing the decode-re-encode process.

We consider two verifier metrics: Prompt Confidence and Answer Logits. For Prompt Confidence, we follow : re-injecting the generated latents along with the original prompt into the LLM and aggregating token-level logits to obtain a confidence score. For Answer Logits, we query the LLM with the question, Does this generated image $\langle$image$\rangle$ align with the $\langle$prompt$\rangle$?, and uses the logit of the yes token. If the LLM responds no, we use the negative logit of the no token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol using a best-of-$N$ selection strategy. As shown in the following table, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality.

Best-of-N	Prompt Confidence	Answer Logits
1.5B LLM + 5.5B DiT (GenEval = 53.2)
4/8	56.7	59.6
4/16	57.5	62.5
4/32	60.0	64.3
7.0B LLM + 5.5B DiT (GenEval = 55.5)
4/8	58.3	62.5
4/16	59.6	65.8
4/32	60.1	67.8

TTS results across LLM–DiT configurations. Substantial performance improvements are observed with both verifier metrics on GenEval. “4/8” refers to selecting the best 4 out of 8 samples.

Conclusion

We demonstrate that Representation Autoencoders (RAEs) successfully scale to large-scale text-to-image generation. Our findings show that RAEs not only work at scale but actually simplify the design: complex modifications like wide DDT heads become unnecessary as model capacity increases. By offering faster convergence, better generation quality, and a shared latent space for unified modeling, RAE establishes itself as a simple yet powerful foundation for next-generation generative models.