Representation Autoencoders (RAEs) reuse pretrained, frozen representation encoders together with lightweight trained decoders to provide high-fidelity, semantically rich latents for diffusion transformers.
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT)
Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 / 1.13 at 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
As with training VAEs, there are two questions to answer:
We challenge the common assumption that pretrained representation encoders, such as DINOv2 and SigLIP2, are unsuitable for the reconstruction task because they “emphasize high-level semantics while downplaying low-level details”
As shown below, RAEs achieve consistently better reconstruction quality (rFID) than SD-VAE. For instance, RAE with MAE-B/16 reaches an rFID of 0.16, clearly outperforming SD-VAE and challenging the assumption that representation encoders cannot recover pixel-level detail.
We next study the scaling behavior of both encoders and decoders. As shown in Table 1c, reconstruction quality remains stable across DINOv2-S, B, and L, indicating that even small representation encoder models preserve sufficient low-level detail for decoding. On the decoder side (Table 1b), increasing capacity consistently improves rFID: from 0.58 with ViT-B to 0.49 with ViT-XL. Importantly, ViT-B already outperforms SD-VAE while being 14× more efficient in GFLOPs, and ViT-XL further improves quality at only one-third of SD-VAE’s cost. We also evaluate representation quality via linear probing on ImageNet-1K in Table 1d. Because RAEs use frozen pretrained encoders, they directly inherit the representations of the underlying encoders. Since RAEs use frozen pretrained encoders, they retain the strong representations of the underlying representation encoders. In contrast, SD-VAE achieves only approximately 8% accuracy.
|
|
|
With RAE demonstrating good reconstruction quality, we now proceed to investigate the diffusability
Following standard practice, we adopt the flow matching objective
We adopt a patch size of 1, which results in a token length of 256 for all RAEs on 256x256 images, matching the sequence length used by VAE-based DiTs. We note that, Since the computational cost of DiT depends primarily on the sequence length, using RAE latents on DiTs will not incur additional computational cost compared to using VAE latents.
DiT Does Not Work Out of the Box. Surprisingly, the standard diffusion recipe fails with RAE: training directly on RAE latents causes a small backbone such as DiT-S to completely fail, while a larger backbone like DiT-XL significantly underperforms it's counterpart with the SD-VAE latents. To investigate this observation, we raise several hypotheses detailed below, which we will discuss in the following sections: |
|
To better understand the training dynamics of Diffusion Transformers in RAE latent space, we construct a simplified experiment: we randomly select a single image, encode it by RAE, and test whether the diffusion model can reconstruct it. Starting from a DiT-S, we first vary the model width while fixing depth. We fix the RAE encoder to DINOv2-B with token dimension of 768.
As shown in the following figure, sample quality is poor when the model width $d < $ token dimension $n=768$, but improves sharply and reproduces the input almost perfectly once $d >= n$. Training losses exhibit the same trend, converging only when $d >= n$. One might suspect that this improvement still arises from the larger model capacity, but again as shown in the following figure, even when doubling the depth from 12 to 24, the generated images remain artifact-heavy, and the training losses fail to converge to similar level of $d = 768$.
Together, the results indicate that for generation in RAE's latent space to succeed, the diffusion model's width must match or exceed the RAE's token dimension.
This appears to contradict the common belief that data usually has low intrinsic dimension
We further extend our investigation to a more practical setting by examining three models of varying width, \{DiT-S, DiT-B, DiT-L\}. Each model is overfit on a single image encoded by {DINOv2-S, DINOv2-B, DINOv2-L}, respectively, corresponding to different token dimensions of {384, 768, 1024}. |
|
As shown above, convergence occurs only when the model width is at least as large as the token dimension (e.g., DiT-B with DINOv2-B), while the loss fails to converge otherwise (e.g., DiT-S with DINOv2-B).
Many prior works
|
Specifically, We adopt the shifting strategy of |
This this yields significant performance gains, showing its importance for training diffusion models in the high-dimensional RAE latent space.
Unlike VAEs, whose latents follow a continuous distribution $\mathcal{N}(\mu, \sigma^2\mathbf{I})$
|
We analyze how $p_\mathbf{n}(\mathbf{z})$ affects reconstruction and generation. As shown on the right, it improves gFID but slightly worsens rFID. This trade-off is expected: adding noise smooths the latent distribution and, therefore, helps reduce OOD issues for the decoder, but also removes fine details, lowering reconstruction quality. |
Combining all the above techniques, we train a DiT-XL model on RAE latents, which achieves a gFID of 4.28 (Right) after only 80 epochs and 2.39 after 720 epochs.
With the same model size, this not only surpasses prior diffusion baselines (e.g., SiT-XL |
![]() |
In the following sections, we investigate ways to make RAE generation more efficient and effective, pushing it toward state-of-the-art performance.
As discussed in Section 2, within the standard DiT framework, handling higher-dimensional RAE latents
requires scaling up the width of the entire backbone, which quickly becomes computationally expensive.
To overcome this limitation, we draw inspiration from DDT
Wide DDT Head. Formally, a DiTDH model consists of a base DiT \(M\) and an additional wide, shallow transformer head \(H\). Given a noisy input \(x_t\), timestep \(t\), and an optional class label \(y\), the combined model predicts the velocity \(v_t\) as
DiTDH converges faster than DiT. We train a series of DiTDH models with varying backbone sizes (DiTDH-S, B, L, and XL) on RAE latents. We use a 2-layer, 2048-dim DiTDH head for all models. Performance is compared against the standard DiT-XL baseline. DiTDH is substantially more FLOP-efficient than DiT. For example, DiTDH-B requires only \( \sim 40\% \) of the training FLOPs yet outperforms DiT-XL by a large margin; when scaled to DiTDH-XL under a comparable training budget, it achieves an FID of 2.16—nearly half that of DiT-XL.
Convergence Comparison. We compare the convergence behavior of DiTDH-XL with previous state-of-the-art diffusion
models
Scaling Behavior. We compare DiTDH with recent methods across different model scales. Increasing the size of DiTDH consistently improves FID performance. The smallest model, DiTDH-S, achieves a competitive FID of 6.07—already outperforming the much larger REPA-XL. Scaling up to DiTDH-B yields a substantial improvement from 6.07 to 3.38, surpassing all prior works of similar or even larger scale. The performance continues to improve with DiTDH-XL, reaching a new state-of-the-art FID of 2.16 at 80 training epochs.
Performance. we provide a quantitative comparison between DiTDH-XL, our most performant model, with recent state-of-the-art diffusion models on ImageNet 256Ă—256 and 512Ă—512 in Table 1 and Table 2. Our method outperforms all prior diffusion models by a large margin, setting new state-of-the-art FID scores of 1.51 without guidance and 1.13 with guidance at 256Ă—256. On 512Ă—512, with 400-epoch training, DiTDH-XL achieves an FID of 1.13 with guidance, surpassing the previous best performance achieved by EDM-2 (1.25).
Method | Epochs | #Params | Generation@256 w/o guidance | Generation@256 w/ guidance | ||||||
---|---|---|---|---|---|---|---|---|---|---|
gFID↓ | IS↑ | Prec.↑ | Rec.↑ | gFID↓ | IS↑ | Prec.↑ | Rec.↑ | |||
Autoregressive | ||||||||||
VAR |
350 | 2.0B | 1.92 | 323.1 | 0.82 | 0.59 | 1.73 | 350.2 | 0.82 | 0.60 |
MAR |
800 | 943M | 2.35 | 227.8 | 0.79 | 0.62 | 1.55 | 303.7 | 0.81 | 0.62 |
xAR |
800 | 1.1B | - | - | - | - | 1.24 | 301.6 | 0.83 | 0.64 |
Pixel Diffusion | ||||||||||
ADM |
400 | 554M | 10.94 | 101.0 | 0.69 | 0.63 | 3.94 | 215.8 | 0.83 | 0.53 |
RIN |
480 | 410M | 3.42 | 182.0 | - | - | - | - | - | - |
PixelFlow |
320 | 677M | - | - | - | - | 1.98 | 282.1 | 0.81 | 0.60 |
PixNerd |
160 | 700M | - | - | - | - | 2.15 | 297.0 | 0.79 | 0.59 |
SiD2 |
1280 | - | - | - | - | - | 1.38 | - | - | - |
Latent Diffusion with VAE | ||||||||||
DiT |
1400 | 675M | 9.62 | 121.5 | 0.67 | 0.67 | 2.27 | 278.2 | 0.83 | 0.57 |
MaskDiT |
1600 | 675M | 5.69 | 177.9 | 0.74 | 0.60 | 2.28 | 276.6 | 0.80 | 0.61 |
SiT |
1400 | 675M | 8.61 | 131.7 | 0.68 | 0.67 | 2.06 | 270.3 | 0.82 | 0.59 |
MDTv2 |
1080 | 675M | - | - | - | - | 1.58 | 314.7 | 0.79 | 0.65 |
VA-VAE |
80 | 675M | 4.29 | - | - | - | - | - | - | - |
800 | 2.17 | 205.6 | 0.77 | 0.65 | 1.35 | 295.3 | 0.79 | 0.65 | ||
REPA |
80 | 675M | 7.90 | 122.6 | 0.70 | 0.65 | - | - | - | - |
800 | 5.78 | 158.3 | 0.70 | 0.68 | 1.29 | 306.3 | 0.79 | 0.64 | ||
DDT |
80 | 675M | 6.62 | 135.2 | 0.69 | 0.67 | 1.52 | 263.7 | 0.78 | 0.63 |
400 | 6.27 | 154.7 | 0.68 | 0.69 | 1.26 | 310.6 | 0.79 | 0.65 | ||
REPA-E |
80 | 675M | 3.46 | 159.8 | 0.77 | 0.63 | 1.67 | 266.3 | 0.80 | 0.63 |
800 | 1.70 | 217.3 | 0.77 | 0.66 | 1.15 | 304.0 | 0.79 | 0.66 | ||
Latent Diffusion with RAE (Ours) | ||||||||||
DiT-XL (DINOv2-S) | 800 | 676M | 1.87 | 209.7 | 0.80 | 0.63 | 1.41 | 309.4 | 0.80 | 0.63 |
DiTDH-XL (DINOv2-B) | 20 | 839M | 3.71 | 198.7 | 0.86 | 0.50 | -- | -- | -- | -- |
80 | 2.16 | 214.8 | 0.82 | 0.59 | -- | -- | -- | -- | ||
800 | 1.51 | 242.9 | 0.79 | 0.63 | 1.13 | 262.6 | 0.78 | 0.67 |
Method | Generation@512 | |||
---|---|---|---|---|
gFID↓ | IS↑ | Prec.↑ | Rec.↑ | |
BigGAN-deep |
8.43 | 177.9 | 0.88 | 0.29 |
StyleGAN-XL |
2.41 | 267.8 | 0.77 | 0.52 |
VAR | 2.63 | 303.2 | - | - |
MAGVIT-v2 |
1.91 | 324.3 | - | - |
XAR | 1.70 | 281.5 | - | - |
ADM | 3.85 | 221.7 | 0.84 | 0.53 |
SiD2 | 1.50 | - | - | - |
DiT | 3.04 | 240.8 | 0.84 | 0.54 |
SiT | 2.62 | 252.2 | 0.84 | 0.57 |
DiffiT |
2.67 | 252.1 | 0.83 | 0.55 |
REPA | 2.08 | 274.6 | 0.83 | 0.58 |
DDT | 1.28 | 305.1 | 0.80 | 0.63 |
EDM2 |
1.25 | - | - | - |
DiTDH-XL (DINOv2-B) | 1.13 | 259.6 | 0.80 | 0.63 |
A central challenge in generating high-resolution images is that resolution scales with the number of tokens: doubling image size in each dimension requires roughly four times as many tokens. To address this, we let the decoder handle resolution scaling by allowing its patch size \(p_d\) to differ from the encoder patch size \(p_e\). When \(p_d = p_e\), the output matches the input resolution; setting \(p_d = 2p_e\) produces a 2Ă— upsampled image, reconstructing a 512Ă—512 image from the same tokens used at 256Ă—256.
Method | #Tokens | gFID ↓ | rFID ↓ |
---|---|---|---|
Direct | 1024 | 1.13 | 0.53 |
Upsample | 256 | 1.61 | 0.97 |
Since the decoder is decoupled from both the encoder and the diffusion process, we can reuse diffusion models trained at 256Ă—256 resolution, simply swapping in an upsampling decoder to produce 512Ă—512 outputs without retraining. This approach slightly increases both rFID and gFID, but being 4Ă— more efficient than quadrupling the number of tokens.
In this work, we propose and study RAE and DiTDH. In Section 2, we showed that RAE combined with DiT already brings substantial benefits, even without the additional head. Here, we turn the question around: can DiTDH still provide improvements without the latent space of RAE?
VAE | DINOv2-B | |
---|---|---|
DiT-XL | 7.13 | 4.28 |
DiTDH-XL | 11.70 | 2.16 |
To investigate, we train both DiT-XL and DiTDH-XL on SD-VAE latents with a patch size of 2, alongside DINOv2-B for comparison, for 80 epochs, and report unguided FID. As shown on the left, DiTDH performs even worse than DiT on SD-VAE, despite the additional computation introduced by the diffusion head. This indicates that the head provides little benefit in low-dimensional latent spaces, and its primary strength arises in high-dimensional diffusion tasks introduced by RAE.
DiTDH achieves strong performance when paired with the high-dimensional latent space of RAE. This raises a key question: is the structured representation of RAE essential, or would DiTDH work equally well on unstructured high-dimensional inputs such as raw pixels?
Pixel | DINOv2-B | |
---|---|---|
DiT-XL | 51.09 | 4.28 |
DiTDH-XL | 30.56 | 2.16 |
To evaluate this, we train DiT-XL and DiTDH-XL directly on raw pixels. For 256×256 images with a patch size of 16, the resulting DiT input token dimensionality is 16×16×3 = 768, matching that of the DINOv2-B latents. We report unguided FID after 80 epochs. As shown on the right, DiTDH outperforms DiT on pixels, but both models perform far worse than their counterparts trained on RAE latents. These results demonstrate that high dimensionality alone is not sufficient—the structured representation provided by RAE is crucial for achieving strong performance gains.
In this work, we challenge the belief that pretrained representation encoders are too high-dimensional and too semantic for reconstruction or generation. We show that a frozen representation encoder, paired with a lightweight trained decoder, forms an effective Representation Autoencoder (RAE). On this latent space, we train Diffusion Transformers in a stable and efficient way with three added components: (1) match DiT width to the encoder token dimensionality, (2) apply a dimension-dependent shift to the noise schedule, and (3) add decoder noise augmentation so the decoder handles diffusion outputs. We also introduce DiTDH, a shallow-but-wide diffusion transformer head that increases width without quadratic compute. Empirically, RAEs enable strong visual generation: on ImageNet, our RAE-based DiTDH-XL achieves an FID of 1.51 at 256Ă—256 (no guidance) and 1.13 / 1.13 at 256Ă—256 and 512Ă—512 (with guidance). We believe RAE latents serve as a strong candidate for training diffusion transformers efficiently and robustly in future generative modeling research.
@misc{zheng2025diffusiontransformersrepresentationautoencoders,
title={Diffusion Transformers with Representation Autoencoders},
author={Boyang Zheng and Nanye Ma and Shengbang Tong and Saining Xie},
year={2025},
eprint={2510.11690},
archivePrefix={arXiv},
primaryClass={cs.CV}
}