Spring 2026 CS 444

Assignment 5: Diffusion Transformers (DiT)

Due date: May 6th, 11:59:59 PM

Welcome to the last assignment for CS 444! In this assignment, you will implement Scalable Diffusion Models with Transformers (Diffusion Transformers, or DiT) and train them with DDPM on the MNIST dataset to produce results like in the below image.


Authors: Jay Mahajan & Ozgur Kara -- shout out to the CS 444 TAs for helping us with this assignment and verifying our implementations!

Download the starter code here, and see setup instructions below under Assignment Setup.

The top-level notebook (MP5.ipynb) will guide you through all the steps. You will mainly focus on implementing (a) the Diffusion Transformer architecture in DiT.py, including patchify/unpatchify, multi-head self-attention, the feed-forward network, and the DiT block with adaptive layer-norm conditioning, and (b) the DDPM training and sampling routines in the notebook.

  • For the transformer, the architecture is inspired by the original DiT paper. You will build it from scratch using PyTorch, including a custom multi-head self-attention module, a feed-forward network, and the adaLN-Zero conditioning mechanism that injects the diffusion timestep into each transformer block.

    You can develop and test Part A on your own PC, since the per-module unit tests in tests.py are lightweight.

  • For diffusion, we follow the standard DDPM formulation with a linear variance schedule. You will implement the precomputed noise schedule, the noisy forward process used during training, and the iterative reverse process used during sampling.

    We train on MNIST so the model is small enough to fit on a free Google Colab T4 GPU.

Notes for compute

For this assignment, you will need Google Colab or access to a strong GPU. We recommend that you use a T4 GPU, which is free for Google Colab. You should expect training to take around 2-3 hours.

To avoid high compute usage, you can develop and test Part A on your own PC. Once you feel confident the unit tests in tests.py pass, you can copy over the code and finish the rest of the assignment on Colab.

Part A: Building the Transformer

From class, you learned what a transformer is and how diffusion works. In this project, you will build your own transformer and use it to perform diffusion. Let's first start with building the transformer. Our transformer looks something like this:

I know this seems like a lot, but let's start small. For each of these smaller parts, we have given a test bench so you can check whether you are on the right track. You can run it with python tests.py.

Part I: Patchify and Unpatchify

From the lecture, we learned that transformers need to take in tokens. For images, we found out that creating patches of images works best as our tokens.

For patchify, we want to convert a set of images of shape (batch size, channels, height, width) into a set of tokens of shape (batch size, number of patches, patch size * patch size * channels) given a patch size. Implement this in the Patchify class in DiT.py.

Similarly, for unpatchify, we want to go back from patches of shape (batch size, number of patches, patch size * patch size * channels) back to images. Implement this in the UnPatchify class in DiT.py.

Both of these modules should have no trainable parameters. Hint: You can use einops to implement this easily. You can safely assume that the image is square. You can test this by running tests.py.

Part II: Multi-Head Self-Attention

Attention is a way to learn which patches are important. To simplify this, we have split it into a single attention operation, and then put everything together in multi-head attention.

First, the attention operation:

where Q, K, and V are your query, key, and value matrices, and d_k is the key dimension. Implement this in the Attention class in DiT.py.

Once you have implemented a single attention operation, we can implement multi-head self-attention.

In this case, each self-attention head has an inner dimension that is equal to the hidden dimension divided by the number of heads. Finally, you should have a final linear layer that projects back to the hidden dimension. Implement this in the MultiHeadSelfAttention class in DiT.py.

Hint: You need to pass the self-attention layers to nn.ModuleList to make sure they are registered as parameters. You can test this by running tests.py.

Part III: Feed-Forward Network

The feed-forward network is 2 linear layers with a ReLU activation function in between. The first linear layer should project to an inner dimension, and the second linear layer should project back to the hidden dimension. Implement this in the FeedForward class in DiT.py. You can test this by running tests.py.

Part IV: DiT Block

Now that we have all of the components, we can build a single DiT block. The inputs to this DiT block are image tokens of shape (batch size, number of tokens, hidden dimension) and a time embedding of shape (batch size, time embedding dimension).

The first thing you will need to do is implement the MLP for the time embedding. The MLP will consist of 2 linear layers with a SiLU activation in between. The first linear layer will project the time embedding dimension into a time embedding dimension divided by 4. The second linear layer will then project the output of the previous linear layer into hidden dimension * 6. This should result in an output of shape (batch dimension, hidden dimension * 6). (You will also need to zero-initialize the linear layers; there are multiple ways to do this.) Split this output into 6 parameters to get α1, β1, γ1, α2, β2, γ2.

Refer to the diagram above. For Scale & Shift, given parameters (γ and β) and the input, the formula is:

Out = (γ + 1) * LayerNorm(x) + β

where LayerNorm should have elementwise_affine set to False.

From here, pass your output into your MultiHeadSelfAttention class.

For Scale, given α and input x:

Out = α * x

Then add your original input back into the output as a residual connection. The result here will also be used as the next residual connection.

Repeat the Scale & Shift using the next set of parameters and then pass the output into the feed-forward network. Scale again with the final parameter and finally add the second residual connection.

The output of the block should have the same dimension as your image tokens.

Hint: For the time embedding MLP output parameters, note that there will only be batch and hidden dimensions; hence you will need to make sure the dimensions are suitable for multiplication/addition with the image tokens input (which does include a hidden dimension). To do this, unsqueeze and broadcast on the hidden dimension.

Implement this in the DiTBlock class in DiT.py. You can test this in tests.py.

Part V: Building the Entire DiT

Once you have that, you can build the full DiT model. The inputs are an image of shape (batch size, channels, height, width) and a timestep tensor of shape (batch size, 1).

  1. Patchify the image using your Patchify class.

  2. Create a 2D positional embedding. To make this easy, we have given this for free — use get_positional_embedding to initialize it.

  3. Create a linear layer that projects from your patch dimension to the hidden dimension. This should NOT be zero-initialized. Add the positional embedding to its output.

  4. Create a time embedding using nn.Embedding. This should take in the total number of timesteps and project to your time embedding dimension.

  5. Initialize the DiT blocks using your DiTBlock class. You will need to pass the list of blocks into nn.ModuleList in order to make this work.

  6. Initialize a LayerNorm and a final linear layer that projects from your hidden dimension to the patch dimension. The final linear layer should be zero-initialized.

  7. Finally, unpatchify the tokens into an image using your UnPatchify class. The output size of the entire DiT should be the same as your input image size.

Implement this in the DiT class in DiT.py. You can test this in tests.py.

Part B: Training with Diffusion (DDPM)

For this part, you will implement DDPM. Follow the instructions in the notebook for a proper implementation.

Part I: Init Function

In order to properly complete DDPM, we need to noise the image properly. In the __init__ function, you will need to do the following:

  • Create a list β of length T such that β0 = 0.0001 and βT = 0.02, with all other elements evenly spaced between the two.
  • αt = 1 − βt
  • ᾱt = Πs=0t αs

Next, initialize the DiT model with the hyperparameters in the notebook. Then, create an AdamW optimizer and an MSE loss function.

Part II: Diffusion Training & Sampling

You will need to implement the training loop. This is defined in the train function and follows Algorithm 1 below:

You will also need to implement the sampling function. Implement this in run_inference following Algorithm 2 below:

Once you have correctly implemented it, run the training! You should expect training to take 2-3 hours to finish.

For full points for DDPM, you must reach an FID in the range of 120 to 150 on MNIST after training. To do this, you should follow the default hyperparameters in the notebook; the bulk of your work is to get the implementation correct rather than tuning.

As you start this assignment, you will realize that training a diffusion model is more computationally intensive than what you are used to (around 2-3 hours on a free Colab T4 GPU). In order to get an idea whether your implementation is on the right track without waiting a long time for training to converge, here are reference samples from a correct implementation at different epochs:

Epoch 1 Epoch 11 Epoch 21 Epoch 51

For reference, here is the expected training loss curve and FID curve:

  

Extra Credit

There are multiple things you can do to achieve extra credit. Please note that the first three items are extra-large, worth 20 points each, while the remaining ones are standard extra credit (up to 5 points each).

  • Implement a diffusion model using the traditional U-Net architecture instead of DiT. For detailed instructions, feel free to refer to last year's assignment. (+20)

  • Train a latent diffusion model. The inputs to your DiT are latents from a pretrained VAE (stabilityai/sdxl-vae). We recommend using CelebA or LSUN for this task. For simplicity, perform this at 64×64 resolution. Your results will not be perfect, but we expect decent results. Note any changes you made to get it to work. (+20)

  • Implement Flow Matching. For flow matching, you must use a different time embedding — luckily we have given this to you for free in DiT.py as a continuous time embedding. For full credit, you must reach an FID < 100. You must also complete the inference in 100 steps. (+20)

  • Implement a custom DiT architecture and train it. Describe in detail what you changed.

  • Implement RoPE instead of the fixed positional embeddings.

  • Train a class-conditioned DiT and add classifier-free guidance. Explain how you added class conditioning into the DiT.

  • Train using a different dataset (LSUN, CelebA, etc.). Note anything you changed to get it to work.

  • Implement DDIM to speed up inference.

  • Train using a learned positional embedding. This should be a simple nn.Parameter that requires gradients. Display your first epoch, epoch 21, epoch 51 and the final result. Compare the results with the fixed positional embeddings.

  • Anything impressive with DiTs — feel free to try something!

Assignment Setup

Download the starter code here.

To complete this assignment in a reasonable amount of time, you'll need to use a GPU. This can either be your personal GPU or Google Colab free / Colab Pro with GPU enabled.

Environment Setup

We suggest that you use Anaconda to manage Python package dependencies (https://www.anaconda.com/download). This guide provides useful information on how to use Conda: https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html. Ensure that IPython is installed (https://ipython.org/install). You may then navigate the assignment directory in terminal and start a local IPython server using the jupyter notebook command.

Unless you have a machine with a GPU, running the training portion of this assignment on your local machine will be very slow and is not recommended. Please use Google Colab or other GPU resources (refer to this Campuswire post). If you are using Colab, make sure to select a GPU runtime by going to "Runtime" > "Change runtime type" and selecting "GPU" as the hardware accelerator. Be mindful of Colab's usage limits if you are using Colab Pro so that you don't run out of GPU hours.

Data Setup

Instructions for downloading MNIST are provided in the notebook; the torchvision dataloader will download it automatically the first time you run the notebook.

If you are running this assignment on Colab: Note that loading data from a Google Drive-mounted folder is very slow. To work around this, the notebook is configured by default to download the data to a path local to the Colab runtime. Please note that this location is non-persistent, meaning the data will be lost if you restart the session; you will need to re-download it.

If you are not running this assignment on Colab: Please update the data path variable in the notebook to point to the directory where you want the data to be downloaded.

References

Submission Instructions:

The assignment deliverables are as follows. If you are working in a pair, only one designated student should make the submission.

  1. Upload the following files to Canvas:
    1. netid_mp5_output.pdf: Your IPython notebook with output cells converted to PDF format
    2. netid_mp5_code.zip: All of your code in a single zip file, containing at least the following files:
      1. MP5.ipynb: Your top-level DiT / DDPM notebook
      2. DiT.py: Your main DiT model implementation
      3. tests.py: The provided test bench (unmodified)
      Do not upload datasets or model checkpoints.
    3. netid_mp5_report.pdf: Your assignment report (using this template) in PDF format
    4. (Optional) Any additional files related to extra credit

Please refer to course policies on collaborations, late submission, etc.