Spring 2026 CS 444Assignment 5: Diffusion Transformers (DiT)Due date: May 6th, 11:59:59 PMWelcome to the last assignment for CS 444! In this assignment, you will implement
Scalable Diffusion Models with Transformers
(Diffusion Transformers, or DiT) and train them with DDPM
on the MNIST dataset to produce results like in the below image. Authors: Jay Mahajan & Ozgur Kara -- shout out to the CS 444 TAs for helping us with this assignment and verifying our implementations! Download the starter code here, and see setup instructions below under Assignment Setup.
The top-level notebook (
Notes for computeFor this assignment, you will need Google Colab or access to a strong GPU. We recommend that you use a T4 GPU, which is free for Google Colab. You should expect training to take around 2-3 hours. To avoid high compute usage, you can develop and test Part A on your own PC.
Once you feel confident the unit tests in Part A: Building the TransformerFrom class, you learned what a transformer is and how diffusion works. In this project, you will build your own transformer and use it to perform diffusion. Let's first start with building the transformer. Our transformer looks something like this:
I know this seems like a lot, but let's start small. For each of these smaller
parts, we have given a test bench so you can check whether you are on the right
track. You can run it with Part I: Patchify and UnpatchifyFrom the lecture, we learned that transformers need to take in tokens. For images, we found out that creating patches of images works best as our tokens. For patchify, we want to convert a set of images of shape
Similarly, for unpatchify, we want to go back from patches of shape Both of these modules should have no trainable parameters. Hint: You can use
Part II: Multi-Head Self-AttentionAttention is a way to learn which patches are important. To simplify this, we have split it into a single attention operation, and then put everything together in multi-head attention. First, the attention operation:
where Once you have implemented a single attention operation, we can implement multi-head self-attention.
In this case, each self-attention head has an inner dimension that is equal to the
hidden dimension divided by the number of heads. Finally, you should have a final
linear layer that projects back to the hidden dimension. Implement this in the
Hint: You need to pass the self-attention layers to Part III: Feed-Forward NetworkThe feed-forward network is 2 linear layers with a ReLU activation function in
between. The first linear layer should project to an inner dimension, and the second
linear layer should project back to the hidden dimension. Implement this in the
Part IV: DiT BlockNow that we have all of the components, we can build a single DiT block. The inputs
to this DiT block are image tokens of shape The first thing you will need to do is implement the MLP for the time embedding.
The MLP will consist of 2 linear layers with a SiLU activation in between. The first
linear layer will project the time embedding dimension into a time embedding
dimension divided by 4. The second linear layer will then project the output of the
previous linear layer into Refer to the diagram above. For Scale & Shift, given parameters
(
where From here, pass your output into your For Scale, given
Then add your original input back into the output as a residual connection. The result here will also be used as the next residual connection. Repeat the Scale & Shift using the next set of parameters and then pass the output into the feed-forward network. Scale again with the final parameter and finally add the second residual connection. The output of the block should have the same dimension as your image tokens. Hint: For the time embedding MLP output parameters, note that there will only be batch and hidden dimensions; hence you will need to make sure the dimensions are suitable for multiplication/addition with the image tokens input (which does include a hidden dimension). To do this, unsqueeze and broadcast on the hidden dimension. Implement this in the Part V: Building the Entire DiTOnce you have that, you can build the full DiT model. The inputs are an image of
shape
Implement this in the Part B: Training with Diffusion (DDPM)For this part, you will implement DDPM. Follow the instructions in the notebook for a proper implementation. Part I: Init FunctionIn order to properly complete DDPM, we need to noise the image properly. In the
Next, initialize the DiT model with the hyperparameters in the notebook. Then,
create an Part II: Diffusion Training & SamplingYou will need to implement the training loop. This is defined in the
You will also need to implement the sampling function. Implement this in
Once you have correctly implemented it, run the training! You should expect training to take 2-3 hours to finish. For full points for DDPM, you must reach an FID in the range of 120 to 150 on MNIST after training. To do this, you should follow the default hyperparameters in the notebook; the bulk of your work is to get the implementation correct rather than tuning. As you start this assignment, you will realize that training a diffusion model is more computationally intensive than what you are used to (around 2-3 hours on a free Colab T4 GPU). In order to get an idea whether your implementation is on the right track without waiting a long time for training to converge, here are reference samples from a correct implementation at different epochs:
For reference, here is the expected training loss curve and FID curve:
Extra CreditThere are multiple things you can do to achieve extra credit. Please note that the first three items are extra-large, worth 20 points each, while the remaining ones are standard extra credit (up to 5 points each).
Assignment SetupDownload the starter code here. To complete this assignment in a reasonable amount of time, you'll need to use a GPU. This can either be your personal GPU or Google Colab free / Colab Pro with GPU enabled. Environment Setup We suggest that you use Anaconda to manage Python package dependencies (https://www.anaconda.com/download). This guide provides useful information on how to use Conda: https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html. Ensure that IPython is installed (https://ipython.org/install). You may then navigate the assignment directory in terminal and start a local IPython server using thejupyter notebook command.
Unless you have a machine with a GPU, running the training portion of this assignment on your local machine will be very slow and is not recommended. Please use Google Colab or other GPU resources (refer to this Campuswire post). If you are using Colab, make sure to select a GPU runtime by going to "Runtime" > "Change runtime type" and selecting "GPU" as the hardware accelerator. Be mindful of Colab's usage limits if you are using Colab Pro so that you don't run out of GPU hours. Data Setup Instructions for downloading MNIST are provided in the notebook; the
If you are running this assignment on Colab: Note that loading data from a Google Drive-mounted folder is very slow. To work around this, the notebook is configured by default to download the data to a path local to the Colab runtime. Please note that this location is non-persistent, meaning the data will be lost if you restart the session; you will need to re-download it. If you are not running this assignment on Colab: Please update the data path variable in the notebook to point to the directory where you want the data to be downloaded. References
Submission Instructions:The assignment deliverables are as follows. If you are working in a pair, only one designated student should make the submission.
Please refer to course policies on collaborations, late submission, etc. |