Spring 2026 CS 444

Assignment 4: Object Detection

Due date: Monday, April 20, 11:59:59 PM

In this assignment, you will complete the code for two object detectors and train them on the PASCAL VOC 2007 dataset: the convolution-based single-stage YOLO detector and the transformer-based DETR detector.


Download the starter code here, and see setup instructions below under Assignment Setup.

The top-level notebooks (mp4a_yolo.ipynb and mp4b_detr.ipynb) will guide you through all the steps. The notebooks are mostly similar in structure, with slightly different data and model implementations. You will mainly focus on implementing (a) the loss function of YOLO in the yolo_loss.py file, (b) the main model architecture of DETR in detr.py using PyTorch's native Transformer module, alongside the set matching loss for DETR in detr_loss.py.

  • YOLO: The model architecture is inspired by DetNet; however, you are not required to understand it. In principle, it can be replaced by a different network architecture and trained from scratch, but to achieve a good accuracy with a minimum of computational expense and tuning, you should stick to the provided one.

    We will use a pretrained ResNet18 backbone for YOLO.

  • DETR: The model architecture is inspired by the original DETR paper, but with some simplifications to introduce you to PyTorch's built-in nn.Transformer module. You should have a basic understanding of DETR's building blocks so that you can follow the instructions in the detr.py file and implement a simplified DETR model. Then, you will also implement a simple version of the set matching loss to train your DETR model.

    We will use a pretrained DINOv2 (a popular self-supervised Vision Transformer by Facebook/Meta) backbone for DETR.

For full points, you should achieve a minimum mAP of 0.53 after 15 epochs of training for YOLO, and a minimum mAP of 0.58 after 15 epochs of training for DETR. To do this, you should experiment with adjusting the learning rates. Specifically, both models have different learning rates for the vision backbone and the detection head, with the latter set to a higher value since the head needs to learn from scratch and requires more aggressive updates. For both models, you should tune the vision backbone's learning rate. We recommend using default values for the remaining hyperparameters.

As you start this assignment, you will realize that training is more computationally intensive than what you are used to (up to 2 hours each on regular Colab's free T4 GPU). In order to get an idea whether your implementation works without waiting a long time for training to converge, here are reference mAP values for the first 10 epochs with default hyperparameters:

Epoch YOLO mAP DETR mAP
1 0.2619 0.1593
2 0.3080 0.1972
3 0.3599 0.1864
4 0.4263 0.2270
5 0.4553 0.2019
6 0.4588 0.2879
7 0.4706 0.2833
8 0.4930 0.3196
9 0.4932 0.4165
10 0.5058 0.4291

Also, since evaluation on the test set can take a long time (up to 5 minutes), you can modify eval_every in the notebooks to evaluate less frequently. A good strategy is to evaluate once every epoch and stop early while debugging to check your implementation, then evaluate less often during hyperparameter tuning to save time.

Useful Resources

For YOLO, the instructions in yolo_loss.py should be sufficient to guide you through the assignment, but it will be really helpful to understand the big picture of how YOLO works and how the loss function is defined.

The following resources are useful for understanding YOLO in detail:

Similarly, for DETR, the instructions in detr.py and detr_loss.py will help you complete the assignment. However, you should also understand the building blocks of the DETR approach. You can reference the following resources:

Extra Credit

  • Replace the provided pretrained network with a different one and train with the YOLO loss or DETR Hungarian matching/set losses, if appropriate, and aim to achieve better accuracy.
  • Experiment with any alternative detection architectures and compare performance to your YOLO or DETR detector.
  • Experiment with alternative losses: Smooth L1 loss, Focal loss, advanced IoU losses (e.g., Distance IoU, Complete IoU), etc. Compare performance to the YOLO and DETR losses.
  • Incorporate anchors in your YOLO or DETR implementations.
  • Visualize convolutional feature maps in YOLO or attention maps in DETR to better understand the models' detection behavior. What do these models "see"?
  • Experiment with other ResNet variants for YOLO and/or DETR for more direct comparisons between different model settings (e.g., YOLO+ResNet18 vs. YOLO+ResNet34 vs. DETR+ResNet34 vs. DETR+DINOv2 -- ResNet34 and DINOv2-small have similar parameter counts).

Assignment Setup

Download the starter code here.

To complete this assignment in a reasonable amount of time, you'll need to use a GPU. This can either be your personal GPU or Google Colab free/Colab Pro with GPU enabled.

Environment Setup

We suggest that you use Anaconda to manage Python package dependencies (https://www.anaconda.com/download). This guide provides useful information on how to use Conda: https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html. Ensure that IPython is installed (https://ipython.org/install). You may then navigate the assignment directory in terminal and start a local IPython server using the jupyter notebook command.

Unless you have a machine with a GPU, running this assignment on your local machine will be very slow and is not recommended. Please use Google Colab or other GPU resources (refer to this Campuswire post). If you are using Colab, make sure to select a GPU runtime by going to "Runtime" > "Change runtime type" and selecting "GPU" as the hardware accelerator. Be mindful of Colab's usage limits if you are using Colab Pro so that you don't run out of GPU hours.

Data Setup

Instructions for downloading the data are provided in the notebooks.

If you are running this assignment on Colab: Note that loading data from a Google Drive-mounted folder is very slow. To work around this, the notebook is configured by default to download the data to /content/VOC_DATA, which is local to the Colab runtime. Please note that this location is non-persistent, meaning the data will be lost if you restart the session; you will need to re-download it.

If you are not running this assignment on Colab: Please update the VOC_PATH variable in the notebooks to point to the directory where you want the data to be downloaded.

Submission Instructions:

The assignment deliverables are as follows. If you are working in a pair, only one designated student should make the submission.

  1. Upload the following files to Canvas:
    1. netid_mp4a_yolo_output.pdf: Your YOLO IPython notebook with output cells converted to PDF format
    2. netid_mp4b_detr_output.pdf: Your DETR IPython notebook with output cells converted to PDF format
    3. netid_mp4_code.zip: Your code in a single zip file, containing at least the following files:
      1. mp4a_yolo.ipynb: Your top-level YOLO notebook
      2. mp4b_detr.ipynb: Your top-level DETR notebook
      3. yolo_loss.py: Your main YOLO loss implementation
      4. detr.py: Your main DETR model implementation
      5. detr_loss.py: Your main DETR loss implementation
      Do not upload the annotations directory or your data.
    4. netid_mp4_report.pdf: Your assignment report (using this template) in PDF format
    5. (Optional) Any additional files related to extra credit

  2. Upload your YOLO output file to the Kaggle competition for the YOLO detector.
  3. Upload your DETR output file to the Kaggle competition for the DETR detector.

    Don't forget to hit "Submit" after uploading your files, otherwise we will not receive your submission!

Please refer to course policies on collaborations, late submission, etc.