# PyTorch Tutorial

This tutorial is mostly based on:

* https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
* https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

In [None]:
import numpy as np
import torch

print(torch.__version__)

## PyTorch Tensors

PyTorch tensors are just like NumPy arrays, and they include many of the same operations you are used to from NumPy.

Construct a tensor of size $5 \times 3$ with random values:



In [None]:
x = torch.rand(5, 3)
print(x)

Construct a matrix filled zeros and of dtype long:



In [None]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

Make a tensor from a list of values:

In [None]:
x  = torch.tensor([1, 2, 3], dtype=float)
print(x)

Create a tensor based on another tensor (inherit size and dtype, unless otherwise specified):

In [None]:
x = x.new_ones(5, 3)  # new_* methods take in sizes
print(x)

x = torch.randn_like(x, dtype=torch.float)  # override dtype!
print(x)                                    # result has the same size

Get the size object of a tensor, an object which supports tuple operations:

In [None]:
print(x.size())

Operations on tensors use similar syntax to NumPy:

In [None]:
y = torch.rand(5, 3)
print(x + y)
print(torch.add(x, y))

PyTorch also supports in-place operations (method names end in '_'):

In [None]:
y.add_(x)
print(y)

Indexing works as you would expect:

In [None]:
x = torch.rand(5, 3)
print(x)
print(x[1:4, :])

Tensor data types and casting:

In [None]:
a = torch.ones(3, 3, dtype=torch.double)
print(a)
b = a.long()
print(b.type())
c = a.int()
print(c)

Integer division works just like Python and NumPy (only in newer versions of PyTorch)

In [None]:
print(3 / 2)
print(np.array(3) / np.array(2))
print(torch.tensor(3) / torch.tensor(2))

Documentation on various dtypes: https://pytorch.org/docs/stable/tensors.html

### More useful PyTorch Tensor operations

To see the complete API check here: https://pytorch.org/docs/stable/tensors.html

`.view()` can be used to resize/reshape tensors:


In [None]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)  # the size -1 is inferred from other dimensions
print(x.size(), y.size(), z.size())

If you have a one element tensor, use `.item()` to get the value as a Python number:


In [None]:
x = x.sum()
print(x)
print(x.item())

### Converting between NumPy arrays and PyTorch Tensors

Important: PyTorch Tensors and NumPy arrays will share the same underlying memory locations. If you change values for one, the values for the other will be changed too.

Convert PyTorch Tensor to NumPy array:

In [None]:
a = torch.ones(5)
print(a)
b = a.numpy()
print(b)

a.add_(1)
print(a)
print(b)

Convert NumPy array to PyTorch Tensor:

In [None]:
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)

## CUDA Tensors (On GPU)

PyTorch tensors have the added benefit that they can easily be placed on a GPU to speed up computations.

Query information about the GPU (if CUDA is available):

In [None]:
if torch.cuda.is_available():
    !nvidia-smi

You can use `torch.device` objects to move tensors to and from the GPU:

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings `.to("cuda")`
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))       # `.to` can also change dtype


## Autograd: Automatic Differentiation

From: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

Now that you have learned how to use PyTorch Tensors you will learn how we can use PyTorch for automatic differentiation.

The `autograd` package in PyTorch provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

To allow PyTorch to keep track of operations for automatic differentiation, we need to set `requires_grad` as `True` for a Tensor. Autograd will then start to track all operations on the Tensor. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into the `.grad` attribute.

To stop a tensor from tracking history, you can call `.detach()` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block in `with torch.no_grad():`. This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don't need the gradients.

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic
graph that encodes a complete history of computation. Each tensor has
a `.grad_fn` attribute that references a `Function` that has created
the `Tensor` (except for Tensors created by the user - their `grad_fn` is `None`).

If you want to compute the derivatives, you can call `.backward()` on a `Tensor`. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to `backward()`, however if it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.

Create a tensor and set `requires_grad=True` to track computation with it:

In [None]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

Perform a tensor operation:

In [None]:
y = x + 2
print(y)

`y` was created as a result of an operation, so it has a `grad_fn`:

In [None]:
print(y.grad_fn)

Do more operations on `y`:

In [None]:
z = y * y * 3
print(z)
out = z.mean()
print(out)

`.requires_grad_()` changes an existing Tensor's `requires_grad` flag in-place. The input flag defaults to `False` if not given:




In [None]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

### Gradients

Let's backprop now. Because `out` contains a single scalar, `out.backward()` is equivalent to `out.backward(torch.tensor(1))`:



In [None]:
out.backward()

Print gradients $\frac{d(\texttt{out})}{d\texttt{x}}$:

In [None]:
print(x.grad)

You should get a matrix of `4.5`. Let's call the `out` *Tensor* "$o$". We find that $o = \frac{1}{4}\sum_i z_i$, $z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$. Therefore, $\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, hence $\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

You can do many crazy things with autograd!



In [None]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

In [None]:
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)

print(x.grad)

You can also stop autograd from tracking history on Tensors with `.requires_grad=True` by wrapping the code block in `with torch.no_grad()`:

In [None]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
	print((x ** 2).requires_grad)

**Read Later:**

Documentation of `torch.autograd` and `Function` is at
http://pytorch.org/docs/autograd

## Neural Networks

The `torch.nn` package in PyTorch provides higher level building blocks for neural networks like fully connected or convolutional layers. The `nn` package makes use of the `autograd` functionality to define these model building blocks and differentiate them. This allows us to quickly and easily implement neural networks by putting together layers and using PyTorch to help us update learnable parameters with the gradient.

An `nn.Module` contains layers, and a method `forward(input)` that
returns the `output`.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  `weight = weight - learning_rate * gradient`

### Define the network

In the cell below we define a simple convolutional neural network. Notice that we use the `nn.Conv2d` and `nn.Linear` Modules as building blocks for the network.

There are plenty of other types of layers and tools available in the [torch.nn](https://pytorch.org/docs/stable/nn.html) package such as pooling layers, dropout, and batchnorm.

Conveniently, PyTorch is completely open source so you can check out exactly how each of these Modules are implemented:

* https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/linear.py
* https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/conv.py

**Important:** Whenever you extend the `nn.Module` class (e.g. with the `Net` class below) you will need to call the superclass constructor or an error will be thrown. In this example below this line is: `super().__init__()` 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super().__init__()
        # 1 input image channel, 6 output channels, 5x5 convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # Affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.flatten(start_dim=1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
print(net)

You just have to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for you
using ``autograd``.
You can use any of the Tensor operations in the ``forward`` function.

The learnable parameters of a model are returned by ``net.parameters()``



In [None]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

Continuing, let's try a random 32x32 input

In [None]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

Zero the gradient buffers of all parameters and backprops with random
gradients:



In [None]:
net.zero_grad()
out.backward(torch.randn(1, 10))

**Note:**

`torch.nn` only supports mini-batches. The entire `torch.nn` package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, `nn.Conv2d` will take in a 4D Tensor of `nSamples x nChannels x Height x Width`.

If you have a single sample, just use `input.unsqueeze(0)` to add a fake batch dimension.

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
- `torch.Tensor` - A *multi-dimensional array* with support for autograd operations like `backward()`. Also *holds the gradient* w.r.t. the tensor.
- `nn.Module` - Neural network module. *Convenient way of encapsulating parameters*, with helpers for moving them to GPU, exporting, loading, etc.
- `nn.Parameter` - A kind of Tensor, that is *automatically registered as a parameter when assigned as an attribute to a* `Module`.
- `autograd.Function` - Implements *forward and backward definitions of an autograd operation*. Every `Tensor` operation, creates at least a single `Function` node, that connects to functions that created a `Tensor` and *encodes its history*.

**At this point, we covered:**
- Defining a neural network
- Processing inputs and calling backward

**Still Left:**
- Computing the loss
- Updating the weights of the network

### Loss Function

A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions) under the `nn` package. A simple loss is `nn.MSELoss`, which computes the mean-squared error between the input and the target.

For example:

In [None]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

Now, if you follow `loss` in the backward direction, using its `.grad_fn` attribute, you will see a graph of computations that looks like this:

    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
          -> view -> linear -> relu -> linear -> relu -> linear
          -> MSELoss
          -> loss

So, when we call `loss.backward()`, the whole graph is differentiated w.r.t. the loss, and all Tensors in the graph that have `requires_grad=True` will have their `.grad` Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:

In [None]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

### Backprop

To backpropagate the error all we have to do is call `loss.backward()`. You need to clear the existing gradients though, otherwise the gradients will be accumulated to existing gradients.

Now we'll call `loss.backward()`, and have a look at conv1's bias
gradients before and after the backward step.

In [None]:
net.zero_grad()  # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

Now, we have seen how to use loss functions.

**The only thing left to learn is:**

- Updating the weights of the network

### Update the weights

The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

     weight = weight - learning_rate * gradient

We can implement this using simple python code:

```python
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)
```

However, as you use neural networks, you'll want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
To enable this, PyTorch has a small package: `torch.optim` that
implements all these methods. Using it is very simple:

In [None]:
import torch.optim as optim

# Create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# In your training loop:
optimizer.zero_grad()             # zero the gradient buffers
output = net(input)               # compute the forward pass
loss = criterion(output, target)  # compute the loss
loss.backward()                   # compute the gradients
optimizer.step()                  # update the parameters

print(loss)

**Important:** Note how gradient buffers had to be manually set to zero using `optimizer.zero_grad()`. This is because gradients are accumulated, so if you don't zero gradients before each `backward()` call, you will begin accumulating gradients from previous forward/backward passes.

#### Note on eval and train modes

**Important**: If you use layers in your network like `torch.nn.Dropout` or `torch.nn.BatchNorm2d` which have different behavior during training and evaluation, you will need to make sure the modules in your network are appropriately set. PyTorch makes this easy with `eval` and `train` methods for any network extending `nn.Module`. Before beginning training you will call `net.train()` to set all modules in the network to train mode, and equivalently before evaluating you should call `net.eval()`.

## Training a Classifier

Now that you have seen the basics of how to define neural networks, compute losses, and make training updates, you will see how a simple classifier is trained in PyTorch on CIFAR-10.

### What about data?

Generally, when you have to deal with image, text, audio, or video data,
you can use standard python packages that load data into a numpy array.
Then you can convert this array into a `torch.*Tensor`.

-  For images, packages such as Pillow, OpenCV are useful
-  For audio, packages such as scipy and librosa
-  For text, either raw Python or Cython based loading, or NLTK and
   SpaCy are useful

Specifically for vision, we have created a package called
`torchvision`, that has data loaders for common datasets such as
Imagenet, CIFAR10, MNIST, etc., models for common architectures, and data transformers for images.

This provides a huge convenience and avoids writing boilerplate code.

For this tutorial, we will use the CIFAR10 dataset.
It has the classes: 'airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck'. The images in CIFAR-10 are of
size $3 \times 32 \times 32$, i.e. 3-channel color images of $32 \times 32$ pixels in size.

### Training an image classifier

We will do the following steps in order:

1. Load and normalizing the CIFAR10 training and test datasets using ``torchvision``
2. Define a Convolution Neural Network
3. Define a loss function
4. Train the network on the training data
5. Test the network on the test data

### 1) Loading and normalizing CIFAR10

Using `torchvision`, it’s extremely easy to load CIFAR10.



In [None]:
import torch
import torchvision
import torchvision.transforms as transforms

The output of torchvision datasets are `PIL` images of range [0, 1].
We transform them to Tensors of normalized range [-1, 1] using the `transforms.ToTensor` and `transforms.Normalize` functions. 

The [transforms package](https://pytorch.org/docs/stable/torchvision/transforms.html) has other functions that you might use for **data augmentation**. For example, `torchvision.transforms.RandomResizedCrop` and `torchvision.transforms.RandomHorizontalFlip`.

In [None]:
# Transforms
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

# Datasets
trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=train_transform)
testset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=test_transform)

# Data loaders
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(
    testset, batch_size=4, shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 
           'dog', 'frog', 'horse', 'ship', 'truck')

Let us show some of the training images, for fun.



In [None]:
import matplotlib.pyplot as plt
import numpy as np


def imshow(img):
    """Function to display an image."""
    img = img / 2 + 0.5  # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')


# Get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# Show images
imshow(torchvision.utils.make_grid(images))
# Print labels
print('      '.join('%5s' % classes[labels[j]] for j in range(4)))

#### Training on GPU

Just like how you transfer a Tensor on to the GPU, you transfer the neural
net onto the GPU.

Let's first define our device as the first visible cuda device if we have
CUDA available:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Assuming that we are on a CUDA machine, this should print a CUDA device:
print(device)

If `device` is in fact set to a CUDA device, then these methods will recursively go over all modules and convert their parameters and buffers to CUDA tensors:

```python
net = net.to(device)
```

Remember that you will have to send the inputs and targets at every step
to the GPU too:

```python
inputs, labels = inputs.to(device), labels.to(device)
```

Why don't I notice MASSIVE speedup compared to CPU? Because your network
is *realllly* small.

**Exercise:** Try increasing the width of your network (argument 2 of
the first `nn.Conv2d`, and argument 1 of the second `nn.Conv2d` –
they need to be the same number), see what kind of speedup you get.

### 2) Define a Convolution Neural Network

Copy the neural network from the Neural Networks section before and modify it to
take 3-channel images (instead of 1-channel images as it was defined).



In [None]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.out_channels1 = 6
        self.out_channels2 = 16
        self.conv1 = nn.Conv2d(3, self.out_channels1, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(self.out_channels1, self.out_channels2, 5)
        self.fc1 = nn.Linear(self.out_channels2 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.flatten(start_dim=1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net().to(device)

### 3) Define a Loss function and optimizer

Let's use a Classification Cross-Entropy loss and SGD with momentum.



In [None]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
#optimizer = optim.Adam(net.parameters(), lr=0.001)

### 4) Train the network


This is when things start to get interesting.
We simply have to loop over our data iterator, and feed the inputs to the
network and optimize.



In [None]:
import time


net = net.train()

# Loop over the dataset for multiple epochs
for epoch in range(1, 3):
    running_loss = 0.0
    t_s = time.time()

    # For each mini-batch...
    for i, data in enumerate(trainloader, 1):
        # Get the inputs
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 0:  # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch, i, running_loss / 2000))
            running_loss = 0.0
            print('iters time:', time.time() - t_s)
            t_s = time.time()

print('Finished Training')

Let’s quickly save our trained model:

In [None]:
PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)

We can load back in a saved model with the following: (note: saving and re-loading the model wasn’t necessary here, we only did it to illustrate how to do so):

In [None]:
net = Net().to(device)
net.load_state_dict(torch.load(PATH))

### 5) Test the network on the test data

We have trained the network for 2 passes over the training dataset.
But we need to check if the network has learned anything at all.

We will check this by predicting the class label that the neural network
outputs, and checking it against the ground-truth. If the prediction is
correct, we add the sample to the list of correct predictions.

Okay, first step. Let us display an image from the test set to get familiar.

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('Ground truth:')
print('      '.join('%5s' % classes[labels[j]] for j in range(4)))

Okay, now let us see what the neural network thinks these examples above are:



In [None]:
net = net.eval()

outputs = net(images.to(device))

The outputs are energies for the 10 classes.
Higher the energy for a class, the more the network
thinks that the image is of the particular class.
So, let's get the index of the highest energy:



In [None]:
_, predicted = torch.max(outputs, 1)

imshow(torchvision.utils.make_grid(images))
print('Predicted:')
print('      '.join('%5s' % classes[predicted[j]] for j in range(4)))

The results seem pretty good.

Let us look at how the network performs on the whole dataset.



In [None]:
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d%%' % (
    100 * correct / total))

That looks waaay better than chance, which is 10% accuracy (randomly picking
a class out of 10 classes).
Seems like the network learned something.

Hmmm, what are the classes that performed well, and the classes that did
not perform well:

In [None]:
class_correct = [0] * 10
class_total = [0] * 10
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s: %2d%%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

### Other information

How to write data loading code in PyTorch: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

More details on saving and loading models: https://pytorch.org/tutorials/beginner/saving_loading_models.html

## Transfer Learning

Nowadays, CIFAR-10 is considered to be a small dataset (only 50K training images). Using transfer learning, we can use a model that is pre-trained on a much larger dataset like ImageNet (1.3M training images) and fine-tune it on our smaller dataset to get better performance.

This portion of the tutorial will be more hands-on. It is intended to be done individually or in small groups either after class or in breakout rooms at the end of class. You will need to complete the following steps:

1. Copy the code above from steps 1-5
2. Replace the custom model we defined with a pre-trained model from `torchvision.models`
3. Replace the last layer of the model with a linear layer that outputs 10 classes (CIFAR-10 has 10 classes, but ImageNet had 1,000 classes)
4. Modify the transforms to ensure that the CIFAR-10 images are the dimensions expected by the models pre-trained on ImageNet ($224 \times 224$) using `torchvision.transforms`
5. Re-train the model on CIFAR-10

See https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html if you get stuck. Feel free to play around with other modifications, like SGD vs. Adam or fine-tuning vs. feature extraction. Good luck!