Task Answers (Image Data Processing)#

Tutorial (the digit classification task)

Is it worth the effort to use a complex model to gain the 1% performance increase? Whether a complex model is worth the effort depends on your goal and the available computing resources that you have. In terms of computing resources, the dataset in the tutorial is small, so adding extra complexity to the model would not cost too much time and computing resources. So it is probably fine to spend an extra 1 hour to get the 1% performance increase. However, if you are working on a very large dataset, you may need to use many GPUs to train the model for several days. In this case, spending a lot of effort to get 1% increase may still be worth the effort if you will deploy the model for a lot of users. The 1% performance increase may still help your company save (or earn) money in the long term. However, if you are doing a small-scale academic project, having this 1% increase may not be worth the effort, and you might spend the extra time studying other types of models that people have not tried before to help the academic field gain more insights into the task. Moreover, from the performance point of view, the 1% increase in the tutorial could just happen by chance. There is randomness in the stochastic gradient descent optimization process, and this means that the performance also has randomness (with a mean and standard deviation) if we repeat the same experiment many times. So observing one of the 1% performance increases does not provide sufficient evidence to prove that the complex model performs better than the simple model.
What differences can you spot between the explanations for the different models? The GradCAM for the simple neural net focuses on the digits themselves. The red/yellow heatmaps show that the digits influence the model’s decisions the most, and the activations are more concentrated. However, the GradCAM for the complex model focuses more on the background, and the activations are more spread, which suggests that the model may rely on spurious correlations in the background of the image to make decisions.
Which one of these models will you use? The simple network and the complex one do not have a lot of differences in the model performance. From the explainability point of view, the GradCAM of the simple network shows that the model is indeed looking at the digits themselves, which matches our expectations better than the complex model (which focuses on the background). So, in this case, the simple neural net looks like a better choice if we do not care about the 1% performance gain too much. However, model selection should not rely solely on one single metric, such as explainability or a performance measurement. To test model robustness, it is better to conduct more experiments. For example, you can erase or blur a part of the image to check if the model is still performing well. You can do error analysis by putting edge cases into the model, such as alphabets instead of digits, and check the model behavior (both performance and explainability).

Tutorial Advanced (the CIFAR-10 dataset)

Why do we shuffle the training data, but not the validation and test data? The main reason for shuffling the training data is to increase the diversity of the mini-batches in the optimization process when using stochastic gradient descent. Having good mini-batch diversity prevents the optimization process from getting stuck in a local minimum. When doing gradient descent for a deep learning model, it is usually hard to find the global minimum, and a lot of the time, the model is trying to find a good local minimum. However, it is possible that the model gets stuck in a local minimum and never goes out to find other options if the mini-batches lack diversity. The model may think that the mini-batches represent the entire dataset well enough, which could result in redundant or cyclic gradient descent updates due to seeing exactly the same mini-batch multiple times (because we train the model in multiple epochs). In other words, a diverse set of mini-batches can help the gradient descent process search a wider range in the objective function that we want to optimize, which can lead to a better set of model weights that are robust to tackle the task.
Which augmentation types lead to the highest increase in performance for this dataset? This depends on the context and whether the augmentation is reasonable. For example, if we are detecting wildfire smoke from camera monitoring video streams, it does not make sense to do vertical flipping for the images. The camera in the real-world situation will not be upside-down, and adding the vertical flipping data augmentation is likely just to confuse the model. However, horizontal flipping makes sense because smoke could travel in the other direction according to different wind directions, and horizontal flipping helps the model to generalize to unseen types of wildfire smoke. A small rotation of the image also makes sense because the camera could shake due to weather conditions. A small color jittering also makes sense because it helps the model to generalize to different weather and lighting conditions. Some small occlusion (by randomly erasing a part of the image) also makes sense because the camera lens could be dirty and sometimes point and block a part of the view.
Inside a batch, we do not preserve the original version of the augmented images. In other words, augmented images are not copies of the original images, but we are just modifying the original images themselves. Why is this not a problem? The goal of data augmentation is to diversify the data to help the model generalize to unseen situations. So we do not need to stick with the exact original images. Moreover, the original image is just one possible variation of the data augmentation. We usually specify a probability for each type of data augmentation, and there is a certain probability that the model will see the original image.
Why are we adding augmentations only to the training dataset? Our goal of data augmentation is to facilitate the training process to help the model generalize to unseen situations with randomness. There is no point in doing this on the validation and test sets because we want consistent evaluation outcomes. Adding data augmentation to the validation and test sets will just confuse our evaluation with uncontroled randomness.
What does the “19” in the VGG19 on the model name stand for? This means the number of layers in the model.

Assignment 3

The loss when using the SGD optimizer drops slowly, and eventurally it may still be able to reach a good performance. But when we change the optimizer to Adam, we start to get a boost in model performance and a faster decrease in the loss.

Assignment 4

When we use a very small learning rate, the loss almost does not change, and the performance of the model changes only a little bit (but is still changing). When we use a very large learning rate, we see that the loss and the model performance oscillate (i.e., alternating between some low and high values).

Assignment 5

When we initialize all the weights to zero, we see that the loss and model performance almost never change. It looks like the model just stopped working. However, when we change the weight initialization to the Kaiming method, we see that the model can now be trained significantly faster than all the other settings when using the SGD optimizer with the same learning rate.

Assignment 6

The model is trying to use the labels on the garbage (e.g., labels on canes/bottles that show brands, labels on cardboard that show barcodes, and some parts on the paper) to identify their types, which is not a desirable behavior.

Assignment 7 (optional)

We recommend using ResNet18 and a single linear layer at the end for classification. The recommended optimizer is Adam, the suggested learning rate is 5e-5, and the suggested number of epoch is 15. We also suggest using the Kaiming weight initialization method. These hyperparameters were found to produce the highest accuracy on the validation set while also avoiding overfitting. It is recommended that the Gemeente use this network with these hyperparameters as a starting point for further development of the AI garbage classifier. Some inspiration and functions accredited to this Kaggle page.

Assignment 8.1

The last layer was changed so it ‘collapses’ from the previous layer of the network’s last fully connected layer to the number of classes in our model, so we can actually make predictions.

Assignment 8.2

Tis issue is the network failing to break symmetry, also known as the “symmetry problem,” is a well-known problem in deep learning. When we have multiple neurons in a layer with the same weights and biases, they will produce the same output, and the gradients will be the same for all the neurons. This makes it impossible for the network to learn different features from the data, as all the neurons in a layer will contribute equally to the output.
By randomly initializing the weights, we break the symmetry, and each neuron will learn different features from the data. This is because the random initialization ensures that each neuron in a layer has a different starting point and is optimized differently, resulting in a diverse set of learned features. This allows the network to learn complex representations of the data, leading to better performance.

Assignment 8.3

Zero initialization: initializes all weights to zero. This can cause problems with symmetry breaking and cause all neurons to update identically during training, leading to poor performance.
Gaussian distributed initialization: initializes weights with random values drawn from a normal distribution. This method can help to break symmetry and enable better training, but the variance of the distribution needs to be carefully chosen to avoid exploding/vanishing gradients.
Kaiming initialization: similar to Xavier initialization, but designed specifically for ReLU activation functions that can cause problems with dying ReLUs if not initialized properly. This method scales the weights based only on the number of input neurons, rather than both input and output neurons as in Xavier initialization.
[EXTRA] Xavier initialization: scales the weights based on the number of input and output neurons, helping to ensure that the variance of the activations remains roughly constant across layers. This can lead to faster convergence and better performance, particularly for tanh activation functions.
More information about weight initialization can be found in this notebook.

Assignment 8.4

The phenomenon where adding more layers to a neural network leads to worse performance is known as the “overfitting problem”. This occurs when a model is too complex and fits the training data too closely, leading to poor generalization to new data. To fix this issue, we can use regularization methods such as L1, L2, and dropout. L1 regularization involves adding a penalty to the loss function that encourages weights to be sparse (so close to 0, they essentially act as a 0 weight would), while L2 regularization adds a penalty that encourages small weights. Dropout randomly “drops out” neurons during training, which helps prevent overfitting. By using these regularization techniques, we can help ensure that the model generalizes well to new data, while still keeping the architecture mostly the same.

Assignment 8.5

When using a high learning rate, the model’s optimization algorithm may overshoot the optimal weights and biases during training, leading to instability and divergence. This can result in the loss function not converging or even increasing. To address this issue, we can use techniques such as learning rate scheduling, which gradually reduces the learning rate over time, or we can use adaptive optimization algorithms such as Adam, which dynamically adjust the learning rate during training based on the gradients. Additionally, regularization techniques such as dropout and weight decay can help to prevent overfitting and improve generalization.
Conversely, when using a very small learning rate, the optimizer will only take very small steps in updating the model parameters, which means that it will take a very long time to train our model.

Assignment 8.6

ReLU activation function - ReLU has a derivative of 1 for all positive inputs, which helps prevent the gradient from becoming too small.
Initialization techniques - Properly initializing the weights can help prevent the gradient from becoming too small or too large.
Dropout - Dropout randomly “drops out” neurons during training, which helps prevent overfitting and can also prevent the gradient from becoming too small.
L1 and L2 regularization - L1 regularization adds a penalty term to the loss function that encourages sparse weight matrices, while L2 regularization adds a penalty term that encourages small weights. This helps prevent the gradient from becoming too large.
These methods work by either ensuring that the gradient doesn’t become too small or too large, or by preventing overfitting, which can exacerbate the vanishing gradient problem.
[EXTRA] Batch Normalization - Normalizing the input to each layer of the network can help prevent the gradient from becoming too small or too large.
[EXTRA] Gradient Clipping - Limiting the maximum or minimum value of the gradients can prevent the gradients from becoming too large or too small.
[EXTRA] Residual connections - Adding residual connections to the network can help prevent the gradient from becoming too small as the signal can bypass the problematic layers. This solution corresponds to the ResNet that we used in this tutorial.
More information about Vanishing Gradient can be found from this notebook.