::include_graphics("images/kerasPipeline.png") knitr
Introduction to Keras - Lab
Introduction
This lab has been adapted from the first lab in the Workshop on Deep learning with keras and Tensorflow in R (Rstudio conf. 2020), and has been downladed and re-created to ensure its usability.
The lab is oriented to provide a first contact with Keras in R and presents the main steps required for building training and using a deepl learning model using Keras and R as summarized in the Keras Cheatshhet:
The following steps provide a good mental model for tuning a model.
It is important to recall that though they don’t guarantee you’ll find the optimal model; however, it should give you a higher probability of finding a near optimal one.
- Prepare data
- Balance batch size with a default learning rate
- Tune the adaptive learning rate optimizer
- Add callbacks to control training
- Explore model capacity
- Regularize overfitting
- Repeat steps 1-6
- Evaluate final model results
We’ll demonstrate with one of the most famous benchmark data sets, MNIST. We’ll work with a multi-layer perceptron (MLP), but realize that these steps also translate to other DL models (i.e. CNNs, RNNs, LSTMs).
Package Requirements
library(keras) # for modeling
library(tidyverse) # for wrangling & visualization
library(glue) # for string literals
library(reticulate) # Interface to 'Python'
MNIST
keras
has many built in data sets (or functions to automatically install data sets). Check out the available datasets with dataset_
+ tab.
We’re going to use the MNIST data set which is the “hello world” for learning deep learning! ℹ️
<- dataset_mnist()
mnist str(mnist)
List of 2
$ train:List of 2
..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
$ test :List of 2
..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
Our training images (aka features) are stored as 3D arrays:
- 60,000 images consisting of a…
- 28x28 matrix with…
- values ranging from 0-255 representing gray scale pixel values.
# 60K images of 28x28 pixels
dim(mnist$train$x)
[1] 60000 28 28
# pixel values are gray scale ranging from 0-255
range(mnist$train$x)
[1] 0 255
Notice, however, that the labels, which indicate the “true” number associated with the image are stored as a vector, because per each image (a 2-D array of 28x28) ther is a single number.
# 60K numerical labels
dim(mnist$train$y)
[1] 60000
# integer values ranging from 0 to 9
range(mnist$train$y)
[1] 0 9
Check out the first digit
<- mnist$train$x[1,,]
digit digit
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 3
[7,] 0 0 0 0 0 0 0 0 30 36 94 154 170
[8,] 0 0 0 0 0 0 0 49 238 253 253 253 253
[9,] 0 0 0 0 0 0 0 18 219 253 253 253 253
[10,] 0 0 0 0 0 0 0 0 80 156 107 253 253
[11,] 0 0 0 0 0 0 0 0 0 14 1 154 253
[12,] 0 0 0 0 0 0 0 0 0 0 0 139 253
[13,] 0 0 0 0 0 0 0 0 0 0 0 11 190
[14,] 0 0 0 0 0 0 0 0 0 0 0 0 35
[15,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[16,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[17,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[18,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[19,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[20,] 0 0 0 0 0 0 0 0 0 0 0 0 39
[21,] 0 0 0 0 0 0 0 0 0 0 24 114 221
[22,] 0 0 0 0 0 0 0 0 23 66 213 253 253
[23,] 0 0 0 0 0 0 18 171 219 253 253 253 253
[24,] 0 0 0 0 55 172 226 253 253 253 253 244 133
[25,] 0 0 0 0 136 253 253 253 212 135 132 16 0
[26,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[27,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[28,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 18 18 18 126 136 175 26 166 255 247 127 0
[7,] 253 253 253 253 253 225 172 253 242 195 64 0
[8,] 253 253 253 253 251 93 82 82 56 39 0 0
[9,] 253 198 182 247 241 0 0 0 0 0 0 0
[10,] 205 11 0 43 154 0 0 0 0 0 0 0
[11,] 90 0 0 0 0 0 0 0 0 0 0 0
[12,] 190 2 0 0 0 0 0 0 0 0 0 0
[13,] 253 70 0 0 0 0 0 0 0 0 0 0
[14,] 241 225 160 108 1 0 0 0 0 0 0 0
[15,] 81 240 253 253 119 25 0 0 0 0 0 0
[16,] 0 45 186 253 253 150 27 0 0 0 0 0
[17,] 0 0 16 93 252 253 187 0 0 0 0 0
[18,] 0 0 0 0 249 253 249 64 0 0 0 0
[19,] 0 46 130 183 253 253 207 2 0 0 0 0
[20,] 148 229 253 253 253 250 182 0 0 0 0 0
[21,] 253 253 253 253 201 78 0 0 0 0 0 0
[22,] 253 253 198 81 2 0 0 0 0 0 0 0
[23,] 195 80 9 0 0 0 0 0 0 0 0 0
[24,] 11 0 0 0 0 0 0 0 0 0 0 0
[25,] 0 0 0 0 0 0 0 0 0 0 0 0
[26,] 0 0 0 0 0 0 0 0 0 0 0 0
[27,] 0 0 0 0 0 0 0 0 0 0 0 0
[28,] 0 0 0 0 0 0 0 0 0 0 0 0
[,26] [,27] [,28]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 0 0
[6,] 0 0 0
[7,] 0 0 0
[8,] 0 0 0
[9,] 0 0 0
[10,] 0 0 0
[11,] 0 0 0
[12,] 0 0 0
[13,] 0 0 0
[14,] 0 0 0
[15,] 0 0 0
[16,] 0 0 0
[17,] 0 0 0
[18,] 0 0 0
[19,] 0 0 0
[20,] 0 0 0
[21,] 0 0 0
[22,] 0 0 0
[23,] 0 0 0
[24,] 0 0 0
[25,] 0 0 0
[26,] 0 0 0
[27,] 0 0 0
[28,] 0 0 0
Lets plot the first digit and compare to the above matrix.
plot(as.raster(digit, max = 255))
The associated label confirms it is number 5.
$train$y[1] mnist
[1] 5
Now lets check out the first 100 digits
par(mfrow = c(10, 10), mar = c(0,0,0,0))
for (i in 1:100) {
plot(as.raster(mnist$train$x[i,,], max = 255))
}
Prepare Data
When we work with keras:
- training and test sets need to be independent
- features and labels (aka target, response) need to be independent
- use
%<-%
for object unpacking (see?zeallot::%<-%
)
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnist
# the above is the same as
# train_images <- mnist$train$x
# train_labels <- mnist$train$y
# test_images <- mnist$test$x
# test_labels <- mnist$test$y
Shape into proper tensor form
The shape of our data is dependent on the type of DL model we are training. Multi Layer Perceptrons (MLPs) require our data to be in a 2D tensor (aka matrix); however, our data are currently in a 3D tensor.
We can reshape our tensor from 3D to 2D. Much like a matrix can be flattened to a vector:
<- matrix(1:9, nrow = 3)
m m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# flattened matrix
as.vector(m)
[1] 1 2 3 4 5 6 7 8 9
# Warning of how you invert the process !!!
# matrix(as.vector(m), nrow=3)
We can reshape a 3D array to a 2D array with array_reshape()
::include_graphics("images/reshape.png") knitr
# reshape 3D tensor (aka array) to a 2D tensor (aka matrix)
<- array_reshape(train_images, c(60000, 28 * 28))
train_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images
# our training data is now a matrix with 60K observations and
# 784 features (28 pixels x 28 pixels = 784)
str(train_images)
num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...
Since we are dealing with a multi-classification problem where the target ranges from 0-9, we’ll reformat the labels with to_categorical()
.
This function takes a vector of \(N\) integers of \(k\) distinct classes and creates a matrix of \(N\) rows and \(k\) columns in such a way that if the nth value in the original vector, was a 7, now the nth row, column 8, in the transformed matrix, has a 1 and the remaining columns of this row contain a zero.
Note: column 1 refers to the digit “0”, column 2 refers to the digit “1”, etc.
class(train_labels)
[1] "array"
dim(train_labels)
[1] 60000
<- to_categorical(train_labels)
train_labels <- to_categorical(test_labels)
test_labels class(train_labels)
[1] "matrix" "array"
dim(train_labels)
[1] 60000 10
head(train_labels)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 1 0 0 0 0
[2,] 1 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 1 0 0 0 0 0
[4,] 0 1 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 1
[6,] 0 0 1 0 0 0 0 0 0 0
Stabilize learning by data scaling
When applying DL models, our feature values should not be relatively large compared to the randomized initial weights and all our features should take values in roughly the same range.
When features have large or widely varying values, large gradient updates can be triggered that will prevent the network from converging
Tips:
When all features have the same value range (i.e. images), we can standardize values between 0-1.
When features varying in range from one another (i.e. age, height, longitude) normalize each feature to have mean of 0 and standard deviation of 1 (?
scale()
)
# all our features (pixels) range from 0-255
range(train_images)
[1] 0 255
# standardize train and test features
<- train_images / 255
train_images <- test_images / 255 test_images
Randomize data
Although this dataset is not ordered, it is a good practice to randomize data so that our train and validation datasets are properly represented.
<- nrow(train_images)
obs set.seed(123)
<- sample(seq_len(obs), size = obs, replace = FALSE)
randomize <- train_images[randomize, ]
train_images <- train_labels[randomize, ] train_labels
Once the data has been prepared we can proceed to train some DL models!!
Balance Batch Size & Default Learning Rate
Epoch: A complete pass through the entire training dataset.
- Example: Training a model on 1,000 images where each image is seen once.
Batch: A subset of the training dataset used to update the model’s parameters.
- Example: Processing 100 images at a time during training.
Mini-batch: A smaller subset of the training dataset used for parameter updates.
- Example: Processing 32 images at a time during training with limited memory.
The learning rate is the most important hyperparameter to get right but before I tuning it, I like to find a batch size that balances:
- smoothness of the learning curve
- speed of training
Tips: start with…
- a default learning rate,
- a single hidden layer with units set to around
mean(n_features + n_response_classes)
, - the default batch size of 32,
- then test ranges of batch sizes [16, 32, 64, 128, 256, 512].
# get number of features
<- ncol(train_images)
n_feat
# 1. Define model architecture
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu',
input_shape = n_feat) %>%
layer_dense(units = 10, activation = 'softmax')
# 1. Define model architecture
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu', input_shape = c(n_feat)) %>%
layer_dense(units = 10, activation = 'softmax')
# 2. Define how our model is going to learn
%>% compile(
model loss = "categorical_crossentropy",
optimizer = "sgd",
metrics = "accuracy"
)
summary(model)
Model: "sequential_1"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_3 (Dense) (None, 512) 401920
dense_2 (Dense) (None, 10) 5130
================================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
________________________________________________________________________________
Let’s train our model with the default batch size of 32.
# 3. Train our model
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2
)
history
Final epoch (plot to see history):
loss: 0.1699
accuracy: 0.953
val_loss: 0.1778
val_accuracy: 0.9486
plot(history)
Your turn! (3min)
Now retrain the same model but try with batch sizes of 16, 128, 512 while leaving everything else the same. How do the results (loss, accuracy, and compute time) compare?
# define model architecture
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu', input_shape = n_feat) %>%
layer_dense(units = 10, activation = 'softmax')
# define how the model is gonig to learn
%>% compile(
model loss = "categorical_crossentropy",
optimizer = "sgd",
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
____ = ____
)
Pick an Adaptive Learning Rate Optimizer
Once we’ve picked an adequate batch size, next we want to start tuning the learning rate. There are two main considerations:
- We usually always assess different learning rates ranging from [1e-1, 1e-7].
- Adaptive learning rate optimizers nearly always outperform regular SGD
Adaptive learning rates help to escape “saddle points”.
The primary adaptive learning rates include:
SGD (Stochastic Gradient Descent): Basic optimization algorithm that updates model parameters based on the negative gradient of the loss function.
SGD with momentum: Extends SGD by incorporating a momentum term, which helps accelerate learning and maintain direction in the presence of noisy or sparse gradients.
nAG (Nesterov Accelerated Gradient): Variant of SGD that improves convergence speed by calculating the gradient at a lookahead position, providing more accurate gradient information.
Adagrad (Adaptive Gradient): Adapts the learning rate for each parameter based on the historical gradients, giving more weight to parameters with smaller gradients.
Adadelta: Extension of Adagrad that dynamically adjusts the learning rate using the ratio of accumulated gradients to accumulated parameter updates.
RMSprop (Root Mean Square Propagation): Adaptive learning rate algorithm that normalizes learning rates by dividing them by an exponentially weighted moving average of squared gradients.
These optimization techniques offer different strategies for updating model parameters, improving convergence and adapting the learning rate to optimize model performance.
See ?optimizer_sgd()
or ?optimizer_rmsprop()
See https://ruder.io/optimizing-gradient-descent/ for more details
Your turn! (5min)
Retrain our DL model using:
- SGD with momentum (
optimizer_sgd()
)- Try default learning rate (0.01) with momentum (0.5, 0.9, 0.99)
- Try larger learning rates (0.1, 1) with momentum
- RMSprop (
optimizer_rmsprop()
)- What is the default learning rate? Train with default.
- Try a larger learning rate. What happens?
- Adam (
optimizer_adam()
)- What is the default learning rate? Train with default.
- Try a larger learning rate. What happens?
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu', input_shape = n_feat) %>%
layer_dense(units = 10, activation = 'softmax')
%>% compile(
model loss = "categorical_crossentropy",
optimizer = ____,
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128
)
Add Callbacks
When training a model, sometimes we want to:
- automatically stop a model once performance has stopped improving
- dynamically adjust values of certain parameters (i.e. learning rate)
- log model information to use or visualize later on
- continually save the model during training and save the model with the best performance
Keras provides a suite of tools called _callbacks that help us to monitor, control, and customize the training procedure.
Stop training at the right time
We often don’t know how many epochs we’ll need to reach a minimum loss. The early stopping callback allows us to automatically stop training after we experience no improvement in our loss after patience
number of epochs.
If you are going to use the model after training you always want to retain the “best” model, which is the model with the lowest loss.
restore_best_weights
will restore the weights for this “best” model even after you’ve passed it by n epochs.Sometimes your model will stall on a low validation loss resulting in a tie. Even with early stopping the model will continue to train for all the epochs.
min_delta
allows you to state some small value that the loss must improve by otherwise it will stop.
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu', input_shape = n_feat) %>%
layer_dense(units = 10, activation = 'softmax')
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.1,
momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = callback_early_stopping(patience = 3,
restore_best_weights = TRUE,
min_delta = 0.0001)
)
history
Final epoch (plot to see history):
loss: 0.003236
accuracy: 0.9998
val_loss: 0.06499
val_accuracy: 0.9832
Note how we did not train for all the epochs…only until our loss didn’t improve by 0.0001 for 3 consistent epochs.
plot(history)
Add a learning rate scheduler
Although adaptive learning rate optimizers adjust the velocity of weight updates depending on the loss surface, we can also incorporate additional ways to modify the learning rate “on the fly”.
Learning rate decay reduces the learning rate at each epoch. Note the
decay
argument in?optimizer_xxx()
.There has been some great research on cyclical learning rates (see https://arxiv.org/abs/1506.01186). You can incorporate custom learning rates such as this with
callback_learning_rate_scheduler()
. This is more advanced but definitely worth reading.A simpler, and very practical approach, is to reduce the learning rate after the model has stopped improving with
callback_reduce_lr_on_plateau()
. This approach allows the model to tell us when to reduce the learning rate.
When using decay
or callback_reduce_lr_on_plateau()
, you can usually increase the base learning rate so that you learn quickly early on and then the learning rate will reduce so you can take smaller steps as you approach the minimum loss.
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = 'relu', input_shape = n_feat) %>%
layer_dense(units = 10, activation = 'softmax')
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.1, momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE, min_delta = 0.0001),
callback_reduce_lr_on_plateau(patience = 1, factor = 0.1)
) )
history
Final epoch (plot to see history):
loss: 0.007838
accuracy: 0.9994
val_loss: 0.06073
val_accuracy: 0.9835
lr: 0.0001
plot(history)
plot(history$metrics$lr)
Explore Model Capacity
Now that we have found a pretty good learning rate and we have good control over our training procedure with callbacks, we can start to assess how the capacity of our model effects performance.
The capacity of a deep learning model refers to its ability to learn complex patterns in the data.
It is determined by the architecture, including the number of layers and neurons.
Higher capacity allows the model to capture more intricate details, but it also increases the risk of overfitting.
Finding the right balance is important for optimal performance.
Capacity is usually controlled in two ways:
- width: the number of units in a hidden layer
- depth: the number of hidden layers
Your turn! (5min)
Explore different model capacities while all other parameters constant. Try increasing
- Width: remember that the number of units is usually a power of 2 (i.e. 32, 64, 128, 256, 512, 1024).
- Depth: try using 2 or 3 hidden layers
<- keras_model_sequential() %>%
model %>%
___________ layer_dense(units = 10, activation = 'softmax')
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(lr = 0.1, momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE, min_delta = 0.0001),
callback_reduce_lr_on_plateau(patience = 1, factor = 0.1)
) )
Smart experimenting
As we start to experiment more, it becomes harder to organize and compare our results. Let’s make it more efficient by:
- Creating a function that allows us to dynamically change the number of layers and units.
- Using
callback_tensorboard()
which allows us to save and visually compare results.
There is another approach to performing grid searches. See the extras notebook https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/imdb-grid-search.nb.html for details.
<- function(n_units, n_layers, log_to) {
train_model
# Create a model with a single hidden input layer
<- keras_model_sequential() %>%
model layer_dense(units = n_units, activation = "relu", input_shape = n_feat)
# Add additional hidden layers based on input
if (n_layers > 1) {
for (i in seq_along(n_layers - 1)) {
%>% layer_dense(units = n_units, activation = "relu")
model
}
}
# Add final output layer
%>% layer_dense(units = 10, activation = "softmax")
model
# compile model
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.1, momentum = 0.9),
metrics = "accuracy"
)
# train model and store results with callback_tensorboard()
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE, min_delta = 0.0001),
callback_reduce_lr_on_plateau(patience = 1, factor = 0.1),
callback_tensorboard(log_dir = log_to)
),verbose = FALSE
)
return(history)
}
Now we can create a grid for various model capacities. We include an ID for each model, which we will use to save our results within callback_tensorboard()
.
<- expand_grid(
grid units = c(128, 256, 512, 1024),
layers = c(1:3)
%>%
) mutate(id = paste0("mlp_", layers, "_layers_", units, "_units"))
grid
# A tibble: 12 × 3
units layers id
<dbl> <int> <chr>
1 128 1 mlp_1_layers_128_units
2 128 2 mlp_2_layers_128_units
3 128 3 mlp_3_layers_128_units
4 256 1 mlp_1_layers_256_units
5 256 2 mlp_2_layers_256_units
6 256 3 mlp_3_layers_256_units
7 512 1 mlp_1_layers_512_units
8 512 2 mlp_2_layers_512_units
9 512 3 mlp_3_layers_512_units
10 1024 1 mlp_1_layers_1024_units
11 1024 2 mlp_2_layers_1024_units
12 1024 3 mlp_3_layers_1024_units
Now we can loop through each model capacity combination and train our models.
This will take a few minutes so this is a good time to save your work and take a break.
system.time(
for (row in seq_len(nrow(grid))) {
# get parameters
<- grid[[row, "units"]]
units <- grid[[row, "layers"]]
layers <- paste0("mnist/", grid[[row, "id"]])
file_path
# provide status update
cat(layers, "hidden layer(s) with", units, "neurons: ")
# train model
<- train_model(n_units = units, n_layers = layers, log_to = file_path)
m <- min(m$metrics$val_loss, na.rm = TRUE)
min_loss
# update status with loss
cat(min_loss, "\n", append = TRUE)
} )
1 hidden layer(s) with 128 neurons: 0.07069184
2 hidden layer(s) with 128 neurons: 0.06938207
3 hidden layer(s) with 128 neurons: 0.06946184
1 hidden layer(s) with 256 neurons: 0.06554866
2 hidden layer(s) with 256 neurons: 0.06640807
3 hidden layer(s) with 256 neurons: 0.07036863
1 hidden layer(s) with 512 neurons: 0.06244506
2 hidden layer(s) with 512 neurons: 0.06106765
3 hidden layer(s) with 512 neurons: 0.06557445
1 hidden layer(s) with 1024 neurons: 0.06027493
2 hidden layer(s) with 1024 neurons: 0.05934745
3 hidden layer(s) with 1024 neurons: 0.05973132
user system elapsed
128.89 75.78 287.67
Our results suggest that larger width models tend to perform better but it is unclear if deeper models add much benefit. However, to get more clarity we can analyze the learning rates with tensorboard()
.
Note that callback_tensorboard()
saved all the model runs in /mnist
subdirectory.
tensorboard("mnist")
Started TensorBoard at http://127.0.0.1:3383
Your turn! (3 min)
Retrain the model with:
- 2 hidden layers
- 1024 units in each hidden layer
- early stopping after 3 epochs of no improvement
- reduce learning rate after 1 epoch of no improvement
<- keras_model_sequential() %>%
model %>%
___________
___________
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(lr = 0.1, momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback______(patience = _____, restore_best_weights = TRUE, min_delta = 0.0001),
callback______(patience = _____, factor = 0.1)
) )
Regularize Overfitting
Often, once we’ve found a model that minimizes the loss, there is still some overfitting that is occuring…sometimes a lot.
So our next objective is to try flatten the validation loss learning curve and bring it as close to the training loss curve as possible.
Weight decay
A common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take on small values, which makes the distribution of weight values more regular.
This is called weight regularization and its done by adding to the loss function of the network a cost associated with having large weights.
If you a familiar with regularized regression ℹ️ (lasso, ridge, elastic nets) then weight regularization is essentially the same thing. ℹ️
\[Loss = MSE + \lambda \sum^p_{j=1} w^2_j\]
- Although you can use L1, L2 or a combination, L2 is by far the most common and is known as weight decay in the context of neural nets.
- Optimal values vary but when tuning we typically start with factors of \(10^{-s}\) where s ranges between 1-4 (0.1, 0.01, …, 0.0001).
- The larger the weight regularizer, the more epochs generally required to reach a minimum loss.
- Weight decay can cause a noisier learning curve so its often beneficial to increase the
patience
parameter for early stopping if this is noticable.
<- keras_model_sequential() %>%
model layer_dense(
units = 512, activation = "relu", input_shape = n_feat,
kernel_regularizer = regularizer_l2(l = 0.001) # regularization parameter
%>%
) layer_dense(
units = 512, activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.001) # regularization parameter
%>%
) layer_dense(units = 10, activation = "softmax")
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.1,
momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE, min_delta = 0.0001),
callback_reduce_lr_on_plateau(patience = 1, factor = 0.1)
) )
history
Final epoch (plot to see history):
loss: 0.09333
accuracy: 0.996
val_loss: 0.1284
val_accuracy: 0.9823
lr: 0.001
plot(history)
Dropout
Dropout is one of the most effective and commonly used regularization techniques for neural networks.
Dropout applied to a layer randomly drops out (sets to zero) a certain percentage of the output features of that layer.
By randomly dropping some of a layer’s outputs we minimize the chance of fitting patterns to noise in the data, a common cause of overfitting.
Dropout rates typically ranges between 0.2-0.5. Sometimes higher rates are necessary but note that you will get a warning when supplying rate > 0.5.
The higher the dropout rate, the slower the convergence so you may need to increase the number of epochs.
Its common to apply dropout after each hidden layer and with the same rate;
however, this is not necessary.
<- keras_model_sequential() %>%
model layer_dense(units = 512, activation = "relu", input_shape = n_feat) %>%
layer_dropout(0.3) %>% # regularization parameter
layer_dense(units = 512, activation = "relu") %>%
layer_dropout(0.3) %>% # regularization parameter
layer_dense(units = 10, activation = "softmax")
%>% compile(
model loss = "categorical_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.1,
momentum = 0.9),
metrics = "accuracy"
)
<- model %>% fit(
history
train_images, train_labels,validation_split = 0.2,
batch_size = 128,
epochs = 20,
callback = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE, min_delta = 0.0001),
callback_reduce_lr_on_plateau(patience = 1, factor = 0.1)
)%>%
) as.data.frame()
history
epoch value metric data
1 1 3.242483e-01 loss training
2 2 1.482577e-01 loss training
3 3 1.137299e-01 loss training
4 4 9.758937e-02 loss training
5 5 7.881726e-02 loss training
6 6 6.788949e-02 loss training
7 7 6.261241e-02 loss training
8 8 4.050179e-02 loss training
9 9 3.128683e-02 loss training
10 10 2.672179e-02 loss training
11 11 2.508097e-02 loss training
12 12 2.197508e-02 loss training
13 13 2.042278e-02 loss training
14 14 2.059718e-02 loss training
15 15 NA loss training
16 16 NA loss training
17 17 NA loss training
18 18 NA loss training
19 19 NA loss training
20 20 NA loss training
21 1 8.989375e-01 accuracy training
22 2 9.537917e-01 accuracy training
23 3 9.648333e-01 accuracy training
24 4 9.694791e-01 accuracy training
25 5 9.750834e-01 accuracy training
26 6 9.785417e-01 accuracy training
27 7 9.797083e-01 accuracy training
28 8 9.865834e-01 accuracy training
29 9 9.901250e-01 accuracy training
30 10 9.915208e-01 accuracy training
31 11 9.917708e-01 accuracy training
32 12 9.930000e-01 accuracy training
33 13 9.931250e-01 accuracy training
34 14 9.934375e-01 accuracy training
35 15 NA accuracy training
36 16 NA accuracy training
37 17 NA accuracy training
38 18 NA accuracy training
39 19 NA accuracy training
40 20 NA accuracy training
41 1 1.419821e-01 loss validation
42 2 1.198308e-01 loss validation
43 3 9.774947e-02 loss validation
44 4 8.783271e-02 loss validation
45 5 8.001570e-02 loss validation
46 6 7.208887e-02 loss validation
47 7 7.537550e-02 loss validation
48 8 6.350064e-02 loss validation
49 9 6.203371e-02 loss validation
50 10 6.138184e-02 loss validation
51 11 6.125287e-02 loss validation
52 12 6.218536e-02 loss validation
53 13 6.142388e-02 loss validation
54 14 6.137693e-02 loss validation
55 15 NA loss validation
56 16 NA loss validation
57 17 NA loss validation
58 18 NA loss validation
59 19 NA loss validation
60 20 NA loss validation
61 1 9.568334e-01 accuracy validation
62 2 9.630000e-01 accuracy validation
63 3 9.705000e-01 accuracy validation
64 4 9.735833e-01 accuracy validation
65 5 9.779167e-01 accuracy validation
66 6 9.800833e-01 accuracy validation
67 7 9.792500e-01 accuracy validation
68 8 9.827500e-01 accuracy validation
69 9 9.837500e-01 accuracy validation
70 10 9.838333e-01 accuracy validation
71 11 9.838333e-01 accuracy validation
72 12 9.841667e-01 accuracy validation
73 13 9.843333e-01 accuracy validation
74 14 9.843333e-01 accuracy validation
75 15 NA accuracy validation
76 16 NA accuracy validation
77 17 NA accuracy validation
78 18 NA accuracy validation
79 19 NA accuracy validation
80 20 NA accuracy validation
81 1 1.000000e-01 lr training
82 2 1.000000e-01 lr training
83 3 1.000000e-01 lr training
84 4 1.000000e-01 lr training
85 5 1.000000e-01 lr training
86 6 1.000000e-01 lr training
87 7 1.000000e-01 lr training
88 8 1.000000e-02 lr training
89 9 1.000000e-02 lr training
90 10 1.000000e-02 lr training
91 11 1.000000e-02 lr training
92 12 1.000000e-02 lr training
93 13 9.999999e-04 lr training
94 14 9.999999e-05 lr training
95 15 NA lr training
96 16 NA lr training
97 17 NA lr training
98 18 NA lr training
99 19 NA lr training
100 20 NA lr training
plot(history)
Repeat
At this point, we have a pretty good model. However, often, iterating over these steps can improve model performance even further. For brevity, we’ll act as if we have found a sufficient solution.
Evaluate results
Once a final model is chosen, we can evaluate the model on our test set to provide us with a accurate expectation of our generalization error. Our goal is that our test error is very close to our validation error.
%>% evaluate(test_images, test_labels, verbose = FALSE) model
loss accuracy
0.05530928 0.98350000
Confusion matrix
To understand our model’s performance across the different response classes, we can assess a confusion matrix ℹ️.
First, we need to predict our classes and also get the actual response values.
<- model %>% predict(test_images) %>% k_argmax()
predictions <- mnist$test$y actual
We can see the number of missed predictions in our test set
library(reticulate)
<- as.array(predictions)
predictions <- sum(predictions != actual)
missed_predictions missed_predictions
[1] 165
We can use caret::confusionMatrix()
to get our confusion matrix. We can see which digits our model confuses the most by analyzing the confusion matrix.
- 9s are often confused with 4s
- 6s are often confused with 0s & 5s
- etc.
::confusionMatrix(factor(predictions), factor(actual)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 971 0 4 0 1 2 3 1 4 1
1 1 1128 0 0 1 0 3 5 0 2
2 0 1 1017 4 2 0 0 7 3 0
3 0 2 2 993 1 7 0 1 3 3
4 1 0 1 0 965 0 3 0 1 9
5 0 1 1 4 0 874 7 0 3 2
6 3 2 2 0 3 2 940 0 2 1
7 1 0 3 3 1 1 1 1009 3 4
8 2 1 2 5 2 2 1 2 952 1
9 1 0 0 1 6 4 0 3 3 986
Overall Statistics
Accuracy : 0.9835
95% CI : (0.9808, 0.9859)
No Information Rate : 0.1135
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9817
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity 0.9908 0.9938 0.9855 0.9832 0.9827 0.9798
Specificity 0.9982 0.9986 0.9981 0.9979 0.9983 0.9980
Pos Pred Value 0.9838 0.9895 0.9836 0.9812 0.9847 0.9798
Neg Pred Value 0.9990 0.9992 0.9983 0.9981 0.9981 0.9980
Prevalence 0.0980 0.1135 0.1032 0.1010 0.0982 0.0892
Detection Rate 0.0971 0.1128 0.1017 0.0993 0.0965 0.0874
Detection Prevalence 0.0987 0.1140 0.1034 0.1012 0.0980 0.0892
Balanced Accuracy 0.9945 0.9962 0.9918 0.9905 0.9905 0.9889
Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity 0.9812 0.9815 0.9774 0.9772
Specificity 0.9983 0.9981 0.9980 0.9980
Pos Pred Value 0.9843 0.9834 0.9814 0.9821
Neg Pred Value 0.9980 0.9979 0.9976 0.9974
Prevalence 0.0958 0.1028 0.0974 0.1009
Detection Rate 0.0940 0.1009 0.0952 0.0986
Detection Prevalence 0.0955 0.1026 0.0970 0.1004
Balanced Accuracy 0.9898 0.9898 0.9877 0.9876
Visualize missed predictions
We can also visualize this with the following:
tibble(
actual,
predictions%>%
) filter(actual != predictions) %>%
count(actual, predictions) %>%
mutate(perc = n / n() * 100) %>%
filter(n > 1) %>%
ggplot(aes(actual, predictions, size = n)) +
geom_point(shape = 15, col = "#9F92C6") +
scale_x_continuous("Actual Target", breaks = 0:9) +
scale_y_continuous("Prediction", breaks = 0:9) +
scale_size_area(breaks = c(2, 5, 10, 15), max_size = 5) +
coord_fixed() +
ggtitle(paste(missed_predictions, "mismatches")) +
theme(panel.grid.minor = element_blank()) +
labs(caption = 'Adapted from Rick Scavetta')
Visualize missed predictions
Lastly, lets check out those mispredicted digits.
<- which(predictions != actual)
missed <- ceiling(sqrt(length(missed)))
plot_dim
par(mfrow = c(plot_dim, plot_dim), mar = c(0,0,0,0))
for (i in missed) {
plot(as.raster(mnist$test$x[i,,], max = 255))
}
If we look at the predicted vs actual we can reason about why our model mispredicted some of the digits.
par(mfrow = c(4, 4), mar = c(0,0,2,0))
for (i in missed[1:16]) {
plot(as.raster(mnist$test$x[i,,], max = 255))
title(main = paste("Predicted:", predictions[i]))
}
Key takeaways
Follow these steps and guidelines when tuning your DL model:
- Prepare data
- data needs to be shaped into the right tensor dimensions
- data should be scaled so they don’t trigger exploding gradients
- Balance batch size with a default learning rate so that…
- your learning curve is not too noisy
- the training compute time is sufficient
- Tune the adaptive learning rate optimizer
- compare SGD+momentum with RMSprop & Adam
- assess learning rates on log scale between [1e-1, 1e-7]
- Add callbacks to control training
- use early stopping for more efficient training
- use an adaptive learning rate callback to have more control of the learning rate
- Explore model capacity
- compare adding width vs depth (consider loss vs. training time)
- be smart with your experiments (use tensorboard callback!)
- Regularize overfitting
- weight decay controls magnitude of weights; start by assessing values btwn 0.1, 0.01, …, 0.0001
- dropout minimizes happenstance patterns from noise; typical values range from 0.2-0.5.
- Iterate, iterate, iterate!