The MNIST (mixed National Institute of Standards and Technology) dataset (https://en.wikipedia.org/wiki/MNIST_database) is a classic data set in machine learning. To develop our intuitions about the problem, we start with a simple linear classifier and achieve an average accuracy of $80\%$. We then proceed to build a state-of-the-art convolutional neural network (CNN) and achieve an accuracy of over $98\%$.
This notebook is available on https://github.com/jcboyd/deep-learning-workshop.
A Docker image for this project is available on Docker hub:
$ docker pull jcboyd/deep-learning-workshop/:[cpu|gpu]
$ nvidia-docker run -it -p 8888:8888 jcboyd/deep-learning-workshop/:[cpu|gpu]
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012. (Figure 1.1)
from pandas import read_csv
read_csv(open('data/shapes.csv'))
from __future__ import print_function
from __future__ import division
import tensorflow.examples.tutorials.mnist.input_data as input_data
mnist = input_data.read_data_sets('MNIST_data', reshape=False, one_hot=False)
Xtr = mnist.train.images
Ytr = mnist.train.labels
Xval = mnist.validation.images
Yval = mnist.validation.labels
Xte = mnist.test.images
Yte = mnist.test.labels
print('Training data shape: ', Xtr.shape)
print('Training labels shape: ', Ytr.shape)
print('Validation data shape: ', Xval.shape)
print('Validation labels shape: ', Yval.shape)
print('Test data shape: ', Xte.shape)
print('Test labels shape: ', Yte.shape)
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from src import vis_utils
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')
fig, ax = plt.subplots(figsize=(6, 6))
idx = np.random.randint(len(Xtr))
vis_utils.plot_image(ax, Xtr[idx, :, :, 0], Ytr[idx])
print(Xtr[idx].reshape((1, 784))[0])
fig = plt.figure(figsize=(8, 6))
vis_utils.plot_array(fig, Xtr, Ytr, num_classes=10)
# First, vectorise image data
Xtr_rows = np.reshape(Xtr, (Xtr.shape[0], -1)).copy()
Xval_rows = np.reshape(Xval, (Xval.shape[0], -1)).copy()
Xte_rows = np.reshape(Xte, (Xte.shape[0], -1)).copy()
# As a sanity check, print out the shapes of the data
print('Training data shape: ', Xtr_rows.shape)
print('Validation data shape: ', Xval_rows.shape)
print('Test data shape: ', Xte_rows.shape)
mean_image = np.mean(Xtr, axis=0).reshape(1, 784)
Xtr_rows -= mean_image
Xval_rows -= mean_image
Xte_rows -= mean_image
fig, ax = plt.subplots(figsize=(6, 6))
vis_utils.plot_image(ax, mean_image.reshape(28, 28))
MNIST
, $D = 784$, and we need a weight for every pixel in an image.Stanford Computer Vision course--Convolutional Neural Networks for Visual Recognition http://cs231n.stanford.edu/
Training consists of finding the weights that minimise the loss over the training set.
The most common procedure for optimising a convex differentiable function is known as gradient descent,
where $\alpha$ is referred to as the step size or learning rate. Thus, each iteration is a descent step, and we converge iteratively to a global minimum.
from src.linear_models import MultiSVM, SoftmaxRegression
# Perform bias trick
Xtr_rows = np.append(Xtr_rows, np.ones((Xtr_rows.shape[0], 1)), axis=1)
Xval_rows = np.append(Xval_rows, np.ones((Xval_rows.shape[0], 1)), axis=1)
Xte_rows = np.append(Xte_rows, np.ones((Xte_rows.shape[0], 1)), axis=1)
reg = 5e4
batch_size = 200
max_iters = 1500
learning_rate = 1e-7
model = MultiSVM(Xtr_rows, Ytr)
model.train(reg, batch_size, learning_rate, max_iters, Xval_rows, Yval)
num_test = Yte.shape[0]
predictions = [model.predict(Xte_rows[i]) for i in range(num_test)]
print('Error: %.02f%%' % (100 * (1 - float(sum(Yte == np.array(predictions))) / num_test)))
from src.vis_utils import plot_confusion_matrix
num_classes = 10
fig, ax = plt.subplots(figsize=(8, 6))
classes = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']
confusion_matrix = np.zeros((num_classes, num_classes), np.int32)
for i in range(len(predictions)):
confusion_matrix[Yte[i]][predictions[i]] += 1
plot_confusion_matrix(ax, confusion_matrix, classes, fontsize=15)
false = np.where(np.not_equal(Yte, predictions))[0]
idx = np.random.choice(false)
print('Prediction: %d\nTrue class: %d' % (predictions[idx], Yte[idx]))
fig, ax = plt.subplots(figsize=(6, 6))
vis_utils.plot_image(ax, Xte[idx][0:,:,0])
fig = plt.figure(figsize=(8, 4))
vis_utils.plot_weights(fig, model.W[:-1,:], classes)
Deep learning is characterised by the modeling of a hierarchy of abstraction in the input data. In the following we focus on applications to images, but note deep learning has seen great success in various fields from natural language processing to speech synthesis.
where $\mathbf{W} \in \mathbb{R}^{K \times D}$ are the weights and $\mathbf{b} \in \mathbb{R}^{K \times 1}$ (sometimes incorporated into the weights--bias trick) and $\text{softmax}(x) = \frac{\exp(x)}{\sum_{x'}\exp(x')}$ generalises the sigma logistic function.
Recall from signal processing, the convolution between two functions,
$$(f * g)(t) \triangleq \int_{-\infty}^{+\infty}f(\tau)g(t-\tau)d\tau$$In image processing, a convolution between an image $\mathbf{I}$ and kernel $\mathbf{K}$ of size $d \times d$ and centered at a given pixel $(x, y)$ is defined as,
$$(\mathbf{I} * \mathbf{K})(x, y) = \sum_{i = 1}^{d}\sum_{j = 1}^{d} \mathbf{I}(x + i -d/2, y + j - d/2) \times \mathbf{K}(i, j)$$The dimension $d \times d$ is referred to as the $\textit{receptive field}$ of the convolution.
References:
First, we initialise our model. In TensorFlow, this consists of declaring the operations (and relationships thereof) required to compute the forward pass (from input to loss function) of the model (see src/cnn.py
). Note that this is done in a declarative fashion, and it may be counter-intuitive that this code is only run once to initialise the computational graph. Actual forward passes are performed via a tf.Session()
variable, with mini-batches passed through the graph to a nominal reference node (for example, the loss node). TensorFlow then knows how to backpropagate through each graph operation. This paradigm has its drawbacks, however, as it is highly verbose, and error traces are often opaque. PyTorch
, a TensorFlow alternative, addresses this problem by keeping everything interpreted.
from src.cnn import ConvolutionalNeuralNetwork
nb_labels = 10
batch_size = 64
model = ConvolutionalNeuralNetwork(img_size=28, nb_channels=1, nb_labels=nb_labels)
import tensorflow as tf
from src.utils import sample_batch, one_hot_encoding, error_rate
max_iters = 1500
with tf.Session() as sess:
saver = tf.train.Saver()
num_training = Xtr.shape[0]
batch = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.01,
batch * batch_size,
num_training,
0.95,
staircase=True)
optimizer = tf.train.MomentumOptimizer(
learning_rate, 0.9).minimize(model.loss, global_step=batch)
tf.global_variables_initializer().run()
for step in range(max_iters):
x_batch, y_batch = sample_batch(Xtr, Ytr, augment=False)
y_batch = one_hot_encoding(y_batch, nb_labels)
feed_dict = {model.X: x_batch, model.Y: y_batch}
_, l, lr, pred = sess.run(
[optimizer, model.loss, learning_rate, model.pred],
feed_dict=feed_dict)
if step % 100 == 0:
error = error_rate(pred, y_batch)
Yval_one_hot = one_hot_encoding(Yval, nb_labels)
print('Step %d of %d' % (step, max_iters))
print('Mini-batch loss: %.5f Error: %.5f Learning rate: %.5f' % (l, error, lr))
print('Validation error: %.1f%%' % error_rate(
model.pred.eval(feed_dict={model.X : Xval}), Yval_one_hot))
# save weights
saver.save(sess, '/tmp/model.ckpt')
tf.reset_default_graph()
model = ConvolutionalNeuralNetwork(img_size=28, nb_channels=1, nb_labels=nb_labels)
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, '/tmp/model.ckpt')
pred, conv1, conv2 = sess.run([model.pred, model.conv1, model.conv2],
feed_dict={model.X : Xte[:1000]})
pred = np.argmax(pred, axis=1).astype(np.int8)
correct = np.sum(pred == Yte[:1000])
print('Test error: %.02f%%' % (100 * (1 - float(correct) / float(pred.shape[0]))))
classes = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']
confusion_matrix = np.zeros((num_classes, num_classes), np.int32)
for i in range(len(pred)):
confusion_matrix[Yte[i]][pred[i]] += 1
fig, ax = plt.subplots(figsize=(8, 6))
plot_confusion_matrix(ax, confusion_matrix, classes, fontsize=15)
Romanuke, Vadim. "Parallel Computing Center (Khmelnitskiy, Ukraine) represents an ensemble of 5 convolutional neural networks which performs on MNIST at 0.21 percent error rate.". Retrieved 24 November 2016."
img = Xtr[5]
fig, ax = plt.subplots(figsize=(6, 6))
vis_utils.plot_image(ax, Xte[0, :, :, 0])
fig = plt.figure(figsize=(10, 5))
vis_utils.plot_activation_maps(fig, conv1, 4, 8)
plt.show()
Then, the 64 activations of the second convolutional layer ($14 \times 14$ px)
fig = plt.figure(figsize=(10, 10))
vis_utils.plot_activation_maps(fig, conv2, 8, 8)
plt.show()