All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html
A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).
Here are some tips on using a Jupyter notebook:
Shift+Enter
(or the play button in the toolbar).2 + 2 # hit Shift+Enter to run
Alt+Enter
(or the plus button in the toolbar).Some tips on using a Jupyter notebook with Python:
Tab
will autocomplete the keyword you have started typing%bash
will allow you to run bash code, %paste
will allow you to paste a block of code while retaining its formating, and %matplotlib inline
will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html For more info on Jupyter: https://jupyter.org/
In this lab, we will import data (./data/decathlon.txt
) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Å ebrle).
pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a data frame, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas.
import pandas as pd
my_data = pd.read_csv('data/decathlon.txt', sep="\t") # load data
print(type(my_data)) # display my_data data type
my_data.head(n=5) # adjust n to view more data
head()
function to view the first few rows only.my_data['100m'].head()
columns = ['100m', '400m']
my_data[columns].head()
my_data[my_data['Competition']=='OlympicG'].head()
loc
object. This behaves like a dictionary whose keys are the data frame's index.my_data.loc['SEBRLE']
my_data.count() # summarise counts of data
list(my_data.columns) # get the names of the columns
print(my_data.shape) # get the shape (rows x columns)
print(my_data.values) # get the content as a numpy array
print(my_data.dtypes) # get the data type (dtype) of each column
To create visualisations, we'll use matplotlib
, the primordial plotting library for Python. matplotlib
may be used in different ways using a built-in interface called pyplot
. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, pylab
, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).
We also use a Jupyter magic command for inline plotting.
%matplotlib inline
import matplotlib.pyplot as plt
We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):
from IPython.display import Image, set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg') # toggle vector graphics for a crisp plot!
# basic visualization: athletes' performances depending on two disciplines
my_data.plot(kind='scatter', x='400m', y='Shot.put', s=50, color='blue')
# A scatterplot matrix allows us to visualize:
# * on the diagonal, the density estimation for each of the features
# * on each of the off-diagonal plots, a scatterplot between two of the features.
# Each dot represents a sample.
from pandas.plotting import scatter_matrix
#from pandas.tools.plotting import scatter_matrix # if pandas older than 0.20
scatter_matrix(my_data.get(['Shot.put','High.jump', '400m']), alpha=0.2,
figsize=(6, 6), diagonal='kde');
# fancier plot with seaborn : https://seaborn.pydata.org/
import seaborn.apionly as sns
sns.set_style('whitegrid')
sns.jointplot('Shot.put', 'High.jump', data = my_data,
kind='kde', size=6, space=0, color='b')
#Â loooking at correlated features
sns.jointplot('Shot.put', 'Discus', data = my_data,
kind='reg', size=6, space=0, color='b')
# Remove columns we don't need
data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)
# Verify new column headers
data.columns
We will first implement PCA ourselves. Recall PCA of $\mathbf{X} \in \mathbb{R}^{N \times D}$ finds $M$ principal components as the coluns of $\mathbf{W} \in \mathbb{R}^{D \times M}$ and a matrix of projected data $\mathbf{Z} \in \mathbb{R}^{N \times M}$, such that $\mathbf{Z}\mathbf{W}^T$ minimises the error as a reconstruction of $\mathbf{X}$ for the choice of $M$ (equivalently, the factorisation maximally explains the variance in $\mathbf{X}$). The principal components $\mathbf{W}$, correspond to the $M$ largest eigenvectors of the empirical covariance matrix $\boldsymbol\Sigma = \frac{1}{N}\mathbf{X}^T\mathbf{X}$.
Question: Recall that PCA works best on standardised data (mean 0, standard deviation 1). Standardise the data. Hint: you should aim to used numpy's vector-based operations.
import numpy as np
# transform data from to numpy array
X = data.values
# TODO: standardise the data
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_centered = (X - X_mean) / X_std
Question: Find the two first components and project the data. These are the two largest eigenvectors of the empirical covariance matrix, $\boldsymbol\Sigma = \frac{1}{N}\mathbf{X}^T\mathbf{X}$.
from numpy import linalg
# TODO: calculate the the covariance matrix with numpy
N = X_centered.shape[0]
X_cov = X_centered.T.dot(X_centered) / N
# TODO: find its two first principal components
w, v = linalg.eig(X_cov)
w, v = w[:2], v[:,:2]
# TODO: project X onto the principal components
X_projected = X_centered.dot(v)
Let's display the projected data. We use a jupyter magic command to display plots inline automatically.
# create figure and axis objects
fig, ax = plt.subplots(figsize=(6, 5))
# create scatterplot on axis N.B. we record the return value to feed to the colorbar
cax = ax.scatter(X_projected[:, 0], X_projected[:, 1], c=my_data['Rank'],
cmap=plt.get_cmap('viridis'))
# Set axis limits
ax.set_xlim([-5.5, 5.5])
ax.set_ylim([-4, 4])
# Create color bar
plt.colorbar(cax, label='Rank')
Question: can you recognize a pattern ? Describe what you see.
Answer: Good performances (dark blue points) are on the right side of the graph (PC1 >0) and bad performances (yellow points) on the left side (PC1 < 0).
In this course, we will rely heavily on the scikit-learn machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
from sklearn import decomposition, preprocessing
std_scale = preprocessing.StandardScaler().fit(X)
X_scaled = std_scale.transform(X)
pca = decomposition.PCA(n_components=2)
pca.fit(X_scaled)
Question: Use pca.transform
to project the data onto its principal components.
# TODO: project X on principal components
X_projected = pca.transform(X_scaled)
pca.explained_variance_ratio_
gives the percentage of variance explained by each of the components.
print pca.explained_variance_ratio_
Question: How is pca.explained_variance_ratio_
computed? Check this is the case by computing it yourself.
tot_var = np.var(X_scaled, axis=0).sum()
print (1 / tot_var) * np.var(X_projected, axis=0)
Project the data onto its principal components after standardizing.
Question: Plot the fraction of variance explained by each of the first 10 principal components.
# TODO: calculate the first ten components
pca = decomposition.PCA(n_components=10)
pca.fit(X_scaled)
plt.bar(np.arange(10), pca.explained_variance_ratio_, color='blue')
plt.xlim([-1, 9])
plt.xlabel("Number of PCs", fontsize=16)
plt.ylabel("Fraction of variance explained", fontsize=16)
To better understand the information captured by the principal components, we can consider  pca.components_
. These are the columns of $\mathbf{W}$ (for $M = 2$).
pcs = pca.components_
print pcs[0]
We can display each row of $\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\mathbf{Z}$, now we are visualising the projections, $\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features.
fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlim([-0.7, 0.7])
ax.set_ylim([-0.7, 0.7])
for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):
# plot line between origin and point (x, y)
ax.plot([0, x], [0, y], color='k')
# display the label of the point
ax.text(x, y, data.columns[i], fontsize='14')
Question: based on the two previous graphs, can you find a meaning for the two principal components ?
Answer: PC1: variables with a negative impact on PC1 correspond to disciplines for which a good performance is a small number and variables with a negative impact on PC1 corresponds to disciplines for which a good performance is a high number. This component segregates athletes with excellent performances from other with relatively worst performances invery disciple as we notices before.
The second principal components segregates strong (good at throwing) and less resistant (low performances at running) from the others.
Finally this last graph shows correlation between variables, like Discus (discus throwing) and Shotput (shot putt).
We have seen above that PCA can be performed by computing the eigendecomposition of the covariance matrix. An alternative way of performing PCA is with singular value decomposition (SVD) (https://en.wikipedia.org/wiki/Singular_value_decomposition). SVD effectively generalises the eigendecomposition to non-square matrices, factorising a matrix $\mathbf{X} \in \mathbb{R}^{N\times M}$ as $\mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^T$, where $\mathbf{U} \in \mathbb{R}^{N\times N}$ and $\mathbf{V} \in \mathbb{R}^{D\times D}$ are orthonormal, and $\mathbf{S} \in \mathbb{R}^{N\times D}$ is the diagonal matrix of singular values with zero rows. Notice that the covariance matrix of $\mathbf{X}$, $$\mathbf{\Sigma} = \mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{S}\mathbf{U}^T\mathbf{U}\mathbf{S}\mathbf{V}^T = \mathbf{V}\mathbf{S}^2\mathbf{V}^T,$$ which is an eigenvalue decomposition. Hence, there is a strong link between PCA and SVD: $\mathbf{W} = \mathbf{U}$ are the principal components and $\mathbf{Z} = \mathbf{X}\mathbf{W}^T = \mathbf{U}\mathbf{S}\mathbf{V}^T\mathbf{V} = \mathbf{U}\mathbf{S}$ are the projected data. Therefore, we can use SVD to perform PCA.
For this task, we will use a classic image from image analysis, "Lena". We start by loading and visualising the image with matplotlib's imread()
and imshow()
functions:
import scipy.misc
img = scipy.misc.imread('data/lena.jpg').astype(float)
plt.imshow(img, cmap='gray')
plt.axis('off')
plt.show()
Question: We should first normalise the data by subtracting the mean of each column: $\mathbf{x}'_{:, i} = \mathbf{x}_{:, i} - \frac{1}{N} \sum_{i=1}^Nx_{ij}$
# TODO: normalise the data prior to SVD
img -= np.mean(img, axis=0)
Question: use the svd()
function from numpy.linalg to perform SVD.
# TODO: SVD in numpy. The function should return a tuple U, S, V
U, S, V = linalg.svd(img)
Question: Reconstruct the Lena image from the factorisation. You should plot the reconstruction to check.
# TODO: reconstruct Lena
Z = U.dot(np.diag(S)).dot(V)
To create a lower-rank PCA projection, take the $M$ largest singular values in $\mathbf{S}$, and the corresponding vectors of $\mathbf{U}$ and $\mathbf{V}$. This is equivalent to taking the $M$ largest eigenvectors for PCA, and is known as truncated SVD. Note that we are doing PCA on the rows of the image (which are not independent, but this use of PCA as a form of lossy compression is interesting).
fig = plt.figure(figsize=(6, 6))
for i, M in enumerate([5, 10, 20, 100]):
# TODO: reconstruct X from truncated matrices
Z = U[:, :M].dot(np.diag(S[:M])).dot(V[:M])
ax = fig.add_subplot(2, 2, i + 1)
ax.imshow(Z, cmap='gray')
ax.set_title('No. components = %d' % M)
ax.axis('off')
plt.tight_layout()
plt.show()
Question: What do you observe? How much compression (in terms of floating point numbers stored) is achieved in each case?