2017-10-02 Principal Components Analysis¶

1. How to use Jupyter¶

All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).

Here are some tips on using a Jupyter notebook:

  • Each block of text is contained in a cell. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the guide.
  • You can run a cell by clicking inside it and hitting Shift+Enter (or the play button in the toolbar).
In [1]:
2 + 2  # hit Shift+Enter to run
Out[1]:
4
  • If you want to create a new cell below the one you're running, hit Alt+Enter (or the plus button in the toolbar).

Some tips on using a Jupyter notebook with Python:

  • A notebook behaves like an interactive python shell! This means that
    • classes, functions, and variables defined at the cell level have global scope throughout the notebok
    • hitting Tab will autocomplete the keyword you have started typing
    • typing a question mark after a function name will load the interactive help for this function.
  • Jupyter has special Python commands (shortcuts, if you will) called magics. For instance, %bash will allow you to run bash code, %paste will allow you to paste a block of code while retaining its formating, and %matplotlib inline will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html
  • Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html

For more info on Jupyter: https://jupyter.org/

2. PCA of the Olympic Athletes data¶

In this lab, we will import data (./data/decathlon.txt) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Å ebrle).

Data description¶

  • The data set consists of 41 rows and 13 columns.
  • The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.
  • Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.
  • The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).

Loading and manipulating the data with pandas¶

pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a data frame, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas.

In [2]:
import pandas as pd
my_data = pd.read_csv('data/decathlon.txt', sep="\t")  # load data
In [3]:
print(type(my_data))  # display my_data data type
<class 'pandas.core.frame.DataFrame'>
In [4]:
my_data.head(n=5)  # adjust n to view more data
Out[4]:
100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus Pole.vault Javeline 1500m Rank Points Competition
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.7 1 8217 Decastar
CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.5 2 8122 Decastar
KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.2 3 8099 Decastar
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.1 4 8067 Decastar
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 4.72 63.44 276.4 5 8036 Decastar

Accessing data¶

  • We can select a column by name. Note the returned object is also a pandas object (a series--a single-columned DataFrame), so we can use the head() function to view the first few rows only.
In [5]:
my_data['100m'].head()
Out[5]:
SEBRLE     11.04
CLAY       10.76
KARPOV     11.02
BERNARD    11.02
YURKOV     11.34
Name: 100m, dtype: float64
  • Or a list for multiple columns.
In [6]:
columns = ['100m', '400m']
my_data[columns].head()
Out[6]:
100m 400m
SEBRLE 11.04 49.81
CLAY 10.76 49.37
KARPOV 11.02 48.37
BERNARD 11.02 48.93
YURKOV 11.34 50.42
  • We can select rows satisfying a given condition(s) by passing a boolean series.
In [7]:
my_data[my_data['Competition']=='OlympicG'].head()
Out[7]:
100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus Pole.vault Javeline 1500m Rank Points Competition
Sebrle 10.85 7.84 16.36 2.12 48.36 14.05 48.72 5.0 70.52 280.01 1 8893 OlympicG
Clay 10.44 7.96 15.23 2.06 49.19 14.13 50.11 4.9 69.71 282.00 2 8820 OlympicG
Karpov 10.50 7.81 15.93 2.09 46.81 13.97 51.65 4.6 55.54 278.11 3 8725 OlympicG
Macey 10.89 7.47 15.73 2.15 48.97 14.56 48.34 4.4 58.46 265.42 4 8414 OlympicG
Warners 10.62 7.74 14.48 1.97 47.97 14.01 43.73 4.9 55.39 278.05 5 8343 OlympicG
  • To index a row, we can use the data frame's loc object. This behaves like a dictionary whose keys are the data frame's index.
In [8]:
my_data.loc['SEBRLE']
Out[8]:
100m              11.04
Long.jump          7.58
Shot.put          14.83
High.jump          2.07
400m              49.81
110m.hurdle       14.69
Discus            43.75
Pole.vault         5.02
Javeline          63.19
1500m             291.7
Rank                  1
Points             8217
Competition    Decastar
Name: SEBRLE, dtype: object
In [9]:
my_data.count()  # summarise counts of data
Out[9]:
100m           41
Long.jump      41
Shot.put       41
High.jump      41
400m           41
110m.hurdle    41
Discus         41
Pole.vault     41
Javeline       41
1500m          41
Rank           41
Points         41
Competition    41
dtype: int64

Manipulating data¶

In [10]:
list(my_data.columns)  # get the names of the columns
Out[10]:
['100m',
 'Long.jump',
 'Shot.put',
 'High.jump',
 '400m',
 '110m.hurdle',
 'Discus',
 'Pole.vault',
 'Javeline',
 '1500m',
 'Rank',
 'Points',
 'Competition']
In [11]:
print(my_data.shape)  # get the shape (rows x columns)
(41, 13)
In [12]:
print(my_data.values)  # get the content as a numpy array
[[11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.7 1 8217
  'Decastar']
 [10.76 7.4 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.5 2 8122 'Decastar']
 [11.02 7.3 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.2 3 8099 'Decastar']
 [11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.1 4 8067
  'Decastar']
 [11.34 7.09 15.19 2.1 50.42 15.31 46.26 4.72 63.44 276.4 5 8036 'Decastar']
 [11.11 7.6 14.31 1.98 48.68 14.23 41.1 4.92 51.77 278.1 6 8030 'Decastar']
 [11.13 7.3 13.48 2.01 48.62 14.17 45.67 4.42 55.37 268.0 7 8004 'Decastar']
 [10.83 7.31 13.76 2.13 49.91 14.38 44.41 4.42 56.37 285.1 8 7995
  'Decastar']
 [11.64 6.81 14.57 1.95 50.14 14.93 47.6 4.92 52.33 262.1 9 7802 'Decastar']
 [11.37 7.56 14.41 1.86 51.1 15.06 44.99 4.82 57.19 285.1 10 7733
  'Decastar']
 [11.33 6.97 14.09 1.95 49.48 14.48 42.1 4.72 55.4 282.0 11 7708 'Decastar']
 [11.33 7.27 12.68 1.98 49.2 15.29 37.92 4.62 57.44 266.6 12 7651
  'Decastar']
 [11.36 6.8 13.46 1.86 51.16 15.67 40.49 5.02 54.68 291.7 13 7313
  'Decastar']
 [10.85 7.84 16.36 2.12 48.36 14.05 48.72 5.0 70.52 280.01 1 8893
  'OlympicG']
 [10.44 7.96 15.23 2.06 49.19 14.13 50.11 4.9 69.71 282.0 2 8820 'OlympicG']
 [10.5 7.81 15.93 2.09 46.81 13.97 51.65 4.6 55.54 278.11 3 8725 'OlympicG']
 [10.89 7.47 15.73 2.15 48.97 14.56 48.34 4.4 58.46 265.42 4 8414
  'OlympicG']
 [10.62 7.74 14.48 1.97 47.97 14.01 43.73 4.9 55.39 278.05 5 8343
  'OlympicG']
 [10.91 7.14 15.31 2.12 49.4 14.95 45.62 4.7 63.45 269.54 6 8287 'OlympicG']
 [10.97 7.19 14.65 2.03 48.73 14.25 44.72 4.8 57.76 264.35 7 8237
  'OlympicG']
 [10.8 7.53 14.26 1.88 48.81 14.8 42.05 5.4 61.33 276.33 8 8235 'OlympicG']
 [10.69 7.48 14.8 2.12 49.13 14.17 44.75 4.4 55.27 276.31 9 8225 'OlympicG']
 [10.98 7.49 14.01 1.94 49.76 14.25 42.43 5.1 56.32 273.56 10 8102
  'OlympicG']
 [10.95 7.31 15.1 2.06 50.79 14.21 44.6 5.0 53.45 287.63 11 8084 'OlympicG']
 [10.9 7.3 14.77 1.88 50.3 14.34 44.41 5.0 60.89 278.82 12 8077 'OlympicG']
 [11.14 6.99 14.91 1.94 49.41 14.37 44.83 4.6 64.55 267.09 13 8067
  'OlympicG']
 [10.85 6.81 15.24 1.91 49.27 14.01 49.02 4.2 61.52 272.74 14 8023
  'OlympicG']
 [10.55 7.34 14.44 1.94 49.72 14.39 39.88 4.8 54.51 271.02 15 8021
  'OlympicG']
 [10.68 7.5 14.97 1.94 49.12 15.01 40.35 4.6 59.26 275.71 16 8006
  'OlympicG']
 [10.89 7.07 13.88 1.94 49.11 14.77 42.47 4.7 60.88 263.31 17 7993
  'OlympicG']
 [11.06 7.34 13.55 1.97 49.65 14.78 45.13 4.5 60.79 272.63 18 7934
  'OlympicG']
 [10.87 7.38 13.07 1.88 48.51 14.01 40.11 5.0 51.53 274.21 19 7926
  'OlympicG']
 [11.14 6.61 15.69 2.03 51.04 14.88 41.9 4.8 65.82 277.94 20 7918
  'OlympicG']
 [10.92 6.94 15.15 1.94 49.56 15.12 45.62 5.3 50.62 290.36 21 7893
  'OlympicG']
 [11.08 7.26 14.57 1.85 48.61 14.41 40.95 4.4 60.71 269.7 22 7865
  'OlympicG']
 [11.08 6.91 13.62 2.03 51.67 14.26 39.83 4.8 59.34 290.01 23 7708
  'OlympicG']
 [11.1 7.03 13.22 1.85 49.34 15.38 40.22 4.5 58.36 263.08 24 7592
  'OlympicG']
 [11.33 7.26 13.3 1.97 50.54 14.98 43.34 4.5 52.92 278.67 25 7583
  'OlympicG']
 [10.86 7.07 14.81 1.94 51.16 14.96 46.07 4.7 53.05 317.0 26 7573
  'OlympicG']
 [11.23 6.99 13.53 1.85 50.95 15.09 43.01 4.5 60.0 281.7 27 7495 'OlympicG']
 [11.36 6.68 14.92 1.94 53.2 15.39 48.66 4.4 58.62 296.12 28 7404
  'OlympicG']]
In [13]:
print(my_data.dtypes)  # get the data type (dtype) of each column
100m           float64
Long.jump      float64
Shot.put       float64
High.jump      float64
400m           float64
110m.hurdle    float64
Discus         float64
Pole.vault     float64
Javeline       float64
1500m          float64
Rank             int64
Points           int64
Competition     object
dtype: object

Visualisation¶

To create visualisations, we'll use matplotlib, the primordial plotting library for Python. matplotlib may be used in different ways using a built-in interface called pyplot. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, pylab, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).

We also use a Jupyter magic command for inline plotting.

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):

In [15]:
from IPython.display import Image, set_matplotlib_formats 
set_matplotlib_formats('pdf', 'svg') # toggle vector graphics for a crisp plot!
In [16]:
# basic visualization: athletes' performances depending on two disciplines
my_data.plot(kind='scatter', x='400m', y='Shot.put', s=50, color='blue')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1121bc910>
In [17]:
# A scatterplot matrix allows us to visualize:
#   * on the diagonal, the density estimation for each of the features
#   * on each of the off-diagonal plots, a scatterplot between two of the features. 
#     Each dot represents a sample.

from pandas.plotting import scatter_matrix
#from pandas.tools.plotting import scatter_matrix # if pandas older than 0.20
scatter_matrix(my_data.get(['Shot.put','High.jump', '400m']), alpha=0.2,
               figsize=(6, 6), diagonal='kde');
In [18]:
# fancier plot with seaborn : https://seaborn.pydata.org/
import seaborn.apionly as sns
sns.set_style('whitegrid')

sns.jointplot('Shot.put', 'High.jump', data = my_data, 
              kind='kde', size=6, space=0, color='b')

# loooking at correlated features
sns.jointplot('Shot.put', 'Discus', data = my_data, 
              kind='reg', size=6, space=0, color='b')
Out[18]:
<seaborn.axisgrid.JointGrid at 0x1165c4e90>

Cleaning data¶

In [19]:
# Remove columns we don't need
data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)

# Verify new column headers
data.columns
Out[19]:
Index([u'100m', u'Long.jump', u'Shot.put', u'High.jump', u'400m',
       u'110m.hurdle', u'Discus', u'Pole.vault', u'Javeline', u'1500m'],
      dtype='object')

Find the first two principal components of the data¶

We will first implement PCA ourselves. Recall PCA of $\mathbf{X} \in \mathbb{R}^{N \times D}$ finds $M$ principal components as the coluns of $\mathbf{W} \in \mathbb{R}^{D \times M}$ and a matrix of projected data $\mathbf{Z} \in \mathbb{R}^{N \times M}$, such that $\mathbf{Z}\mathbf{W}^T$ minimises the error as a reconstruction of $\mathbf{X}$ for the choice of $M$ (equivalently, the factorisation maximally explains the variance in $\mathbf{X}$). The principal components $\mathbf{W}$, correspond to the $M$ largest eigenvectors of the empirical covariance matrix $\boldsymbol\Sigma = \frac{1}{N}\mathbf{X}^T\mathbf{X}$.

Question: Recall that PCA works best on standardised data (mean 0, standard deviation 1). Standardise the data. Hint: you should aim to used numpy's vector-based operations.

In [20]:
import numpy as np

# transform data from to numpy array
X = data.values

# TODO: standardise the data
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_centered = (X - X_mean) / X_std

Question: Find the two first components and project the data. These are the two largest eigenvectors of the empirical covariance matrix, $\boldsymbol\Sigma = \frac{1}{N}\mathbf{X}^T\mathbf{X}$.

In [21]:
from numpy import linalg

# TODO: calculate the the covariance matrix with numpy
N = X_centered.shape[0]
X_cov = X_centered.T.dot(X_centered) / N

# TODO: find its two first principal components
w, v = linalg.eig(X_cov)
w, v = w[:2], v[:,:2]

# TODO: project X onto the principal components
X_projected = X_centered.dot(v)

Let's display the projected data. We use a jupyter magic command to display plots inline automatically.

In [22]:
# create figure and axis objects
fig, ax = plt.subplots(figsize=(6, 5))
# create scatterplot on axis N.B. we record the return value to feed to the colorbar
cax = ax.scatter(X_projected[:, 0], X_projected[:, 1], c=my_data['Rank'],
                 cmap=plt.get_cmap('viridis'))
# Set axis limits
ax.set_xlim([-5.5, 5.5])
ax.set_ylim([-4, 4])
# Create color bar
plt.colorbar(cax, label='Rank')
Out[22]:
<matplotlib.colorbar.Colorbar at 0x116bb14d0>

Question: can you recognize a pattern ? Describe what you see.

Answer: Good performances (dark blue points) are on the right side of the graph (PC1 >0) and bad performances (yellow points) on the left side (PC1 < 0).

Use scikit-learn to find the PCs¶

In this course, we will rely heavily on the scikit-learn machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [23]:
from sklearn import decomposition, preprocessing

std_scale = preprocessing.StandardScaler().fit(X)
X_scaled = std_scale.transform(X)

pca = decomposition.PCA(n_components=2)
pca.fit(X_scaled)
Out[23]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Question: Use pca.transform to project the data onto its principal components.

In [24]:
# TODO: project X on principal components
X_projected = pca.transform(X_scaled)

pca.explained_variance_ratio_ gives the percentage of variance explained by each of the components.

In [25]:
print pca.explained_variance_ratio_
[ 0.32719055  0.1737131 ]

Question: How is pca.explained_variance_ratio_ computed? Check this is the case by computing it yourself.

In [26]:
tot_var = np.var(X_scaled, axis=0).sum()
print (1 / tot_var) * np.var(X_projected, axis=0)
[ 0.32719055  0.1737131 ]

Project the data onto its principal components after standardizing.

Question: Plot the fraction of variance explained by each of the first 10 principal components.

In [27]:
# TODO: calculate the first ten components
pca = decomposition.PCA(n_components=10)
pca.fit(X_scaled)

plt.bar(np.arange(10), pca.explained_variance_ratio_, color='blue')
plt.xlim([-1, 9])
plt.xlabel("Number of PCs", fontsize=16)
plt.ylabel("Fraction of variance explained", fontsize=16)
Out[27]:
<matplotlib.text.Text at 0x11623df90>

To better understand the information captured by the principal components, we can consider  pca.components_. These are the columns of $\mathbf{W}$ (for $M = 2$).

In [28]:
pcs = pca.components_
print pcs[0]
[-0.42829627  0.41015201  0.34414444  0.31619436 -0.3757157  -0.41255442
  0.30542571  0.02783081  0.15319802 -0.03210733]

We can display each row of $\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\mathbf{Z}$, now we are visualising the projections, $\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features.

In [29]:
fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlim([-0.7, 0.7])
ax.set_ylim([-0.7, 0.7])

for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):
    # plot line between origin and point (x, y)
    ax.plot([0, x], [0, y], color='k')
    # display the label of the point
    ax.text(x, y, data.columns[i], fontsize='14')

Question: based on the two previous graphs, can you find a meaning for the two principal components ?

Answer: PC1: variables with a negative impact on PC1 correspond to disciplines for which a good performance is a small number and variables with a negative impact on PC1 corresponds to disciplines for which a good performance is a high number. This component segregates athletes with excellent performances from other with relatively worst performances invery disciple as we notices before.

The second principal components segregates strong (good at throwing) and less resistant (low performances at running) from the others.

Finally this last graph shows correlation between variables, like Discus (discus throwing) and Shotput (shot putt).

3 Singular Value Decomposition for PCA¶

We have seen above that PCA can be performed by computing the eigendecomposition of the covariance matrix. An alternative way of performing PCA is with singular value decomposition (SVD) (https://en.wikipedia.org/wiki/Singular_value_decomposition). SVD effectively generalises the eigendecomposition to non-square matrices, factorising a matrix $\mathbf{X} \in \mathbb{R}^{N\times M}$ as $\mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^T$, where $\mathbf{U} \in \mathbb{R}^{N\times N}$ and $\mathbf{V} \in \mathbb{R}^{D\times D}$ are orthonormal, and $\mathbf{S} \in \mathbb{R}^{N\times D}$ is the diagonal matrix of singular values with zero rows. Notice that the covariance matrix of $\mathbf{X}$, $$\mathbf{\Sigma} = \mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{S}\mathbf{U}^T\mathbf{U}\mathbf{S}\mathbf{V}^T = \mathbf{V}\mathbf{S}^2\mathbf{V}^T,$$ which is an eigenvalue decomposition. Hence, there is a strong link between PCA and SVD: $\mathbf{W} = \mathbf{U}$ are the principal components and $\mathbf{Z} = \mathbf{X}\mathbf{W}^T = \mathbf{U}\mathbf{S}\mathbf{V}^T\mathbf{V} = \mathbf{U}\mathbf{S}$ are the projected data. Therefore, we can use SVD to perform PCA.

Loading image data¶

For this task, we will use a classic image from image analysis, "Lena". We start by loading and visualising the image with matplotlib's imread() and imshow() functions:

In [30]:
import scipy.misc
img = scipy.misc.imread('data/lena.jpg').astype(float)
plt.imshow(img, cmap='gray')
plt.axis('off')
plt.show()

Normalise the data¶

Question: We should first normalise the data by subtracting the mean of each column: $\mathbf{x}'_{:, i} = \mathbf{x}_{:, i} - \frac{1}{N} \sum_{i=1}^Nx_{ij}$

In [31]:
# TODO: normalise the data prior to SVD
img -= np.mean(img, axis=0)

Perfom SVD¶

Question: use the svd() function from numpy.linalg to perform SVD.

In [32]:
# TODO: SVD in numpy. The function should return a tuple U, S, V
U, S, V = linalg.svd(img)

Question: Reconstruct the Lena image from the factorisation. You should plot the reconstruction to check.

In [33]:
# TODO: reconstruct Lena
Z = U.dot(np.diag(S)).dot(V)

Truncated SVD¶

To create a lower-rank PCA projection, take the $M$ largest singular values in $\mathbf{S}$, and the corresponding vectors of $\mathbf{U}$ and $\mathbf{V}$. This is equivalent to taking the $M$ largest eigenvectors for PCA, and is known as truncated SVD. Note that we are doing PCA on the rows of the image (which are not independent, but this use of PCA as a form of lossy compression is interesting).

In [34]:
fig = plt.figure(figsize=(6, 6))

for i, M in enumerate([5, 10, 20, 100]):
    # TODO: reconstruct X from truncated matrices
    Z = U[:, :M].dot(np.diag(S[:M])).dot(V[:M])
    
    ax = fig.add_subplot(2, 2, i + 1)
    ax.imshow(Z, cmap='gray')
    ax.set_title('No. components = %d' % M)
    ax.axis('off')

plt.tight_layout()
plt.show()

Question: What do you observe? How much compression (in terms of floating point numbers stored) is achieved in each case?