Morphological Class Discovery¶

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import seaborn.apionly as sns
import matplotlib.pyplot as plt
import time
from IPython.display import Image, set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')

from src import stats
from src.feature_reader import FeatureReader
from src.class_discovery import ClassDiscovery
from src.vis_utils import ImageCropper, FeaturePlotter
/Users/jcboyd/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

The goal of the morphological characterisation of the drug screen cells is to define a way of quantitatively describing the phenotype in terms of cellular and nuclear morphology. In the drug screen, it was however difficult to define a set of classes with a clear biological meaning. For this reason, we tried identify classes useful for training classifiers. In addition, the ultimate hope is to develop models that will be transferable across cell lines, eliminating large amounts of analysis between the raw images and the final outputs.

The feature extraction of Cell Cognition produces 517 features (239 on DAPI, 239 on Cy5, and 39 on Cy3). For MDA231, we have 122307 cells, and on MDA468 we have 97134. We use unsupervised learning on the respective datasets to identify cell classes. Our approach is depicted below. The dataset features are first normalised by subtraction and division of their respective means and standard deviations. Noting a high dimensionality and high correlation between some of the features extracted by Cell Cognition [For example, many shape features such as area, perimeter, convex hull, will increase together], we apply PCA to decorrelate the data and retain only the most prominent dimensions of the manifold. The large dataset remains restrictive in size to applying certain clustering algorithms. We therefore subsample the data. However, rather than direct random sampling, we do the following: first, we apply k-means to the dataset with a large number of clusters. We expect this will identify small isolated clusters that may be lost by random sampling alone. Then, we create a subsample of the dataset by sampling evenly (insofar as it is possible) from each cluster. With this new, smaller dataset, we apply three clustering algorithms robust to anisotropic data: DBSCAN, spectral clustering, and hierarchical clustering. We are able to visualise the discovered clusters of each of these clustering algorithms with t-SNE and by cropping sample nuclei. A satisfactory clustering outcome may be written to annotation files to then be viewed in Cell Cognition. The ultimate specification of morphological classes must be decided by experts, with unsupervised learning providing a starting point.

In [2]:
Image('img/clustering_flowchart.png')
Out[2]:
In [3]:
# base_folder = '/Users/jcboyd/Documents/drug_screen/'
base_folder = '/Users/jcboyd/Documents/drug_screen_widefield/'

plate = '22_384_20X-hNA_D_F_C3_C5_20160031_2016.01.25.17.23.13_MDA231'
# plate = '22_384_20X-hNA_D_F_C3_C5_20160032_2016.01.25.16.27.22_MDA468'
# plate = '25D_20_384_20X_D_F_C3_C5_Decon_20160031_2'
# plate = '25D_20_384_20X_D_F_C3_C5_Decon_20160032_1'

cecog_out = 'cecog_out_expanded_from_primary'

ch5_folder = os.path.join(base_folder, '%s/%s/hdf5/' % (cecog_out, plate))
classifier_folder = os.path.join(base_folder, 'cecog_classifiers', 'clustering2')
channels = ['primary__primary4', 'secondary__expanded', 'tertiary__expanded']

Feature extraction¶

In [4]:
cd = ClassDiscovery(plate, ch5_folder, channels)
cd.read_all()
cd.save_data()
Reading cell data...
Reading cell centers...
Reading primary__primary4 features...
Reading secondary__expanded features...
Reading tertiary__expanded features...
Removing invalid data...
Done!
Saving data...
Done!
In [5]:
X = cd.data[cd.feature_names].as_matrix().astype('float')
num_cells, num_features = X.shape
print 'Shape: %s rows, %s cols' % (num_cells, num_features)
print 'Size: %.02f Mb' % (sys.getsizeof(X) / 10 ** 6)
Shape: 108132 rows, 476 cols
Size: 411.00 Mb

Anomalies (MDA231)¶

The following cells may be deemed anomalous: A02_04_38 (large), A04_03_16 (wtf), A07_02_122 (trinuclear), A13_02_82 (monster), A15_01_39 (binuclear), B02_03_67 (saturation), B17_01_5 (odd membrane), H18_03_25 (membrane warts), N24_02_14 (nuclear waste). Honourable mentions: B12_02 (sniper), D01_02 (dracula), F04_04 (death ray), E09_04 (tapis rouge), M01_01 (ring of power), M15_01 (freak central).

Data preprocessing¶

We first normalise our dataset by subtracting from each column (feature) its mean, and dividing by its standard deviation.

In [6]:
"""
Normalise dataset
"""
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
In [7]:
"""
Compute covariance matrix
This is also the correlation matrix as the data has been standardised.
"""
cov = X.T.dot(X) / X.shape[0]
In [8]:
"""
Feature selection removing highly correlated features
"""
X_selected = stats.remove_correlated_fearures(X, cov)
X_selected.shape
Out[8]:
(108132, 98)

Dimensionality reduction¶

Before proceeeding, we reduce the dimensionality of the data. Two techniques are tried: PCA and autoencoders.

Principal components analysis¶

Principal components analysis (PCA) is a technique to identify the principal components of a matrix, $\mathbf{X} \in \mathbb{R}^{N \times D}$. Should we wish to project the observations in $\mathbf{X}$ to $\mathbf{Z} \in \mathbb{R}^{N \times M}$, $M < D$, we can do so by finding $\mathbf{W} \in \mathbb{R}^{D \times M}$ such that $\mathbf{X}\mathbf{W} \approx \mathbf{Z}$, that is, $\mathbf{W}$ maps $\mathbf{X}$ to lower-dimensional $\mathbf{Z}$. If we restrict our choice to orthonormal $\mathbf{W}$, we have the factorisation $\mathbf{X} = \mathbf{Z}\mathbf{W}^T$. It may be shown that the principal components of matrix $\mathbf{X}$ are the eigenbasis of the empirical covariance matrix, $\boldsymbol\Sigma = \frac{1}{N}\mathbf{X}^T\mathbf{X}$, and the choice of $\mathbf{W}$ minimising the reconstruction error to $\mathbf{X}$ are the eigenvectors corresponding to the $M$ largest eigenvalues of $\boldsymbol\Sigma$. A useful interpretation is that the principal components correspond to the trends in the data, fitting an orthogonal basis optimally through the cloud of points. These trend lines can then be thought of as latent variables underlying the observations. Should we choose $M = D$, then the mapping $\mathbf{Z} = \mathbf{X}\mathbf{W}$ decorrelates the matrix in the same dimension, that is, rotates $\mathbf{X}$ such that its principal components are aligned with the cartesian axes, and the features are no longer correlated. This may be seen easily by showing that the correlation matrix $\mathbf{Z}$ must be diagonal. Should we additionally divide by the root of the singular values of $\mathbf{S}$, we would scale the dimensions of the projected matrix to be of unit length. This is called whitening. Whitening, however, runs the risk of promoting smaller, noisy dimensions.

In [9]:
"""
Compute singular-value decomposition
"""
U, S, V = np.linalg.svd(cov)

"""
Decorrelate data -- X^TX = USV^T then Z = XU => Z^TZ = U^TX^TXU = S (diagonal covariance)
"""
Z = X.dot(U)

We then use truncated SVD to significantly reduce dimensionality. We try different numbers of principal components. It may also be shown that the PCA factorisation is equivalent to the singular value decomposition (SVD) of $\mathbf{X}$. In SVD, we factorise $\mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^T$, where $\mathbf{V}$ is equivalent to the eigenmatrix of $\boldsymbol\Sigma$, and therefore also the principal components of $\mathbf{X}$. With SVD, $\mathbf{X}$ as a linear mapping may be seen as a rotation, $\mathbf{U}$, then a scaling by diagonal $\mathbf{S}$, and finally a further rotation by $\mathbf{V}$.

In [10]:
"""
Truncated SVD
"""
from sklearn.decomposition import PCA

NUM_COMPONENTS = 20

pca_model = PCA(n_components=NUM_COMPONENTS)
Z_trunc = pca_model.fit_transform(X) # Z_trunc = X.dot(V.T[:,0:100])

"""
Recall, minimising reconstruction error is equivalent to maximising variance.
"""
print 'Explained variance: %.02f' % pca_model.explained_variance_ratio_.cumsum()[-1]
Explained variance: 0.77

We also try whitening, though we do not find any improvement in the clustering results.

In [11]:
"""
Whiten data -- divide eigenbasis by eigenvalues.
Add small constant to avoid zero-division error.
Optionally increase constant to avoid promoting small, noisy components.
"""
Z_white = Z_trunc / np.sqrt(S[0:NUM_COMPONENTS] + 0.01)

Autoencoders¶

In [12]:
from keras.models import Model
from keras.layers import Input, Dense
Using TensorFlow backend.
In [13]:
input_data = Input(shape=(num_features,))
dense_1 = Dense(32, activation='relu')(input_data)
encoder = Model(inputs=input_data, outputs=dense_1)
dense_2 = Dense(num_features, activation='sigmoid')(dense_1)
ae1 = Model(inputs=input_data, outputs=dense_2)
ae1.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 476)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                15264     
_________________________________________________________________
dense_2 (Dense)              (None, 476)               15708     
=================================================================
Total params: 30,972
Trainable params: 30,972
Non-trainable params: 0
_________________________________________________________________
In [14]:
input_data = Input(shape=(num_features,))
dense_1 = Dense(128, activation='relu')(input_data)
dense_2 = Dense(32, activation='relu')(dense_1)
encoder = Model(inputs=input_data, outputs=dense_2)
dense_3 = Dense(128, activation='relu')(dense_2)
dense_4 = Dense(num_features, activation='sigmoid')(dense_3)

ae2 = Model(inputs=input_data, outputs=dense_4)
ae2.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 476)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               61056     
_________________________________________________________________
dense_4 (Dense)              (None, 32)                4128      
_________________________________________________________________
dense_5 (Dense)              (None, 128)               4224      
_________________________________________________________________
dense_6 (Dense)              (None, 476)               61404     
=================================================================
Total params: 130,812
Trainable params: 130,812
Non-trainable params: 0
_________________________________________________________________
In [15]:
from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(X)

model = ae1 # ae2 # ae3
model.compile(optimizer='adadelta', loss='mse')

model.fit(x_train, x_train, epochs=20, batch_size=256,
          shuffle=True, validation_data=(x_test, x_test))

# denoise = False

# noise = np.random.normal(scale=0.01, size=(x_train.shape))
# x_train_noisy = np.clip(x_train + noise, 0, 1)
# model.fit(x_train_noisy, x_train, epochs=20, batch_size=256,
#           shuffle=True, validation_data=(x_test, x_test))
Train on 81099 samples, validate on 27033 samples
Epoch 1/20
81099/81099 [==============================] - 3s 43us/step - loss: 0.9069 - val_loss: 0.8039
Epoch 2/20
81099/81099 [==============================] - 3s 33us/step - loss: 0.7636 - val_loss: 0.7518
Epoch 3/20
81099/81099 [==============================] - 2s 28us/step - loss: 0.7302 - val_loss: 0.7313
Epoch 4/20
81099/81099 [==============================] - 2s 25us/step - loss: 0.7152 - val_loss: 0.7203
Epoch 5/20
81099/81099 [==============================] - 2s 23us/step - loss: 0.7064 - val_loss: 0.7131
Epoch 6/20
81099/81099 [==============================] - 2s 27us/step - loss: 0.7001 - val_loss: 0.7077
Epoch 7/20
81099/81099 [==============================] - 2s 26us/step - loss: 0.6953 - val_loss: 0.7035
Epoch 8/20
81099/81099 [==============================] - 2s 29us/step - loss: 0.6916 - val_loss: 0.7002
Epoch 9/20
81099/81099 [==============================] - 2s 29us/step - loss: 0.6886 - val_loss: 0.6975
Epoch 10/20
81099/81099 [==============================] - 2s 27us/step - loss: 0.6862 - val_loss: 0.6953
Epoch 11/20
81099/81099 [==============================] - 2s 29us/step - loss: 0.6842 - val_loss: 0.6934
Epoch 12/20
81099/81099 [==============================] - 3s 32us/step - loss: 0.6824 - val_loss: 0.6918
Epoch 13/20
81099/81099 [==============================] - 3s 32us/step - loss: 0.6809 - val_loss: 0.6904
Epoch 14/20
81099/81099 [==============================] - 3s 31us/step - loss: 0.6796 - val_loss: 0.6890
Epoch 15/20
81099/81099 [==============================] - 2s 25us/step - loss: 0.6783 - val_loss: 0.6878
Epoch 16/20
81099/81099 [==============================] - 2s 27us/step - loss: 0.6771 - val_loss: 0.6867
Epoch 17/20
81099/81099 [==============================] - 3s 33us/step - loss: 0.6759 - val_loss: 0.6856
Epoch 18/20
81099/81099 [==============================] - 2s 28us/step - loss: 0.6749 - val_loss: 0.6846
Epoch 19/20
81099/81099 [==============================] - 2s 30us/step - loss: 0.6739 - val_loss: 0.6837
Epoch 20/20
81099/81099 [==============================] - 2s 30us/step - loss: 0.6730 - val_loss: 0.6828
Out[15]:
<keras.callbacks.History at 0x156cbb990>
In [16]:
X_encoded = encoder.predict(X)
print(X_encoded.shape)
(108132, 32)
In [17]:
"""
Set matrix used for analysis
"""
X = Z_trunc # X = X_encoded X = X_selected # X = Z_white 
"""
Join to DataFrame
"""
normalised_feature_names = ['z%d' % i for i in range(X.shape[1])]
df_normalised_features = pd.DataFrame(X, columns=normalised_feature_names)
cd.data = pd.concat([cd.data, df_normalised_features], axis=1)

K-means clustering¶

k-means fits unlabeled data to one of $K$ clusters such that the within-cluster variance (centered on the Euclidean mean of the cluster) or inertia [within-cluster sum of squares--the objective function of k-means] is minimised. This is equivalent to fitting the data with $K$ spherical Gaussians using maximum likelihood. We use both the k-means and batch k-means clustering algorithms. Mini-batch k-means (as opposed to straight batch k-means) takes a random subset of the data in each iteration, and usually converges much faster [note, that in any case the standard optimiser Lloyd's algorithm gives only an approximate solution to k-means, the problem being hard to solve exactly.]. The mini-batch solution is typically within 5% of the inertia of the full version, but executes up to several orders of magnitude faster. Note that k-means is also used as part of the spectral clustering algorithm.

In [18]:
from sklearn.cluster import KMeans, MiniBatchKMeans
In [19]:
NUM_CLUSTERS = 30
start_time = time.time()

# kmeans30 = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++').fit(X)
kmeans30 = MiniBatchKMeans(n_clusters=NUM_CLUSTERS, init='k-means++').fit(X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 0.39 seconds
In [20]:
NUM_CLUSTERS = 10
start_time = time.time()

# kmeans10 = KMeans(n_clusters=NUM_CLUSTERS, random_state=0).fit(X)
kmeans10 = MiniBatchKMeans(n_clusters=NUM_CLUSTERS, init='k-means++').fit(X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 0.34 seconds
In [21]:
NUM_CLUSTERS = 5
start_time = time.time()

kmeans5 = KMeans(n_clusters=NUM_CLUSTERS).fit(X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 9.58 seconds
In [22]:
start_time = time.time()

mini_batch_kmeans5 = MiniBatchKMeans(n_clusters=NUM_CLUSTERS).fit(X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 0.16 seconds
In [23]:
rel_err = abs(mini_batch_kmeans5.inertia_ - kmeans5.inertia_) / kmeans5.inertia_
print 'Relative error: %.02f' % rel_err
Relative error: 0.05
In [24]:
fp = FeaturePlotter()
models = [kmeans5, kmeans10, kmeans30]

fig = plt.figure(figsize=(8, 6))

for i, model in enumerate(models):
    ax = fig.add_subplot(2, 2, i + 1)
    NUM_CLUSTERS = len(set(model.labels_))
    fp.plot_cluster_populations(ax, model.labels_ + 1,'k-means (%d)' % NUM_CLUSTERS)

fig.tight_layout()
plt.show(fig)
In [25]:
def export_crops(df_cluster, num_cells):
    num_cells = min(num_cells, df_cluster.shape[0])
    fig = plt.figure(figsize=(5, 10))

    for i in range(num_cells):
        sample = df_cluster.sample(n=1)
        well, field, cell_index = sample[['well', 'field', 'cell_index']].values[0]

        ic = ImageCropper(well, field, ch5_folder)
        bounding_boxes = ic.get_bounding_box_list('primary__primary4')

        img_crop = ic.get_image_crop(bounding_boxes[int(cell_index)], padding=64)
        mask_crop = ic.get_mask_crop(bounding_boxes[int(cell_index)], padding=64)
        
        cy5, cy3, dapi = img_crop[:, :, 0], img_crop[:, :, 1], img_crop[:, :, 2]
            
        ax = fig.add_subplot(num_cells, 5, 1 + 5 * i)
        ax.imshow(img_crop)
        if i == 0: ax.set_title('All')
        ax.axis('off')

        ax = fig.add_subplot(num_cells, 5, 2 + 5 * i)
        ax.imshow(dapi, cmap='Greys_r')
        if i == 0: ax.set_title('DAPI')
        ax.axis('off')

        ax = fig.add_subplot(num_cells, 5, 3 + 5 * i)
        ax.imshow(cy5, cmap='Greys_r')
        if i == 0: ax.set_title('Cy5')
        ax.axis('off')

        ax = fig.add_subplot(num_cells, 5, 4 + 5 * i)
        ax.imshow(cy3, cmap='Greys_r')
        if i == 0: ax.set_title('Cy3')
        ax.axis('off')

        mask = mask_crop[:, :, 1].astype('uint8')
        cy5[mask > 0] = 0
        overlay = np.dstack([cy5 + mask, cy3, dapi])

        ax = fig.add_subplot(num_cells, 5, 5 + 5 * i)
        ax.imshow(overlay)
        if i == 0: ax.set_title('Mask')
        ax.axis('off')

    return fig

from matplotlib.backends.backend_pdf import PdfPages

output = './output/'
labels = 'kmeans_30'
cd.data[labels] = kmeans30.predict(X)
num_clusters = len(set(cd.data[labels]))

with PdfPages(os.path.join(output, 'clusters_%s.pdf' % labels)) as pdf:
    for i in range(num_clusters):
        df_cluster = cd.data[cd.data[labels] == i]
        fig = export_crops(df_cluster, 10)
        pdf.savefig()
        plt.close(fig)

Subsampling data¶

In [26]:
"""
Balance populations with balanced sampling
"""
NUM_SAMPLES = 50

cd.data['kmeans_labels'] = kmeans30.labels_
df_subsample = cd.subsample_data(cd.data['kmeans_labels'], NUM_SAMPLES)
"""
Set subsample matrix
"""
subsample_X = df_subsample[normalised_feature_names].as_matrix().astype('float')
print 'Shape: %s rows, %s cols' % subsample_X.shape
Shape: 1500 rows, 20 cols

Spectral Clustering¶

Spectral clustering in another clustering approach that often yields better outcomes than the simpler k-means algorithm. The method is spectral due to its use of the spectra (eigenvalues) of the Laplacian matrix. Spectral clustering begins by choosing a metric to specify a similarity graph, whose arc weights between two nodes indicate the degree to which two observations are similar. It may be shown that the eigenvectors corresponding to eigenvalue 0 of the graph Laplacian matrix [Equivalently, the basis vectors of the nullspace of the Laplacian.] are indicator vectors for the membership of nodes within each of the corresponding connected components in the graph. It may be further shown that spectral clustering is equivalent to solving a relaxed version of the RatioCut problem [The RatioCut problem is a variant of the min-cut problem where the resulting clusters are balanced in size. With this provision, however, the problem becomes NP-hard.]. Once the $k$ eigenvectors for a chosen $k$ are acquired, a solution to the relaxed RatioCut problem has been found, however the vectors, which are real-valued, must be discretised to indicate cluster membership of the observations. Therefore, k-means is used as a heuristic to allocate clusters. Spectral clustering therefore consists of three main steps:

  • Form graph Laplacian matrix
  • Compute first $k$ eigenvectors of Laplacian for desired $k$ clusters
  • Cluster on eigenvectors
In [27]:
from sklearn.cluster import SpectralClustering

NUM_CLUSTERS = 6
sc = SpectralClustering(n_clusters=NUM_CLUSTERS, affinity='nearest_neighbors',
                        assign_labels='kmeans')

start_time = time.time()

sc.fit(subsample_X)

end_time = time.time()

print 'Total time: %.02f seconds' % (end_time - start_time)
Total time: 0.28 seconds
/Users/jcboyd/anaconda2/lib/python2.7/site-packages/sklearn/utils/graph.py:115: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if normed and (np.issubdtype(csgraph.dtype, np.int)
In [28]:
df_subsample['sc_class'] = sc.labels_

fig = plt.figure(figsize=(5, 3))
ax = fig.add_subplot(1, 1, 1)

fp.plot_cluster_populations(ax, df_subsample['sc_class'] + 1, 'spectral (%d)' % NUM_CLUSTERS)

plt.show(fig)

DBSCAN¶

Density-based spatial clustering of applications with noise (DBSCAN) is a popular clustering technique. According to this approach, a data point is reachable if it is within a minimum threshold that we denote $\epsilon$. A point is known as a core point if it is reachable by some minimum number of points, $m$. Core points form the basis of clusters (hence, density-based), with reachable non-core points on the periphery. Any non-core point not reachable from a core point is known as an outlier (noise). DBSCAN is flexible in not requiring a specified number of clusters to search, as well as fitting anisotropic clusters, common in spatial data. The main drawback is that the parameters $(\epsilon, m)$ must be chosen, and while a likely starting point can be approximated heuristically, it can entail a costly cross-validation. Another weakness is that the algorithm will struggle to find clusters of different densities.

In [29]:
# # from scipy.spatial.distance import euclidean
# from scipy.spatial import distance_matrix

# dists = distance_matrix(subsample_X, subsample_X)

# dists = dists + np.diag([np.inf] * dists.shape[0])
# nearest_neighbour_distances = np.min(dists, axis=0)
In [30]:
# fig, ax = plt.subplots(figsize=(10, 7.5))

# ax.hist(nearest_neighbour_distances, bins=500)

# ax.set_title('Distance to nearest neighbour', fontsize=20)
# ax.set_xlabel('Distance', fontsize=15)
# ax.set_ylabel('Frequency', fontsize=15)
# ax.grid('on')
# ax.set_xlim([0, 80])

# plt.show()
In [31]:
# EPS = 10
# num_neighbours = np.sum(dists <= EPS, axis=0)
In [32]:
# fig, ax = plt.subplots(figsize=(10, 7.5))

# ax.hist(num_neighbours)

# ax.set_title('Number of neighbours within epsilon (%d)' % EPS, fontsize=20)
# ax.set_xlabel('Distance', fontsize=15)
# ax.set_ylabel('Frequency', fontsize=15)
# ax.grid('on')
# ax.set_xlim([0, 80])

# plt.show()
In [33]:
# from sklearn.cluster import DBSCAN
# from sklearn.model_selection import GridSearchCV

# eps_list = range(4, 10, 2)
# min_samples_list = range(2, 8, 2)

# for eps in eps_list:
#     for min_samples in min_samples_list:
#         start_time = time.time()
#         db = DBSCAN(eps=eps, min_samples=min_samples).fit(subsample_X)
#         end_time = time.time()
#         print 'Fitting time: %.02f seconds' % (end_time - start_time)
#         print 'Eps: %d MinPts: %d, Labels: %s' % (eps, min_samples, np.unique(db.labels_))
In [34]:
# db = DBSCAN(eps=8, min_samples=4).fit(subsample_X)
# db_classes = db.labels_
# plt.hist(db_classes)
# plt.show()

Gaussian Mixture Models¶

In [35]:
from sklearn.mixture import GaussianMixture

NUM_CLUSTERS = 8

gm = GaussianMixture(n_components=NUM_CLUSTERS, init_params='kmeans')
gm.fit(subsample_X)

df_subsample['gmm_class'] = gm.predict(subsample_X)
In [36]:
fig = plt.figure(figsize=(5, 3))
ax = fig.add_subplot(1, 1, 1)

fp.plot_cluster_populations(ax, df_subsample['gmm_class'] + 1, 'GMM (%d)' % NUM_CLUSTERS)

plt.show(fig)

Hierarchical Clustering¶

Hierarchical clustering is a highly flexible way of performing clustering, in which points are clustered recursively according to closeness. Hierarchical clustering may be either agglomerative (bottom up) or divisive (top down), although the former is far more common, divisive rarely being tractable. Hierarchical clustering thus creates a tree-like structure, and the programmer must decide at what distance to threshold between the root (1 cluster) and the leaves ($N$ clusters). A standard tool for exploring the model is a visualisation of the tree structure called a dendrogram [From the greek dendro meaning tree.]. Model parameters are usually a metric for measuring distances between points, and a linkage criterion for prioritising the order of merging points into clusters. The simplest is known as single linkage, whereby the next merge is made for the cluster having the point with shortest distance to any point in any other cluster. Various other criteria for linkage exist, notably Ward's method, which selects the merge that minimises the error sum of squares. This finds a balance between minimising the intra-class variance, and maximising the inter-class variance, seeking denser, more isolated clusters.

In [37]:
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
In [38]:
NUM_CLUSTERS = 6

start_time = time.time()

li = linkage(subsample_X, 'ward')

end_time = time.time()
print 'Total time: %.02f seconds' % (end_time - start_time)
Total time: 0.10 seconds
In [39]:
fig, ax = plt.subplots(figsize=(8, 4))
ax.set_title('Dendrogram of subsample', fontsize=20)
dn = dendrogram(li, truncate_mode='lastp', p=NUM_CLUSTERS, show_contracted=True)
ax.axhline(y=200, c='k', linestyle='--')

plt.show(fig)
In [40]:
df_subsample['li_class'] = fcluster(li, NUM_CLUSTERS, criterion='maxclust') - 1

Visualisation - PCA¶

In [41]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
In [42]:
start_time = time.time()

pca.fit(X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 0.31 seconds

We use the first three principal components to visualise some of our clustering outcomes in two dimensions as illustrated below.

In [43]:
W = np.transpose(pca.components_)
Z = X.dot(W)
In [44]:
set_matplotlib_formats('png')
fig = plt.figure(figsize=(15, 5))
fp = FeaturePlotter()
principal_components = [(0, 1), (0, 2), (1, 2)]

for i, (pci, pcj) in enumerate(principal_components):
    ax = fig.add_subplot(1, 3, i + 1)

    x_min, y_min = np.percentile(Z[:, (pci, pcj)], q=0.01, axis=0)
    x_max, y_max = np.percentile(Z[:, (pci, pcj)], q=99.99, axis=0)

    fp.plot_clusters(
        ax, Z[:, (pci, pcj)], kmeans5.labels_,
        title='PC%d vs. PC%d' % (pci + 1, pcj + 1),
        xlabel='PC%d' % (pci + 1), ylabel='PC%d' % (pcj + 1),)
#         xlim=(x_min, x_max), ylim=(y_min, y_max))

fig.tight_layout()
plt.show()
In [45]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111, projection='3d')

fp.plot_clusters(ax, Z, kmeans5.labels_, title='PC1 vs. PC2 vs. PC3', _3d=True)

plt.show()

Visualisation - t-SNE¶

t-distributed stochastic neighbour embedding (t-SNE) is a recent technique for dimensionality reduction and has particular merit in visualising high-dimensional data. t-SNE models the observations $\mathbf{x}_i$ according to a probability distribution that reflects the similarities of different observations. It then invokes another probability distribution over a set of points in a lower-dimensional space (particularly two or three dimensions for the purposes of visualisation), and minimises the KL-divergence between the distributions using gradient descent. This has the effect of finding a representation reflecting the relative proximities of points from the high-dimensional space. We use t-SNE to visualise clustering outputs.

In [46]:
from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
In [47]:
start_time = time.time()

model.fit_transform(subsample_X)

end_time = time.time()
print 'Fitting time: %.02f seconds' % (end_time - start_time)
Fitting time: 11.92 seconds
In [48]:
Z = model.embedding_
In [51]:
fig = plt.figure(figsize=(12, 4))
fp = FeaturePlotter()
LIM = 20

ax = fig.add_subplot(1, 3, 1)
fp.plot_clusters(ax, Z[:, [0, 1]], df_subsample['gmm_class'],
                 title='t-SNE of GMM', xlabel='q1', ylabel='q2',
                 xlim=[-LIM, +LIM], ylim=[-LIM, +LIM])

ax = fig.add_subplot(1, 3, 2)
fp.plot_clusters(ax, Z[:, [0, 1]], df_subsample['sc_class'],
                 title='t-SNE of spectral clustering', xlabel='q1', ylabel='q2',
                 xlim=[-LIM, +LIM], ylim=[-LIM, +LIM])

ax = fig.add_subplot(1, 3, 3)
fp.plot_clusters(ax, Z[:, [0, 1]], df_subsample['li_class'],
                 title='t-SNE of linkage model', xlabel='q1', ylabel='q2',
                 xlim=[-LIM, +LIM], ylim=[-LIM, +LIM])

plt.tight_layout()
plt.show()
# fig.savefig('/Users/jcboyd/Desktop/clustering.pdf')

Export clusters¶

In [52]:
# class_dict = cd.create_class_dict(df_subsample, 'gm_class')
In [53]:
# cd.export_annotations(classifier_folder, class_dict)

Crops of clustered cells¶

In [54]:
labels = 'sc_class'
NUM_CLUSTERS = len(set(df_subsample[labels]))
NUM_SAMPLES = 5
In [55]:
fig = plt.figure(figsize=(2 * NUM_SAMPLES, 2 * NUM_CLUSTERS))

for class_ in range(NUM_CLUSTERS):
    df_cluster = df_subsample[df_subsample[labels] == class_]
    
    for i in range(min(NUM_SAMPLES, df_cluster.shape[0])):        
        row = df_cluster.iloc[i]
        well, field, cell_index = row[['well', 'field', 'cell_index']]

        ax = fig.add_subplot(NUM_CLUSTERS, NUM_SAMPLES, class_ * NUM_SAMPLES + i + 1)
        ic = ImageCropper(well, field, ch5_folder)
        bounding_boxes = ic.get_bounding_box_list('primary__primary4')
        img_crop = ic.get_image_crop(bounding_boxes[int(cell_index)], padding=64)
        ax.imshow(img_crop)
        ax.axis('off')

plt.show()

N.B. MDA231 cells with Cy5 saturation (usually forming 1-2 small classes) come exclusively from well A02.

In [ ]:
# df_subsample[df_subsample[labels] == 2]