Principal Component Analysis(PCA) with code on MNIST dataset

4 min readSep 3, 2019

PCA is extensionally used for dimensionality reduction for the visualization of high dimensional data. We do dimensionality reduction to convert the high d-dimensional dataset into n-dimensional data where n<d. We usually set the threshold at d>3.

Spread of data on one axis is very large but relatively less spread(variance) on another axis. Spread is nothing but variance or having high information so in general terms, we can say that high spread has high information. Therefore, we can skip dimensions having less variance because having less information in order to get a visualization, data must be column standardized.

2-D TO 1-D

We want to find the direction of v1 variable where the variance is maximum. So in order to do that, rotate the x variable axis on the plane. Here v1 has maximum variance and v2 have minimum variance so v1 has more information about the dataset. So finally 2-D data set having (x,y) variable can be converted to 1-D variable in direction of v1.

Mathematical Notation

Task: Find u1 such that variance of projected summation of x(i) is maximum.

Eigenvalues and Eigenvectors

We will find covariance matrix

For every eigenvalue, there is a corresponding eigenvector. Every pair of the eigenvector is perpendicular to each other. We will sort eigenvalue in decreasing order. The vector V1 corresponds to maximum eigenvalue have maximum variance implying maximum information of the dataset. Similarly, variance decreases as eigenvalue decreases.

Project variable f1 in the direction of V1 to get high variance vector.

Limitations of PCA:

If data follows some wave-type structure after projection wave shape gets distorted.

PCA on MNIST dataset with code

Detailed code available here: https://github.com/ranasingh-gkp/PCA-TSNE-on-MNIST-dataset

1 — Data preprocessing

It is mandatory before applying PCA to convert mean=0 and standard deviation =1 for each variable.

# Data-preprocessing: Standardizing the data
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.htmlfrom sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)

2 — Compute covariance matrix

#find the co-variance matrix which is : A^T * A
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data)
print ( “The shape of variance matrix = “, covar_matrix.shape)

3 — Compute eigenvalue and eigenvector

# finding the top two eigen-values and corresponding eigen-vectors 
# for projecting onto a 2-Dim space.
#https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.linalg.eigh.htmlfrom scipy.linalg import eigh# the parameter ‘eigvals’ is defined (low value to heigh value) 
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783)(index) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782,783))print(“Shape of eigen vectors = “,vectors.shape)
# converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.Tprint(“Updated shape of eigen vectors = “,vectors.shape)
# here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
# here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

Projecting the original data sample on the plane formed by two principal eigenvectors by vector-vector multiplication.

import matplotlib.pyplot as plt
new_coordinates = np.matmul(vectors, sample_data.T)

Appending label to the 2d projected data(vertical stack) and creating a new data frame for plotting the labeled points.

import pandas as pdnew_coordinates = np.vstack((new_coordinates, labels)).T
dataframe = pd.DataFrame(data=new_coordinates, columns=(“1st_principal”, “2nd_principal”, “label”))
print(dataframe.head())

(0,1,2,3,4 are Xi other are principal axis)

4 — Plotting

# plotting the 2d data points with seaborn
import seaborn as sn
sn.FacetGrid(dataframe, hue=”label”, size=6).map(plt.scatter, ‘1st_principal’, ‘2nd_principal’).add_legend()
plt.show()

There is a lot of overlapping among classes means PCA not very good for the high dimensional dataset. Very few classes can be separated but most of them are mixed. PCA is mainly used for dimensionality reduction, not for visualization. To visualize high dimension data, we mostly use T-SNE( https://github.com/ranasingh-gkp/PCA-TSNE-on-MNIST-dataset)

5 — PCA for dimension reduction

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()# PCA for dimensionality redcution (non-visualization)
pca.n_components = 784
pca_data = pca.fit_transform(sample_data)percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_);
cum_var_explained = np.cumsum(percentage_var_explained)

Plotting

# Plot the PCA spectrum
plt.figure(1, figsize=(6, 4))
plt.clf()
plt.plot(cum_var_explained, linewidth=2)
plt.axis(‘tight’)
plt.grid()
plt.xlabel(‘n_components’)
plt.ylabel(‘Cumulative_explained_variance’)
plt.show()

Here we plot the cumulative sum of variance with the component. Here 300 components explain the almost 90% variance. So we can reduce the dimension according to the required variance.

================Thanks==================

Reference:

Google image

https://colah.github.io/posts/2014-10-Visualizing-MNIST/