Lviv University
Notions to discuss
Curse of Dimensionality
Motivation
Principal component analysis (PCA):
Uses

Experiment setup
Question
The big question remains: how do we get from this data set to a simple equation of \(x\)?
Issues
Goal
Identify a most meaningful basis to re-express a data set.
In case of spring example, goal of PCA is to determine that \(x\) axis is the one that matters.
Measurement definition
\[ \vec{X} = \begin{bmatrix} x_A \\ y_A \\ x_B \\ y_B \\ x_C \\ y_C \end{bmatrix} \]
Naive basis
Naive basis: reflects the methods we used to measure the data. \[ \boldsymbol{B} = \begin{bmatrix} \boldsymbol{b_1} \\ \vdots \\ \boldsymbol{b_m} \end{bmatrix} = \begin{bmatrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \dots & 1 \end{bmatrix} = \boldsymbol{I} \]
Question
Is there another linear basis that best re-expresses our data set?
Important
Note the linearity assumption!
Basis change: definition
Let \(\boldsymbol{X}\) be the original data set (\(m\times n\) matrix with \(m=6\) and \(n=72000\)).
Let \(\boldsymbol{Y}\) be another \(m\times n\) matrix such that: \[\begin{align} \label{basis} &\boldsymbol{P}\boldsymbol{X} = \boldsymbol{Y} \end{align}\]
\(p_i\): rows of \(P\), \(x_i\): columns of \(X\), \(y_i\): columns of \(Y\)
Interpretation
Basis change
\[\begin{align*} &\boldsymbol{P}\boldsymbol{X} = \begin{bmatrix} \boldsymbol{p_1} \\ \vdots \\ \boldsymbol{p_m} \end{bmatrix} \begin{bmatrix} \boldsymbol{x_1} & \dots & \boldsymbol{x_n} \end{bmatrix} = \\ & = \begin{bmatrix} \boldsymbol{p_1} \cdot \boldsymbol{x_1} & \dots & \boldsymbol{p_1}\cdot\boldsymbol{x_n} \\ \vdots & \ddots & \vdots\\ \boldsymbol{p_m}\cdot\boldsymbol{x_1} & \dots & \boldsymbol{p_m}\cdot\boldsymbol{x_n} \end{bmatrix} \end{align*}\] \(j\)th coefficient of \(\boldsymbol{y_i}\) is a projection on the \(j\)th row of \(\boldsymbol{P}\), therefore, \(\boldsymbol{p_i}\) are a new set of basis vectors for columns of \(\boldsymbol{X}\).
\(\boldsymbol{p_i}\) will become principal components of \(\boldsymbol{X}\).
Questions on \(\boldsymbol{P}\)
How to express best?
Use signal-to-noise ratio. \[ SNR = \dfrac{\sigma^2_{signal}}{\sigma^2_{noise}} \] High SNR - precision measurement, low - noisy data.


How do we generalize to higher dimensions?
Consider two sets of measurements with zero means \[\begin{align*} &A=\left\{a_1, a_2, \dots, a_n\right\}, B=\left\{b_1, b_2, \dots, b_n\right\}. \end{align*}\]
Variances are \[\begin{align*} &\sigma_A^2 = \dfrac{1}{n}\sum\limits_i a_i^2, \, \sigma_B^2 = \dfrac{1}{n}\sum\limits_i b_i^2,\\ \end{align*}\]
Covariance of \(A\) and \(B\) is \[ \sigma_{AB}^2 = \dfrac{1}{n}\sum\limits_i a_i b_i \]
Absolute value of covariance measures the degree of redundancy.
Matrix form
\[\begin{align*} &\boldsymbol{a} = \left[a_1 a_2 \dots a_n\right] \\ &\boldsymbol{b} = \left[b_1 b_2 \dots b_n\right] \\ &\sigma_{\boldsymbol{a}\boldsymbol{b}}^2 \equiv \dfrac{1}{n} \boldsymbol{a}\boldsymbol{b}^T \end{align*}\]
Generalization
Let’s generalize to a multiple number of vectors.
Covariance matrix
\[ \boldsymbol{C_X} \equiv \dfrac{1}{n} \boldsymbol{X} \boldsymbol{X}^T \]
Covariance matrix properties
Proof of symmetricity
Theorem 1. Inverse of orthogonal matrix is its transpose.
Let \(A\) be an \(m \times n\) orthogonal matrix where \(a_i\) is the \(i\)th column vector. We have \[ (A^T A)_{ij} = a_i^T a_j = \begin{cases} 1, \; \text{ if } i=j,\\ 0 \; \text{ otherwise} \end{cases} \] Therefore, \(A^T A = I \Rightarrow A^{-1} = A^T\).
Proof of symmetricity
Theorem 2. For any matrix \(A\), \(A^T A\) and \(A A^T\) are symmetric. \[\begin{align*} & (A A^T)^T = A^{TT} A^T = A A^T,\\ & (A^T A)^T = A^T A^{TT} = A^T A. \end{align*}\]
Covariance
\(\boldsymbol{C_X}\) captures the covariance between all possible pairs of measurements. The covariance values reflect the noise and redundancy in our measurements.
New goals
Let’s transform \(\boldsymbol{C_X}\) into some optimized matrix \(\boldsymbol{C_Y}\).
Optimized matrix \(\boldsymbol{C_Y}\)
Diagonalizing
What are the methods for diagonalizing \(\boldsymbol{C_Y}\)?
We assume that all basis vectors \(\left\{\boldsymbol{p_1}, \dots, \boldsymbol{p_m}\right\}\) are orthonormal, that is, \(\boldsymbol{P}\) is an orthonormal matrix.
How does PCA work?
Looking at figure, it aligns the basis with the axis of maximal variance.
Algorithm
Resulting ordered set of \(\boldsymbol{p}\) are called principal components.
Assumptions review
Setup
We have a dataset \(\boldsymbol{X}\) which is a \(m \times n\) matrix:
Goal
Find some orthonormal matrix \(\boldsymbol{P}\) in \(\boldsymbol{Y}=\boldsymbol{P}\boldsymbol{X}\) such that \(\boldsymbol{C_Y} \equiv \dfrac{1}{n} \boldsymbol{Y}\boldsymbol{Y}^T\) is a diagonal matrix.
The rows of \(\boldsymbol{P}\) are the principal components of \(\boldsymbol{X}\).
\(\boldsymbol{C_Y}\) rewrite
Let’s rewrite \(\boldsymbol{C_Y}\) in terms of unknown variable \(\boldsymbol{P}\): \[\begin{align*} &\boldsymbol{C_Y} = \dfrac{1}{n}\boldsymbol{Y}\boldsymbol{Y}^T = \dfrac{1}{n}(\boldsymbol{P}\boldsymbol{X})(\boldsymbol{P}\boldsymbol{X})^T = \\ & = \dfrac{1}{n} \boldsymbol{P} \boldsymbol{X} \boldsymbol{X}^T \boldsymbol{P}^T = \boldsymbol{P}(\dfrac{1}{n} \boldsymbol{X} \boldsymbol{X}^T) \boldsymbol{P}^T = \\ &= \boldsymbol{P} \boldsymbol{C_X} \boldsymbol{P}^T, \end{align*}\] where \(\boldsymbol{C_X}\) is the covariance matrix of \(\boldsymbol{X}\).
Goal
Any symmetrix matrix \(A\) is diagonalized by an orthogonal matrix of its eigenvectors.
Theorem
Theorem 3. A matrix is symmetric \(\Leftrightarrow\) it is orthogonally diagonalizable.
\((\Rightarrow)\) If \(A\) is orthogonally diagonalizable, then \(A\) is symmetric.
Orthogonally diagonalizable means that \(\exists E: A = E D E^T\), where \(D\) is a diagonal matrix and \(E\) is a matrix that diagonalizes \(A\). Let’s compute \(A^T\): \[ A^T = (E D E^T)^T = E^{TT}D^T E^T = E D E^T = A \]
Theorem
Theorem 4. A symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors.
Let \(A\) be a square $n n $ symmetric matrix with eigenvectors \(\left\{e_1, \dots, e_n\right\}\). Let \(E=\left[e_1 \dots e_n\right]\). This theorem asserts that \(\exists \text{ diagonal matrix } D: A = E D E^T\).
First, let’s prove that any matrix can be orthogonally diagonalized if and only if it that matrix’s eigenvectors are all linearly independent.
Let \(A\) be some matrix with independent eigenvectors (not degenerate). Let \(D\) be a diagonal matrix where \(i\)th eigenvalue is placed in \(ii\)th position. We will show that \(AE=ED\).
Theorem
\[\begin{align*} & AE = \left[Ae_1 \dots Ae_n\right],\\ & ED = \left[\lambda_1 e_1 \dots \lambda_n e_n\right]. \end{align*}\] Evidently, if \(AE=ED\) then \(Ae_i = \lambda_i e_i \; \forall i\). This is the definition of the eigenvalue equation. Therefore, \(A = E D E^{-1}\).
Theorem
Now let’s prove that a symmetric matrix always has orthogonal eigenvectors. Suppose that \(\lambda_1\) and \(\lambda_1\) are distinct eigenvalues for eigenvectors \(e_1\) and \(e_2\). \[\begin{align*} &\lambda_1 e_1 \cdot e_2 = (\lambda_1 e_1)^T e_2 = (A e_1)^T e_2 =\\ & = e_1^T A^T e_2 = e_1^T A e_2 = e_1^T(\lambda_2 e_2) = \lambda_2 e_1 \cdot e_2. \end{align*}\] As \(\lambda_1 \neq \lambda_2\), then \(e_1 \cdot e_2 = 0\).
So, \(E\) is an orthogonal matrix, and by theorem 1 \(E^T = E^{-1}\) and \(A = E D E^T\).
Theorem
For symmetric matrix \(\boldsymbol{A}\) we have \(\boldsymbol{A} = \boldsymbol{E} \boldsymbol{D} \boldsymbol{E}^T\), where \(D\) is a diagonal matrix and \(E\) is a matrix of eigenvectors of \(A\) arranged as columns.
Note that \(\boldsymbol{A}\) might have \(r \leq m\) orthonormal eigenvectors where \(r\) is the rank. This will mean that \(\boldsymbol{A}\) is . Therefore, we’ll need to select additional \((m-r)\) additional orthogonal vectors to fill matrix \(\boldsymbol{E}\).
These vectors do not affect the final solution because variances associated with these directions are \(0\).
Important
We select the matrix \(\boldsymbol{P}\) to be a matrix where each row \(\boldsymbol{p_i}\) is an eigenvector of \(\dfrac{1}{n}\boldsymbol{X} \boldsymbol{X}^T\). By this selection, \(\boldsymbol{P} \equiv \boldsymbol{E}^T\). Keeping in mind that \(\boldsymbol{P}^{-1} = \boldsymbol{P}^T\) (Theorem 1), we have: \[\begin{align*} & \boldsymbol{C_Y} = \boldsymbol{P} \boldsymbol{C_X} \boldsymbol{P}^T \\ &= \boldsymbol{P}(\boldsymbol{E}^T \boldsymbol{D} \boldsymbol{E})\boldsymbol{P}^T \\ &= \boldsymbol{P}(\boldsymbol{P}^T \boldsymbol{D} \boldsymbol{P})\boldsymbol{P}^T = \\ & = (\boldsymbol{P} \boldsymbol{P}^T) \boldsymbol{D} (\boldsymbol{P} \boldsymbol{P}^T) \\ & = (\boldsymbol{P} \boldsymbol{P}^{-1})\boldsymbol{D}(\boldsymbol{P} \boldsymbol{P}^{-1}). \end{align*}\] Therefore, \(\boldsymbol{C_Y} = \boldsymbol{D}\).
Result
Practical computation
Setup
Let \(\boldsymbol{X}\) be an arbitrary \(n \times m\) matrix (!) and \(\boldsymbol{X}^T \boldsymbol{X}\) be a rank \(r\), square, symmetric \(m \times m\) matrix.
Properties
Theorem
Theorem 5. For any arbitrary \(m \times n\) matrix \(X\), the symmetric matrix \(X^T X\) has a set of orthonormal eigenvectors of \(\left\{\hat{\boldsymbol{v}_1}, \dots, \hat{\boldsymbol{v}_n}\right\}\) and a set of associated eigenvalues \(\left\{\lambda_1, \dots, \lambda_n\right\}\). The set of vectors \(\left\{\boldsymbol{X}\hat{\boldsymbol{v_1}}, \dots, \boldsymbol{X}\hat{\boldsymbol{v_n}}\right\}\) then forms an orthogonal basis, where each vector \(X\hat{\boldsymbol{v_i}}\) is of length \(\sqrt{\lambda_i}\).
\[\begin{align*} &(X \hat{v_i}) \cdot (X \hat{v_j}) = (X \hat{v_i})^T \cdot (X \hat{v_j}) = \\ & = \hat{v_i} X^T X \hat{v_j} = \hat{v_i}^T(\lambda_j v_j) = \lambda_j \hat{v_i} \cdot \hat{v_j} =\\ & = \lambda_j \delta_{ij}. \end{align*}\] And so we have \[ \|X \hat{v_i}\|^2 = (X \hat{v_i})\cdot(X \hat{v_i}) = \lambda_i. \]
The scalar version of SVD
Is a restatement of the third definition: \[\begin{align} \label{scalar_svd} &\boldsymbol{X} \hat{\boldsymbol{v}_i} = \sigma_i \hat{\boldsymbol{u}_i} \end{align}\]
The set of eigenvectors \(\left\{\hat{\boldsymbol{v}_1}, \dots, \hat{\boldsymbol{v}_r}\right\}\) and the set of vectors \(\left\{\hat{\boldsymbol{u}_1}, \dots, \hat{\boldsymbol{u}_r}\right\}\) are both bases in \(r\)-dimensional space.
The matrix version of SVD.
Construct a new diagonal matrix \(\Sigma\): \[ \Sigma \equiv \begin{bmatrix} \sigma_{\tilde{1}} \\ & \ddots & & \text{0} \\ & & \sigma_{\tilde{r}} \\ \text{0} & & & \ddots \\ & & & & 0 \end{bmatrix} \] where \(\sigma_{\tilde{1}} \geq \dots \geq \sigma_{\tilde{r}}\) are the rank-ordered set of singular values.
The matrix version of SVD.
Likewise we construct accompanying orthogonal matrices \[\begin{align*} & \boldsymbol{V} = \left[\hat{\boldsymbol{v_1}}, \dots, \hat{\boldsymbol{v_m}}\right],\; \boldsymbol{U} = \left[\hat{\boldsymbol{u_1}}, \dots, \hat{\boldsymbol{u_n}}\right], \end{align*}\] where we appended additional \((m-r)\) and \((n-r)\) orthonormal vectors to fill up the matrices for \(\boldsymbol{V}\) and \(\boldsymbol{U}\) respectively, in order to deal with degeneracy issues.
\[ \boldsymbol{X}\boldsymbol{V} = \boldsymbol{U} \boldsymbol{\Sigma}, \] where each column of \(\boldsymbol{V}\) and \(\boldsymbol{U}\) perform the scalar version of decomposition \(\eqref{scalar_svd}\).
As \(\boldsymbol{V}\) is orthogonal, we multiply both sides by \(\boldsymbol{V}^{-1}=\boldsymbol{V}^T\) and obtain \[\begin{align} \label{svd} &\boldsymbol{X} = \boldsymbol{U} \boldsymbol{\Sigma} \boldsymbol{V}^T. \end{align}\]
Interpretation
Equation \(\eqref{svd}\) states that any arbitrary matrix \(\boldsymbol{X}\) can be converted into an orthogonal matrix, diagonal matrix, and another orthogonal matrix (or a rotation, a stretch, and a second rotation).
Reinterpret equation \(\eqref{scalar_svd}\) as \(\boldsymbol{X}\boldsymbol{a} = k\boldsymbol{b}\), where \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are column vectors and \(k\) is a scalar constant.
The set \(\left\{\hat{\boldsymbol{v_1}}, \dots, \hat{\boldsymbol{v_m}}\right\}\) is analogous to \(\boldsymbol{a}\) and \(\left\{\hat{\boldsymbol{u_1}}, \dots, \hat{\boldsymbol{u_n}}\right\}\) is analogous to \(\boldsymbol{b}\).
\(\left\{\hat{\boldsymbol{v_1}}, \dots, \hat{\boldsymbol{v_m}}\right\}\) and \(\left\{\hat{\boldsymbol{u_1}}, \dots, \hat{\boldsymbol{u_n}}\right\}\) are orthonormal sets of vectors which span an \(m\) or \(n\) dimensional space, respectively.
Inputs are \(\boldsymbol{a}\) and outputs are \(\boldsymbol{b}\). Can we formalize the view that \(\left\{\hat{\boldsymbol{v_1}}, \dots, \hat{\boldsymbol{v_m}}\right\}\) and \(\left\{\hat{\boldsymbol{u_1}}, \dots, \hat{\boldsymbol{u_n}}\right\}\) span all possible inputs and outputs?
Interpretation
From \(\eqref{svd}\) we have \[\begin{align*} & \boldsymbol{X} = \boldsymbol{U} \boldsymbol{\Sigma} \boldsymbol{V}^T \Rightarrow \\ & \boldsymbol{U}^T \boldsymbol{X} = \boldsymbol{\Sigma} \boldsymbol{V}^T \Rightarrow \\ & \boldsymbol{U}^T \boldsymbol{X} = \boldsymbol{Z}, \end{align*}\] where \(\boldsymbol{Z} \equiv \boldsymbol{\Sigma} \boldsymbol{V}^T\).
Interpretation
Comparing to \(\eqref{basis}\), \(\left\{\hat{\boldsymbol{u_1}}, \dots, \hat{\boldsymbol{u_n}}\right\}\) perform the same role as \(\left\{\hat{\boldsymbol{p_1}}, \dots, \hat{\boldsymbol{p_m}}\right\}\). Hence, \(\boldsymbol{U}^T\) is a change of basis from \(\boldsymbol{X}\) to \(\boldsymbol{Z}\).
Therefore, from the fact that the orthonormal basis \(\boldsymbol{U}^T\) (or \(\boldsymbol{P}\)) transforms column vectors it follows that \(\boldsymbol{U}^T\) is a basis that spans the columns of \(\boldsymbol{X}\). Bases that span the columns are termed column spaces.
Column spaces are equivalent to matrix outputs.
Row spaces are equivalent to matrix inputs.
How are PCA and SVD related?
Consider original \(m \times n\) matrix \(\boldsymbol{X}\). Define \[ \boldsymbol{Y} \equiv \dfrac{1}{\sqrt{n}} \boldsymbol{X}^T, \] where each column of \(\boldsymbol{Y}\) has zero mean. Consider: \[\begin{align*} &\boldsymbol{Y}^T \boldsymbol{Y} = \left(\dfrac{1}{\sqrt{n}}\boldsymbol{X}^T\right)^T\left(\dfrac{1}{\sqrt{n}}\boldsymbol{X}^T\right) = \dfrac{1}{n}\boldsymbol{X} \boldsymbol{X}^T = \boldsymbol{C_X}. \end{align*}\] By construction \(\boldsymbol{Y}^T \boldsymbol{Y}\) equals the covariance matrix of \(\boldsymbol{X}\). Principal components of \(\boldsymbol{X}\) are the eigenvectors of \(\boldsymbol{C_X}\). If we calculate the SVD of \(\boldsymbol{Y}\), the columns of matrix \(\boldsymbol{V}\) contain the eigenvectors of \(\boldsymbol{Y}^T \boldsymbol{Y} = \boldsymbol{C_X}\).
Therefore, columns of \(\boldsymbol{V}\) are the principal components of \(\boldsymbol{X}\).
Interpretation
\(\boldsymbol{V}\) spans the row space of \(\boldsymbol{Y} \equiv \dfrac{1}{\sqrt{n}} \boldsymbol{X}^T\). Therefore, \(\boldsymbol{V}\) must also span the column space of \(\dfrac{1}{\sqrt{n}} \boldsymbol{X}\).
We conclude that finding the principal components amounts to finding an orthonormal basis that spans the column space of \(\boldsymbol{X}\).
Summary of PCA

Loss function
It can be proved that under a common loss function, mean squared error (\(L_2\) norm), PCA provides the optimal reduced representation of the data.
This means that selecting orthogonal directions for principal components is the best solution to predicting the original data.
Kernel PCA
Decorrelation is removing second-order dependencies.
Higher-order dependencies: if prior knowledge is known about the problem, then a nonlinearity (i.e. kernel) might be applied to the data to transform the data to a more appropriate naive basis. This parametric approach is often termed kernel PCA.
Basic concepts: LSA (latent semantic analysis)
SVD is applied to a term-document matrix (each cell weighted by log frequency and normalized by entropy), and then the first e.g. 300 dimensions are used as the LSA embedding.
Alternatively, this is PCA applied to NLP data.
Issues with TF-IDF
Solution
Use topics - aggregated words.
Basic concepts
Basic concepts
What is a principal component: recap
In principal component analysis, a principal component is a new feature that is constructed from a linear combination of the original features in a dataset.
Idea behind PCA
Reduce the dimensionality of a dataset by projecting the data onto a lower-dimensional space, while still preserving as much of the variance in the data as possible.
Making sense of principal components
PCs are the directions in which the data varies the most. To make sense of the PCs, you can consider the following:
PCA items
Preliminary
Before applying PCA, missing values need to be removed from the dataset and each variable should be scaled to have zero mean and unit variance.
Compute
\(\text{Covariance matrix} = (1/n) \cdot X^T \cdot X\), where \(X\) is the data matrix with \(n\) samples and \(p\) features.
Then eigenvectors and eigenvalues are obtained by solving the following equation: \[ \text{Covariance matrix} \cdot v = \lambda \cdot v \] where \(v\) is the eigenvector and \(\lambda\) is the corresponding eigenvalue.
Compute
# Compute covariance matrix
cov_matrix = np.cov(data, rowvar=False)
# Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort in descending order
idx = np.argsort(-eigenvalues)
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:,idx]
# Select top principal components
k = 2
eigenvectors = eigenvectors[:,:k]Common steps
Definition
Stochastic Neighbor Embedding (SNE) tries to place the objects in a low-dimensional space so as to optimally preserve neighborhood identity, and can be naturally extended to allow multiple different low-d images of each object.
Step 1.
High-dimensional probabilities.
Compute probability that that object \(i\) would pick \(j\) as neighbor: \[ p_{ij} = \dfrac{exp\left(-d_{ij}^2\right)}{\sum\limits_{k \neq i} exp\left(-d_{ik}^2\right)} \] Values \(d_{ij}\) represent between points \(i\) and \(j\): \[ d_{ij}^2 = \dfrac{\|\boldsymbol{x}_i-\boldsymbol{x}_j\|^2}{2\sigma_i^2} \]
Step 1.
What is the meaning of \(\sigma_i\)? It is found by a binary search that makes the Shannon entropy of the distribution over neighbors equal to \(log_2 k\), where \(k\) is the effective number of local neighbors or .
We compute entropy via: \[\begin{align*} &H = -\sum\limits_j p_{ij} log_2 p_{ij} = log_2 k,\\ &k = 2^{-\sum\limits_j p_{ij} log_2 p_{ij}} \end{align*}\] By tuning \(\sigma_i\) we try to match the value of \(k\) set by the user.
Step 1.
The higher the effective number of local neighbors (perplexity), the higher \(\sigma_i\) and the wider the Gaussian function used in the dissimilarities.
Intuition
Mathematical intuition: The higher the perplexity, the more likely it is to consider points that are far away as neighbors.
Step 2.
Low-dimensional probabilities.
Now that we have the high-dimensional probabilities, we move on to calculate the low dimensional ones, which depend on where the data points are mapped in the low dimensional space.
In the low-dimensional space we also use Gaussian neighborhoods but with a fixed variance (which we set without loss of generality to be \(\dfrac{1}{2}\)) so the induced probability \(q_{ij}\) that point \(i\) picks point \(j\) as its neighbor is a function of the low-dimensional images \(y_i\) of all the objects and is given by expression: \[ q_{ij} = \dfrac{exp\left(-\|\boldsymbol{y}_i-\boldsymbol{y}_j\|^2\right)}{\sum\limits_{k \neq i} exp\left(-\|\boldsymbol{y}_i-\boldsymbol{y}_k\|^2\right)} \]
Step 3.
Choice of cost function.
If the points \(Y_i\) are placed correctly in the low-dimensional space, the conditional probabilities \(p\) and \(q\) will be very similar. To measure the mismatch between both probabilities, SNE uses the as a loss function for each point. Each point in both high and low dimensional space has a conditional probability to call another point its neighbor. Hence, we have as many loss functions as we have data points. We define the cost function as the sum of the KL divergences over all data points, \[ C = \sum\limits_i \sum\limits_j p_{ij} log\dfrac{p_{ij}}{q_{ij}} = \sum\limits_i KL(P_i \| Q_I). \]
Kullback-Leibler Divergence
The KL divergence, which is closely related to relative entropy, information divergence, and information for discrimination, is a non-symmetric measure of the difference between two probability distributions \(p(x)\) and \(q(x)\).
Specifically, the KL divergence of \(q(x)\) from \(p(x)\), denoted DKL(p(x),q(x)), is a measure of the information lost when \(q(x)\) is used to approximate \(p(x)\).
Kullback-Leibler Divergence
Let \(p(x)\) and \(q(x)\) are two probability distributions of a discrete random variable \(x\). That is, both \(p(x)\) and \(q(x)\) sum up to \(1\), and \(p(x) > 0\) and \(q(x) > 0\) \(\forall x \in X\).
KL divergence is defined as: \[ KL(p(x) \| q(x)) = \sum\limits_{x \in X} p(x) ln \dfrac{p(x)}{q(x)}. \] Continuous version: \[ KL(p(x) \| q(x)) = \int\limits_{-\infty}^{\infty} p(x) ln \dfrac{p(x)}{q(x)}dx. \]
Differentiation of the cost function
\[ \dfrac{\partial C}{\partial \boldsymbol{y}_i} = 2\sum\limits_j (\boldsymbol{y}_i-\boldsymbol{y}_j)(p_{ij}-q_{ij}+p_{ji}-q_{ij}). \]
Interpretation
A sum of forces pulling toward or pushing it away depending on whether is observed to be a neighbor more or less often than desired.



Problems
Solutions
Symmetric SNE.
The probability of point \(x_i\) to consider point \(x_j\) as its neighbor is not the same probability that point \(x_j\) would consider point \(x_i\) as a neighbor. We symmetrize pairwise probabilities in high dimensional space by defining \[ \tilde{p}_{ij} = \dfrac{p_{ij}+p_{ji}}{2n} \]
The Crowding Problem
If we want to correctly project close distances between points, the moderate distances are distorted and appear as huge distances in the low dimensional space.
To solve this, use the Student t-Distribution (which is what gives the ‘t’ to t-SNE) with one degree of freedom for the low-dimensional probabilities: \[ q_{ij} = \dfrac{\left(1+\|y_i-y_j\|^2\right)^{-1}}{\sum\limits_{k \neq l} \left(1+\|y_k-y_l\|^2\right)^{-1}} \]
Now \(p_{ij}=p_{ji}\) and \(q_{ij}=q_{ji}\), and the gradient of the cost function \[ C = \sum\limits_i \sum\limits_j \tilde{p}_{ij} log\dfrac{\tilde{p}_{ij}}{q_{ij}} \] is easier to compute.
Local structure
Since t-SNE also uses KL divergence as its loss function, it also carries the problems discussed in the previous section. This is not to say that it is completely ignored, but the main takeaway is that t-SNE severely prioritizes the conservation of the local structure.
Global structure
Since the KL divergence function does not penalize the misplacement in low dimensional space of points that are far away in high dimensional space, we can conclude that the global structure is not well preserved. t-SNE will group similar data points together into clusters, but distances between clusters might not mean anything.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
def read_glove(file_path):
with open(file_path) as f:
for i, line in enumerate(f):
fields = line.rstrip().split(' ')
vec = [float(x) for x in fields[1:]]
word = fields[0]
yield (word, vec)
words = []
vectors = []
for word, vec in read_glove('data/glove/glove.42B.300d.txt'):
words.append(word)
vectors.append(vec)
model = TSNE(n_components=2, init='pca', random_state=0) coordinates = model.fit_transform(vectors)
plt.figure(figsize=(8, 8))
for word, xy in zip(words, coordinates):
plt.scatter(xy[0], xy[1])
plt.annotate(word,
plt.xlim(25, 55)
plt.ylim(-15, 15)
plt.show()GloVe embeddings visualized by t-SNE
Get libraries
from __future__ import print_function
import time
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as snsConvert to Pandas and randomize
feat_cols = [ 'pixel'+str(i) for i in range(X.shape[1]) ]
df = pd.DataFrame(X,columns=feat_cols)
df['y'] = y
df['label'] = df['y'].apply(lambda i: str(i))
X, y = None, None
print('Size of the dataframe: {}'.format(df.shape))
[out] Size of the dataframe: (70000, 785)
# For reproducibility of the results
np.random.seed(42)
rndperm = np.random.permutation(df.shape[0])Use Scikit-learn for PCA
pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols].values)
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1]
df['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
Explained variation per principal component: [0.09746116 0.07155445 0.06149531]Scatterplot of first 2 PCs in 3D
Run PCA
N = 10000
df_subset = df.loc[rndperm[:N],:].copy()
data_subset = df_subset[feat_cols].values
pca = PCA(n_components=3)
pca_result = pca.fit_transform(data_subset)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1]
df_subset['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
[out] Explained variation per principal component: [0.09730166 0.07135901 0.06183721]Run t-SNE
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(data_subset)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
[out] [t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.564s...
[t-SNE] Computed neighbors for 10000 samples in 121.191s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
...
[t-SNE] Mean sigma: 2.129023
[t-SNE] KL divergence after 300 iterations: 2.823509
t-SNE done! Time elapsed: 157.3975932598114 secondsPlot t-SNE