Introduction to Principal Component Analysis

Chapter 19 of Norman and Streiner has a discussion of Principal Component Analysis and Factor Analysis that assumes a fairly strong background in working with multidimensional data in linear spaces. The purpose of this module is to fill in some of that background. For additional background, see http://setosa.io/ev/principal-component-analysis/, https://en.wikipedia.org/wiki/Factor_analysis, and https://en.wikipedia.org/wiki/Principal_component_analysis

Vectors

A lot of scientific data comes in the form of "vectors". In physics, a vector is a data point with magnitude and direction. Magnitude is just a number, but what is direction? In one dimension (on a line), direction is just left or right, up or down, i.e. it is just a sign for the magnitude, so in one dimension, a vector is just a signed, real number, or "scalar". In two dimensions (in a plane) instead of dealing with left vs. right and up vs. down, we need to deal with both simultaneously. So in two dimensions a vector has two components, two signed numbers. If we call one of those numbers $x$ and the other $y$, we can put then together between two parentheses or square brackets and write

$( x, y )$ or $[x, y]$

Both forms are used. For the rest of this discussion, we will choose the square brackets. We could get up to three dimensions by using $z$, but we'll soon run out of letters, so the general form of a vector of $n$ dimensions using one subscripted letter is

$[x_1, x_2, x_3, ..., x_n]$ where each $x_i, i = 1, ... n$ is a signed number, e.g. $[.707,-.707]$ or $[-10, 5, 10, -5]$.

See https://en.wikipedia.org/wiki/Euclidean_vector

In statistics, we speak of variables, in which each variable is a lable for a column of data, and data instances, in which each instance is a row of data. It the data is all in the form of numbers that can meaningfully be added, subtracted, multiplied and divided, we can treat each data instance row as a vector in a space of dimension equal to the number of columns.

How similar or different are two vectors

Just as we are interested in how similar or how different two data points presented scalars are, we need a way to compare to data points presented as vectors. Fortunately we can do arithmetic on vectors. We can add or subtract them component by component, so, if the two vectors are of the same dimension (have the same number of components) and all the component-by-component differences are the same, we know the two vectors are the same, but we also need a way to tell how close or far from each other they are.

One measure of the difference between two vectors is the Euclidean distance, i.e. the square root of the sum of the squares of the differences of the vector components. For example, if we have the vectors $[1,2]$ and $[3,-1]$ the Eucldean distance between them is $√{(1-3)^2+(2-(-1))^2} = √{4+9} = √{13}$. This is also called the $L_2$-norm of the difference between the vectors. If one of the vectors is zero, this is just the length of the other vector, usually noted as $||x||$ for a vector $x$.

Another measure of the difference between them is the angle subtended by the arc around the origin between them. This turns out to be easy to compute by using that is called the dot product. The dot product is the sum of the products of the components. For the vectores $a = [1,2]$ and $b = [2,-1]$ the dot product $a.b = 1*2 + 2*(-1) = 0$. In all cases $a.b = ||a||*||b||*cos(θ)$, where $θ$ is the angle between the vectors. In this case the vectors are at right angles to each other and $cos(θ)$ is zero. Note that $cos(θ) = a.b/(||a||*||b||)$ provided neither norm is zero.

Matrices

A matrix is a generalization of a vector the presents an array of numbers in rows and columns. The rows go across the page and the columns go up and down. To remember this distinction, think of columns as being vertical supports for a building. $$(\table \x_{1,1} , \x_{1,2}, ..., \x_{1,n} ; \x_{2,1} , \x_{2,2}, ..., \x_{2,n}; ... ; ... ; ... ; \x_{m,1}, \x_{m,2}, ..., \x_{m,n} )$$ is a matrix of $m$ rows and $n$ columns.

A matrix of n columns on the left is applied to a vector of dimension n on the right by taking the dot product of each row with the vector $$(\table \x_{1,1} , \x_{1,2}, ..., \x_{1,n} ; \x_{2,1} , \x_{2,2}, ..., \x_{2,n}; ... ; ... ; ... ; \x_{m,1}, \x_{m,2}, ..., \x_{m,n} ) (\table \v_{1} ; \v_{2}; ...; ... ; ...; ...; ...; \v_{n} ) = (\table \x_{1,1}\v_{1}+\x_{1,2}\v_{2}+...+ \x_{1,n} \v_{n};\x_{2,1} \v_{1}+\x_{2,2} \v_{2}+ ... +\x_{2,n}\v_{n}; ... ; ... ; ... ; \x_{m,1}\v_{1}+\x_{m,2}\v_{2}+ ... + \x_{m,n}\v_{n} ) $$ Note that the product only has dimension $m$, so to stay in a space of the same dimension, we must use square matrices.

If exchange rows and columns, we are transposing a matrix.

Eigenvalues and Eigenvectors

A solution to the equation $A v = \html"λ" v$ for given matrix $A$ gives eigenvector $v$ with eigenvalue $\html"λ"$. An eigenvector is a direction in the space for which the matrix A expands the eigenvector by a the scalar value of the eigenvalue. A real symmetric $n$ by $n$ matrix has $n$ eigenvectors which span the entire space.

Singular Value Decomposition of a Data Matrix

A data matrix is unlikly to be square, by if we multiply if by its transpose, we can get square matrix. For an $m$ by $n$ matrix $M$, $M^T$ is an $n$ by $m$ matrix, so $M^T M$ is an $n$ by $n$ matrix. If we subtract the mean for each column, the eigenvectors and eigenvalues of $M^T M$ give us a decomposition of the space in terms of the variance-covariance matrix of the data, with the largest eigenvalues reflecting the largest variance. If we look at $M M^T$ we are looking at the variance-covariance matrix of the variables (or features). In either case, the first eigenvector give a sense of the average overall direction of the maximal variance. If we think of cluster as elliptical, this gives us a sense of the overall direction of the major axes of all the clusters. The second eigenvector then cuts across the clusters and allows us to see the clusters somewhat separately. For more information and alternative methods of finding the eigenvectors and eigenvalues, see the wikipedia article, https://en.wikipedia.org/wiki/Principal_component_analysis