Where and How to Use PCA
Understanding how PCA works—in a visual way
I believe most of us prefer to look at a complex problem from a distance to see the essence source. It is also the same for intangible matters. You might need to step out of it and look from a different perspective. We prefer to do it this way because it reduces the complexity of the problem and avoids unnecessary details to focus on essential ones. Think about you are looking at a painting. When you get closer to that painting, you can see every brush marks in detail except that you can’t see how they create a composition together. It might be better to look at that painting from afar to understand what its story is.
It is also the same for the complex datasets that have many variables. You can’t understand its story just by looking at it. You need to summarize your dataset, create graphics to find out its story. However, it isn’t enough too. There are still too many details you need to expound. It is also same for the model that you are going to create. Using too many variables in your model will increase its complexity and reduce its performance. Then, it is time to step out of the problem and look from a different perspective. What you need to do is extract the main components of your dataset by considering every data point. These components should explain how the data points vary by considering each variable in your dataset.
Understanding Components
Let’s go on a basic example to understand what are the components. Suppose you have two variable as Math and Science that shows students’ exam results. You want to extract extra information that explains the variation of both Math and Science. In the following Figure 1, you can see how data points sprawled in two-dimensional space. Remember, each component should explain these two variables together but from a different perspective. Since there are two variables, we can create two components. They should tell how data points get spread in orthogonal directions. The first direction should be A, which lays on higher variation, and the second one should be B which is perpendicular to the direction A and lays on lower variation. Then, the components will be the same as on the right side of Figure 1.
We drew two lines that are perpendicular to each other. How to interpret them? As you can see from Figure 1, components do not only explain a single variable. They stand for both “Science” and “Math” together but in different ratios. This means that 1 unit change in C1 has more impact on “Science” than it does on “Science”. However, 1 unit change in C2 affects “Math” more than “Science”. Therefore, “Science” holds more information than “Math” for C1 and vice versa for C2 . The question; which component presents more information than the other one. To understand it, we need to have a look at how PCA works.
Principal Component Analysis
Principal Component Analysis is a commonly used dimensionality reduction approach for extracting main components in a high dimensional feature set. It projects data points in multi-dimensional space onto uncorrelated principal components. However, we can define an infinite number of the component by projecting data points. How can we know which components are the best to explain variables? For example, the left side of Figure 2 shows a scatterplot of the X and Y variables. The first principal component that is PC1 always represents the linear that lays on the higher variation. The first component explains higher variation on the dataset, so it has more information than the others. Since all components are orthogonal to each other, ie uncorrelated, the second component will be perpendicular to PC1. On the right side of Figure 1, you can see the projected points onto PC1 and PC2.
The first question is how to find the linear line that lays on higher variation. The answer comes with the projected points onto that line. Let’s suppose that we have been drawing random lines that pass from the center point (blue point). To find the higher variation, we need to find the distance between each projected point and the center. The line that gives the highest sum of squared distances will be the PC1. For example, in Figure 3, 2 lines have been drawn. Line 2 stands for the higher variation, and this is the line that represents PC1.
The second question is how do we interpret the line representing PC1. The first thing we have to do is shift the center point to the origin of the coordinate. Afterward, we can then define units based on the slope of PC1. A seen in Figure 4, the slope of the PC1 is 0.39. It means that for every 2.54 units of Science we go 1 unit up for Math. This result shows us the data mainly spread out around the Science axis (X). So, Science is 2.54 times more important than “Math” for PC1.
Now, we can define the influence of 1 unit change of PC1 on “math” and Science. As you can see from the Figure 4, 1 unit of PC1 is equal to 0.93 units of “science” and 0.36 units of Math. In other words, to obtain PC1, we need 0.93 unit “science” and 0.36 unit “0.36”. These scores are called “loading scores” of PC1.
Since PC2 is perpendicular to PC1, we can easily find the linear combination of “Math” and “Science” for PC2. As seen in Figure 5, 1 unit change of PC2 equals -0.36 units of Science and 0.93 units of Math. It shows that “Math” is 2.5 times more important than “Science” for PC2.
The values [0.93, 0.36] are also known as eigenvector or singular vector of PC1. [-0.36, 0.93] are the eigenvector or singular vector of PC2. However, the sum of squares of the distances between projected points and the center gives the eigenvalues. For instance, let’s assume the sum of the squared distances (SSD) of PC1 is 12, and the SSD of PC2 is 3. That means that PC1 explains variance four times more than PC1. 0.75 is the explained variance ratio of PC1, and 0.25 is the explained variance ratio of PC2.
Conclusions
The purpose of this article is to explain PCA more visually. I didn’t mention a lot of math operations to find the principal components. I mainly try to focus on the idea behind PCA. However, there are several easy ways to obtain principal components. Eigenvalues of the covariance matrix and singular vector decomposition are the most used and known techniques. These techniques help you to find “loading scores” a.k.a eigenvectors/singular vectors, and “sum of square distances” a.k.a eigenvalues, or explained variance ratios of components. Therefore, I recommend you to take a look at these methods.