Nice visualization! This provides me an opportunity to go on a random tangent on PCA:
The post considers PCA from visualization perspective, but the exactly same thing can also be viewed as a method for reducing number of dimensions in the original dataset. [1] Now, one of the interesting questions in a dimensionality reduction task is, how to pick the number of dimensions (principal components)? A good number? In a principled way, instead of just computing the next component and the next and the one after that, until you get bored? (It works for visualizations where you often want only the first two or three components anyway, but suppose we want more information than plots.)
I recently learned that there's a fascinating way to do this, presented in Bishop's paper [2] from 1999. In short: this can answered by recasting the PCA as Bayesian latent variable model with a hierarchical prior. (Yes, it is a bit of mouthful to say. Yes, it is fairly mathematical, unlike the visualization.)
Frank Harrell demonstrates a way to use PCA to reduce dimensionality with regression modeling strategies. His Course Notes (pdf) is a good reference point for multiple strategies on regression.
Yeah PCA will give you eigenvalues of the PCs in descending order of variance explained so just summing and weighting those tells you the first 3 PCs explain say 93% of the variance in the data.
But the question is when exactly to stop, because the reconstruction error is always going to get lower. The bayesian solution, IIUC, is something like "stop when information required to store the new PCs is more than the information gained from the reduced reconstruction error."
The only issue with this is, if you get tons of data then there will be less uncertainty in the principal components. And so it will recommend as many as possible, even if they only decrease the reconstruction error a tiny bit.
This is the standard way when looking for reduced order models in fluid mechanics.
Variations on this:
i) How 'faithfully' does it represent the data, eg, how many modes (components) are needed to resolve accuracy in a particular metric, or the entire system
ii) What is the cut-off component number which has a signal of order of the measurement uncertainty.
He, I just happened to have MATLAB on my other screen performing a PCA of a CFD simulation. I got bored of watching the progress bar and decided to browse HN a bit. I guess I cannot escape...
Hey HN. Co author here. Crazy to see this up here again. Anyway, just letting y'all know I finished my phd and teaching for now so there will be a lot more visualizations like this coming out this summer. Will get Vicapow back in the game for one last score, too.
1. PDE's
2. Lorenz attractor with a waterwheel in threejs
3. Macroscopic fundamental diagram theory of traffic flow in cities
4.???
5. Profit
I love this visualization - but I think there's a very different intuition you get from PCA in high dimensions.
I prefer to think of the singular vectors in PCA as an ordering of "prototype signals" for which some linear combination best reconstructs the data. That explains, for example, why the largest singular vectors on natural time series data gives fourier like coefficients, and why the largest singular vectors on aligned faces gives variations in lighting.
Very nice simple post. One of the topics that always makes me scratch my head as it wasn't directly applicable to training a mode in Andrew Ng's Machine Learning course
As someone who knows nothing about machine learning and nothing about PCA (well, until now :)), can someone please explain how the two relate to each other? Is one of them a subset to the other, or what?
Machine learning (and more specifically here, supervised learning) is about predicting a specific attribute of a new sample, based on the attributes of the samples that you've acquired. For example, if you have access to the database of the clients of a bank, containing their attributes such as their income, their age, their occupation and whether or not the bank accepted to give them a loan, you may want to create a system that based on this database, can predict whether a new person will get a loan or not.
It happens that having too much different features is not necessarily a good thing, in a phenomenom called curse of dimensionality.
Due to this, we are interested in trying to reduce the number of attributes our algorithm will process. There are two big categories of methods to do that: feature selection and feature extraction.
In feature selection, you try to select the attributes that are the "best" to predict your value. For example, computing the statistical correlation between the attributes and the value you want to predict, and choose those with the highest correlations.
In feature extraction, you create new attributes that are a linear combinations of the original attributes. PCA is a feature extraction algorithm.
It is often used when you have an abundance of measured variables that you are using for input that you suspect might be highly correlated. For example in a study of self-reported lifestyle behaviors you might have questions about frequency of participation in: jogging, walking, running, weight lifting, cycling, aerobics classes, yoga, Crossfit, martial arts, climbing, tennis, softball, volleyball, golf, Ultimate Frisbee, and many more.
In your effort to predict whether a person will follow dietary guidelines for healthy eating you could just assign each activity as its own input to the model. Or, you could apply PCA (and something like varimax factor rotation) and what you might find is that these activities seem to reflect three somewhat separable latent variables that is: physical fitness, competitive athletics and friendship/team based social activity dimension. You now potentially have reduced 50 individual activity measures into 3 dimensions.
Next you would think more deeply about the specific items and combine them in into 3 scales and use the scales as a reduced dimensional input into the predictive model.
In very broad terms, PCA can be thought of as a pre-processing step to reduce the original data set to the "components" which account for the most variation in the data.
Essentially it's distilling your data down to what is most relevant, and thus helps say a classification algorithm work better by only training on the reduced "more manageable" data.
The post considers PCA from visualization perspective, but the exactly same thing can also be viewed as a method for reducing number of dimensions in the original dataset. [1] Now, one of the interesting questions in a dimensionality reduction task is, how to pick the number of dimensions (principal components)? A good number? In a principled way, instead of just computing the next component and the next and the one after that, until you get bored? (It works for visualizations where you often want only the first two or three components anyway, but suppose we want more information than plots.)
I recently learned that there's a fascinating way to do this, presented in Bishop's paper [2] from 1999. In short: this can answered by recasting the PCA as Bayesian latent variable model with a hierarchical prior. (Yes, it is a bit of mouthful to say. Yes, it is fairly mathematical, unlike the visualization.)
[1] https://en.wikipedia.org/wiki/Dimensionality_reduction
[2] https://www.microsoft.com/en-us/research/publication/bayesia...