Overview
This lesson introduces factor analysis.
Objectives
After completing this module, students should be able to:
Reading
NA
Factor analysis is the search for these underlying dimensions, or factors, underneath a larger set of variables. Factor analysis actually long predates the era of big data and machine learning, and has a long history of different methods, most of which produce similar results.
Usually we find strong underlying factors when the variables we measure are correlated with each other. For instance, when x and y values are well-correlated (as in the previous figure) creating a diagonal (American) football shape, then we can speculate that there is actually one underlying factor that explains them both.
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
In the figure on the right, there is mainly a single underlying factor explaining the data, and x and y are just two measurements of basically the same thing (eg, x and y could be length of the right leg and length of the left leg, and the underlying factor is a person’s height). In the figure on the left, x and y are uncorrelated, and thus there are truly two different variables at play here. And in the middle is something inbetween: there is a strong diagional latent variable, but also a lot of variation around that diagional line, which may constitute a second latent dimension.
So how do we find these underlying dimensions? With two dimensions, one might guess that regression is the answer – just regress y on x to get the line – and that’s close to right, but it breaks down in more than two dimensions. In two dimensions, the main underlying factor, sometimes called the first principal component (we’ll return to this in a moment), is along the main axis of the data, and the second principal component is perpendicular to this, as in the figure on the left:
In three dimensions, the pattern is the same: the first component is along the longest axis of the data, the second component is the second-longest axis perpendicular to the first, and so on. See the image on the right, where red is the first, blue the second, and green the third. (You can interact with this graph to get a better visual sense here.) In fact, there are as many latent dimensions are there are variables (\(p\)), although usually all we care about are the first few, since the later ones are by definition smaller and account for less of the variation in the data.
There are two different ways to interpret these latent dimensions. First, we can consider where each observation falls on each dimension. That is, returning to our original political example, someone with a high score on x and y (social and political conservatism, say), would have a high score on the latent diagional line – they would be far out towards the upper-right, which is to say, they are basically right-wing. Someone with a high score on x but a low score on y, or vice versa, would fall towards the middle of the diagional line; and someone with a low score on x and y would fall towards the left. In the first case, they would have a high score on the latent dimesion (imagining the diagional line is a new axis); in the second case a middling score; and in the third case a low score. Basically, you take each point and project it directly onto the latent dimension.
The gif below shows how, as the latent dimension changes (it can be any direction, although there is only one best direction (purple)), the projection of all the points (ie, where they fall on the latent dimesion as defined by the red lines connecting the points to the diagional line) changes. Points that have a high score for a latent dimension direction at 45-degrees might have a very different score for a latent dimension that is completely vertical. Luckily, we don’t have to calculate the projections for every direction, just the direction of first few principal components.
(Illustration from here, which is also a good discussion of PCA more generally.)
The second way to interpret the latent dimenion is in terms not of the observations (what each individual score is on these new axes), but in terms of the variables themselves. Consider the following data:
Here, although there is clearly an underlying factor that explains much of the variation in the data, that dimension is fairly similar to x, and fairly orthogonal (perpendicular) to y. That is, the projection of the principal component on x is large, while the projection on y is small. If these were political opinions again, we would conclude that the political world is mainly determined by one’s social views, and only slightly affected by one’s economic views; while there is still basically just one main political dimension (the first principal component, the blue line), that main political dimension isn’t all that different from the x variable, and the scores individuals would have on x would be similar to the scores they have on the first component.
So when we have found our principal components, then, we can ask two things: First, how do the observations project onto the components? That is, what are the individual scores not in terms of the original variables, but in terms of the new, transformed dimensions. Second, how do the original variables project onto the components: that is (roughly speaking) what proportion of each of the original variables make the new one? Is it 50% x and 30% y and 20% z (ie, is it pointing mainly in the x direction), or is it 10% x, 10% y, and 80% z (ie, pointing almost entirely in the z direction, and not very different from z)? Etc.
As we will see, when we do our factor analysis, we get both. But in terms of understanding the system we are examining, we are mainly interested in the second: how do our new variables (factors or components; I use the terms interchangeably here) explain and relate to the old ones? Only in some applications do we really care about the projections of the observations. One example of the latter is IQ. IQ is perhaps the most famous factor analysis, which takes scores on a bunch of different types of mental tests, and finds that they are all strongly correlated with each other and thus can be “explained” by a single underlying factor; when you take an IQ test, you actually get scores on all these separate variables, and then your IQ is the underlying factor that combines these scores into one.
As you might imagine from the fact that the components are straight lines, the equation for each component in terms of the existing variables \(X_1\), \(X_2\), etc, is a simple linear equation
\[C_p = \phi_{p1} X_1 + \phi_{p2} X_2 + \phi_{p3} X_2 ... \phi_{pp} X_p\]
Where \(C_p\) is the \(p\)th component, and as said before, there are always as many components as there are variables (p). The first component, \(C_1\), is defined by the values \(\{ \phi_{11}, \phi_{12},...\}\) (which like \(\beta_i\), are just numbers), and in this equation, just as \(X_1\) is a vector of \(n\) observations, \(C_p\) is also a vector of size \(n\): the projection of the observations onto that component. What we want is to find those values \(\{ \phi_{11}, \phi_{12},...\}\) such that the variance of \(C_1\) is maximized – ie, we want to find the latent dimension (line in space) that goes through the longest axis of the “football” of data. After we’ve found the first component \(C_1\), we can then find \(C_2\), the dimension through the next longest axis of the data football, and so on. And values \(\{ \phi_{11}, \phi_{12},...\}\) are of course the projections of the original variables onto \(C_1\) – the larger \(\phi_{12}\) is, for instance, the larger the projection of \(X_2\) on \(C_1\) – ie, the more like \(X_2\) \(C_1\) is, or the more similar their directions in space are.
So how do we find these values \(\{ \phi_{p1}, \phi_{p2},...\}\)? There are a number of methods, and what is powerful about factor analysis (FA) or principal component analysis (PCA) is that all these methods tend to produce very similar results. In fact, factor analysis and principal component analysis aren’t quite the same thing, but rather than worrying about that, we can trust in the fact that fundamentally they tend to produce the same dimensions.
The most common estimation of principal components is using the covariance matrix of the variables. That is, if \(X_1\) and \(X_3\) are highly correlated (and thus have a high covariance), then since an observation’s scores \(X_1\) and \(X_3\) are very similar, we would likewise expect \(X_1\) and \(X_3\) to have similar scores on any given component. The covariance matrix presents us with all the covariances between every pair of variables, and what we want to do is use this matrix to find which variables are jointly both highly correlated, and have a high degree of variance. One flaw with this approach as described is that the variance of a variable can depend in its units: measured in $10K units, the variance of incomes in the US is one thing, but measured in $1 units, the variance is much larger. So generally we need to standardize (rescale) everything to have a mean of 0 and sd of 1 before we conduct our analysis – just as one does in standardized regression to allow comparison of \(\beta\) sizes between X variables.
Once we’ve got our covariance matrix, we then need to find the component (direction) with the highest variance. This is often does using the eigenvectors of the covariance matrix. Explaining this in perfect detail would take us too far afield, but the eigenvectors of a matrix are much like the principal components of some data set: a \(p\) x \(p\) matrix has exactly \(p\) eigenvectors, where each eigenvector is a vector of length \(p\). An eigenvector is defined as that vector such that, when multiplied by its associated matrix (using matrix multiplication; see Module 1.2), gives as a result the exact same vector, except stretched or shrunk by some value known as the eigenvalue. The equation is
\[M v_p = \lambda_p v_p\]
That is, matrix M times vector \(v_p\) equals that vector \(v_p\) times some real number \(\lambda_p\). Each eigenvector \(v_p\) has its own associated eigenvalue \(\lambda_p\) that defines how streched or shrunk it is when multiplied by M. And crucially, there are no other vectors besides the eigenvectors that, when multiplied by M, give you a vector pointing in the same direction; every other vector ends up pointing in some different direction after multiplying by M.
So why do we care about eigenvectors? Well, it turns out the eigenvectors turn up all over mathematics and science, from quantum physics to social science to image recognition and self-driving cars. Like principal components, they give you a reduced version of the data that contains much of the original data, but in a smaller form. And for temporal processes, they often describe where complex temporal structures with lots of feedback end up in equilibrium over time. If you haven’t heard of them, they are probably the most important mathematical construct you’ve never heard of.
So getting back to the math, every matrix M has \(p\) eigenvectors and \(p\) eigenvalues associated with it: \(\{ v_1,\lambda_1\},\{ v_2,\lambda_2\},\{ v_3,\lambda_3\}...\). We order them by the size of \(\lambda_p\): the eigenvector with the largest \(\lambda_p\) is the first eigenvector (and thus the first principal component), and so on for all the rest. And because eigenvector-based analysis is so ubiquitous, there are a lot of standard methods for finding the \(p\) pairs \(\{ v_1,\lambda_1\},\{ v_2,\lambda_2\},\{ v_3,\lambda_3\}...\) for any given matrix M.
So to summarize, one classic procedure for finding the principal components (ie, the values \(\{ \phi_{p1}, \phi_{p2},...\}\) and the projections \(C_1, C_2...\)) is:
To put this into more concrete terms, lets look at some real data. Let’s use the psych
package, a package of useful functions and example datasets for analyzing psychological data. Inside, it has the “Motivational State Questionnaire” dataset, a set of 3896 individual responses to a battery of mood questions. The subset we’re interested in here is 71 questions that were asked of each subject about their mood at the time of the survey: how happy, anxious, alert, sleepy, surprised, etc, etc, they were at that moment on a 0-3 scale. The subsantive question is, there are a lot of mood questions out there, but do people really have 71 different moods at any given time? Or are there a few underlying emotional factors that explain most of how you are at any given moment, and the myriad specific moods are just different ways of measuring the simpler, underlying factors? For instance, in the study that collected these data, the researchers started with the common assumption that one of the dominant underlying factors is whether you are happy or sad at any given moment; this level of happiness vs sadness shapes many of the other moods people report at any given moment, and is sort of like a basic thermometer reading for anyone at any time. At least, that is the common theory these scientists were testing.
So let’s get the data, clean it up a bit, and take a quick look at it:
#install.packages("psych")
library(psych)
msq1 <- psychTools::msq[,2:72]
msq1[msq1=="9"] = NA
msq1 <- data.frame(msq1)
msq2 <- na.omit(msq1)
names(msq2)
[1] "afraid" "alert" "angry" "anxious"
[5] "aroused" "ashamed" "astonished" "at.ease"
[9] "at.rest" "attentive" "blue" "bored"
[13] "calm" "cheerful" "clutched.up" "confident"
[17] "content" "delighted" "depressed" "determined"
[21] "distressed" "drowsy" "dull" "elated"
[25] "energetic" "enthusiastic" "excited" "fearful"
[29] "frustrated" "full.of.pep" "gloomy" "grouchy"
[33] "guilty" "happy" "hostile" "idle"
[37] "inactive" "inspired" "intense" "interested"
[41] "irritable" "jittery" "lively" "lonely"
[45] "nervous" "placid" "pleased" "proud"
[49] "quiescent" "quiet" "relaxed" "sad"
[53] "satisfied" "scared" "serene" "sleepy"
[57] "sluggish" "sociable" "sorry" "still"
[61] "strong" "surprised" "tense" "tired"
[65] "tranquil" "unhappy" "upset" "vigorous"
[69] "wakeful" "warmhearted" "wide.awake"
dim(msq2)
[1] 1747 71
msq2[1:5,1:5]
afraid alert angry anxious aroused
1 1 1 0 1 1
2 0 1 0 0 0
3 0 0 0 0 0
4 0 1 0 1 1
103 0 0 1 0 0
That’s a lot of columns and a lot of variables. Let’s do some factor analysis on it to see what, if any, are the underlying dimensions.
The psych
package also has a very robust factor analysis function, fa()
. It actually does much more than simply finding the eigenvectors of the covariance matrix. One of the most important tweaks is that it rotates the principal components a little bit to nudge things so that each component loads only on a few variables, and not on any of the rest. Without the rotation, we might get for the first principal component that it is 40% X1, 30% X2, 20% X3, and 10% X4; after the rotation, we might get 60% X1 and 40% X2, and none of the rest. But as we will see, often this extra rotation doesn’t make much difference, and ultimately we end up with the same eigenvectors as ever.
fact <- fa(msq2,nfactors=2)
fact1 <- fact$loadings[,1]
Here we are only calculating the first two factors, but that’s enough for our purposes for now. The second line extracts from the output the loadings, or the projections of the variables onto the factors, \(\{ \phi_{11}, \phi_{12},...\}\). These are essential for interpreting the factors: if the first factor loads highly positive on all the happy variables and highly negative on all the unhappy variables, then we can know this factor is probably a happiness-unhappiness dimension. But if it loads on some other set of variables, maybe some other underlying factor explains the biggest variance in mood.
The best way to read the factor, then, is to look not at the loadings as ordered by the variables (ie, \(\{ \phi_{11}, \phi_{12},...\}\)), but rather to order them from largest to smallest, so we can see what variables load most positively, and what variables load most negatively: ie, what are the two poles along which the first latent dimension varies.
fact1[order(fact1)]
sluggish tired dull gloomy sleepy
-0.57105799 -0.54881685 -0.54750768 -0.53205954 -0.50569655
grouchy drowsy depressed unhappy irritable
-0.50203289 -0.49831594 -0.49528362 -0.48878774 -0.47579067
inactive blue sad upset frustrated
-0.46837928 -0.46244573 -0.41256029 -0.41080632 -0.40251644
bored hostile distressed lonely angry
-0.37434355 -0.35435492 -0.35003093 -0.34831253 -0.32835989
quiet idle sorry ashamed fearful
-0.29948657 -0.27307778 -0.26551905 -0.23102291 -0.20674325
tense afraid clutched.up guilty scared
-0.20531701 -0.19916658 -0.19462196 -0.17660533 -0.17244643
still nervous placid anxious quiescent
-0.14494122 -0.10041209 -0.04945854 0.01036319 0.07525211
jittery astonished tranquil surprised intense
0.09560217 0.14366556 0.17480685 0.22752845 0.23026932
serene calm at.rest relaxed at.ease
0.24368851 0.26822107 0.30673400 0.38998166 0.47890593
determined strong inspired proud aroused
0.49890726 0.54461269 0.56684424 0.57895328 0.58282944
interested warmhearted confident delighted vigorous
0.59668718 0.61225943 0.62979990 0.64521495 0.64831274
wakeful attentive satisfied elated wide.awake
0.65002080 0.66440296 0.66796561 0.66890261 0.67136906
sociable content alert excited pleased
0.67158295 0.67492386 0.68666737 0.69832570 0.72470873
enthusiastic full.of.pep happy lively energetic
0.75267586 0.77269967 0.77920857 0.78151569 0.78457320
cheerful
0.78675772
This is a lot to look at, but we can immediately tell what variables go into the first factor by looking at the lowest- and highest-scoring variables (the sign, ie, which direction is positive vs negative, is arbitrary and meaningless). What we see is that while one end is happy, enthusiastic, cheerful, full.of.pep, lively, and energetic, the other side is not unhappy so much as sluggish, tired, dull, sleepy, etc. Ie, the most important factor underlying all these 71 moods is not happy vs unhappy, but more like energetic (and happy) vs tired, sleepy, and depressed. An important discovery!
So what is the second factor? Is it happy/unhappy? Or some other dimension of mood? We can again examine it from the fa()
output:
fact2 <- fact$loadings[,2]
fact2[order(fact2)]
calm relaxed at.ease tranquil serene
-0.352885876 -0.329328504 -0.314019286 -0.296146699 -0.239425721
at.rest still inactive placid idle
-0.220366409 -0.210805035 -0.181192225 -0.158556289 -0.130717864
content tired drowsy sleepy satisfied
-0.092566444 -0.085640657 -0.084743296 -0.074466172 -0.052861670
sluggish quiet bored quiescent dull
-0.051250126 -0.047187091 -0.037158459 0.008404442 0.017418666
warmhearted happy confident pleased sociable
0.030302370 0.049778277 0.062769206 0.103344356 0.105867790
delighted cheerful proud interested wakeful
0.144332820 0.145129497 0.191750307 0.205150300 0.236644249
enthusiastic attentive wide.awake strong alert
0.245216122 0.251307405 0.267979328 0.275439171 0.276002665
elated full.of.pep lively energetic vigorous
0.277534264 0.302719139 0.306414361 0.334074266 0.354623629
aroused grouchy surprised excited inspired
0.374395194 0.383309096 0.384160442 0.387856349 0.389399539
lonely astonished irritable determined gloomy
0.413086287 0.430430321 0.434064713 0.440353948 0.446182126
hostile blue jittery depressed guilty
0.450331361 0.473217869 0.482896170 0.487026145 0.490809488
unhappy ashamed sorry sad intense
0.515707520 0.527920666 0.543121600 0.549980128 0.555921589
clutched.up angry anxious afraid fearful
0.573135634 0.575964542 0.585729159 0.598757261 0.599046272
upset scared frustrated nervous distressed
0.602792835 0.616661476 0.619454282 0.629479344 0.637585044
tense
0.674806887
Now we see a second dimension that is interestingly different from the first: not energetic vs sleepy, but relaxed and calm versus tense and frustrated. Here too, the factor is both not quite what we expected, but at the same time familiar and plausible.
Within the fa()
output are also the projections of individuals onto each of these dimensions (ie, how energetic-vs-tired someone seems to be, and how calm-vs-tense they seem to be, and so on for all the other factors we might estimate), and a large number of other diagnostics and outputs (too many to show here!). There’s a whole world to explore in factor analysis, but the basic truth is that the fundamental factors are the same however we estimate them.
For instance, here are two other methods for estimating principal components, both built into the base R package, but using different approaches. The first uses singular value decomposition (another unsupervised approach similar to factor analysis), and the second uses the covariance/eigenvector approach.
# prcomp method using SVD
pcaA <- prcomp(msq2)
pcaA1 <- pcaA$rotation[,1]
# Princomp method using eigen of cov
pcaB <- princomp(msq2)
pcaB1 <- pcaB$loadings[,1]
Finally, let’s do it manually ourselves using cov()
to estimate the covariance matrix and eigen()
to get the eigenvectors:
# Direct eigen of cov
covm <- cov(msq2)
eigenm <- eigen(covm)
eigen1 <- eigenm$vectors[,1]
How similar are these methods?
f1mat <- cbind(fact1,pcaA1,pcaB1,eigen1)
cor(f1mat)
fact1 pcaA1 pcaB1 eigen1
fact1 1.0000000 0.9836389 -0.9836389 -0.9836389
pcaA1 0.9836389 1.0000000 -1.0000000 -1.0000000
pcaB1 -0.9836389 -1.0000000 1.0000000 1.0000000
eigen1 -0.9836389 -1.0000000 1.0000000 1.0000000
Essentially identical, with fa()
being slightly different due to the rotation – but not very different.
One final method for estimating the components directly from the data is perhaps a bit surprising. Because eigenvectors are, in a sense, both the underlying truth of the data and the equilibrium of any process that the data describes, one additional way to find the first eigenvector / principal component is to take any random vector \(v_a\) of length \(p\) and multiply it by the data matrix; then take the resulting vector \(v_b\) (now of length \(n\)) and multiply it back through the matrix to get another vector \(v_a'\), and so on back and forth. If you keep doing this a few times, ultimately the process converges so that the final \(v_a\) is almost identical to the first eigenvector. (This is basically how singular value decomposition works.) It’s a crude but very effective way to get the first component (and can be adapted to get the others), and works both quickly and well on even very large datasets.
tm.msq2 <- t(as.matrix(msq2))
m.msq2 <- as.matrix(msq2)
va <- rnorm(ncol(msq2))
for(i in 1:10){
vb <- m.msq2 %*% scale(va)
va <- tm.msq2 %*% scale(vb)
}
(Note how we have to rescale va and vb each time, otherwise they increase because the eigenvalue stretches the eigenvector each time it’s multiplied through the matrix. We only care about the direction of v (ie, the relative sizes of the loadings), not the absolute size, so we rescale it to keep it from getting unwieldly big.)
So does the final va match the first component?
cor(va,pcaA1)
[,1]
[1,] -0.999497
Yes it does.
Since factor analysis theoretically can produces as many factors as there were variables originally (p), but one of the purposes of factor analysis is to reduce a high-p dataset to a smaller set of factors, one might naturally wonder how many factors do we have to keep? Can a 100-variable dataset be reduced to just three factors that explain most of the variation? How do we know how many to keep?
It turns out this question is a vexed and ongoing one, and has been for decades. Ideally we would want a nice statistical test to tell us that the first k factors are statistically significant in some sense, and the remaining ones are just picking up on random noise. But despite many suggested such tests, the community has not really settled on a single best test.
Perhaps one of the oldest and most established methods is looking at the eigenvalues, which in a sense determine the relative importance of each of the factors: the bigger the eigenvalue, the more of the variation in the original data that eigenvector/factor explains.
One common way of examining these values is to plot them, ordered by size, in something called a “scree plot” (because it looks the the scree on the side of a mountain). All of our estimation methods above include the values as well as the factors, and here is a simple scree plot of them from our manual eigenvector method.
plot(eigenm$values,type="b")
There are two common criteria to use for choosing the number of factors based on an examination of these values. First, one looks for the “elbow” in the curve – where it goes from the steep decline, the the flat area, where the presumption is the flat are is all the factors that are just noise. So in this figure, counting from the left, it might be 5 or so factors that are significant, and the rest are just noise. Another method is to only keep those factors with values above 1 (because that distinguishes eigenvectors that grow vs those that shrink), which in this case yields 6 factors to retain – so the two methods are in fairly close agreement.
One final way to see this is the plot the proportion of the total variance in the data explained by the factors as you add them together. Since the eigenvalues are equivalent to the amount of variance each component explains in the original data, we can just do the same plot as before, but use the cumsum
function to plot the cumulative amount (cumsum
just turns a vector \(v\) of \(i\) values in a vector \(c\) where each value \(c_i\) is the sum of all the \(v_i\) up to the ith), and normalize it by dividing by the total.
plot(cumsum(eigenm$values)/sum(eigenm$values),ylim=c(0,1))
We can see that this is a fairly smooth curve, with a bit of an elbow at around the fourth or fifth value – about the same as before. More interestingly, over 50% of the total variance in the data is explained by just those first 4 factors, so only 4 out 71 factors are already explaining over half of the variation. This plus the previous tests suggests that 4-6 factors do explain quite a lot of what’s going on in the variation of mood among people over time. We may not be 70-dimensional emotional thinkers, but 4-6 seems like a plausibly complex psychological model.