Overview

This lesson introduces factor analysis.

Objectives

After completing this module, students should be able to:

Explain the meaning of latent or underlying factors.
Calculate factors using R, and interpret the results.
Express factors in terms of variable or observation projections.

Reading

1 Factor analysis

Factor analysis is the search for these underlying dimensions, or factors, underneath a larger set of variables. Factor analysis actually long predates the era of big data and machine learning, and has a long history of different methods, most of which produce similar results.

Usually we find strong underlying factors when the variables we measure are correlated with each other. For instance, when x and y values are well-correlated (as in the previous figure) creating a diagonal (American) football shape, then we can speculate that there is actually one underlying factor that explains them both.

`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

In the figure on the right, there is mainly a single underlying factor explaining the data, and x and y are just two measurements of basically the same thing (eg, x and y could be length of the right leg and length of the left leg, and the underlying factor is a person’s height). In the figure on the left, x and y are uncorrelated, and thus there are truly two different variables at play here. And in the middle is something inbetween: there is a strong diagional latent variable, but also a lot of variation around that diagional line, which may constitute a second latent dimension.

1.1 Latent dimensions in space

So how do we find these underlying dimensions? With two dimensions, one might guess that regression is the answer – just regress y on x to get the line – and that’s close to right, but it breaks down in more than two dimensions. In two dimensions, the main underlying factor, sometimes called the first principal component (we’ll return to this in a moment), is along the main axis of the data, and the second principal component is perpendicular to this, as in the figure on the left:

In three dimensions, the pattern is the same: the first component is along the longest axis of the data, the second component is the second-longest axis perpendicular to the first, and so on. See the image on the right, where red is the first, blue the second, and green the third. (You can interact with this graph to get a better visual sense here.) In fact, there are as many latent dimensions are there are variables ($p$), although usually all we care about are the first few, since the later ones are by definition smaller and account for less of the variation in the data.

1.2 Projection of observations

There are two different ways to interpret these latent dimensions. First, we can consider where each observation falls on each dimension. That is, returning to our original political example, someone with a high score on x and y (social and political conservatism, say), would have a high score on the latent diagional line – they would be far out towards the upper-right, which is to say, they are basically right-wing. Someone with a high score on x but a low score on y, or vice versa, would fall towards the middle of the diagional line; and someone with a low score on x and y would fall towards the left. In the first case, they would have a high score on the latent dimesion (imagining the diagional line is a new axis); in the second case a middling score; and in the third case a low score. Basically, you take each point and project it directly onto the latent dimension.

The gif below shows how, as the latent dimension changes (it can be any direction, although there is only one best direction (purple)), the projection of all the points (ie, where they fall on the latent dimesion as defined by the red lines connecting the points to the diagional line) changes. Points that have a high score for a latent dimension direction at 45-degrees might have a very different score for a latent dimension that is completely vertical. Luckily, we don’t have to calculate the projections for every direction, just the direction of first few principal components.

(Illustration from here, which is also a good discussion of PCA more generally.)

1.3 Projection of variables

The second way to interpret the latent dimenion is in terms not of the observations (what each individual score is on these new axes), but in terms of the variables themselves. Consider the following data:

Here, although there is clearly an underlying factor that explains much of the variation in the data, that dimension is fairly similar to x, and fairly orthogonal (perpendicular) to y. That is, the projection of the principal component on x is large, while the projection on y is small. If these were political opinions again, we would conclude that the political world is mainly determined by one’s social views, and only slightly affected by one’s economic views; while there is still basically just one main political dimension (the first principal component, the blue line), that main political dimension isn’t all that different from the x variable, and the scores individuals would have on x would be similar to the scores they have on the first component.

So when we have found our principal components, then, we can ask two things: First, how do the observations project onto the components? That is, what are the individual scores not in terms of the original variables, but in terms of the new, transformed dimensions. Second, how do the original variables project onto the components: that is (roughly speaking) what proportion of each of the original variables make the new one? Is it 50% x and 30% y and 20% z (ie, is it pointing mainly in the x direction), or is it 10% x, 10% y, and 80% z (ie, pointing almost entirely in the z direction, and not very different from z)? Etc.

As we will see, when we do our factor analysis, we get both. But in terms of understanding the system we are examining, we are mainly interested in the second: how do our new variables (factors or components; I use the terms interchangeably here) explain and relate to the old ones? Only in some applications do we really care about the projections of the observations. One example of the latter is IQ. IQ is perhaps the most famous factor analysis, which takes scores on a bunch of different types of mental tests, and finds that they are all strongly correlated with each other and thus can be “explained” by a single underlying factor; when you take an IQ test, you actually get scores on all these separate variables, and then your IQ is the underlying factor that combines these scores into one.

2 Estimation and derivation

As you might imagine from the fact that the components are straight lines, the equation for each component in terms of the existing variables $X_1$, $X_2$, etc, is a simple linear equation

\[C_p = \phi_{p1} X_1 + \phi_{p2} X_2 + \phi_{p3} X_2 ... \phi_{pp} X_p\]

Where $C_p$ is the $p$th component, and as said before, there are always as many components as there are variables (p). The first component, $C_1$, is defined by the values $\{ \phi_{11}, \phi_{12},...\}$ (which like $\beta_i$, are just numbers), and in this equation, just as $X_1$ is a vector of $n$ observations, $C_p$ is also a vector of size $n$: the projection of the observations onto that component. What we want is to find those values $\{ \phi_{11}, \phi_{12},...\}$ such that the variance of $C_1$ is maximized – ie, we want to find the latent dimension (line in space) that goes through the longest axis of the “football” of data. After we’ve found the first component $C_1$, we can then find $C_2$, the dimension through the next longest axis of the data football, and so on. And values $\{ \phi_{11}, \phi_{12},...\}$ are of course the projections of the original variables onto $C_1$ – the larger $\phi_{12}$ is, for instance, the larger the projection of $X_2$ on $C_1$ – ie, the more like $X_2$ $C_1$ is, or the more similar their directions in space are.

So how do we find these values $\{ \phi_{p1}, \phi_{p2},...\}$? There are a number of methods, and what is powerful about factor analysis (FA) or principal component analysis (PCA) is that all these methods tend to produce very similar results. In fact, factor analysis and principal component analysis aren’t quite the same thing, but rather than worrying about that, we can trust in the fact that fundamentally they tend to produce the same dimensions.

2.1 Estimation via eigenvectors

The most common estimation of principal components is using the covariance matrix of the variables. That is, if $X_1$ and $X_3$ are highly correlated (and thus have a high covariance), then since an observation’s scores $X_1$ and $X_3$ are very similar, we would likewise expect $X_1$ and $X_3$ to have similar scores on any given component. The covariance matrix presents us with all the covariances between every pair of variables, and what we want to do is use this matrix to find which variables are jointly both highly correlated, and have a high degree of variance. One flaw with this approach as described is that the variance of a variable can depend in its units: measured in $10K units, the variance of incomes in the US is one thing, but measured in $1 units, the variance is much larger. So generally we need to standardize (rescale) everything to have a mean of 0 and sd of 1 before we conduct our analysis – just as one does in standardized regression to allow comparison of $\beta$ sizes between X variables.

Once we’ve got our covariance matrix, we then need to find the component (direction) with the highest variance. This is often does using the eigenvectors of the covariance matrix. Explaining this in perfect detail would take us too far afield, but the eigenvectors of a matrix are much like the principal components of some data set: a $p$ x $p$ matrix has exactly $p$ eigenvectors, where each eigenvector is a vector of length $p$. An eigenvector is defined as that vector such that, when multiplied by its associated matrix (using matrix multiplication; see Module 1.2), gives as a result the exact same vector, except stretched or shrunk by some value known as the eigenvalue. The equation is

\[M v_p = \lambda_p v_p\]

That is, matrix M times vector $v_p$ equals that vector $v_p$ times some real number $\lambda_p$. Each eigenvector $v_p$ has its own associated eigenvalue $\lambda_p$ that defines how streched or shrunk it is when multiplied by M. And crucially, there are no other vectors besides the eigenvectors that, when multiplied by M, give you a vector pointing in the same direction; every other vector ends up pointing in some different direction after multiplying by M.

So why do we care about eigenvectors? Well, it turns out the eigenvectors turn up all over mathematics and science, from quantum physics to social science to image recognition and self-driving cars. Like principal components, they give you a reduced version of the data that contains much of the original data, but in a smaller form. And for temporal processes, they often describe where complex temporal structures with lots of feedback end up in equilibrium over time. If you haven’t heard of them, they are probably the most important mathematical construct you’ve never heard of.

So getting back to the math, every matrix M has $p$ eigenvectors and $p$ eigenvalues associated with it: $\{ v_1,\lambda_1\},\{ v_2,\lambda_2\},\{ v_3,\lambda_3\}...$. We order them by the size of $\lambda_p$: the eigenvector with the largest $\lambda_p$ is the first eigenvector (and thus the first principal component), and so on for all the rest. And because eigenvector-based analysis is so ubiquitous, there are a lot of standard methods for finding the $p$ pairs $\{ v_1,\lambda_1\},\{ v_2,\lambda_2\},\{ v_3,\lambda_3\}...$ for any given matrix M.

2.2 Example estimation

So to summarize, one classic procedure for finding the principal components (ie, the values $\{ \phi_{p1}, \phi_{p2},...\}$ and the projections $C_1, C_2...$) is:

Standardize the variables
Create the covariance matrix
Find the eigenvectors and eigenvalues for that matrix.
Choose how many of them we want to keep and analyze.

To put this into more concrete terms, lets look at some real data. Let’s use the psych package, a package of useful functions and example datasets for analyzing psychological data. Inside, it has the “Motivational State Questionnaire” dataset, a set of 3896 individual responses to a battery of mood questions. The subset we’re interested in here is 71 questions that were asked of each subject about their mood at the time of the survey: how happy, anxious, alert, sleepy, surprised, etc, etc, they were at that moment on a 0-3 scale. The subsantive question is, there are a lot of mood questions out there, but do people really have 71 different moods at any given time? Or are there a few underlying emotional factors that explain most of how you are at any given moment, and the myriad specific moods are just different ways of measuring the simpler, underlying factors? For instance, in the study that collected these data, the researchers started with the common assumption that one of the dominant underlying factors is whether you are happy or sad at any given moment; this level of happiness vs sadness shapes many of the other moods people report at any given moment, and is sort of like a basic thermometer reading for anyone at any time. At least, that is the common theory these scientists were testing.

So let’s get the data, clean it up a bit, and take a quick look at it:

#install.packages("psych")
library(psych)
msq1 <- psychTools::msq[,2:72]     
msq1[msq1=="9"] = NA      
msq1 <- data.frame(msq1) 
msq2 <- na.omit(msq1)    
names(msq2)

 [1] "afraid"       "alert"        "angry"        "anxious"     
 [5] "aroused"      "ashamed"      "astonished"   "at.ease"     
 [9] "at.rest"      "attentive"    "blue"         "bored"       
[13] "calm"         "cheerful"     "clutched.up"  "confident"   
[17] "content"      "delighted"    "depressed"    "determined"  
[21] "distressed"   "drowsy"       "dull"         "elated"      
[25] "energetic"    "enthusiastic" "excited"      "fearful"     
[29] "frustrated"   "full.of.pep"  "gloomy"       "grouchy"     
[33] "guilty"       "happy"        "hostile"      "idle"        
[37] "inactive"     "inspired"     "intense"      "interested"  
[41] "irritable"    "jittery"      "lively"       "lonely"      
[45] "nervous"      "placid"       "pleased"      "proud"       
[49] "quiescent"    "quiet"        "relaxed"      "sad"         
[53] "satisfied"    "scared"       "serene"       "sleepy"      
[57] "sluggish"     "sociable"     "sorry"        "still"       
[61] "strong"       "surprised"    "tense"        "tired"       
[65] "tranquil"     "unhappy"      "upset"        "vigorous"    
[69] "wakeful"      "warmhearted"  "wide.awake"

dim(msq2)

[1] 1747   71

msq2[1:5,1:5]

    afraid alert angry anxious aroused
1        1     1     0       1       1
2        0     1     0       0       0
3        0     0     0       0       0
4        0     1     0       1       1
103      0     0     1       0       0

That’s a lot of columns and a lot of variables. Let’s do some factor analysis on it to see what, if any, are the underlying dimensions.

2.3 Example 2

The psych package also has a very robust factor analysis function, fa(). It actually does much more than simply finding the eigenvectors of the covariance matrix. One of the most important tweaks is that it rotates the principal components a little bit to nudge things so that each component loads only on a few variables, and not on any of the rest. Without the rotation, we might get for the first principal component that it is 40% X1, 30% X2, 20% X3, and 10% X4; after the rotation, we might get 60% X1 and 40% X2, and none of the rest. But as we will see, often this extra rotation doesn’t make much difference, and ultimately we end up with the same eigenvectors as ever.

fact <- fa(msq2,nfactors=2)
fact1 <- fact$loadings[,1]

Here we are only calculating the first two factors, but that’s enough for our purposes for now. The second line extracts from the output the loadings, or the projections of the variables onto the factors, $\{ \phi_{11}, \phi_{12},...\}$. These are essential for interpreting the factors: if the first factor loads highly positive on all the happy variables and highly negative on all the unhappy variables, then we can know this factor is probably a happiness-unhappiness dimension. But if it loads on some other set of variables, maybe some other underlying factor explains the biggest variance in mood.

2.4 Example 3

The best way to read the factor, then, is to look not at the loadings as ordered by the variables (ie, $\{ \phi_{11}, \phi_{12},...\}$), but rather to order them from largest to smallest, so we can see what variables load most positively, and what variables load most negatively: ie, what are the two poles along which the first latent dimension varies.

fact1[order(fact1)]

    sluggish        tired         dull       gloomy       sleepy 
 -0.57105799  -0.54881685  -0.54750768  -0.53205954  -0.50569655 
     grouchy       drowsy    depressed      unhappy    irritable 
 -0.50203289  -0.49831594  -0.49528362  -0.48878774  -0.47579067 
    inactive         blue          sad        upset   frustrated 
 -0.46837928  -0.46244573  -0.41256029  -0.41080632  -0.40251644 
       bored      hostile   distressed       lonely        angry 
 -0.37434355  -0.35435492  -0.35003093  -0.34831253  -0.32835989 
       quiet         idle        sorry      ashamed      fearful 
 -0.29948657  -0.27307778  -0.26551905  -0.23102291  -0.20674325 
       tense       afraid  clutched.up       guilty       scared 
 -0.20531701  -0.19916658  -0.19462196  -0.17660533  -0.17244643 
       still      nervous       placid      anxious    quiescent 
 -0.14494122  -0.10041209  -0.04945854   0.01036319   0.07525211 
     jittery   astonished     tranquil    surprised      intense 
  0.09560217   0.14366556   0.17480685   0.22752845   0.23026932 
      serene         calm      at.rest      relaxed      at.ease 
  0.24368851   0.26822107   0.30673400   0.38998166   0.47890593 
  determined       strong     inspired        proud      aroused 
  0.49890726   0.54461269   0.56684424   0.57895328   0.58282944 
  interested  warmhearted    confident    delighted     vigorous 
  0.59668718   0.61225943   0.62979990   0.64521495   0.64831274 
     wakeful    attentive    satisfied       elated   wide.awake 
  0.65002080   0.66440296   0.66796561   0.66890261   0.67136906 
    sociable      content        alert      excited      pleased 
  0.67158295   0.67492386   0.68666737   0.69832570   0.72470873 
enthusiastic  full.of.pep        happy       lively    energetic 
  0.75267586   0.77269967   0.77920857   0.78151569   0.78457320 
    cheerful 
  0.78675772

This is a lot to look at, but we can immediately tell what variables go into the first factor by looking at the lowest- and highest-scoring variables (the sign, ie, which direction is positive vs negative, is arbitrary and meaningless). What we see is that while one end is happy, enthusiastic, cheerful, full.of.pep, lively, and energetic, the other side is not unhappy so much as sluggish, tired, dull, sleepy, etc. Ie, the most important factor underlying all these 71 moods is not happy vs unhappy, but more like energetic (and happy) vs tired, sleepy, and depressed. An important discovery!

2.5 Example 4

So what is the second factor? Is it happy/unhappy? Or some other dimension of mood? We can again examine it from the fa() output:

fact2 <- fact$loadings[,2]
fact2[order(fact2)]

        calm      relaxed      at.ease     tranquil       serene 
-0.352885876 -0.329328504 -0.314019286 -0.296146699 -0.239425721 
     at.rest        still     inactive       placid         idle 
-0.220366409 -0.210805035 -0.181192225 -0.158556289 -0.130717864 
     content        tired       drowsy       sleepy    satisfied 
-0.092566444 -0.085640657 -0.084743296 -0.074466172 -0.052861670 
    sluggish        quiet        bored    quiescent         dull 
-0.051250126 -0.047187091 -0.037158459  0.008404442  0.017418666 
 warmhearted        happy    confident      pleased     sociable 
 0.030302370  0.049778277  0.062769206  0.103344356  0.105867790 
   delighted     cheerful        proud   interested      wakeful 
 0.144332820  0.145129497  0.191750307  0.205150300  0.236644249 
enthusiastic    attentive   wide.awake       strong        alert 
 0.245216122  0.251307405  0.267979328  0.275439171  0.276002665 
      elated  full.of.pep       lively    energetic     vigorous 
 0.277534264  0.302719139  0.306414361  0.334074266  0.354623629 
     aroused      grouchy    surprised      excited     inspired 
 0.374395194  0.383309096  0.384160442  0.387856349  0.389399539 
      lonely   astonished    irritable   determined       gloomy 
 0.413086287  0.430430321  0.434064713  0.440353948  0.446182126 
     hostile         blue      jittery    depressed       guilty 
 0.450331361  0.473217869  0.482896170  0.487026145  0.490809488 
     unhappy      ashamed        sorry          sad      intense 
 0.515707520  0.527920666  0.543121600  0.549980128  0.555921589 
 clutched.up        angry      anxious       afraid      fearful 
 0.573135634  0.575964542  0.585729159  0.598757261  0.599046272 
       upset       scared   frustrated      nervous   distressed 
 0.602792835  0.616661476  0.619454282  0.629479344  0.637585044 
       tense 
 0.674806887

Now we see a second dimension that is interestingly different from the first: not energetic vs sleepy, but relaxed and calm versus tense and frustrated. Here too, the factor is both not quite what we expected, but at the same time familiar and plausible.

Within the fa() output are also the projections of individuals onto each of these dimensions (ie, how energetic-vs-tired someone seems to be, and how calm-vs-tense they seem to be, and so on for all the other factors we might estimate), and a large number of other diagnostics and outputs (too many to show here!). There’s a whole world to explore in factor analysis, but the basic truth is that the fundamental factors are the same however we estimate them.

3 Other estimation methods

For instance, here are two other methods for estimating principal components, both built into the base R package, but using different approaches. The first uses singular value decomposition (another unsupervised approach similar to factor analysis), and the second uses the covariance/eigenvector approach.

# prcomp method using SVD
pcaA <- prcomp(msq2)
pcaA1 <- pcaA$rotation[,1]

# Princomp method using eigen of cov
pcaB <- princomp(msq2)
pcaB1 <- pcaB$loadings[,1]

Finally, let’s do it manually ourselves using cov() to estimate the covariance matrix and eigen() to get the eigenvectors:

# Direct eigen of cov
covm <- cov(msq2)
eigenm <- eigen(covm)
eigen1 <- eigenm$vectors[,1]

How similar are these methods?

f1mat <- cbind(fact1,pcaA1,pcaB1,eigen1)
cor(f1mat)

            fact1      pcaA1      pcaB1     eigen1
fact1   1.0000000  0.9836389 -0.9836389 -0.9836389
pcaA1   0.9836389  1.0000000 -1.0000000 -1.0000000
pcaB1  -0.9836389 -1.0000000  1.0000000  1.0000000
eigen1 -0.9836389 -1.0000000  1.0000000  1.0000000

Essentially identical, with fa() being slightly different due to the rotation – but not very different.

3.1 Estimation bonus

One final method for estimating the components directly from the data is perhaps a bit surprising. Because eigenvectors are, in a sense, both the underlying truth of the data and the equilibrium of any process that the data describes, one additional way to find the first eigenvector / principal component is to take any random vector $v_a$ of length $p$ and multiply it by the data matrix; then take the resulting vector $v_b$ (now of length $n$) and multiply it back through the matrix to get another vector $v_a'$, and so on back and forth. If you keep doing this a few times, ultimately the process converges so that the final $v_a$ is almost identical to the first eigenvector. (This is basically how singular value decomposition works.) It’s a crude but very effective way to get the first component (and can be adapted to get the others), and works both quickly and well on even very large datasets.

tm.msq2 <- t(as.matrix(msq2))
m.msq2 <- as.matrix(msq2)
va <- rnorm(ncol(msq2))
for(i in 1:10){
  vb <- m.msq2 %*% scale(va)
  va <- tm.msq2 %*% scale(vb)
}

(Note how we have to rescale va and vb each time, otherwise they increase because the eigenvalue stretches the eigenvector each time it’s multiplied through the matrix. We only care about the direction of v (ie, the relative sizes of the loadings), not the absolute size, so we rescale it to keep it from getting unwieldly big.)

So does the final va match the first component?

cor(va,pcaA1)

          [,1]
[1,] -0.999497

Yes it does.

4 How many factors to keep?

Since factor analysis theoretically can produces as many factors as there were variables originally (p), but one of the purposes of factor analysis is to reduce a high-p dataset to a smaller set of factors, one might naturally wonder how many factors do we have to keep? Can a 100-variable dataset be reduced to just three factors that explain most of the variation? How do we know how many to keep?

It turns out this question is a vexed and ongoing one, and has been for decades. Ideally we would want a nice statistical test to tell us that the first k factors are statistically significant in some sense, and the remaining ones are just picking up on random noise. But despite many suggested such tests, the community has not really settled on a single best test.

Perhaps one of the oldest and most established methods is looking at the eigenvalues, which in a sense determine the relative importance of each of the factors: the bigger the eigenvalue, the more of the variation in the original data that eigenvector/factor explains.

4.1 Scree plot

One common way of examining these values is to plot them, ordered by size, in something called a “scree plot” (because it looks the the scree on the side of a mountain). All of our estimation methods above include the values as well as the factors, and here is a simple scree plot of them from our manual eigenvector method.

plot(eigenm$values,type="b")

There are two common criteria to use for choosing the number of factors based on an examination of these values. First, one looks for the “elbow” in the curve – where it goes from the steep decline, the the flat area, where the presumption is the flat are is all the factors that are just noise. So in this figure, counting from the left, it might be 5 or so factors that are significant, and the rest are just noise. Another method is to only keep those factors with values above 1 (because that distinguishes eigenvectors that grow vs those that shrink), which in this case yields 6 factors to retain – so the two methods are in fairly close agreement.

4.2 Cumulative variance

One final way to see this is the plot the proportion of the total variance in the data explained by the factors as you add them together. Since the eigenvalues are equivalent to the amount of variance each component explains in the original data, we can just do the same plot as before, but use the cumsum function to plot the cumulative amount (cumsum just turns a vector $v$ of $i$ values in a vector $c$ where each value $c_i$ is the sum of all the $v_i$ up to the ith), and normalize it by dividing by the total.

plot(cumsum(eigenm$values)/sum(eigenm$values),ylim=c(0,1))

We can see that this is a fairly smooth curve, with a bit of an elbow at around the fourth or fifth value – about the same as before. More interestingly, over 50% of the total variance in the data is explained by just those first 4 factors, so only 4 out 71 factors are already explaining over half of the variation. This plus the previous tests suggests that 4-6 factors do explain quite a lot of what’s going on in the variation of mood among people over time. We may not be 70-dimensional emotional thinkers, but 4-6 seems like a plausibly complex psychological model.

Computational Statistics 11.2: Factor Analysis