Overview
This lesson introduces clustering
Objectives
After completing this module, students should be able to:
Reading
Lander, 22.1, 22.3.
As we saw in the introduction, cluster analysis is an alternative form of unsupervised learning, with perhaps as long and complex a lineage as factor analysis. Unlike factor analysis, cluster analysis is generally targeted at the observation level rather than the variables: we often want to know what natural groups the observations fall into, and what group each observation belongs to (although in more complex versions, an observation can belong to greater or lesser degrees to multiple clusters). The challenge, then, is to find what natural clusters, if any, the data fall into. But when we seek to characterize the clusters themselves (ie, interpret what they mean, as we tried to do with factors), then we are doing something more like variable projection: we want to know where the clusters are located in the same variable space as the observations.
For instance, consider the following variant on our original simulated example.x1 <- rnorm(100,0,.5)
y1 <- rnorm(100,3,.5)
x2 <- rnorm(100,3,.5)
y2 <- rnorm(100,0,.5)
factor <- as.factor(c(rep("A",100),rep("B",100)))
sd1 <- data.frame(var1=x1,var2=y1,cluster=factor[1:100])
sd2 <- data.frame(var1=x2,var2=y2,cluster=factor[101:200])
simdat1 <- rbind(sd1,sd2)
# plot(simdat1$var1,simdat1$var2)
Here our original data are on the left, and our guess about the clusters that best fit the data is at the center. Red observations are assigned to group 1, and blue to group 2. But we can also say that group 1 loads heavily on variable 1 (the x axis) and group 2 loads heavily on variable 2 (the y axis).
A more precise way to characterize that might be to define the groups by their centers: the black dots in the figure at the right. The center of the red group is (3,0) and the center of the left is (0,3), making more precise our notion of how the cluster identies are connected to the variables (axes). And those centers even allow us to define who belongs to which group: those nearest the (0,3) center belong the the blue group, and those nearest the (3,0) center belong to the red group. In a sense, the group defines the center point, and the center point defines the group (ie, those closest to that point). The natural next question is how do we define these groups and center points – and the answer lies in the way they reciprocally define each other.
There are literally hundreds of ways to do cluster analysis, but perhaps the most long-established is the k-means algorithm. This is a suprisingly simple algorithm, with only one tricky part: the number of clusters to divide the data into. This is similar to the question of the number of factors to keep, and let us similarly put the question off briefly.
The k-means algorithm is just two steps, repeated:
2(a) For each category, calculate the mean of all the points in that group (ie, the mean x1 value, the mean x2 value, etc; ie, the “centroid” of all those points).
2(b) After calculating all the group centroids, reassign each point to the group of the nearest centroid. Go back to 2(a) and repeat until group assignments stop changing (ie, the process has converged).
That’s it. But it works pretty nicely. Let’s try it on our simulated data. Usually you can’t know ahead of time how many groups to use, but for now let’s cheat and use two.
Here’s our simulated data, which contains the known clusters:
head(simdat1)
var1 var2 cluster
1 0.570914646 2.610354 A
2 0.134137850 3.002990 A
3 -0.767015417 2.636657 A
4 -0.102933579 3.875686 A
5 0.920837599 2.834753 A
6 -0.006542993 2.610651 A
tail(simdat1)
var1 var2 cluster
195 3.655312 0.5181315 B
196 3.118843 0.2610610 B
197 2.733543 1.0524464 B
198 2.694174 0.3511974 B
199 3.252162 0.3774145 B
200 2.234124 -0.4496955 B
Let’s try to assign our two clusters using k means, based on a random imitialization
set.seed(1)
cat <- as.factor(round(runif(200)))
simdat2 <- cbind(simdat1,cat)
head(simdat2)
var1 var2 cluster cat
1 0.570914646 2.610354 A 0
2 0.134137850 3.002990 A 0
3 -0.767015417 2.636657 A 1
4 -0.102933579 3.875686 A 1
5 0.920837599 2.834753 A 0
6 -0.006542993 2.610651 A 1
Now lets iterate our 2(a) and 2(b):
for(i in 1:2){
# 2(a): get centroids of two groups
centroids <- aggregate(simdat2[,1:2],by=list(cat=simdat2$cat),FUN=mean)
print(centroids)
# 2(b): calculate distances of each point to centroid 1 (d1) and centroid 2 (d2)
d1 <- sqrt( (simdat2[,1]-centroids[1,2])^2 + (simdat2[,2]-centroids[1,3])^2 )
d2 <- sqrt( (simdat2[,1]-centroids[2,2])^2 + (simdat2[,2]-centroids[2,3])^2 )
# then reassign the category variable depending on which centroid is closer
simdat2$cat <- as.factor(as.numeric(d1 < d2))
}
cat var1 var2
1 0 1.329780 1.64664
2 1 1.659056 1.43930
cat var1 var2
1 0 2.94335944 0.1057073
2 1 0.05206146 2.9760860
As can be seen, in this case it only takes two steps for the centroids to converge to (approximately) (3,0) and (0,3). It will obviously take longer with higher dimensional data, more groups, and less well-separated observations. But even with large data, it is a very fast procedure. It’s only drawbacks are that (a) you have to guess the k ahead of time, and (b) depending on the random initialization, you can get different groupings if the data are not well separated.
R of course has its own routines for estimating k means clusters.
kout <- kmeans(simdat2[,1:2],centers=2)
as.vector(kout$cluster)
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
kout$centers
var1 var2
1 2.94335944 0.1057073
2 0.05206146 2.9760860
As we can see, it gets the categories exactly right.
But as we said, for more complex data, the exact category assignments and centroids can vary from run to run. Ideally we want the best assignment, where the best assignment is the placement of centroids such that the total variance within each group is minimized. That is, in 2(b) we add up the distances between every member of a group and the centroid, for each group; we want the location of centroids such that the total sum (ie sum(d1) + sum(d2)
in our example) is minimized. That is, we want the lowest level of within-cluster variation we can get – each cluster should be as tight a group as possible.
Since the result of the k means algorithm can depend on the starting conditions, the kmeans
function can automatically take a bunch of initial conditions, run the whole thing with each initial condition, and report the best of the runs. In the example below, we run it first with one random start, then with 25, assigning 5 categories to our original data:
set.seed(100)
kout <- kmeans(simdat2[,1:2],centers=5,nstart=1)
kout$tot.withinss
[1] 54.92029
kout <- kmeans(simdat2[,1:2],centers=5,nstart=25)
kout$tot.withinss
[1] 53.13394
Note how the total sum of squares for the second is better than the first, due to running it 25 times rather than just once.
We can also try it using our previous Motivational State Questionnaire data.
kout <- kmeans(msq2,centers=2,nstart=25)
We can inspect the group memberships, but that’s a little hard to parse with 71 variables and thousands of observations. Easier to interpret is the location of the centroids in the 71 dimensions. This might seem like a lot, but there are only two centroids, each of which is in 71-dimensional space (just as the two centroids in the previous example were in 2-dimensional space). What we might want to know is, for each centroid, what dimensions does it score particularly highly on (how far out is it on those axes, like (3,0) is high on x and low on y), and do we see any pattern among those variables that centroid X is high on? We can just sort each centroid by its scores on all 71 variables, and display the top scoring variables:
centroids <- kout$centers
topvars_centroid1 <- centroids[1,order(centroids[1,])]
topvars_centroid2 <- centroids[2,order(centroids[2,])]
tail(topvars_centroid1)
drowsy sluggish bored quiet sleepy tired
1.578125 1.589844 1.593750 1.625977 1.694336 1.858398
tail(topvars_centroid2)
satisfied happy sociable confident content warmhearted
1.889350 1.950207 1.957123 1.976487 1.993084 1.994467
Look familiar? The first cluster seems to be the sleep/tired bunch, and the second cluster is the happy bunch, although it seems more about content and happy than about energetic and happy, as we saw in the first principal component.
We often see a strong overlap between clusters and factors, but a key difference is that factors are inherently dimensional and oppositional: there are two directions for every factor, and we often see clear oppositions at either end, such as energetic-vs-tired (component 1) and relaxed-vs-anxious (component 2). Clusters are less oppositional: we can talk about variables that score highly (as we just did above), but it is less illuminating to look at the variables that score weakly, since those are just things that aren’t near the cluster, which is more of a grab-bag when the cluster is some specific set of variables like quiet/sleepy/tired/etc.
We never answered the question of how many clusters to choose, of course. And running the process multiple times with nstart
doesn’t solve the problem, since that just figures out the best cluster assignment given a choice of \(k\) clusters. We can come up with rules of thumb, as we did in looking for the “elbow” in the scree plot with factor analysis: eg, with “Hartigan’s Rule,” we can just add clusters one at a time, see what the total variance is (as we did with different runs of the same cluster size), and then only accept the additional cluster if it lowers the total variance enough. (In a sense, this is like the adjusted \(R^2\), which penalizes you for adding new variables on the assumption that any additional random variable can only lower \(R^2\), and thus it should only be added if the adjusted \(R^2\) is lowered above and beyond what we would expect by chance.) But this is a fairly ad hoc and not universally accepted approach.
One partial way around this problem is via hierarchical clustering. Unlike k means, hierarchical clustering assembles clusters piece by piece: first a few similar observations are joined into a multitude of small clusters, and then similar similar small clusters are attached together to former larger, on and upward until everything is just one cluster. And then the user can just decide where in this process to draw the line and take the clusters are that level of aggregation.
As you might have noticed, the key idea here is “similarity” – we join similar observations, and then similar clusters, until we’ve joined everything. Similarity between observations is simple enough: if we just take the distance between two observations (ie, the Euclidean distance), that’s a pretty good measure of similarity. But how do we measure the similarity between two small clusters, or between a cluster and a single observation we are considering adding to it?
As usual, it turns out there are multiple answers to this question. It’s easiest to define them in terms of dissimilarity, which is just the inverse of similarity (ie, we will join clusters with the lowest dissimilarity).
Complete: the dissimilarity between two clusters A and B is equal to the largest distance between a member of group A and a member of group B.
Single: like Complete, but sets the dissimilarity between A and B to the smallest distance between a member of A and a member of B.
Average: the average dissimilarity between every member of A and every member of B.
Centroid: the distance between the centroid of A and the centroid of B.
At this point we could write our own algorithm in R easily enough, once we’ve chosen when similarity measure we prefer. At each step, we just compare every existing cluster or singleton observation with every other one, and join together the two most similar ones; and then repeat that until we’ve hierarchically joined everything up. But as usual, no need to code this, there’s a simple R function to do it, hclust()
.
Here’s hclust
run on our simulated data. We’ll use the “average” method as a reasonable middle ground among the four distance methods. First though, we need to calculate the complete matrix of distances between every observation and every other one. Luckily, there is another function to do this in R, dist()
.
hout <- hclust(dist(simdat1[,1:2]),method="average")
The easiest way to visualize the results are via a plot of the dendrogram:
plot(hout)
Observations and clusters are joined together going from bottom to top, where the height where any two clusters join is equal to the dissimilarity between those two clusters. As one can see from this plot, there are a bunch of observations and small clusters that are moderately different from each other, but two major clusters that are very different (height = about 4). This reflects the fact that these data are tightly grouped into two main clusters that are much more different from each other than any subsets of those clusters are from each other.
From this hierachical clustering, we can just choose the height to slice the tree (eg, a horizontal cut at height=3 will divide the tree into two branches), or alternatively, the number of clusters we want (which just causes the algorithm to progressively lower the cut line until the tree chopped there divides into k branches).
plot(hout)
abline(a=3,b=0,col="red")
To get the cluster assignments, we just apply cutree
to the hclust output, choose either the height to cut it at, or the number of clusters:
as.vector(cutree(hout,2))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
as.vector(cutree(hout,h=3))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
So how do we choose where to make the cut – ie, the old question: how do we choose k? Well, there’s still no good answer, but a good rule of thumb is to make the cut in a wide stretch like that between the height of 2 and 4 in the previous dendrogram: where there’s a big jump in distance before the next set of clusters are grouped together. But again, while that’s clear for this artificial data, it’s much less clear in many real cases – suggesting that maybe the theory is a bit mistaken, and there is no “real” answer to determine, or that perhaps we should think of clusters not as exclusive things, but as something that observations can have multiple and partial memberships with. But that’s going beyond what we can cover here.
To conclude, here’s hclust applied to our MSQ data:
hout2 <- hclust(dist(msq2),method="complete")
plot(hout2,labels=FALSE)
abline(a=16.5,b=0,col="red")
abline(a=21,b=0,col="blue")
Two plausible places to make the cut are shown with the blue and red lines, which divide the tree into 2 and 5 clusters respectively. Whether we prefer 2 or 5, or something inbetween, or something more numerous, is ultimately up to the judgment of the researcher based on her substantive knowledge of the results.