1 “Big data”

So far we have been dealing with what we might call “standard” data: a few categorical or numerical variables, where we wished to know whether a set of means were different (t or f test), or whether two categorical variables were independent (chi-squared test), or how to model some dependent variable y as a linear function of a small number of independent variables x1, x2, etc (multiple regression). These are all very good and even powerful tools, but they were largely developed in an era when both computation and data were less plentiful than now.

What is “big data”? No one can agree on that, and for good reason – “big data” is more a marketing term than a statistical one. But it does capture something true, which is that modern data problems can be larger and more complex than the traditional toolset developed in the middle of the 20th century can quite handle.

1.1 Types of “big”

There are a number of ways in which data can be too “big” for the standard approaches. First, it can simply have too large an \(n\). “Big” in the sense of “long” data is actually not a very big problem in the modern era, and computers are perfectly happy to run a regression with even millions of observations, assuming the software has been well-designed. Nor is this a very interesting problem, since the methods are generally the same, and the only real difference is that you will often find that p values might be signficant even when effect sizes are very very small, just because large numbers of observations make it easy to capture even the slightest correlations. The only real effect in that case is to underscore the point that statistical signficance and substantive significance are not the same thing: you can have a variable that is statistically signficant, but it’s substantive effect on your dependent variable might be negligable; long data often just makes it easier to mistake the one for the other.

The second sense is that the “big” data can be very wide, with a large number of variables (columns). In this case, modeling becomes much more challenging, both computationally and conceptually. If we have an \(n\) of 100 and 1000 variables, we can’t run a regression, because we can perfectly fit y with so many x variables, but our fit will be entirely arbitrary, and will have no predictive power outside of our 100-item sample. (In fact, we would run into other technical problems even running the regression, as you can verify for yourself if you simulate and then regress some data that is wider than it is long.) It is this second problem that we will be focusing on in the next couple modules – how to deal with “big” data in the sense of data with lots of variables.

Finally, there is arguably a third way that data can be “big,” although it is related to the other two: the data could have a very complex internal structure, as our temporal data in the last module did. In this case, the \(n\) might not be very large, and the \(p\) (number of variables) might also not be very large, but the amount of computation to, for instance, test all the possible ARIMA variants can go up very quickly. This becomes even more complex in the case of vector autoregression, for instance, when there is no Y vs X, but just a collection of X variables all of which may be influencing each other mutually over time. Or it can be very complex with spatial data, such as a set of countries that are all affecting each other; or hierarchical data, where you have countries, states, counties, schools, and children (for instance) all nested within each other. Or, in the one example we will briefly look at in the final module, network data, where individuals (or other items) are connected to each other in a complex web of interconnections which determine which affect which. In all of these case, the complexity of the model can be very high even when \(n\) and \(p\) are relatively low, and of course each is worthy of at least a course to itself.

1.2 Large p

Returning to the second sense of big, as in wide (large numbers of variables), there have basically been two strands of research dealing with this. The first strand looks at how we extend the modeling approach taken in regression to large numbers of independent variables. If we want to model Y as a function of a lot of X’s, what can we do to handle so many independent variables, or to select a smaller more manageable subset, or perhaps to aggregate our X’s into a smaller number of groups? The second approach is less interested in modeling Y as a function of the X’s, and more interested in how we understand, summarize, or group our variables in a way that allows us to detect whether there are a few underlying groups or factors that shape the myriad variables we observe: in effect, whether underneath it all, perhaps there are fewer true variables than we think. This second approach can then be fed back into the first approach, taking those underlying factors as a new, smaller set of variables to regress Y on, but it need not: it can be interesting and useful in and of itself to discover a small set of factors underlying a large set, in the way that the single IQ number purports to underly a large set of mental skills.

In the next two modules we will tackle these two approaches in reverse order. Using the terminology developed in the “machine learning” branch of computer science, in this module we will start with “unsupervised learning” – where we take a large set of variables and observations, and try to detect a smaller set of underlying groups or factors beneath them. In the next module we will turn to “supervised learning,” which is any technique were the goal is to use a large number of variables to model and predict Y – although usually in machine learning the interest is more in predicting Y than in understanding the causes of Y per se. In some ways, the unsupervised approaches are a bit simpler, so we will begin with them here.

2 Factors and clusters

The two approaches we will examine in this module are factor analysis and clustering. With factor analysis, we look for a smaller set of underlying dimensions that may explain a large number of variables with a smaller set of new variables. With clustering, we look for a small number of groups of observations that may capture a smaller set of clusters of categories underlying more complex data.

Both approaches may result in similar insights. For instance, consider the following simulated data.

On the left we see a plot of two variables against each other. They could be hours spent on homework on the x axis and course grade on the y, or reponses on a political survey with individuals’ left-right ideology on social issues on the x axis (eg, approval of gay marriage), and economic issues (eg, tax rates) on the y axis.

On the one hand, we might think that the hours spent on homework cause the course grade, and this is probably to some extent true; but we might also think that there are underlying factors that affect both, such as work ethic, interest in the material, or maybe even membership in a study group. One version presents the underlying cause as a single latent dimension, such as interest in the subject material, that determines how far out on the diagonal line any individual is (middle figure); and the position along this line is reflected in two different measurements that are basically capturing this one underlying thing. A different version of the underlying cause is that there are two groups, eg those in a study group and those who aren’t, and the membership in the study group (right figure, blue) determines where you are on both hours spent and overall grade.

The first version is a factor – an underlying dimension (or multiple ones) that account for most of the variation we see. The second version is clustering, which is another way of reducing the complex, multidimensional data to a smaller system. Of course, the truth might be some combination of the two.

For instance, in the political example, we know that if we plot people’s answers to a left-right social politics question and a left-right economic policy question, we will get much the pattern we see in this figure: there’s mostly just a single left-right dimension that underlies both social and economic political opinions, but there’s also a tendency for people not just to be spread out along the diagonal line, but to be missing in the middle due to partisanship drawing them into one of two camps. Whether the deeper truth is ideology or partisanship is a fundamental and difficult question to answer though, if it is answerable at all.

Computational Statistics 11.1: Unsupervised learning

1 “Big data”

1.1 Types of “big”

1.2 Large p

2 Factors and clusters