1 Regression

A linear regression is similar to determining the covariance or correlation between variables, but it also allows us to

Determine the independent effects of multiple X variables on Y.
Statistically test for significant effects for each X variable.
Directly predict values of Y based on values of X.
Measure the overall accuracy of all the X variables in explaining the variation in Y.

In this module we will consider bivariate regression, where there are just two variables, a single X and a single Y. In the next module we will examine mulitple regression with many X variables.

1.1 Best-fit line

The example below plots CO2 against GDP, illustrating how how a country’s per-capita CO2 output varies with its economic size as measured by per capita GDP. We might want to explain the world’s growing CO2, or to predict how much it will change as GDP’s grow, or simply to investigate how individual country’s CO2 meet or exceed what we might expect based on their economic size. Based on the scatter plot, we might guess that the two variables are correlated, but we need linear regression to tackle these more specific questions.

2 Regression

As you may already know, a linear regression finds the “best fit” line (or plane for multiple regression) through the cloud of points. This is the line that “best explains” the variation in Y as a function of X, in a sense we will make more precise in a moment.

As you might recall from middle school, the equation of a line is $y = mx+b$, where $m$ is the slope of the line and $b$ is the intercept (where the line crosses the y axis when x equals 0). More often we write the equation as $y = a + bx$, or more commonly $y = \beta_{0} + \beta_{1}x$, which allows us to add more x variables with additional $\beta_i$ values, as we will discuss in the next module. Again, with a single X, the $\beta_0$ is the interecept, and $\beta_1$ is the slope.

2.1 Slope, intercept, and prediction

Skipping over a moment the question of how we find this best-fit line, once we have it we can immediately discuss the effect of changes in X on Y and predict values of Y based on various values of X.

If we want to know the best guess for a value of y based on x, we can just plug an x value into our equation to get a value of y:

\[\hat{y}_i = \beta_0 + \beta_1 x_i\]

$\hat{y}_i$ is the predicted value of $y_i$ associated with some $x_i$. For example, in the above illustration, say we want to know what the prediced amount of C02 is in tons per person for the US, with a per-capita GDP of $39,700.

2.2 Prediction, continued

The equation for our line above is $y = 0.42 + 0.31x$, where again, we have recorded y in tons and x in thousands. Unlike correlations, for regression it matters quite a lot what our units are. If we had recorded GDP in dollars rather than thousands of dollars, we would have had $\beta$ values that were 1000 times smaller. One must always be careful to note exactly what units each variable is recorded in.

So given a $\beta_0$ of 0.42 and $\beta_1$ of 0.31, if we plug 39.7 into this we get: $\hat{y}_{US} = 0.42 + 0.31 \cdot 39.7 = 12.727$, or 12.73 tons per person. So based on this regression, for a country around the size of the US, we would expect about 12.7 tons per person, the blue circle on the figure. The actual datapoint for the US is the yellow circle, which illustrates that this is a statistical procedure, and never an exact fit.

It also illustrates the fact that the US produces far more CO2 that one might expect from it’s size alone. Perhaps more anomalous is the point in the upper-right corner, Luxemborg. This is what’s known as a outlier, and one might suspect that something is very different in that tiny country and exclude it from the dataset. It is always good to plot your data before running any regressions to detect outliers, but it is a dangerous practice to eliminate them without very good reasons. As you can probably visualize, without that outlier, the best-fit line would be somewhat different, sloping upward more rapidly, with the effect that the US datapoint would probably be closer to the best-fit line. Thus someone who wanted the US to look better might be motivated to drop Luxemborg from the dataset, illustrating the dangers of dropping outliers. But if there really is some different mechanism in that country, that’s still the best thing to do – it just requires careful substantive analysis and explanation.

2.3 Interpreting coefficients

We can actually interpret a lot about the relationship between X and Y without doing specific predictions. The most important piece of information is the $\beta_1$ coeffecient assocated with X. As you might recall, the slope $m$ measured the amount y changes for a given change in x. That is, if you increase x by 1 unit (in whatever unit X is measured in, like thousands of dollars), y increases by $\beta_1$ units:

That is, if $y = \beta_0 + \beta_1 x$, then if we increase x by 1 to (x+1), we get $y = \beta_0 + \beta_1 (x+1) = \beta_0 + \beta_1 + \beta_1 x$. Ie, y has increased by $\beta_1$. Returning to our example, if we increase GDP by 1 (thousand dollars per capita), CO2 output is expected to increase by 0.31 tons. Thus we can immediately read off from the coefficient directly what it means: for a 1-unit increase in x, $\beta_1$ is how much y increases. And if want to know, say, the effect of a 10-unit increase in x on y, we just multiply $\beta_1$ by 10. Whether we think $\beta_1$ is a lot or a little, though, is something that the statistics can’t answer – it depends on our substantive intepretation of the numbers.

$\beta_0$, on the other hand, is the best prediction for $y$ if $x = 0$. We have to be careful interpreting this though. If we are trying to predict car exhaust CO2 based on car speed, it makes sense to talk about a car that is going 0 mph, since the engine is still running. But to talk about CO2 production for a country with a GDP of 0 doesn’t make any sense – it’s extrapolating beyond the substantively relevant part of our data. So though the equation predicts a CO2 output of 0.42 tons for a country with 0 GDP, that doesn’t make much substantive sense in this case, and we need to be careful of the limitations of our data.

2.4 Calculating coefficients

\[y = \beta_{0} + \beta_{1}x\]

So where do we get $\beta_0$ and $\beta_1$ from?

As you might guess, the equation for $\beta_1$ is similar to the equation for the covariance between y and x:

\[\beta_{1} = \frac{ \textrm{Cov}(x,y)}{\textrm{Var}(x)} = \frac{ \sum_{i} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i} (x_{i} - \bar{x})^{2}}\]

$\beta_{1}$ can also be put in terms of the correlation coefficient:

\[r = \frac{\textrm{Cov}(x,y)}{ s_{x} s_{y}} = \beta_{1} \frac{s_{x}}{s_{y}}\]

Relative to $\beta_1$, the equation for $\beta_0$ is relatively simple:

\[\beta_{0} = \bar{y} - \beta_{1} \bar{x}\]

2.5 Derivation of the coefficients

How did we get the equations for $\beta_1$ and $\beta_0$? Formally, the equation for Y as a funciton of X is

\[Y = \beta_0 + \beta_1 X + \epsilon\]

where $\epsilon$ is the error term. That is, each x predicts y imperfectly, and for each $x_i$, our predicted $\hat{y}_i$ will be different from the true $y$, and that difference is $\epsilon_i$. This can be thought of as the error term: in our example, it is the difference between the yellow circle (the true US CO2 production) and the predicted blue circlue. The goal in choosing $\beta_0$ and $\beta_1$ is to choose them such that the sum of all these errors is mimimized – ie, you want the “best fit” line that minimimizes the total error between the actual datapoints and the line.

In practice, we minimize the squares of these errors, since that makes them all positive. Thus if $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, or $\epsilon_i = y_i - \beta_0 - \beta_1 x_i$, and we want to minimize the sum of the $(\epsilon_i)^2$ for all our $i$ observations, then we want to choose the best $\beta_0$ and $\beta_1$ to mimimize:

\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]

This is a simple problem to solve using a little calculus. If you recall from your calculus (and omitting a few details), if we want to find the minimum for a function, we can just take the derivative and find where that equals 0. That’s basically all we do: we take the (partial) derivative of the above equation with respect to $\beta_0$ and $\beta_1$ and set each derivative equal to 0. This gives us two equations with two unknowns ($\beta_0$ and $\beta_1$), which is solvable with some basic algebra:

\[-2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0\] \[-2 \sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0\]

The solution to these two equations leads directly to the equations in the previous slide. But you don’t have to memorize this! Just have a sense for where this magical best-fit line is coming from. (And of course, things are a lot more complicated, though fundamentally the same, with multiple X variables.)

More details for the curious can be found here: [http://isites.harvard.edu/fs/docs/icb.topic515975.files/OLSDerivation.pdf]

Computational Statistics 7.2: Bivariate Regression