Linear Regression — The Basics

What is Linear Regression?

The basics

Yeah. It’s not a good sign if I’m starting out already repeating myself. But that’s how things seem to be with linear regression, so I guess it’s fitting. It seems like every day one of my professors will talk about linear regression, and it’s not due to laziness or lack of coordination. Indeed, it’s an intentional part of the curriculum here at New College of Florida because of how ubiquitous linear regression is. Not only is it an extremely simple yet expressive formulation, it’s also the theoretical basis of a whole slew of other tactics. Let’s just get right into it, shall we?

Linear Regression:

Let’s say you have some data from the real world (and hence riddled with real-world error). A basic example for us to start with is this one:


There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as:

\frac{1}{N}\sum\limits_{i=1}^N\left(\hat Y_i - Y_i\right)^2

where \hat Y_i is our predicted value of Y_i for a give X_i. Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as,

\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2

for some \alpha, \beta that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for \alpha, \beta! Let’s go ahead and do that! We want to minimize

\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2

We can start by finding the \hat\alpha such that, \frac{d}{d\alpha}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0. And as long as we don’t forget the chain rule, we’ll be alright…

\sum\limits_{i=1}^N2\left(\alpha + \beta X_i - Y_i\right) = 0\Longrightarrow
\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right) = 0 \Longrightarrow
N\alpha + N\beta\bar X - N\bar Y = 0\Longrightarrow
\alpha = \bar Y - \beta\bar X

and we’ll find the \beta such that \frac{d}{d\beta}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0

And following a similar pattern we find (sorry for the editing… isn’t the greatest therein):

\sum\limits_{i=1}^N2X_i\left(\alpha + \beta X_i - Y_i\right) = 0\Longrightarrow
\alpha\sum\limits_{i=1}^NX_i + \beta\sum\limits_{i=1}^N X_i^2 - \sum\limits_{i=1}^NY_iX_i = 0\Longrightarrow
(\bar Y -\beta\bar X)N\bar X + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i = 0
N\bar Y\bar X -N\beta(\bar X)^2 + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i = 0
\beta\left(\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2\right) = \sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X
\beta = \frac{\sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X}{\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2}

But note: \frac{1}{N}\sum\limits_{i=1}^NX_i^2 - (\bar X)^2 = \text{VAR}(X)
So, \sum\limits_{i=1}^NX_i^2 - N(\bar X)^2 = N\cdot\text{VAR}(X)
And: \sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X = \sum\limits_{i=1}^NY_iX_i - N(\frac{1}{N}\sum\limits_{i=1}^NY_i)\bar X
= \sum\limits_{i=1}^NY_iX_i - \sum\limits_{i=1}^N(Y_i\bar X)
= \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right)
= \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right) - \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right) + \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right)
= \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) + \bar Y\sum\limits_{i=1}^N\left(X_i - \bar X\right)
= \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) = N\cdot \text{COV}(X, Y)

So, \beta = \frac{\text{COV}(X,Y)}{\text{VAR}(X)}.

And then we can find \alpha by substituting in our approximation of \beta. Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation.


Now we have it

Okay, so now we have our line of “best fit”, but what does it mean? Well, it means that this line predicts the data we gave it with the least error. That’s really all it means. And sometimes, as we’ll see later, reading too much into that can really get you into trouble.

But using this model we can now predict other data outside the model. So, for instance, in the model pictured above, if we were to try and predict Y when X=2, we wouldn’t do so bad by picking something around 10 for Y.

An example, perhaps?

So I feel at this point, it’s probably best to give an example. Let’s say we’re trying to predict stock price given the total market price. Well, in practice this model is used to assess volatility, but that’s neither here nor there. Right now, we’re really only interested in the model itself. But without further ado, I present you with, the CAPM (Capital Asset Pricing Model):

r = \alpha + \beta r_m + \epsilon (where \epsilon is the error in our predictions).

And you can fit this using historical data or what-have-you. There are a bunch of downsides to fitting it with historical data though, like the fact that data from 3 days ago really doesn’t have much to say about the future anymore. There are plenty of cool things you can do therein, but sadly, those are out of the scope of this post.

For now, we move on to

Multiple Regression

What is this anyway?

Well, multiple regression is really just a new name for the same thing: how do we fit a linear model to our data given some set of predictors and a single response variable? The only difference is that this time our linear model doesn’t have to be one dimensional. Let’s get right into it, shall we?

So let’s say you have k many predictors arranged in a vector (in other words, our predictor is a vector in \mathbb{R}^n). Well, I wonder if a similar formula would work… Let’s figure it out…

Firstly, we need to know what a derivative is in \mathbb{R}^n. Well, if f:\mathbb{R}^n\to\mathbb{R}^m is a differentiable function, then for any x in the domain, f'(x) is the linear map A:\mathbb{R}^n\to\mathbb{R}^m such that \text{lim}_{h\to0}\frac{||f(x+h) - f(x) - Ah||}{||h||} = 0. Basically, f'(x) is the tangent plane.

So, now that we got that out of the way, let’s use it! We want to find the linear function that minimizes the Euclidean norm of the error terms (just like before). But note: the error term is \epsilon = Y - \hat Y = Y - \alpha -\beta X, for some vector \alpha and some matrix \beta. Now, since it’s easier and it’ll give us the same answer, we’re going to minimize the squared error term instead of just the error term (like we did in the one dimensional version). We’re also going to make one more simplification: That \alpha=0. We can do this safely by simply appending (or prepending) a 1 to the rows of our data (thereby creating a constant term). So for the following, assume we’ve done that.

\langle\epsilon, \epsilon\rangle = (Y - X\beta)^T(Y - X\beta)
\langle\epsilon, \epsilon\rangle = (Y^T - \beta^TX^T)(Y - X\beta)
\langle\epsilon, \epsilon\rangle = Y^TY - 2Y^TX\beta + \beta^TX^TX\beta

So, let’s find the \beta that minimizes that.

0=\lim\limits_{h\to0}\frac{||(Y^TY - 2Y^TX(\beta+h) + (\beta+h)^TX^TX(\beta+h)) - (Y^TY - 2Y^TX\beta + \beta^TX^TX\beta) - Ah||}{||h||}
0=\lim\limits_{h\to0}\frac{||- 2Y^TXh + 2\beta^TX^TXh + h^TX^TXh - Ah||}{||h||}
0=\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX + h^TX^TX - A||\frac{||h||}{||h||}
0=\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX - A||

So, now we see that the derivative is -2Y^TX + 2\beta^TX^TX and we want to find where our error is minimized, so we want to set that derivative to zero:

0=- 2Y^TX + 2\beta^TX^TX
X^TX\beta = X^TY
\beta = (X^TX)^{-1}X^TY

And there we have it. That’s called the normal equation for linear regression.

Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof.

Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests.

Linear Regression — The Basics

Probability — A Measure Theoretic Approach

What is probability?

The Traditional Definition:

Consider a set \Omega (called the sample space), and a function X:\Omega\rightarrow\mathbb{R} (called a random variable.

If \Omega is countable (or finite), a function \mathbb{P}:\Omega\rightarrow\mathbb{R} is called a probability distribution if it satisfies the following 2 conditions:

  1. For each x \in \Omega, \mathbb{P}(x) \geq 0
  2. If A_i\cap A_j = \emptyset, then \mathbb{P}(\bigcup\limits_0^\infty A_i) = \sum\limits_0^\infty\mathbb{P}(A_i)

And if \Omega is uncountable, a function F:\mathbb{R}\rightarrow\mathbb{R} is called a probability distribution or a cumulative distribution function if it satisfies the following 3 conditions:

  1. For each a,b\in\mathbb{R}, a < b \rightarrow F(a)\leq F(b)
  2. \lim\limits_{x\to -\infty}F(x) = 0
  3. \lim\limits_{x\to\infty}F(x) = 1

The Intuition:

What idea are we even trying to capture with these seemingly disparate definitions for the same thing? Well, with the two cases taken separately it's somewhat obvious, but they don't seem to marry very well. The discrete case is giving us a pointwise estimation of something akin to the proportion of observations that should correspond to a value (in a perfect world). The continuous case is the same thing, but instead of corresponding to that particular value (which doesn't really even make sense in this case), the proportion corresponds to the point in question and everything less than it. The shaded region in the top picture below and the curve in the picture directly below it denote the cumulative density function of a standard normal distribution (don't worry too much about what that means for this post, but if you're doing anything with statistics, you should probably know a bit about that).


Another way to define a continuous probability distribution is through something called a probability density function, which is closer to the discrete case definition of a probability distribution (or probability mass function). A probability density function is a function f:\mathbb{R}\rightarrow\mathbb{R}_+ such that \int_{-\infty}^xf(t)dt = F(x). In other words, \frac{dF}{dX} = f. This new function has some properties of our discrete case probability function, but lacks some others. On the one hand, they’re both defined pointwise, but on the other, this one can be greater than one in some places — meaning the value of the probability density function isn’t really the probability of an event, but rather (as the name “suggests”) the density therein.

Does it measure up?

Now let’s check out the measure theoretic approach…

Let \Omega be our sample space, S be the \sigma-algebra on \Omega (so S is the collection of measurable subsets of \Omega), and \mu:S\to\mathbb{R} a measure on that measure space. Let X:\Omega\rightarrow\mathbb{E} be a random variable (\mathbb{E} is generally taken to be \mathbb{R} or \mathbb{R}^n). We define the function \mathbb{P}:\mathcal{P}(\mathbb{E})\rightarrow\mathbb{R} (where \mathcal{P}(\mathbb{E}) is the powerset of \mathbb{E} — the set of all subsets) such that if A\subseteq\mathbb{E}, we have \mathbb{P}(A)=\mu(X^{-1}(A)). We call \mathbb{P} a probability distribution if the following conditions hold:

  1. \mu(\Omega) = 1
  2. for each A\subseteq\mathbb{E} we have X^{-1}(A)\in S.

Why do this?

Well, right off the bat we have a serious benefit: we no longer have two disparate definitions of our probability distributions. Furthermore, there is the added benefit of having a natural separation of concerns: the measure \mu determines the what we might intuitively consider to be the probability distribution while the random variable is used to encode the aspects of the events that we care about.

To further illustrate this

The Examples

A fair die

All even

Let’s consider a fair die. Our sample space will be \{1,2,3,4,5,6\}. Since our die is fair, we’ll define our measure fairly: for any x in our sample space, \mu(\{x\}) = \frac{1}{6}. If we want to know, for instance, what the probability of getting each number is, we could use a very intuitive random variable X(a) = a (so X(1)=1, etc.). Then we see that \mathbb{P}(\{1\}) = \mu(\{1\}) = \frac{1}{6}, and the rest are found similarly.

Odds and Evens?

What if we want to consider the fair die of yester-paragraph, but we only care if the face of the die shows an odd or an even number? Well, since the actual distribution of the die hasn’t changed, we won’t have to change our measure. Instead we’ll change our random variable to capture just those aspects we care about. In particular, X(a) = 0 if a is even, and X(a) = 1 if a is odd. We then see \mathbb{P}(1) = \mu(\{1,3,5\}) = \frac{1}{2} and \mathbb{P}(0) = \mu(\{2,4,6\}) = \frac{1}{2}

Getting loaded

All even

Now let’s consider the same scenario of wanting to know the probability of getting each number, but now our die is loaded. Being as how we’re changing the distribution itself and not just the aspects we’re choosing to care about, we’re going to want to change the measure this time. For simplicity, let’s consider a kind of degenerate case scenario. Let our measure be: \mu(A) = 1 if 1\in A and \mu(A)=0 if 1\notin A. Basically, we’re defining our probability to be such that the only possible outcome is a roll of 1. So since we are concerned with the same things we were concerned with last time, we can take that same random variable. We note \mathbb{P}(1) = 1 and \mathbb{P}(a) = 0 for any a \neq 1.

Odds or evens

Try to do this one yourself. I’m going to go get some sleep now. Please feel free to contact me with any questions. I love doing this stuff, so don’t be shy!

Probability — A Measure Theoretic Approach