Yeah. It’s not a good sign if I’m starting out already repeating myself. But that’s how things seem to be with linear regression, so I guess it’s fitting. It seems like every day one of my professors will talk about linear regression, and it’s not due to laziness or lack of coordination. Indeed, it’s an intentional part of the curriculum here at New College of Florida because of how ubiquitous linear regression is. Not only is it an extremely simple yet expressive formulation, it’s also the theoretical basis of a whole slew of other tactics. Let’s just get right into it, shall we?

Let’s say you have some data from the real world (and hence riddled with real-world error). A basic example for us to start with is this one:

There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as:

where is our predicted value of for a give . Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as,

for some that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for ! Let’s go ahead and do that! We want to minimize

We can start by finding the such that, . And as long as we don’t forget the chain rule, we’ll be alright…

and we’ll find the such that

And following a similar pattern we find (sorry for the editing… WordPress.com isn’t the greatest therein):

But note:

So,

And:

So, .

And then we can find by substituting in our approximation of . Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation.

Okay, so now we have our line of “best fit”, but what does it mean? Well, it means that this line predicts the data we gave it with the least error. That’s really all it means. And sometimes, as we’ll see later, reading too much into that can really get you into trouble.

But using this model we can now predict other data outside the model. So, for instance, in the model pictured above, if we were to try and predict when , we wouldn’t do so bad by picking something around 10 for .

So I feel at this point, it’s probably best to give an example. Let’s say we’re trying to predict stock price given the total market price. Well, in practice this model is used to assess volatility, but that’s neither here nor there. Right now, we’re really only interested in the model itself. But without further ado, I present you with, the CAPM (Capital Asset Pricing Model):

(where is the error in our predictions).

And you can fit this using historical data or what-have-you. There are a bunch of downsides to fitting it with historical data though, like the fact that data from 3 days ago really doesn’t have much to say about the future anymore. There are plenty of cool things you can do therein, but sadly, those are out of the scope of this post.

For now, we move on to

Well, multiple regression is really just a new name for the same thing: how do we fit a linear model to our data given some set of predictors and a single response variable? The only difference is that this time our linear model doesn’t have to be one dimensional. Let’s get right into it, shall we?

So let’s say you have many predictors arranged in a vector (in other words, our predictor is a vector in ). Well, I wonder if a similar formula would work… Let’s figure it out…

Firstly, we need to know what a derivative is in . Well, if is a differentiable function, then for any in the domain, is the linear map such that . Basically, is the tangent plane.

So, now that we got that out of the way, let’s use it! We want to find the linear function that minimizes the Euclidean norm of the error terms (just like before). But note: the error term is , for some vector and some matrix . Now, since it’s easier and it’ll give us the same answer, we’re going to minimize the squared error term instead of just the error term (like we did in the one dimensional version). We’re also going to make one more simplification: That . We can do this safely by simply appending (or prepending) a 1 to the rows of our data (thereby creating a constant term). So for the following, assume we’ve done that.

So, let’s find the that minimizes that.

So, now we see that the derivative is and we want to find where our error is minimized, so we want to set that derivative to zero:

And there we have it. That’s called the **normal equation** for linear regression.

Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof.

Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests.

]]>- A set called “states”
- A set called “symbols”
- A function
- A designated state called the start point
- A subset called the “accepting states”.

The DFA is then often referred to as the ordered quintuple .

Given a DFA, , a state , and a string , we can define like so:

- If only has one symbol, we can consider to be the symbol and define to be the same as if we considered as the symbol.
- If , where and , then .

And in this way, we have defined how DFAs can interpret strings of symbols rather than just single symbols.

Given a DFA, , we can define “the language of “, denoted , as .

Let’s construct a DFA that accepts only strings beginning with a 1 that, when interpreted as binary numbers, are multiples of 5. So some examples of strings that would be in are 101, 1010, 1111

- A set called “states”
- A set called “symbols”
- A function
- A designated state called the start point
- A subset called the “accepting states”.

The NFA is then often referred to as the ordered quintuple .

Given an NFA, , a collection of states , and a string , we can define like so:

- If only has one symbol, then we can consider to be the symbol and define .
- If , where and , then .

And in this way, we have defined how NFAs can interpret strings of symbols rather than just single symbols.

Consider a set (called the **sample space**), and a function (called a **random variable**.

If is countable (or finite), a function is called a **probability distribution** if it satisfies the following 2 conditions:

- For each ,
- If , then

And if is uncountable, a function is called a **probability distribution** or a **cumulative distribution function** if it satisfies the following 3 conditions:

- For each ,

What idea are we even trying to capture with these seemingly disparate definitions for the same thing? Well, with the two cases taken separately it's somewhat obvious, but they don't seem to marry very well. The discrete case is giving us a pointwise estimation of something akin to the proportion of observations that should correspond to a value (in a perfect world). The continuous case is the same thing, but instead of corresponding to that particular value (which doesn't really even make sense in this case), the proportion corresponds to the point in question and everything less than it. The shaded region in the top picture below and the curve in the picture directly below it denote the cumulative density function of a standard normal distribution (don't worry too much about what that means for this post, but if you're doing anything with statistics, you should probably know a bit about that).

Another way to define a continuous probability distribution is through something called a probability density function, which is closer to the discrete case definition of a probability distribution (or **probability mass function**). A **probability density function** is a function such that . In other words, . This new function has some properties of our discrete case probability function, but lacks some others. On the one hand, they’re both defined pointwise, but on the other, this one can be greater than one in some places — meaning the value of the probability density function isn’t really the probability of an event, but rather (as the name “suggests”) the density therein.

Now let’s check out the measure theoretic approach…

Let be our sample space, be the -algebra on (so is the collection of measurable subsets of ), and a measure on that measure space. Let be a random variable ( is generally taken to be or ). We define the function (where is the powerset of — the set of all subsets) such that if , we have . We call a **probability distribution** if the following conditions hold:

- for each we have .

Well, right off the bat we have a serious benefit: we no longer have two disparate definitions of our probability distributions. Furthermore, there is the added benefit of having a natural separation of concerns: the measure determines the what we might intuitively consider to be the probability distribution while the random variable is used to encode the aspects of the events that we care about.

To further illustrate this

Let’s consider a fair die. Our sample space will be . Since our die is fair, we’ll define our measure fairly: for any in our sample space, . If we want to know, for instance, what the probability of getting each number is, we could use a very intuitive random variable (so , etc.). Then we see that , and the rest are found similarly.

What if we want to consider the fair die of yester-paragraph, but we only care if the face of the die shows an odd or an even number? Well, since the actual distribution of the die hasn’t changed, we won’t have to change our measure. Instead we’ll change our random variable to capture just those aspects we care about. In particular, if is even, and if is odd. We then see and

Now let’s consider the same scenario of wanting to know the probability of getting each number, but now our die is loaded. Being as how we’re changing the distribution itself and not just the aspects we’re choosing to care about, we’re going to want to change the measure this time. For simplicity, let’s consider a kind of degenerate case scenario. Let our measure be: if and if . Basically, we’re defining our probability to be such that the only possible outcome is a roll of 1. So since we are concerned with the same things we were concerned with last time, we can take that same random variable. We note and for any .

Try to do this one yourself. I’m going to go get some sleep now. Please feel free to contact me with any questions. I love doing this stuff, so don’t be shy!

]]>Let be an alphabet and let be the set of all strings of length k over that alphabet. Then, we define to be (the union of ∑^{k} over all natural numbers k). If , we call a language.

Consider an alphabet (some finite set of characters), for example we can consider the letters of the English language, the ASCII symbols, the symbols (otherwise known as binary), or the symbols . We can then construct the infinite list of all the different ways we can arrange those characters (e.g. or , etc. if we’re using binary). We call these arrangements “strings”. Once we have all that machinery built up, a language is just some subset of that infinite collection of strings. The language may itself be infinite.

- The alphabet:

The language: (all prime numbers in binary)

Some strings from the language: - The alphabet: ASCII characters

The language: All syntactically correct Clojure programs (the source code) - The alphabet: All Clojure functions, operators, etc, and list

The language: All syntactically correct Clojure programs (the source code)

You see that we need to have an alphabet before we can have a Formal Language. Also, different alphabets may result in equivalent languages — by equivalent, we mean that both languages contain the same strings.

Well, there are two ways to look at this. On the first hand, linguists would like to study language in its abstract essence. For this, Formal Languages may come in handy (if endowed with a Grammar and possibly more). That is not the reason I will be studying Formal Languages. I’m learning about formal languages to find their applications to computing.

Apparently, with the help of a little Mathematical thinking, we can assign semantics to the strings in a language and somehow correlate them with real world problems — such as computability, the P=nP problem, cryptography, and more!

]]>This is my first blog, so I guess I should start off by explaining my motives behind writing what will probably be a sporadically updated blog. Basically, as I learn things that I find pretty difficult to find online, I’ll try to explain them as best I can here. Also, since I enjoy learning math, I’m going to try to keep up a (semi) regular stream of math posts. If you have any questions, feel free to contact me. My up-to-date contact info can be found on my website aaron.niskin.org.

]]>