Overview of Linear Regression
In this section of the course we've focused on supervised learning algorithms and for the first half of this section we've focused specifically on classification algorithms.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

Now we're going to start walking through some regression algorithms and if you remember back to the introduction. A regression algorithm is an algorithm that answers the question, what should it be? Our classification problems asks the question, What is it?

But with regression, we're looking to answer a slightly different question. We're trying to figure out some what something is. Linear regression is one of the most fundamental algorithms in the regression category. Now at a high level what a linear regression algorithm does is that it establishes associations and it helps make predictions.

large

I know that's very high-level because in a sense that is what the majority of machine learning algorithms do. But once we get into the case study you're going to see that this fits very nicely.

The definition provided by the University of Yale for linear regression says that it attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable and the other is considered to be a dependent variable.

large

Now that is a little bit wordy for my taste. So let's walk through a couple of those key terms. We're going to talk about what variables are, what an explanatory variable is, and then also what a dependent variable is.

large

Because if you can understand those three concepts then the entire idea of linear regression is going to make much more sense to you.

First and foremost variables are simply data points. In the definition, you may have noticed that variables were mentioned to be referenced directly to observe data or being fit to observe data. So right here our variables are simply the data that we're analyzing that's going to help us first make our observation and then they're going to help us make our prediction. Explanatory variables are also referred to as independent variables. And what they do is they are kind of your data point that you look at and you compare and you try to fit all of the rest of your data to.

large

So in our case study we are going to have a fun one where we're going to analyze baseball player salaries and when we do that we're going to treat the salary as the independent variable it is going to be the variable that we compare all of the rest of the historical data on and it's how we're going to generate our prediction.

Now the dependent variable is the data that we use to fit inside of and compare with our independent variable.

large

Now this concept is one of the most confusing parts of the learning linear regression and because of that, I think the case study is really going to clear it up. It's going to be very clear on what the dependent variables are and what the independent variables are. And I honestly hesitated to even get into describing what these represent before showing you an example just because they can be a little bit confusing but the more times you do it the easier it's going to be.

And part of the reason why did want to give you a preview of what they are and talk about the definitions is because when you start implementing these algorithms in code. So when you start picking out your data points and you pipe them into a linear regression function you need to know what you're dependent and you're independent variables are. And so I didn't want the very first time that you saw one of these algorithms and saw it specifically in code, I didn't want you to think "What in the world are these two variable types? And where do I put one versus the other?" And so this is one of the primary focuses on what this entire guide is going to be all about.

Now linear regression has been around for a long time. The algorithm itself has been used for decades in code but the algorithm meaning the statistical formulae has actually been around much longer than that. It has been around for over 200 years, it is one of the older machine learning algorithms and processes that we have available to us. And you're going to see why it's because it's one of the more straightforward ones to work with and technically you don't even need to have a full machine learning setup in order to implement it is relatively straightforward.

large

And because of that, it is very popular for determining things such as home prices. So say that you took in all of the variables such as location, square footage, those kinds of elements and then you wanted to see how those related to the price of a home because you wanted to put your home on the market and you wanted to see how much you should sell your house for. This is a very real-world scenario.

I can remember back to when I was putting my house on the market, I didn't just throw a random number up there. I looked at the comps so I looked at all of the other homes around my neighborhood, I looked at what their square footage was, I looked at how nice they were, I looked at all these different variables. And based on those that is how I was able to come up with a price for how much I wanted to put my home on the market for, so a very common use case for linear regression. Also when you're trying to determine industry salaries so say you're working in a company and you want to know how much to offer another individual that you're looking to hire, linear regression is a great tool for that. And our case studies specifically is going to walk through how we can do this, and also a few different options.

And lastly, it is also popular for medical diagnosis associations. Now, I will say and I will give a caveat that linear regression by itself is typically not used for medical diagnosis. Instead it is one of the tools that you can add into an entire machine learning system so you may use linear regression either at the very front or at the very end after you've called a number of other algorithms and it can help refine that process and once you see the case study I think that is going to give you some good clarification.

Now when we talk about the pros of linear regression.

large

One of the top reasons to learn this is because out of all of the algorithms that we're going to walk through, it is one of the most straightforward to learn. Like I said this is an algorithm that has been used for centuries so it can't be that complicated because you don't even need a computer in order to make it work. And when you get into building this into actual code programs you're going to see that many times a linear regression algorithm can be implemented with just a couple lines of code, so it is definitely easy to implement as well.

It has solid performance as long as the data is formulated properly and it also can be accurate as much as a very large algorithms and even neural networks and those kinds of things as long as you're working with linear data and we will walk through what linear data looks like here in a little bit.

And now moving on to the cons

large

As great as the linear regression algorithm is there are a number of drawbacks whenever you're implementing it. One, the accuracy can be poor if the data is not associated properly and in fact, we're going to take a little bit different approach with our case study. We're actually going to look at a number of poor implementations of linear regression before we get into a good one because I want you to understand that if you implement this algorithm improperly you can actually give some very poor predictions that could lead to some disastrous business decisions, so I want you to understand how that works.

Next, it does not work for non-linear relationships. The name linear is actually in the algorithm name, so that should be pretty clear that when it comes to implementing this type algorithm you need to be working with linear data. And when I'm talking about that what I mean is that if you imagine a graph and that's what we're going to build here in a minute. If you imagine having an X and Y access and you can map the X coordinate to the Y coordinate and create lines based off of that data then that is a problem that can be solved with linear regression.

However, if your data points don't have that linear kind of line and they don't trend that way then you may mistakenly think that you have created an accurate prediction but reality would not match those expectations. And that is what we're going to walk through when we get to our case study.

Now another issue is that the linear regression algorithm is sensitive to outliers which means that if you have too many outliers so take for example our case study we're going to go through when we talk about baseball players salaries. If you have some poor performing athletes who are getting huge salaries that could throw off your data. So that's an outlier, it is someone who is getting paid and they're getting paid too much for the kind of performance metric that they are putting out. And so that is something to keep in mind and that is actually an issue with the case study that we're going to walk through.

You're going to see the athletic players salaries are not a perfect line. There's not a 1 to 1 ratio where you can say a player who has this kind of performance should get paid x and a player with lower performance should be paid Y. And so we're going to analyze the importance of domain expertise and I'm going to go through a number of different examples both good and bad so that you can learn how to do that.

The last drawback I'm going to mention is that the data for a linear regression algorithm has to be independent. And what that means is when we work through some of the classification algorithms you noticed how there were all kinds of different correlations between one data point and another. That is how we were able to generate our classes. But when it comes to linear regression you need to perform some preprocessing to make sure that you are looking at your data in the right way. And also that all the data points are independent so that you can tell if one piece of data truly does affect the outcome of another because that is how you're going to generate your prediction.

So now that we've talked about the high-level definition for the linear regression algorithm let's get into the fun stuff. So now we're going to get into our case study where we learn how we can estimate baseball players salaries. And we're going to specifically look at how we can use data to help make our prediction.

So here we have a table

large

and we have a number of baseball players we have seven baseball players and for a number of reasons I'm not using real baseball player names but I am using stats and salary items and part of the reason is because the linear regression algorithm has been around for 200 years and I want this guide to be around for quite a while as well. And so I didn't want to use any names that would be outdated to someone who is coming and going through this guide ten years from now and the names I chose, if you are a baseball fan or a baseball movie fan might look a little familiar to you.

I picked them from the Major League movie that was pretty popular a few decades ago. So what I have here is a table in left-hand column I have the name, I have the salary, what that player makes, then the next column is their batting average, then their OPS which is a more advanced statistic, it's slightly more modern and it combines a couple different data parameters and you do not have to know about baseball statistics in order to understand this. Just know on all three of these items for batting average, OPS, and the last one WAR. All three of those items the lower it is the worse it is the higher it is the better it is.

The reason why I'm pulling in 3 different items is because I want to show you that if you pick the wrong data then you're going to end up making a wildly inaccurate prediction. Remember early on in this course when I talked about the importance of domain expertise? What that means is you need to have some level of expertise in whatever field that you're in whatever you're building your machine learning algorithms in you need to be able to have some knowledge about the data that you're working with.

For example, if you know nothing about baseball or baseball statistics you probably are not going to be the best choice for a baseball statistician even if you are the best in the world and even if you are incredible at data science and machine learning algorithms and all of those kinds of skills. If you don't have some level of domain expertise you're going to end up picking the wrong data points to look at and especially when it comes to algorithms like linear regression. You're going to end up picking something that doesn't actually have a correlation with the independent variable.

So what we have here if you want to see we have our salary those items those are independent variables so all the way from 500 thousand to 28 million those are our independent variables everything else so the batting average, the OPS, and the War each one of those are dependent variables and we're going to see how this actually plays out when we get into the different types of linear regression options that we could pick from.

So what we're going to do we're going to take our data we're going to grab our trusty filter

large

On the left-hand side here, imagine that you have been hired as a baseball data scientist and your very first task is that you are handed a sheet of paper and they say this is a baseball player and these are his stats. He has a batting average of .242, he has an OPS of a .800 and he has a war of 42. Just to give you a very high level view of what WAR means is its one of the newest most popular baseball statistics that combines pretty much every type of skill that a player has and it explains how many wins they help to contribute versus someone who is kind of an average player someone that would be called up from the minor leagues or an amateur is brought in as kind of the control group.

What war does is it says OK this player's contribution allowed the team to win X number of games and in this case for our guy is 42, so that's a pretty significant number. And if you cross-reference it with the other players on there you'll see it fits in right around the middle.

And so with linear regression, we're going to take all that data we're going to pipe it through our linear regression filter

large

and what we're going to do is with our training data we're going to generate a graph. And so this is where we're going to see the straightforward nature of linear regression. We are simply going to take our independent variable which is our salary and then we're going to take each one of those dependent variables and then we're just going to treat it like a traditional x y coordinate graph and that is going to help us to see exactly where all of the data points lie in relation to each other.

large

So we're going to do that and then from there, we're going to create a slope of the means. And so what that means is we're going to create a straight line that dissects our data points as well as possible and then based on that slope we're going to take our new data point which is our John Doe baseball player and we're going to line it up so we're going to say okay this was his batting average and so this is how much salary that we should give him and then we're going to do that for each one of the dependent variables.

So we're going to take a look at batting average first and I'm following the convention where I have going through the vertical line. I have the batting average and then going with the horizontal line I have salary remember salary is our independent variable your independent variable is always going to be on the horizontal line the batting average is always going to be vertical. And so what I did is each one of these dots represents the player's salary and then their batting average. So if you look all the way to the right you can see our player that had the high 300 batting average also was the one that looks like there are somewhere around 28 million dollars or so in salary.

If you go all the way to the left then you can see that this is part of where we have a little bit of our issue. So if you notice we have a player who had the one and a half million dollar salary who actually has the lowest batting average of all of the data points and we have a couple other players that make much less and they have a significantly higher batting average. So if we create our slope here and we try to slide John Doe in where we take his batting average and all we're doing is based on the slope of the line which we are generating by dissecting all of the data that we have right there based on that slope we are saying that John Doe should be paid it looks like around five hundred thousand dollars so with his batting average of .242 that should be a salary.

large

And I'm sorry but I have some bad news for us, we had been fired from the baseball team because we just gave an incredibly good baseball player the lowest offer of his entire career, that is not going to be the right prediction. It comes down to picking the correct versus the incorrect dependent variable batting average is not the best item to pick. Whenever we are trying to dictate what the salary is.

So let's try again.

large

Now, we're going to look at the values. And so right here you can see we're getting closer and this is really what I wanted to focus in on is the importance of domain knowledge but also if you notice we're starting to get a little bit more linear with our data. So now our two players with the lowest salaries do have the lowest OPS. And then moving down to the right a little bit and there you can see that our player with one point five million has a significantly higher OPS value. Now, this still isn't perfect though because if you look all the way to the right we have two players that are right around 21 million which is 20 million more dollars per year which I think is pretty significant. I'm not sure how much more you have in your bank account but mine doesn't quite have that.

And if you look at their OPS their not that much higher than the player with one and a half million. And so that should be a little bit of a red flag to you right here and part of the entire goal of this guide and part of the reason why I didn't want to go straight into the perfect example of linear regression. And that's actually one problem I have with a number of the guides that I have previously found on linear progression is they instantly went to the example that had perfect linear data and that's nice and that's fun.

However, it's been my experience when I've been building out real-world implementations that I usually have to go through a number of different dependent variables until I find one that truly fits the scenario. So in this case right here OPS looks closer than where we were just a little while ago with batting average and if you look at the two lowest salaries and the two highest salaries you can see they are perfectly linear. So we are getting closer and I think our next data point might have what we're looking for but let's see what this prediction is first.

large

So if we have OPS determine our prediction you can see that with a 8 hundred OPS We should be giving a salary of about 15 or 16 million dollars. And I'm sorry to say we have once again we've been fired from our job and it's because we just gave a above average player a much higher than above average salary. And so this is not the best predictor for what we're looking for. OPS can be a great predictive tool for a number of other key elements but not for salary, we still have more data to take to account. So let's move on to our last item.

Now we're going to look at war. Now what war is once again is it is our statistic that is our most wide-reaching for a baseball player. It can tell you with a pretty shockingly high percentage of accuracy how much a player contributes to the team. So it's everything from how many runs they can help get in, how good their defense is, batting average, OPS, RBI's and if you're not familiar with baseball and it sounds like I just spouted of a lot of jibberish just think of it as this statistic shows how important a player is to their team winning, which that is really at the end of the day the most critical factor that is going to help determine salary.

Now if you look at this and you look at these data points you can see that we have now gone from our batting average where we had data points all over the place where it seemed like there was not any true correlation or association between what the batting average was to the salary. Now you can see that we have almost a straight line.

large

Now, if we look at the data we can see that our two lowest points and our two highest points are both matching up nicely on the line and even the other points the ones that used to be on the same level. So if we looked at the OPS you remember that those players who are making 21 million each you know they were on the same level pretty much as a player who is making 20 million less. That makes no logical sense and so that is one of the reasons why that wasn't a great data point for us to look out for generating this prediction.

And now if we use the graph and we see that a player who has about a forty-two point WAR should be getting a little bit over a 10 million dollar salary and if you add all of the data points together for the current state of Major League Baseball this actually fits in perfectly. And so WAR as of right now at the time of filming this is one of the best predictive data points for seeing how much a player is valued and how much they should be paid.

So in summary that is a high-level overview of the linear regression algorithm. It allows you to take a linear approach to analyzing data and to find what kinds of data points will determine a prediction and hopefully, I've also been able to stress the importance of picking the right data points. This is not an algorithm that you can just throw a lot of data at and have it figure out the probabilities for you. This is not like naive Bayes, this is not like a neural network. This is a much more math and statistical approach to analyzing causation.