Overview of Logistic Regression
In this lesson, we're going to walk through the logistic regression algorithm. In the machine learning space, logistic regression has been one of the favorite algorithms that many machine learning developers and data scientists turn to when they're looking to model out some type of system.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

And one of the key reasons for that is because logistic regression has been around for quite a while and it has very deep roots in the statistical analysis space and what it allows you to do is to generate probability between two potential outcomes.

large

So this is very different from linear regression. And the reason why I placed both of these algorithms right next to each other in the course is because I don't want you to get them confused. I know they have very similar names and even when you see it on a graph you may think that they're similar. However, the key takeaway that I want you to remember from this guide is that in logistic regression you really focus on those two potential outcomes.

And when we get into our case study this is going to be much more clear when we're working with linear regression we set up a graph and then we set up data points on the graph and then we leveraged those data points to be able to draw a straight line that allowed us to create a prediction based on historical data. But that prediction really was only able to take us so far and it depended on our data being linear.

In logistic regression, it's going to be much different, where we're going to have one state and then a second state and then the data is going to fall only into those two camps and so it's going to make our prediction much more accurate. And before we can jump into the definition I want to make one point of clarification and this can be confusing to new developers who have never heard of logistic regression or linear regression before and that is that technically logistic regression suffers from being named improperly.

Even though it has the name regression in the actual algorithm name it's technically categorized more as a classification algorithm and I think this makes sense because inside of classification typically you're trying to ask yourself what something is what category something falls into and that is exactly what logistic regression does. And if you want more information on the roots of why logistic regression has the name regression in it I highly recommend for you to go and search for it online and you'll find a plethora of historical information on why that is. I want to focus on the actual behavior of it but that is something that is helpful to understand.

Now let's dive into the definition. So Medcalc says that logistic regression is a statistical method for analyzing data set in which there are one or more independent variables that determine an outcome the outcome is measured with a dichotomous variable.

large

And if you've never heard of what a dichotomous variable is that word may sound very intimidating. However, we already discussed it when we're talking about the primary goal for logistic regression

large

A dichotomous variable, all that means is that it is something that can have a maximum of only two outcomes. The further that you go along in your machine learning journey you'll also hear other terms such as binary classification that are also very similar to this. What this means is that you have two different types of classes that you're trying to place some data point on in you're trying to create a prediction and we're trying to leverage historical data in order to perform that classifications so that's exactly what we're doing with logistic regression.

So let's discuss what some of the primary use cases for this algorithm are

large

logistic regression is very good at image classification so it allows us to perform tasks such as helping with handwriting recognition with telling if you are tagged in an image and different features like that. It also excels in medical diagnosis systems and that's going to actually be our case study that we're going to walk through. We're going to see how we can leverage logistic regression as a component of a medical diagnosis application that checks to see if someone has a higher risk of having cancer versus not having cancer.

It has been used for other use cases such as political predictions sales predictions. And then one of the most popular uses for it is that it can work as a component inside of neural networks. So because of the way logistic regression works because of how fast it is and because of how good it is at the task of binary classification. It is very helpful, not only by itself but typically where you will see logistic regression use the most is when it is layered on and integrated into other algorithms.

So take for example the image classification use case there at the very top logistic regression by itself will not be able to tell you what a particular mark on a piece of paper is. It's not going to be able to tell you if it is a, b, c, d, e, etc. But what it will do is it's going to allow you to perform a final analysis of your hypothesis. So inside of a neural net or even just another set of algorithms say that you do have a letter and you've analyzed it you've used a tool such as a support vector machine or something like that and then you're pretty sure that you have a good idea of what this letter on this piece of paper is representing whether it's a letter number or any type of handwriting.

Then after you have your initial prediction you can run logistic regression to see if that is accurate or not. And it'll give you a very good prediction between two values. So imagine that you have a final type of prediction in your support vector machine has taken in this one letter and you're not sure if it's an I or an L and so you have a pretty decent idea of what it is and the probability is high on the support vector machine side.

However, you still want to make your prediction even more accurate. You can run logistic regression at the very end to see Okay where does it fall between these two classes? Is it an I or is it an L? So that is a very common use case for this algorithm. So now that we've discussed some of the common use cases. Let's talk about the pros and cons.

large

One of the top reasons why people love to utilize logistic regression is it's very fast because computers are so good at performing complex mathematical computations and logistic regression is really just that it is a set of mathematical formula that receives an input and then runs these calculations and it gives you an output it's just a statistical analysis tool. And so computers are very fast at running those types of systems.

The next item on the list of pros is that logistic regression generates very intuitive predictions. And the reason for that is because it focuses on just giving you a choice between two different outcomes and a nice benefit to this type of behavior is that it makes the data and the prediction much easier to read than many other algorithms that are out there.

And the last pro is that this is something that we've already discussed and that is that logistic regression works very well with other algorithms and is one of the most common algorithms that you're going to find in pretty much every neural network that you run into. Neural networks are so massive and there are so many different processes and algorithms that need to occur and being able to classify quickly between two different elements is something that a neural net has to do quite often. And so it makes sense that any time it has to perform that type of tasks that it reaches for logistic regression in order to help it.

Next on the list is the set of cons.

large

And there really are just two cons and I'm not even going to say that these are negatives in terms of some of the other cons we've talked about with other algorithms such as poor performance or anything related to accuracy or anything like that. And so what these cons really are related to is more of simply skill as a machine learning developer or data scientist.

First is that it requires many preprocessing tasks. So in order to get the data formulated in a way where logistic regression can work with it there are many different data set so that simply will not be able to be piped directly into logistic regression because it needs to know certain things so it needs to know two different classes that it looks for when it builds out its training set and also because logistic regression is essentially just a set of mathematical formula. It requires more understanding of how the math works in order to be implemented properly. And like I said I don't really view these as much as cons as simply as tasks that you need to be able to understand and work with whenever you are trying to implement this algorithm.

So now that we have walked through an overview the primary purpose the definition the pros and the cons. Now let's walk through our case study where we are going to see how logistic regression can be used in predicting cancer in medical patients and this may sound like a little bit of a darker kind of case study.

However, I thought it was important to include in the course because this is a real-world example of how powerful algorithms like this can be that it can actually help in saving and preserving people's lives. And depending on what field of study that you want to go into and what part of the industry that you're going to be building these types of machine learning algorithms for you may be able to do things like this which I think is incredible. There are very few times in the entire history of the world where we're able to leverage tools like this in the pursuit of being able to help people.

So let's start by looking at the data set

large

and this data has already been filtered down and preprocessed to be the kind of data that you would need to have in order to implement a logistic regression type of algorithm. And as you can see here on the left-hand side we have white blood cell count and then on the right-hand side, we have a status of whether a set of historical patients had cancer or if they didn't.

This is taken from an actual medical research study that I went through. I am not in this field, I'm not a domain expert whatsoever in the field of medical diagnosis or anything like that. So I'm keeping everything from a data perspective here a very high level but I've found this type of data fascinating and this was a key example and case study that has been published on how logistic regression can aid in the medical community.

And so what our system is looking to do is it's looking to give a prediction on if a patient will have cancer or if they won't all determined by the white blood cell count. So we have a count there that ranges from 1 to 10. And then on the right-hand side, we have our dichotomous variable remember that means that we have either a, essentially a not cancer or a cancer variable. There are only two different statuses that we can look at. And so what's going to happen here is we're going to take all of that data. We're going to run it through our logistic regression funnel and then that is going to generate that medical diagnosis.

large

And so the processes that are going to be run here

large

are we're first going to translate our cancer or not cancer states to 1 to 0 as you will see when we get further in and actually start building these types of programs out in Python programs do not typically work with string bass data. So they like to work with numerical data such as a 0 or a 1. So when we're working with logistic regression those states of cancer or not cancer are going to be translated to numerical values. And then there is going to take all of the data and it's going to run it through that mathematical formula and it's going to generate the prediction.

large

So our training data is going to look something like this. And as you might notice this is a very different kind of grouping than we've seen in many of the other algorithms that we've graphed so far. Notice how all of the data points are either on the zero line they either have a zero value for the status or they all have a one value for the status and for the case of this example we are going to say that 1.0 so the value 1 means that the patient did not have cancer and 0 means that they did. And then there on the bottom access, we have the numbers going from 0 to 10 and that represents the white blood cell count. So now that we've trained our data it's time to make a prediction.

Now if we were to use a tool like linear regression and draw a line up from 0 and 0 all the way up and if we were told that we had a value of say a white blood cell count of three we might mistakenly place our prediction here

large

which means that there is a 75 percent chance that the individual has cancer and that is not a good prediction because if you look at where the data is clustering you could see that that is most likely not the case but if we're to use something like linear regression we might end up with data like this. And so what we do with logistic regression is we don't have to worry about having a straight line and seeing where that straight line should fall on the graph.

Instead we use tools and functions a very common one that you're going to hear when we get into the implementation is the sigmoid function and what that allows us to do instead of creating a straight line we can actually create a curved line on the graph and then we can place our data point in our prediction in a much more accurate value that represents where it should be based on the historical data. So as you can see right here

large

if we have a patient who has a white blood cell count of 3 we are going to be placing them in around the ninetieth or so percentile of being healthy which if you look at the historical data that is much more accurate in terms of where they should be positioned on that graph and what the predictions should be.

And likewise, if we have a different patient who comes in and they have a white blood cell count of seven as you can see here on the graph

large

if we move that prediction all the way down the sloped line we can see that places them in a percentile where they have about 80 percent or so chance of having cancer. And that is the correct prediction and based on the papers that I went through that is actually relatively accurate in what doctors do when they're performing their own diagnosis.

So hopefully you can see how powerful this type of tool can be in many different industries because it allows you to look at a full set of historical data and then based off of that being able to tell between two different outcomes where you should place that prediction. And so this is going to be an algorithm that you hear quite a bit about. The further you go into the data science industry and it's most likely going to be one of the more common algorithms that you implement in yourself.