Guide to Hypothesis Testing for Data Scientists

The word hypothesis is a essentially an educated guess. So in the world of statistics when ever you're analyzing data you're essentially trying to create your own ideas your own assumptions and then therefore your own educated guesses of the world and then the entire reason why you look at your statistical data is to help you make decisions that's the entire point of the entire industry.

And so a hypothesis is just that, it is an educated guess and so if we look at the second word testing that is testing our hypothesis it is testing to see if our educated guess is accurate and that may seem like an oversimplification and it is in a sense because when you go through the case study that we're going to walk through you will see that hypothesis testing actually provides an entire system of analysis and so we are going to be able to look at a case study examine the data and then build out a structure for how we can judge if a hypothesis is correct or if it is incorrect and the case study that we're going to examine is the question of where to place our players where to place our defensive players on the field in a baseball game.

large

Now you do not have to be a baseball statistical expert in order to understand what's going on here. I have a very small sample set here where we have a baseball field and what the green dots represent is the number of times where a hitter has hit the ball to that part of the field. So as you can see they have hit it to the left-hand side of the field nine times they've hit it up the middle four times and they've hit it to the right-hand side twice.

Now there's a concept in baseball called the defensive shift. And so if we look at a baseball field right here

large

this is what a shift looks like in action.

Instead of having all of the players evenly distributed on the field what a shift does is it rearranges the defensive players like you can see right here in these red circles

large

where they are positioned to where the team believes that the hitter is most likely going to hit the ball. And as you can tell from some of the keywords I just said such as most likely you can tell that that deals directly with probabilities and prediction which is at the heart of any kind of statistical analysis.

And so essentially what we're going to try to do with our hypothesis is we're going to try to see by analyzing the data should we position our players and reposition them in order to take advantage of the shift and in order to try to decrease the effectiveness of the batter.

And we're going to create a table here

large

and this is a formal hypothesis testing table that you will see all over mathematics and statistical textbooks.

And I have right here the generic version and then we are going to replace some of these values with our case study. But let's walk through what these components are. On the top left-hand side there you can see that H sub 0 equals the null hypothesis. So what the null hypothesis is, is us saying that whatever our hypothesis is whatever our educated guess is the null hypothesis says if the educated guess is wrong and the behavior that we're trying to effect is not affected whatsoever then the that is the null hypothesis.

Now if that concept is a little bit difficult to wrap your head around that is the reason why I have created this entire case study because if you simply look at each one of the symbols here on this page it might be a little bit intimidating if you've never seen them before. So we're going to dive right into the case study and see how our real-world data points can map directly here.

large

So what we're going to see is our hypothesis is there at the very top of the table. So the hypothesis is that putting a shift on a pull hitter decreases their effectiveness. And so what our hypothesis is looking at is the data, it's looking at the data from that very first page that showed that nine times that pull hitter hit to the left-hand side four times up the middle two times to the right. So as a baseball organization we're looking at that data we're analyzing it and we're coming up with the hypothesis that putting a shift on that hitter is going to decrease their effectiveness.

Now looking to the left-hand side the null hypothesis is pretty much the opposite of that concept. So the null hypothesis, in this case, states that putting a shift on a pull hitter has no effect on performance so notice how the null hypothesis is pretty much the opposite of our hypothesis of our educated guess. And so from that point, we can now create our system for analyzing the types of acceptance and rejection policies that are there whenever we're looking at data.

The top 2 column headings there say except H sub 0 which means that we're going to accept the null hypothesis and the second column says we're going to reject the null hypothesis. The row headers on the left-hand side check to see if the null hypothesis is true. And then the last row accepts and checks to see if the null hypothesis is false.

And as you can see one of the most challenging concepts when it comes to working with hypothesis testing is that you need to work with terminology that many times can seem a little bit convoluted as you can see we are working with the concept of the null hypothesis in all quadrants even though you may think the most intuitive approach would be to simply check to see if our hypothesis is true or if we accept or reject our hypothesis that is not the way the formal hypothesis testing system operates.

And if you're wondering why that is or if you think that that is a stupid way to look at the world I want to give you a slightly different perspective and it's a perspective that you will find in essentially every field of science and that is that as a scientist you should go into any type of testing scenario with the idea that your hypothesis very well could be wrong. And so our idea of putting a shift on a pull hitter decreasing their effectiveness. We have to go in and we have to have an open mindset that assumes that it is completely within the realm of possibility that we were wrong with our educated guess.

And so when we start with a null hypothesis then what that allows us to do is to look at the way the world is right now and look at reality and then we can judge our hypothesis better from that perspective. And so starting off with the top left-hand cell that represents a situation where the null hypothesis is true and where we accept the null hypothesis so in the world of statistical analysis what that means is that we came up with this educated guess but the educated guess was wrong.

We looked at the data and then we decided to not approve and not accept our own hypothesis which means that we accepted the null hypothesis that is the correct decision. So in our scenario what that means is that we put a shift or we had the idea to put a shift on a pull hitter in order to decrease their effectiveness. We looked at the data and saw that that actually did not work and that our hypothesis was wrong. And then we decided to simply follow the data and make the correct decision by accepting the null hypothesis.

Now the other approach. Imagine a scenario where null hypothesis was true. So we analyzed all the data and we decided to reject it. That is the cell directly to the right. And what that means is that the data didn't show any difference at all. So it means that our hypothesis was wrong but we decided to put the shift on anyway. So we decided to ignore what we saw in the data and we decided to simply go with our gut and go with our hypothesis what that is called is that is called a Type 1 Error

large

and it's represented by the Greek symbol of little alpha so if you are ever looking at any type of statistical analysis documentation and you see the words Type 1 error or small alpha(α). What that means is that the null hypothesis was true which meant that your hypothesis was wrong but you still decided to go with your own hypothesis. That is called making a type 1 error.

Now if we go down to the next row.

large

This is where the null hypothesis was false. So this is a situation where you came up with your educated guess and it turned out that your guess was correct and whenever your hypothesis was correct you have proved the null hypothesis H sub 0 to be false. Starting in the bottom left-hand cell, if we except the null hypothesis so that essentially means that we looked at the data and we decided to completely ignore it and stay with the status quo. In our case study, it means that we ignored the data and we don't think that the position shift is related to the performance even if all of the data proved that it was.

Moving over to the right hand the bottom right-hand cell. This is a situation where we looked at the data we analyzed and saw that the null hypothesis was false which means that our hypothesis was correct and then we decided to reject the null hypothesis, that is the correct decision.

Now in looking back at our table of terms

large

if we would have accepted the null hypothesis when it was false then we have committed what is called a type 2 error and that is represented by the Greek symbol small beta(β) and if you think that this is all incredibly elementary then that's good. That was my goal was to make this entire concept be as straightforward as possible but it is not too simplistic and it is an incredibly important concept to understand in the world of statistical analysis.

If your first thought is why in the world is this even needed? Who in the world would look at all of the data and then decide not to go with it? If you believe that then you should simply get out in the world a little bit more often because type 1 errors and type 2 errors are happening around us every single day.

Imagine a scenario where you decide to date someone who has a very nasty track record with their relationships historically. In that situation, you decided to reject the null hypothesis even when the null hypothesis was true and you decided to simply go with your gut go with your own educated guess or what you wanted to do. That is a very common type of decision that happens around us all day.

You can look at individuals who smoke on a regular basis even though all of the data shows that smoking leads to lung cancer. They decided that that would not apply to them and so they decided to reject that hypothesis and make their own decision and ignore the data.

It's because of those things that hypothesis testing is necessary and part of the reason why it's included in this course is because as you go through your own data science journey you're going to run into many different situations where someone will reference that a type 1 or a type 2 error is occurring and it's incredibly important to understand what that represents because those are two very different types of errors. And feel free to reference this guide and this chart to help you clarify that whenever you run into it in your own machine learning journey.