Guide to Probability for Data Scientists
In this guide we're going to dive into probability and probability is one of the most key foundational concepts when it comes to data science and machine learning algorithms because pretty much every one of the algorithms has to deal with probability in some form or another.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

Some algorithms are pretty much just a set of probabilities that are all run against each other and they generate a prediction. Other ones simply use probability in certain cases or in certain functions. However, probability is still one of the key items that you're going to have to understand. And so that's a reason why I wanted to start off the entire course with this as a topic.

Now we're going to go through two different case studies and the very first one is a base case and there are a number of different examples out there we could use for a basic introduction to probability. If you consider flipping a coin that's one of the most popular types of examples I've seen. However, I thought we'd want to do something a little bit more fun so I created this baseball diamond right here

large

and we're going to talk about building a probability system. This is a very basic example but just for first seen what the probability is for a hitter or a baseball player to hit the ball to one of these three fields. So you see how I have separated the field out into LF, CF, and RF that stands for left field, center field, and right field.

Now it doesn't take a lot of arithmetic knowledge to know that that means that we have one element or it's separated into three elements. Each one of these has a weight of one. And so our total potential options for the hitter or the baseball player to hit to one of the spots is three. There's only three different total locations where they could hit on the field.

large

And so from a probability perspective what we can say then is and this is the syntax that you're going to use whenever you're dealing with probabilities or you're analyzing probability formulae. Whenever you're looking at these algorithms is it starts with a capital P and then in parentheses you place whatever that probability is. So for example right here we say that the probability of a hitter hitting two left field which is LF. We say the potential for the condition to be met is 1 and then the total potential options is three.

large

And so we can transfer this directly into a mathematical formula. And so we can say P of L.F. is equal to 1 over 3 which is equal to 33 percent. And so we can say that all other items being equal so we are not going to get into where a hitter usually likes to hit on the field we're not getting into anything like that. This is just a basic example. And so we say with these three options that a player has the probability of hitting to one of these three fields 33 percent of the time.

large

So that is a basic example.

Now let's dive into one that I think is a little bit more fun because there are some more things going on. Imagine that you are building out some type of restaurant application. So think of something kind of like Yelp but you want to build out a cool new feature on the app. And so what this feature is going to do is it is going to randomly go into the database of restaurants and then it is going to perform a recommendation to your users and the way that you can read this chart is you can see at the top here.

This is set up in a tabular format. So we have four different types of restaurants we have Italian restaurants followed by Mexican restaurants followed by American restaurants followed by Chinese restaurants and then on the left-hand side. These are our rows. We have a number of stars. So just like Yelp works or any of these types of restaurant rating systems they are rated from one to four stars. And so if you look you can see each time you can see that there is one Italian restaurant with one star. If you look down on the American cuisine you can see that there are 11 American cuisine items with 3 stars so on an so forth.

large

and we can even take a little bit of a further breakdown here just so we can see all of the stats that are available to us and you can see that we have 20 Italian restaurants, 27 Mexican restaurants, 23 American restaurants, followed by 26 Chinese restaurants.

large

And then when we look at the rows to see the ratings of each of these you can see we have five restaurants with one star 29 restaurants with two stars 51 restaurants with 3 stars and then 11 restaurants with four stars.

large

That takes us up to a total of ninety-six restaurants and so you can see that's just tallying up each one of the elements there are the rows and the columns.

Now the nice thing about setting up our data in this kind of a tabular format it makes it relatively straightforward to perform basic probability. So let's look and on the top left-hand side here you can see that the first equation that we want to fill a follow out is we want to see what is a probability that a restaurant is Italian. Well to do that we can simply look at the list of Italian restaurants and then from there we can just analyze the entire column.

large

So we have one with one star, five with two stars, twelve with 3 stars, and two with four stars. So if we are asking the question that we want to figure out what is the probability that if we simply randomly reach into our database of restaurants what is the probability that the system is going to generate an Italian restaurant?

Well, the way that we can figure that out is by tallying all of those up and then placing those over the total number of restaurants. So as you can see here we have 20 Italian restaurants divided by 96 because we have 96 in the entire database.

large

And if we run that calculation you can see that the probability that if we just randomly reach into our database and pull out a restaurant, the probability of it being Italian is 21 percent.

large

So that gives us the ability to look in our database and have a pretty clear idea or a clear estimate of how many times a single restaurant or a single genre of a restaurant is going to come up like Italian and we could perform the same exact task on the Mexican, the American, and the Chinese restaurants and then they would give us some similar output. So that is behaving like expected and that is pretty similar to the first baseball example we looked at.

Now let's try to extend this a little bit and ask the question. OK, what is the probability that the restaurant we pick if we randomly look in the database. What is a probability that it's going to be an Italian restaurant, and it's going to be four stars?

large

So notice how here we've added another condition to that probability. Well, whenever we do that what we're doing is we're essentially saying that our probability type of equation is going to have to be more specific. It's going to have to filter out some of the other results. So notice right here we no longer care about any of the restaurants that are 1 through 3 stars. So we have only highlighted the ones that are Italian and four stars.

And you can see right there that that leaves us with two restaurants and following the same pattern if we place the two in the numerator and then the total number of restaurants 96 in the denominator we have 2 over 96

large

which if we run that through you can see is 2 percent.

large

So the probability of us reaching into the database and getting a four-star Italian restaurant is 2 percent which is very helpful to know. So far we've looked at a couple different examples. We took a broad view where we wanted to see all of the items that are within a certain column. Then we narrowed down our focus by adding a secondary conditional and now what we're going to do is we're actually going to change it up and now we're going to say what if the condition is met by one set of data. Or what if it's met by another set of data.

With that in mind what we're going to do is ask the question what would happen if we want to see the probability of us picking a random element out of the of restaurants that is Chinese or it's 3 stars.

large

And right here I've highlighted this type of query and this type of probability and as you can see it's quite a bit different than any of the other ones we've looked at so far because we do not just care about one of the probabilities or one of the data attributes but instead we care about all of the 3 star restaurants and we care about all of the Chinese restaurants. And so this is going to change up how we look at the data a little bit and just in case you don't feel like breaking out your calculator. I've tallied it up. We have 51 total 3-star restaurants and 26 total Chinese restaurants.

large

Now you may think that we can simply add these together. Divide it by 96 and we're done. However, that will not get you the output that you're looking for.

large

And a Venn diagram helps to signify exactly why that is it's a very helpful visual because right now you can see that I have our total list of 3-star restaurants and our total list of Chinese restaurants.

large

However, if you notice we have some Chinese restaurants that are 3-star restaurants and so if we counted those twice. So if we simply added them together we would have a duplicate data set inside of our probability. And so our probability would be inaccurate. And that item is this 19.

large

This is right where the Chinese restaurant list meets the 3-star restaurant list. And so if we added both of them together we would technically be counting the same 19 restaurants that fall right here.

large

We'd be counting them twice. And so we do not want to duplicate that because that's going to throw off our probability. So instead what we need to do is we simply need to subtract one of those 19 elements and so if you do that if you tally up all of the 3 star all of the Chinese restaurants and then you just subtract 19. So those 19 restaurants are still counted new only differences are not counted twice.

And you if you do not trust me feel free to pull out your calculator and run it through and you should get that. There are 58 total restaurants that meet this condition which means they are Chinese or they are 3 stars. And so if you say 58 divided by 96

large

you're going to get a probability of 60 percent which means that there is a 60 percent chance that our random restaurant recommendation engine is going to get either a 3 star restaurant or a Chinese restaurant which is very helpful to know because any time you're building out any kind of service like that you want to have some idea on what the output is. And that is exactly what a probability type system can do for you.