Guide to Distribution for Data Scientists

We're going to analyze the different components that makeup distribution so that when you hear it you'll realize exactly what the person is referring to.

So right here

large

what we have is a list of some of the popular forms of distribution. We have uniform distribution, we have normal distribution. And then one of the key terms that you're going to hear in all kinds of different machine learning programs and data science projects is the central tendency.

The central tendency is one of the key elements that you're looking for whenever you're building out some type of statistical analysis tool and we'll talk about some of the different components that make up and how you can find the central tendency and I think that you will be happy after you've gone through this because some of these terms may sound a little intimidating but they are all around us in the world. So they're concepts that even if you've never heard this term you have been working with it for a long period of time. So we're just going to connect the term with your view of it. And so that should help you as you go on your machine learning journey.

So the very first thing we're going to discuss is uniform distribution and this is also referred to as rectangular distribution statistics how to describe that by saying that a uniform distribution also called a rectangular distribution is a probability distribution that has constant probability.

large

Now that is not my favorite definition. However, you are going to hear that kind of set of terms quite often so I want you to be familiar with it. But once you actually see exactly how it works you'll see it's relatively straightforward.

So when we talk about uniform distribution we mean that the probability is going to be the same across all inputs. So for a base case example, we have here the concept of a coin flip.

large

So we have a 50 percent chance that we're going to have heads and we have a 50 percent chance that we're going to have tails as you know assuming that we do not have some kind of trick coin or a biased coin the probability of these options of these two options is 50 percent. If you flip a coin 100 times you should get around 50 heads and 50 tails and so that is uniform distribution, that's when the probability for all of the inputs is the same.

Now let's talk about normal distribution because you will come across both types of distribution and it's very important to understand the key differences. So a normal distribution this is the bell-shaped curve so I'm sure you've seen this in a number of different examples both in mathematical kinds of case studies but also just in general life the bell-shaped curve is one of the most popular and most commonly occurring types of trends in distributions in the real world and we'll talk about that in the case study.

Tech Target says that a normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme.

large

Now I love this example, it's a perfect way of describing how normal distribution works. So one of my favorite case studies of this and you can see right here

large

we have four different types of normal distribution and the reason why I wanted to show this graphic is because many times when you see a bell-shaped curve it looks something like the one all the way to the left where it scales up nice and easy and then goes right back down.

But if you look at each one of these curves you may notice that they all fit within the definition of normal distribution which is that the majority of the data is going to cluster in the center and that's what it means and that is why each one of these four examples is a perfect fit for a bell-shaped curve and for normal distribution.

One of my favorite examples of this is not even one that came from the mathematical or data science world. It simply came from a product manager and it was a PR product manager for Pepsi. And when we were talking about how his press release system worked he and how he would get reviews and ratings on what Pepsi was doing he said something very interesting. He said that pretty much what they've seen throughout the years and since Pepsi's been around for a long time this includes quite a bit of information. And so what he said is that they see a very pure bell-shaped curve when it comes to how their response rate on their advertising, their marketing, and their press releases are.

He said that he will get around 5 percent of people that no matter what they release they are going to love Pepsi so any kind of ad campaign any new product news 5 present people are going to love them no matter what the other 5 percent on the other end of the spectrum they are going to hate anything that Pepsi puts out just because they don't like the brand or there's something about them they don't like. And so what he said is what he cares about is that middle 90 percent because that middle 90 percent and you could see that would be represented right here by any one of these four bell shaped curves.

That middle 90 percent is really who he's trying to speak to because he doesn't have any control over the left-hand side or the right-hand side and so you are going to hear this term quite a bit.

Now normal distribution is also referred to as Gaussian distribution. So if you are reading a data science textbook and it says something like oh the trend was Gaussian what they're saying is that the trend or the data looks something like this where we had a natural clustering right in the center of the data and then they had outliers on the left and the right-hand side. And so that is normal distribution.

Now the formula for discovering normal distribution is also called the Z score and the exact mathematical formulas right here on the screen

large

where it says that Z equals X minus the mean(μ) that's what the little you symbol is, over the standard deviation(σ).

So what x represents here is the value that is being standardized, so this is the data point that we're looking at. Now the little μ is the mean. So this is the average of all the data points and then that gets divided. So subtracting X and then X minus the mean all over. So it all gets divided by this standard deviation(σ) and if you look back at the product review example that I have right here

large

you may notice on the right-hand side we have those two elements we have the mean and then we have the standard deviation and then on the very bottom, you can see x. And so the way we figure out what our z score is and how we build this bell-shaped curve is by taking all of those elements the score we're looking at subtracting that by the mean(μ) and then dividing at all by the standard deviation(σ) and that is the formula for finding the normal distribution.

Now the last topic we're going to discuss is the central tendency now. This is the center of the distribution and as you're going to see there are three ways that we can find the central tendency. And I really like the way that Laerd described and defined the concept of the central tendency. Because if you've never heard this term before it sounds a little intimidating but really it's pretty straightforward.

And what it is is it's "A measure of central tendency and it's a single value that attempts to describe a set of data by identifying the central position within that set of data as such measures of central tendency are sometimes called the measures of central location. They are also classed as summary statistics. The mean often called the average is most likely the measure of central tendency that you are the most familiar with but there are others such as the median and the mode."
-Laerd

large

So right here we can look at a few of the most common ways of getting the central tendency with a base case. So say that you had five data points and they were the numbers one two three four and five. If we add all of those up and divide them by the total we will see that the mean the average is 3.

large

Now if we look at the entire data set right here and then we say what value is right in the middle? Well, that is in statistical analysis that is the median, it is the value that is at the very center the data set. Now the reason why the median is not picked as often as the mean is there are a couple reasons. One is if you're working with an even data set then you have to put in a few tricks in order to find the mean or the medium. So right here it's easy to find the medium because we have five elements which means that the third element is going to be right in the middle and so it's easy to pick out our median as being 3.

If we had six elements then we technically wouldn't have a medium we'd have to pick between either the third or the fourth option to see which one we wanted to use as the median. The median also is not as good because there are many times where if you have data where you have a large number of small numbers and then a large number of large numbers then the median may not actually be a good indicator of what the central tendency is. Whereas the mean is typically the most common just because getting the average of all of the values is usually the best way of finding the most central value included in that collection.

And then lastly we have the mode. Now the mode is the most least used option when you're looking for the central tendency. So right here, for example,

large

we have 6 numbers 1, 2, 2, 3, 4, and 5. The mode is the value that occurs the most often and so, in this case, it's 2 as you may see if you look at the set of values. 2 is not really the best central tendency because imagine that we had a collection that was 1, 2, 2, 3, 100, and 200. The value 2 would still be the mode but it definitely is not the central tendency.

It is not the value that best represents the central part of the dataset. That's really where you'd be looking at one of the other options. So whenever someone says Central Tendency what they mean is try to find the value that represents or best represents the most central part of a data set and you're going to use either the mean, the median, or the mode in order to find that.