Overview of Naive Bayes Algorithm
I'm excited to finally get into the algorithm so we can see how machine learning can allow you to build some pretty amazing and intelligent behavior into your own programs.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

We're going to start off this section with one of my favorite algorithms which is the Naive Bayes algorithm.

large

And just to reiterate Naive Bayes fits inside of supervised learning and it also fits inside of the set of classification algorithms and if you remember back to our introduction what classification does is it allows us to answer the question of "What is it?" And one of the most popular uses for Naive Bayes is text classification.

large

And you're going to be able to see why especially when we go through our use case example and I'm also going to discuss a number of other types of scenarios where Naive Bayes works quite nicely.

Now before we get into our case study let's sit back and let's look through what the definition is and also just so you're aware the pattern that we're going to be following with Naive Bayes is pretty much the same exact learning pattern we're going to take with every one of the algorithms and so we're going to have a high-level definition of when it's usually used. Then we're going to go into the formal definition if the formal definition is too complex or I feel like it might be a little bit confusing then I'm going to break it down into a few pieces and then we're going to go into the pros and cons of the algorithm and then into the case study just so you have kind of a semblance of understanding on how we're going to walk through each one of these algorithms.

So with all that being said let's talk about the definition for the Naive Bayes algorithm. Techopedia defines that as "A naive Bayes classifier is an algorithm that uses Bayes theorem to classify objects. Naive Bayes classifiers assume strong or naive independence between attributes of data points."

large

And don't worry if that definition seems a little bit fuzzy. The entire reason why I've structured this course in the way that I have is because I discovered when I was personally learning about machine learning. I hated these types of definitions and they actually can get even much more complex than this right here. But what I found was the most helpful was going through examples.

Before we even do that let's kind of dissect this type of definition just a little bit so where it says that a naive Bayes classifier what it's doing right there is actually talking about the algorithm itself, that's what we're going to be working through. And it even says it's an algorithm. But what does it do? It uses Bayes theorem. Now if you do not come from a stat or mathematical background you may have never heard of Bayes theorem. But what it is it's hundreds of years old and it is a mathematical formula that has a set of probabilities that are all combined and so essentially without getting into the math because I promised you in this course we're not going to go into the math. But just to give you kind of a high-level overview what you do in Bayes theorem is you combine a number of elements and you combine their probabilities and this is going to make much more sense once we get into that case study.

But essentially that's what it's talking about. Is it saying that naive Bayes is implementing Bayes theorem and Bayes theorem is completely focused on probabilities. So what does it mean when it says that it is naive? Why does it have that name? That's one of the most important elements to know in order to understand how it works and also why it's so fast and when we get into the example you're going to see it with much more clarity.

But just to give you a little bit of a preview the reason why it's called naive is because of what it says there in the second sentence where it says that there is an independence between attributes of data points and so what naive Bayes does is it looks through an entire set of data. So say it's an email or a blog post or a book or anything like that. And instead of trying to draw a full set of connections between everything it's looking at.

It takes each one of the variables so any data that you pass to it it looks at each one. As being completely disconnected from all of the rest of the data. And so that's a reason why it's naive and you're going to see exactly how it's implemented here shortly.

So let's look now at some of the popular use cases.

large

So first and foremost one of the most popular and this is part of the reason why naive Bayes has become one of the top machine learning algorithms out there is because of search engines and specifically Google. Google is pretty much one of the kings of modern machine learning and their search engine has a number of variations of the naive Bayes algorithm built directly into it.

So every time you go to Google and you start typing something in you are being passed through a naive Bayes classifier. That's part of the reason why I really wanted to start off with this as an algorithm because it's something that you're actually using on a daily basis even if you weren't aware of it before. And so I thought it would be a cool way to introduce the full set of algorithms because this is something that you're using and now you're going to be able to see how it works. Another very popular use case is a recommendation engine. So whenever you're looking at Amazon and it's giving you a set of recommended products it could very well be using naive Bayes for that.

And part of the reason is because naive Bayes out of all the machine learning algorithms that we're going to go through is probably one of the best from a performance perspective. And so if you are talking about some feature of your system that needs to happen almost automatically and happen very fast then naive Bayes is a great option to look at. And then third is spam filtering and that's actually the case study that we're going to be going through in the rest of this guide. And it's one of the most popular uses that you'll ever see for naive Bayes.

So far we've discussed the primary usage the definition they use cases. And now let's get into the pros and cons of using the naive Bayes algorithm.

large

First and foremost is something I've already mentioned it is very fast and it is so fast and part of the reason is because of the mathematical formula that's utilized to generate the probabilities is one of the more basic ones out there and because of that computers can process those types of probabilities incredibly fast pretty much at runtime. There are many types of machine learning algorithms that take a very long time to process where you can't even allow the user to sit there and wait because they might take minutes or even hours to run.

Naive Beys can run in a few milliseconds. It is that fast. And so that's one of the very big pros to using it. Also, it is incredibly effective even with small data sets and this is going to be something I will point out too as we go through this course is there are many machine learning algorithms that require a very large amount of data in order to work.

For example when you're working with any type of neural network usually you need to be working with hundreds of thousands or even millions or even hundreds of millions of data points for the neural network to really work effectively where as you're going to see from our example the naive Bayes algorithm can work on a very small data set just as well as a larger one. To give you some to give you some background on some experience I've had with that I built out a few recommendation engines that utilize the naive Bayes algorithm.

And when I built them out they work just as well when I had a sample set of a hundred different data points as when I moved it up to having 50,000 data points. And it was able to give the same type of accuracy when it came to the recommendations. And one of my favorite pros is that out of all of the algorithms the naive Bayes is one of the more straightforward ones to understand and whenever something especially when we're talking about algorithms is more straightforward to understand it makes it a little bit easier to implement. So I definitely put that in the pro category.

Now let's move on to some of the cons.

large

The first is actually in the name of the algorithm and that is that it is naive which means that you need to have a significant amount of domain knowledge on whatever type problem that you're attempting to solve. Because if you don't then you're going to end up with predictions naive Bayes going to work but it may not work in the way that you originally intended. And so the naive part is something that is very important to keep in mind you need to be able to structure this and implemented in a way so that it functions as well as it needs to. Once we get the example that also is going to have some more clarification.

Now one of the other cons is that it assumes that all features are independent. Now, this is connected to the naive part. So this technically could just be one single con and that is that it looks at every part of your entire data set as if it is not connected at all and that's good in some respects that's part of the reason why it's so fast.

However, that can also be a negative because you're going to run into many circumstances where you do have data that's connected and it's important to look at those connections because they have a relationship to each other. So for example naive Bayes as great as it is may not be the best type of algorithm to implement if you're trying to find relationships between people such as in a social networking application like Facebook or Twitter because it may not look at the relationships as well as it should. So we've looked at the pros and cons.

Now let's finally dive into the case study and for naive Bayes, we're going to take a very practical approach because this is one of the top examples and it's also one of the top implementations for the algorithm and that is to build a spam filter.

large

And I wanted to go with this example because this is something that you can reach out and touch on a daily basis. So right here I have my Gmail browser open

large

and I can look and see all of the different e-mails that were categorized as spam and Gmail is phenomenal at this and they are famous for using the naive Bayes algorithm. So everything that we're going to be talking about this is not theoretical. This is exactly how the system is implemented with a product that you use every single day. So let's take a look at a couple of e-mails.

This first one is to us,

large

this is to a user and it is from an annoying salesperson and the subject is Viagra for cheap. And then if you look through the body of the content it is addressed to us and then you can see that it is a very very nasty little pieces spam where it is trying to sell us Viagra for cheap.

Now, this is one that is one of the top case studies that you're going to find anytime you hear or try to research the naive Bayes algorithm and it's because it handles these type of emails very very well. And one of the most important things, when you're looking at these types of case studies, is seeing not only the negative side. So this is a system where it needs to be able to classify this email as spam.

However, if you have a legitimate e-mail like this one right here

large

we need to make sure that this one makes it into the inbox. And so how can we do that? Well, that's where a naive Bayes comes in. The way it works is all of the e-mails that come into your inbox no matter what kind of e-mail system that you use you could use Gmail outlook you could use some type of proprietary one, pretty much every type of mail server uses a spam filtering system in some form or another. And if you want a visual

large

you can see right here we have all these e-mails and they're going to go through this funnel and the funnel represents the entire algorithm and the funnels goal is to output its recommendation.

So, in this case, we should either have all of the emails categorized as spam or not spam. Now one of the common industry conventions in this type of scenario is the categories are called spam or ham. I'm going to just keep on saying not spam because I think ham doesn't really make sense. However, I did want to bring that up in case you go and research any other guides on Bayes you're going to see the word spam and then ham mentioned quite a bit so that is the visual.

However, that doesn't give us any detail so let's to what is inside of this funnel. So it's kind of this black box where we have all of these processes that are occurring. But what I want to do is I want to rip open that box and I want to see exactly what is happening inside of there that builds that recommendation.

Here are a few key tasks.

large

The first one is a task called tokenization then the removal of stop words from there we implement the concept of stemming and after we do all of those processes then we can build our probability index and inside of there nested inside of that process is our training step and then finally our predictions step. Now if you've never heard of any of those words do not worry we're going to take them line item by line item and see exactly what occurs during those steps.

Let's start off with tokenization. So I'm pulling up the spam e-mail right here.

large

So during the tokenization process, this is one of the very first steps that occurs with naive Bayes. And it is also common in a number of other algorithms and so it tokenization allows us to do is to take this e-mail right here, take this text contact and we're going to rip it apart. So we're going to take every piece of content that we're handed in this e-mail and we're going to split it into pieces.

large

So Viagra's one word, for is one word, cheap is another word all the way through. And then in our system we can store these as their own independent variables and so that leads us directly into the next step right here where we're actually counting all of them.

large

So we can see Viagra occurred twice, cheap occurred twice, for occurred twice, all the way down you can see each one of the words that is in that email. Each one of them gets counted and then we have a data structure that stores that.

Now I don't want you to worry about how that process happens when we get into the actual implementation. So when we see how this can occur in a programming language such as Python then you'll see exactly how we can store these, for right now I don't even want you to worry about that. I simply want you to realize that during the tokenization process we rip apart the content and then we count each one of the words so each one of them gets grouped and counted.

That's going to be one of the very first steps so that we can start building out our probability index and to take a very quick aside the naive Bayes is used quite a bit for the actual Google search engine. So imagine a scenario where Google's trying to figure out what a website is all about. So it's trying to see if website x y z should be listed first or shouldn't even be listed for some type of search keyword. And so what it does is it does the same thing we're doing with this e-mail where it goes through the entire set of content for a web page. And it does exactly what we're doing right here it tokenize it.

It counts all of those different keywords and then that is how it's able to tell what the main focus of that web page is just like we're doing right now with this email where we're trying to figure out and what the goal is right here. The first step is we're just trying to figure out what is the overall subject matter of this e-mail? Now as helpful as a tokenization process is it actually introduces a few elements that could be confusing when we're trying to discover what that subject matter is. So, for example, Viagra and cheap those first two keywords right there. Those are good to know those are going to give us an idea that someone is trying to sell us cheap Viagra. However, notice how we have words such as for listed twice and we have and listed a couple times.

Those words do not give us any context for what the email is about. And so the next step that is implemented is the removal of stop words. Now stop words are words that essentially are just ways that give us as humans the ability to read the content but they don't actually help us discover what the content is about. And so what we're going to do is we're going to remove all those stop words just like this

large

and in the show notes, I will give you a link to a page that contains all of the known stop words out there.

Notice how we have removed all of the conjunctions and propositions and words that really don't offer us any context for the subject matter.

large

So now that we've removed those stop words now we're going to implement stemming. So when you think about how you understand language so when you're reading something or someone is talking with you, you can understand that for this example

large

that discount, discount,s discounted, and discounting are all related. Well, the computer doesn't understand that we have to tell the computer that and so that is what we do when we implement the concept of Stemming. What stemming does is it takes a root keyword such as discount in this example and then it says any of these other words so discounts, discounted, discounting those really all get grouped inside of the discount word.

The way that the algorithm is going to look at it is it's going to simply chop off all of the S's the ed's and the ing's for words and stemming is known a little bit as a crude tool. And so it simply is going to chop off those types of letters and sometimes it's wrong sometimes it's going to pull off letters that it shouldn't be pulling off.

However, the tradeoff is speed and when you're implementing an algorithm like naive Bayes typically speed and performance is a pretty big priority for you and so stemming does not look through a dictionary to see if discount is related to these other three words. Instead what it does is it simply finds the most common post fixes like s and ed and it simply removes them. And so it's going to group all of those together.

So that's going to allow us to not have to count so, in other words, say that you add the words discount, discounts, discounted, and discounting all in the same email instead of counting those as four separate words what stemming does is it actually removes all the post fixes like ing or ed. And it says you know what this is all one word. And so if you had all four of those contained in a single email it would say discount is there and it's there four times and so that gives us a much better idea on average for what the content is.

So now that we have performed all those tasks so now that we have performed tokenization we've removed stop words we have implemented stemming we are onto the fun part which is the training step.

large

So in the training step what we're going to do is we're going to run naive Bayes over the entire database of e-mails that we have. So this could be a few thousand it could be a million it really depends and what typically you'll do is you will try to find your best set of training data so that your system is able to learn from it without having to go through 5 billion records.

So, for example, Google when Gmail receives an email it does not run the naive Bayes algorithm over every email in the entire history of Gmail emails that would be billions upon billions of records to go through. And even though naive Bayes is fast it is not that fast. And so what you do is you just build a very typical type of example set. So you build out all of the examples that give the best prediction and that requires some trial and error.

So if you're implementing something like this for your own programs then you're going to test out a number of different test data sets to see what works out best for you. And so in your training set right here I have in green all of the good e-mails. These are the e-mails that should be placed in the inbox and then in red I have spam e-mails.

In the training step what we're doing is we're finding all of those keywords and then that is where we're building our probability index so we're saying okay if there are these kinds of words than you are saying it's going to be spam or the probability is going to be higher that it is spam.

Remember that in classification we're trying to answer the question of "What is it?" And so right here we're trying to answer the question "Is this spam?" And so we run the entire dataset through our training data and then we look at our e-mail. The one that we have a question on. So this could be every time a new e-mail comes in the inbox.

We're running it through and we're saying this e-mail with all of its tokenize keywords with all of the different probabilities that naive Bayes has. Does this resemble the e-mails that we've historically marked as spam or does it seem to resemble the ones that we've placed in the inbox?

large

And so what naive Bayes does is it runs the new email through that data and then based on the probability it then marks it and gives its recommendation and it says either put it in spam or put it in the inbox. So from a high-level overview going all the way from the input when we looked at what that e-mail looked like all the way through all of the different steps we had to take that is the way Naive Bayes works.

Resources