Overview of Support Vector Machines

If that name sounds very weird or intimidating do not worry we are going to dive into exactly what each one of those elements represent in our definition and also in our case study.

So what does a SVM do? Well, the primary usage is for recognition.

large

Now that may seem a little bit over the top because anything that is in the world of classification really has to deal with recognition in some form or another. However, SVM's are very good at many different types of recognition and that is the reason why I've listed that as the primary usage however many of the different classification algorithms could also be lumped inside of this category.

Now let's dive into the definition.

large

The definition of an SVM algorithm is that it is "A support vector machine and it is a discriminative classifier formerly defined by a separating hyperplane. In other words, given labeled training data which every kind of supervised algorithm needs to have the algorithm outputs an optimal hyperplane in which it categorizes new examples."

Wow! That had way too many big words. So let's actually rip that apart a little bit and dive into what it means specifically the elements were going to pull out are discriminant classifier, separating hyperplane, and labeled training data. So what is a discriminative classifier.

So essentially what this means is that given all of the training data it takes those elements all in as an input and then it can separate and classify those elements to represent their specific feature set. And when we get into the case study this is going to make even more sense because the entire point of the SVM algorithm is to be able to separate all the elements and then classify them so we can tell one class from another.

Now let's go into the next definition which is separating hyperplanes and what this represents is a boundary that separates classes.

large

Now, this is very different than all of the other algorithms that we've looked at in it's even different than pretty much everything we're going to look at in the future. And that is because with this type of setup with this type of algorithm the way an SVM works is it creates visual boundaries that separate the classes. And so what we can do with a hyperplane is it creates a relatively easy to understand visual for how a separation of classes is concerned.

So when we talked about naive Bayes, naive Bayes is very focused on being able to classify text into one category or another category. Well, with the way that an SVM works it can actually look at all the data points that you have inside of your system whether it's a database text files anything like that and it builds its own set of classifications and as you're going to see that with this hyperplane it actually creates boundaries where you can see that one type of data should be matched up and another type of data should have its own class and that's the reason why you would use an SVM is when you want very clear distinction on one type of data versus another.

And I know that simply looking at this type of dissection of this word and being able to understand it can be pretty challenging if you've never heard it before. So once we get in the case study this will also make more sense. But I want to introduce these keywords to you so that when you hear me say them they're not going to sound foreign.

Now the last element we're going to look at is labeled training data.

large

Now thankfully this is one of the easier elements to understand out of everything else we've discussed so far but I wanted to bring it up because labeled training data is one of the key prerequisites to the supervised learning set of algorithms. The main difference that you're going to see between a supervised algorithm versus an UNsupervised algorithm is that supervised systems use labeled training data. And what that means is just it says it's required by supervised learning algorithms which means that we know the names of the data classes.

So if you're using an algorithm that is unsupervised which we will get into later on in the course we don't know what kind of data we're working with whereas with all of the algorithms inside of this category of supervised learning algorithms it means we actually know what the data represents. So it's very similar to say that if you were working with Excel spreadsheets or something where you have columns that describe each one of the data points you can categorize it that way. And so any supervised learning algorithm is going to fall into that category.

So now that we've looked into the definition and we've also ripped apart a few of the definitions. Now let's also take a look at the use cases. So when would you want to use an SVM? Well SVM's are very good at image recognition they're fantastic at handwriting recognition. They're very good at stock market predictions and they're great with medical diagnosis tools

large

And so there are a number of other types of situations where an SVM works. However, these should give you a little bit of an idea of when you'd want to use one. In fact, our case study is going to deal with handwriting recognition. So now that we've talked about a few of the common use cases what are some of the pros and cons to a support vector machine? Well, first some of the pros.

large

There is a very clear separation of classes. This is a great tool whenever you have a set of data points that you want to see a very clear delineation of one type of data set versus another. And once we get into the visuals I think this is going to be very clear I don't think there is any set of machine learning algorithms that gives as clear of a distinction between classes as an SVM.

Another great rationale for using an SVM is that the predictions from this type of shallow learning algorithm can sometimes rival those of neural networks and that may not seem like a big deal but if you are in industry and you have started to build out these types of algorithm implementations you will know that and SVM can perform in a few seconds with a neural network may take a few minutes days or even weeks to build out.

And so there are many times where a support vector machine can be a fantastic choice even though it's much more simplistic than a neural network. It actually has some pretty impressive performance statistics. It also has a number of advanced options. We're going to talk about those a little bit as we get through the case study just know that the case study we're going to go through and the examples we're going to discuss don't even touch the full breadth of everything that an SVM offers. But I will give you an idea of what that looks like and when we get into much later on in the course when we get into the code implementations then you're going to see exactly all of the cool things you can do with an SVM.

Now another great tool about support factor machines is that they work great with small sets of data very much like the naive Bayes algorithm an SVM can take in a very limited data set and give you a great recommendation. So far so good. Those are all of the great reasons why you'd want to use an SVM.

Now let's look at a few of the cons.

large

One of the first is that it can be a little bit challenging to implement. Obviously, we're not going to get into that too soon but I will give you a preview on why an SVM can be a little bit difficult. And also in addition to that it also usually requires quite a bit of data preprocessing. What that means is that whenever you're implementing a support vector machine you have to have a pretty good set of quality data. Whereas with a number of other algorithms such as the naive Bayes algorithm where you can pretty much throw all kinds of data to it. So with naive Bayes, you could throw a million different HTML websites to it and it could tear apart that data and give you a pretty good prediction on category content a support vector machine is much different.

With an SVM you need to be much better and much more focused with how you parse your data so that you're giving it something that it understands how it should be working with and so that is something to keep in mind you're probably going to have to spend a decent amount of time before you even pass your data to this algorithm in making sure your data is organized properly.

And lastly, on the con list, this is going to look a little bit confusing and that is that there are multiple kernel options. So what in the world is a kernel option? Well, that is a great question and is one of the most important components in understanding a real-world support vector machine implementation. We're going to get into what it actually looks like. But to give you a high-level preview with the way a kernel option works and it's also called a kernel trick is that it gives you a different way to look at the data.

It gives you several different ways if you want to think of it in dimensional terms imagine that you're looking at data in a linear sense where it's just a flat set of values where you have an X and Y axis. That's one way of looking at data and many machine learning algorithms just do that.

Well, with a support vector machine you actually have the ability to add multiple dimensions so you can look at data in a 3-D or even a 4-D or even more set of dimensional kind of perspective. And so that as great as that is it also adds complexity and so we're going to have to see what that looks like. And then you can decide if it fits your scenario or not.

So far we've taken a very high level overview of the algorithm we've talked about the pros and cons and use cases. Now let's dive into our case study and we have a pretty fun one. In this guide, we're going to talk about a handwriting recognition engine and how a support vector machine can be utilized in order to build it.

large

So as you may or may not know when ever you write a letter when you use your handwriting to write out that address you are not writing for a human to read it but instead the U.S. Postal Service has a set of algorithms and different robots that go through and read exactly what you've written and they actually interpret all of this data and they use tools such as support vector machines in order to be able to understand as in this example right here exactly the address that you wrote down to make sure it gets sent to the right recipient. And so how does this work exactly? We're not going to look at every element inside of the address instead. I simply want to take one specific one because this is the way the process works it's not going to look at the entire address and build out a digital version of that all at one time.

Instead, it's going to dissect it and it's going to look at every element individually and then from there it is going to transfer it. And so right here looking at our trusty machine learning funnel

large

it is going to take each one of the items that you wrote down. Let's take for example our handwritten 9. This is going to be passed through our funnel which in this case is the support vector machine and then the output of this is going to be a digital 9. It is going to be a 9 that the computer system that the Postal Service actually understands.

So what exactly is going on in this funnel? Well let's look inside of it

large

we're going to see that there are a number of processes. The first thing that it does is it converts that 9 into pixels and we're going to look and see what that actually means. But essentially what it does is it creates a picture of that 9 and then it creates a number of very small little boxes so that it can understand if there is handwriting in one of the boxes or if there isn't and based off of those values it is going to be able to generate its recommendation. That is where the data for any type of image recognition system resides, so that's the first process.

After that, it is going to take all of the historical data and it's going to separate it into classes. It's going to say these are all of the 1's, here are all the 9's, here are all the 5's, here are the letter A's and it's going to put those into classes so that we can create this classification and it'll know with every one of the elements where it fits into our system, so that's the second step.

The third one is it's going to compare the knowledge base it's going to compare all that historical data with our new input. So it's going to take all the pixel data and as you can see right here when we zoom into that 9

large

the way it works is every single element gets turned into this little tiny square. And in this case, I'm using color just to make it more clear. But with the way the system is going to treat it is each one of these little boxes you see are going to be represented by a 1 or 0. And this is going very low level we're talking about being able to take this 9 and turn it into binary pixel data. So after the system has analyzed this entire number and it said okay in Block 1 there is nothing. So this is a zero and block 2 is the same and it goes all the way through. And every time that it recognizes that there is some handwriting it throws a little flag and says you know what? This is a 1 this means there is handwriting in this section.

What that does is it creates a digital representation of this number and that is going to do it that's going to give us real data that we can use to see how does this number or whatever this is it could be a letter or number or anything that someone wrote down. How does this compare to all of the other letters and numbers that we have in our system? Because that's how we're going to generate our recommendation on what this number is.

So let's put this on a graph

large

because using our graph data and being able to implement our hyperplanes like I told you we're going to see exactly a visual format for how hyperplanes work. This is the easiest way to understand how the entire process operates. So we're going to separate the data into hyperplanes just like this.

large

Now I'm not showing you the finished product. What a hyperplane does is it separates one class of data from a another. Say that you have all of these different data points for all of the ones that we have in the database. So the U.S. Postal Service every time they get a letter they keep track of OK this is what a one looks like here. And this is what a 9 looks like. And they have all of this pixel data in their database. Well, there's a clear separation between the ones and the nines.

But what we have to do with our support vector machine is figure out where that separation occurs. So each one of these lines the blue, the purple, and the orange-ish yellow ones, each one of these represents a potential hyperplane for our support vector machine. However, not all hyperplanes are created equal and we're going to see that as we walk through when we try to pick out exactly what the best one is. So let's look at the blue one first.

large

If we say that we choose this blue line as the hyperplane which means that we're picking out this blue line and we're saying this is what separates the one from the nine. This may not be the best choice. Notice how our question mark right here. This represents a potential number we don't know what the number is. It's simply our binary data that came in and we need our system to estimate if it's a 1 or 9. Well right here imagine that the hyperplane isn't there and you're simply looking at the data.

This looks like a one however because we chose our hyperplasia to be where it's at. This would be categorized as a 9.

large

Now, this is probably not the case because as we can clearly see the data looks like this questionmark should be in the one category. But if we draw our line where the blue line is right here that is probably not accurate. So what if we look at that yellowish type line?

large

Well, right here we can see another example where we're placing a recommendation option inside of the nine category when it's pretty clear that this should be in the list of ones. This is also probably not the best example.

large

But now let's look at that third example where we have a purple hyperplane and we have the line drawn exactly where we have a right now.

large

When we placed this question mark here we can see that it is falling inside the correct classification and if we move it down we can see that it is still working.

large

So this is at a incredibly high level how a support vector machine works.

Essentially what we're doing is we are trying to classify data by segmenting our data into as wide of a space as possible. So the language I typically like to use is I like to say that the widest Lane wins.

large

Now, lane is not the formal term. Instead, the formal term is actually margin.

large

And so if we look at our lanes or our margins for all three of our potential hyperplanes you can see right here that the purple Lane is much wider. This margin here is separating our classes between 1 and 9 in a much better fashion than the blue or those yellow lines. And the reason for that is because the margin is the widest here. That's the reason why when we integrated those test questions so when we put in our test data points we saw how whenever we use that blue line that it made some poor recommendations and the same thing on the orange side.

But you can see we use the Purple Line as a hyperplane this is going to give us our best segmentation and our best classification.

large

Another formal term inside of a support vector machine is the distance. So what we say is that we have a distance one(D1) and a distance two(D2) distance one is anything to one side of the hyperplane all the way until we find that first vector and distance to anything on the other side. So right here as you can see distance one goes from the hyperplane all the way to our first one distance to goes from the hyperplane all the way to that very first nine that it found.

large

Now if you're wondering where the support vector name comes in the vector is exactly the first element that our hyperplane finds when it's trying to build that margin. And that's a reason why they're called support vectors. And I think it's actually a nice name it's something that is relatively straightforward to remember if you think about support vectors as supporting the margin. That's the way I like to think of it, that's how I personally remember it is because a support vector is the very first element that the margin hits. And I think of it as a lane and so I think of it as a support vector is supporting that lane.

So now that we've talked about that let's give a very brief summary on the key terms.

large

We've talked about hyperplanes are the actual boundary between classes. We've talked about margins, margins are the space between a hyperplane and the first vector which is the first data point. The distance is each side of the margin so every support vector machine should have at least two distances one on one side of the hyperplane one on the other side. And then lastly we have our support vectors. These are the first data elements that give a very clear boundary one when one class starts and a another one starts, so that's the basic usage.

Now let's get into the advanced use of a support vector machine.

large

First and foremost we have multiple dimensions. Now you may have noticed when we looked at our ones and nines you may have noticed that those are a very small segment of all of the data points that a system such as a handwriting recognition system would need to be able to process. And so it is possible in a support vector machine to have multiple dimensions so we can look at ones and nines and letters and all of those kinds of data elements and we're going to take a visual look at what that represents also see parameters as much as I would love to tell you that every support vector machine that you build is going to be perfect where you have a very clear set of classes that is not the case there is going to be some overflow.

In the example that we gave there's going to be someones that fit in the nines category and some nines that the inside of that one category and that is simply how the real world works there is going to be certain people who write their ones very similar to their nines and there's no way of having an absolutely perfect system. So a c parameter allows us to clear that off and so we can ignore those type of data points. We also are going to have multiple classes this is similar to multiple dimensions but you're going to see that this gives us the ability to actually have multiple data points and analyze those all at one time. And then lastly we have a kernel trick.

Now the kernel trick is something I already discussed earlier in this guide about I'm going to give you some cool visuals so you can understand exactly what that represents and it's very closely aligned with multiple dimensions.

So as you can see right here

large

what we have is another graph but this one is much different and this is like I mentioned this is related to the kernel trick. We don't always have this really nice clear linear line but instead, we also have situations where you need to actually draw a line that pulls off of multiple dimensions. And so because of that, a support vector machine is very popular for not just using a straight flat graph. It also uses a 3d graph kind of like we have right here where you can create a visual that takes in all kinds of different parameters and then it builds its recommendation off of that.

As you can see in this example that Gray Line is very similar to the purple hyper plain line we drew earlier and a by leveraging multiple dimensions we're able to slide our hyperplane across all kinds of different parameters so that it gives a clear distinction not just between a set of one and a set of a nine but off of all kinds of different parameters. And that is a perfect lead in to this next kind of advanced use which is multiple classes.

large

Whenever you see a support vector machine that performs more than binary classification and when you hear me say binary classification and I've said it a few times in this course what that means is we have one answer or we have one other answer but that is not always the case. That would make life very easy for us. But that is not always the way it works. Instead, there are going to be times where you need to be able to perform multiple classification and this is when you're using a support vector machine. It's a great tool because you can do it.

So right here as you can see we have hyperplanes that segment our nines are ones and our fives and last on the list of advanced usage is our kernel trick.

large

Now if you think that our multiple dimension description and our kernel trick are similar they are. It really comes down to creating a different perspective on how the algorithm works. So here on the left-hand side imagine that you have all of these data points so in there you can see we have all these little red dots and inside of those red dots we have all these blue dots. Well, you can't really create a circle that separates them so our hyperplane can't really be a circle but then with this kernel trick the way it works is it converts it into what you see on the right-hand side where you've actually elevated and you've added these other dimensions to the red dots and the blue dots and now you can see there is a very clear hyper plain separating them.

If you have a system that has multiple data points and multiple dimensions the kernel trick is a very popular way to create a hyperplane that creates those kinds of classes. And thankfully whenever you're implementing one of the popular machine learning algorithms whether it be in Python or R or the language of your choice whenever you call for a support vector machine class and you're passing the data to it you can simply pass in the type of kernel trick and name that you want to use.

And when we get into the code portion of the course I'll show you exactly how you can do that because each one of the kernel tricks functions a little bit differently and the purpose of this guide is to give you a very high-level view to understand that a support vector machine focuses on classification on recognition and on creating a very clear boundary between different classes of data.