Overview of kNN Algorithm
In this guide, we're going to dive into the kNN machine learning algorithm. Now kNN represents the term K nearest neighbors. Hopefully, this name by the end of this guide is going to make a lot of sense to you because the nearest neighbors name actually represents a summary for every single component of the entire algorithm.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

By the end of this case study hopefully, that is going to make more sense so the entire primary usage of this algorithm is data classification.

Now one of the first questions that you should be asking right now is if we are in the supervised learning classification part of this course why are we saying that the primary usage is data classification because that seems simply like it is duplicating the descriptions of everything that we're already in and by the end of this guide hopefully, it's going to make sense because the kNN algorithm is very good at this one task which is data classification. It can take in a set of inputs and from that set it can generate a specific class and so data classification is exactly the specialty of this algorithm.

And as you're going to be able to see this algorithm is a very good at analyzing all of the different components in a data set and then telling you exactly where a new element and some new piece of data can be placed inside of a system. So let's dive into the formal definition and please do not get intimidated by this definition from Kevin Zakka because it may sound a little bit over the top.

large

What he said is that kNN is a non-parametric instance based and used in a supervised learning setting that is his definition. Now half of that probably unless you're coming from a machine learning background should not make any sense to you whatsoever. We've talked about supervised learning so we don't have to cover the second part of that. That's simply given a program a set of goals essentially we're saying we want to know what something is that's what supervised learning represents. But the first part of that definition where we say non-parametric and instance-based those shouldn't make any sense to you.

So let's dive into what each one of those means.

Non-parametric means that something is not required to follow normal distribution rules.

large

So in many algorithms, there are normal distribution rules to follow. These may be elements such as standard deviation or really any kind of formal mathematical formula that you bring in in order to see how like or not like elements are inside of a data set that is a very common process inside of machine learning algorithms. But when something is described as non-parametric what it means is that it doesn't have that requirement.

So instead what kNN does is it actually forces you as the machine learning developer to supply that. So you need to find exactly what type of in this case it's going to be called K that's where the K comes and you want to find what distance that you are searching for that sets the distribution between all of the elements and I know that probably does not make any sense to you unless you already have worked with the kNN algorithm before so do not worry.

Just understand that what it means is we're not going to use some formal statistical kind of metric in order to see the distance between all of our data points.

Moving on to the next part of the definition instants based.

large

What exactly does that mean? Well in machine learning we have a couple different kinds of types of algorithms that we can work with some algorithms will create a model of the world that they know. So they're going to take all of the historical data or at least a sampling that we provide and then build a model off of that and every time we send in a new data point and we say I want you to predict how to classify this new piece of data.

It will then take that in and it will give you its probability kNN does not work like that. It does actually build a model but instead, every time that you give it a new data point it will then go through all of the data that we provide and then it's going to automatically and dynamically generate that prediction. So any time that you hear that an algorithm is instance based that's what it means it means that you need to create an instance of this system in order for it to generate the prediction.

Now let's dive into some of the popular use cases for this kNN algorithm. So some of the very popular ways that it can be used are with banking systems, weather predictions, and sports analysis.

large

Now that does not set the full list of different types of ways that you can use kNN. There's nearly an infinite way of implementing this type of algorithm. However, these are three of the very popular types of use cases for the system.

Now in discussing some of the pros and cons let's look at a few of the reasons why you may want to use this kNN algorithm.

large

One is out of all of the machine learning algorithms that you're going to work with. It is probably the most straightforward to learn. It also follows a clearly defined set of patterns and once we get into some of the case studies and the examples you're going to see how this works.

Whenever I'm working with a new student kNN is one of the first algorithms I show them because it's a very easy to visualize and it also shows them exactly how you can take one set of inputs so you can take data and then run it through a process and then generate an intelligent result from that. And now as great as kNN is let's also look at some of the reasons why you may not want to use this algorithm.

large

One it has pretty poor performance and the reason for that is related to something we already discussed and that is that it is an instance based algorithm which means it doesn't create a stored model of the world. So a neural network for example which has a much more advanced type of artificial intelligent agent. It creates a model of the world and whenever you pass it a new piece of data it passes that through its personal model that it has built over weeks or months or even years and then it generates a recommendation from that and in some circumstances that works.

However what kNN does is it does not look at historical data until you call it. And so because of that, it can have poor performance if you try to give too many examples all at one time because it attempts to run them all in memory which means it tries to run the entire system every single time that you call on this algorithm.

Now another issue another potential con is that it arbitrarily selects a numerical value for K and if you've never worked with the kNN algorithm don't worry we're going to get into what that represents. But what it means is that there is some level of guidance that you as a developer have to give the algorithm. And if you pick the wrong type of guidance, if you pick the wrong value for K then it could lead to some very inaccurate results.

Now the next con on the list is that it requires a high level of domain knowledge and this may or may not be a con based on your own specialty in your field. So if you are in a field and you're trying to build out an algorithm that you know a lot about.

So for example, if you are a baseball statistician and you're working for a baseball organization and you're wanting to build out a machine learning algorithm with kNN you're perfectly fine. However if you get hired to build out this algorithm and implement this entire system for an industry you know nothing about then you really need to get quite a bit of guidance from someone who does know the industry because without it then you're not going to be able to properly classify each one of the elements in the data set.

Hopefully, this is going to make more sense once we actually walk through all of the examples and see exactly how kNN works but just understand that a high level of domain knowledge is required in order to properly implement this type of system. And last on my list of cons for the kNN algorithm is that it requires a large amount of data in order to be accurate. If you remember back to our example when we walked through the naive Bayes algorithm, naive Bayes works very well on a small or a large amount of data because it is very good at generating probabilities and whether you pass it by examples or 5 million it is going to be able to generate those probabilities.

Now, kNN, as you're about to see, requires more data in order to build out the proper system and in order to generate the predictions that we're looking for. So now that we've taken a high-level overview and discuss the pros and cons we've walked through the definition and use cases for kNN.

Now let's walk through our case study and for this example what we're going to do is we're going to see how we could potentially build out a loan approval engine.

large

This is going to be a pretty common use case for an algorithm like kNN. So imagine that you got hired for lendingtree.com

large

and they're asking you to build out a machine learning algorithm that can take in all of a potential buyers parameters. So this could be everything from their name, their credit score, their annual income, and their current monthly obligations all of those types of data points and these are pretty common types of data that you'd expect someone who is requiring or needing some type of loan to be able to provide.

So in this example, we have a name, we have a credit score of 650, we have an annual income, and then whatever they typically have to pay on a monthly basis.

large

So what our system can do is it can actually take in all of those data parameters and then the goal is that it is going to generate either a loan approval or rejection and it should do it automatically.

large

And so when it comes to loans our goal is to be able to look at all of the data parameters and then classify someone as being worthy of a loan or not worthy. Now one caveat I will add to this specific case study is it's simply a U.S. based rule that if you are building out a system like this that you actually need to provide a rationale if you do reject somebody. So if you do make sure that you're building out the system in a way where you can actually log why you may reject someone because a loan officer is required to provide a rationale for why they cannot provide a loan to someone.

That doesn't have anything to do with machine learning it simply is a law that if you're building now this specific type of system that you need to be able to provide.

Now what goes into the funnel?

large

I talked about in the naive Bayes example our funnel is kind of like a black box that has all of our different processes that we need to be able to implement in order to build out this type of application. Now preprocessing that can mean a number of different things and in fact if you decided to go into the machine learning space and focus just on preprocessing that can be a job completely in and amongst itself what it means is that you take in data and then you can change the values so that they will fit inside the algorithm.

So that could be everything from converting strings into integers all the way through grabbing different values. Imagine you're downloading or importing a spreadsheet and you don't always have values for every column in that spread so preprocessing also could mean that say you have an empty value for income. You don't simply want to reject the person because they didn't submit an income so you may pull in some kind of average and say this person based off of a number of other characteristics they may slide into this income value.

Now after we've performed our preprocessing we need to do is convert the data to a vector. Now if that doesn't make any sense just think of this as a process where we've taken in some kind of data element. So we've essentially generated some kind of score that we can use internally. And this may be a combination for a loan applicant of their credit score and how much they're paying per month and all kinds of different data elements like that.

So we're simply going to combine all of that data so that we can have a simple score that we can use in order to compare our loan and what can with all of our historical loans that we've given out. From there we're going to go through the training data and lastly we're going to base a prediction that is made on the most similar data points and if you're wondering this is where that nearest neighbors naming convention comes into play.

So here we have our training data

large

Our green thumbs-up are all of our loans that worked out well. So what that means is if you're a bank these are all of the individuals who took out a loan and pay it on time they haven't foreclosed anything like that. And all of our red thumbs-down are individuals who have not paid their loan. They might have foreclosed they might have been late on payments they might have short sailed all of those kinds of different parameters.

Now we have that historical data in the database to run through and what we do next is we are going to say Okay our new user so this new loan applicant that we already looked at they fit in this quadrant of the system.

large

Now we're finally going to get to talk about what the letter K represents. So what K represents is it's actually short for Klass and because of some programming conventions you never spell Klass with a C you actually spell Klass with a K whenever you're not actually programming when you're just talking about the term Klasses or Klassification.

And so the way that kNN works is you provide some arbitrary k. This goes in a little bit to both the pro and the con list for why you might or might not want to use the kNN algorithm and that is because there is a level of arbitrariness to the system. So in this example right here what we're doing is we're saying that we've given a level of k what that means is that our algorithm is going to go through all of our historical data points so it's going to comb through the database and it's going to find based on that score we talked about.

So it's going to take in all the different parameters and then it's going to say these are the five closest users that are the closest to our new loan application. And so what kNN is going to do is it's going to bring in those five and you can see that right here at this circle and it's going to say "Who are the Close's five applicant's?" So in other words who's closest in regards to their credit score into how much money they make per year into their monthly expenses. All of those different elements and those are simply some examples you could imagine even more. And we're going to see which ones are closest to that and then based off of whatever that value is that is how we're going to generate our recommendation.

So in this specific example, we had four green thumbs up. That means we had four other past loan applicants that turned out to be good. And so because of that and we're looking at the five closest we're going to say that our new loan applicant should get approved.

large

Now if we change up the data a little bit and we say that you know what actually the loan applicant fit inside of this other set of data points. And as you may have noticed I also changed K to be three instead of five. What that means is it's going to analyze all the data points and it's going to say bring me the three closest to this new loan applicant.

large

And it's going to pull in these three as you can see from this visual we have two out of the three are red thumbs-down which means that our new loan applicant should not get approved because they are the closest in proximity to other loan applicants who did not fulfill their obligations. And so because of that we are going to say I'm sorry but you are rejected because you fit inside of this circle whether it is right or not.

large

This is a machine learning algorithm that requires quite a bit of arbitrary data so it doesn't mean that it's always going to be right. But instead, it is going to fit with inside the parameters that we provide. So in this example, we provided all of these different data parameters such as a credit score and their revenue and all kinds of elements like that and we're going to say you know what you were classified as looking like you were going to be in the red category. It looked like you were going to be the type of individual who was not going to be able to pay their loan and so because of that we reject them.

Now a very important aside I want to bring up is whenever you're building out a system like this it doesn't mean you are going to be right 100% of the time this particular user could very well have been a user who paid all of their obligations they paid their loan they paid whatever their payment was on a monthly basis. There's no way that we can have 100% accuracy in this type of system. The entire field of probability is based on us creating the best guess possible based on in this case historical evidence.

So all we're doing is we're taking all of our good loans we're taking all of our bad loans and we're creating this grouping mechanism where we are looking and we're saying we want to be able to predict as well as we possibly can based on the data who's going to be able to pay their loan and who isn't.

So, in summary, this is the high-level overview and case study for working with the kNN algorithm.