Overview of K-Means Clustering

So we're going to walk through that in the case study the primary usage of K-means clustering is to group data into a preset number of clusters.

large

So any time that you have a situation where you have data and you may not know how to organize it so you do not know what the x and y axis are going to be and you don't know how you should group the data. K-Means clustering can look at the data and based off of how many clusters that you say you want it to have it will then analyze the data and create its own set of clusters and that's how you're going to be able to organize that data.

Now let's look at the definition DataScience.com says at K-means clustering is a type of unsupervised learning which is used when you have unlabeled data which is data without defined categories or groups. The goal of this algorithm is to find groups in the data with the number of groups represented by the variable K.

large

I really like this definition and I didn't even have any terms in here that I felt it was necessary to take a deep dive on because I think that it describes the exact process in a very straightforward manner and it's what I said at the very beginning. The entire goal of K-means clustering is to do exactly what it says is to cluster data and where the case comes in is like it says at the end of the definition K represents the number of clusters that you want the system to have.

And so let's talk about some of the common use cases for K-means clustering

large

Sales data is a great example of when you'd want to do this so imagine that you have a scenario where you have all of this historical sales data but you don't really know how to label it. And you want to see you know what maybe makes a good customer or a high-value customer versus a customer that you may not make much money in. Well, K-means will allow you to group your entire customer base and all of your past historical data so that you can start to analyze trends and so that is one example.

Another is going to be what we are going to walk through in this case study which is gauging interests. So if you have a social network and you're trying to determine what types of interests to recommend to a user or the way that Facebook does it, it recommends groups or pages to you. Well, it uses a tool like K-means clustering to analyze all of your historical likes and your interests to be able to see what else you may be interested in and it can group you that way.

Another popular use case for K-means clustering is in the manufacturing space. A popular use case is to be able to take your work data so analyzes how a factory is performing and then see by using clusters which areas of the factory may be performing better than others and which ones need assistance that's a good use case for it.

Another is image recognition and one little aside I want to point out, you may notice that many of the use cases in the algorithms that we've walked through whether they've been in the classification, the regression, the supervised, or unsupervised learning many of their use cases have image recognition listed and that's because that is a very common use case across many different algorithms and some of them work better than others but usually the image recognition test to see if an algorithm can pick out a face or can perform some type of classification of an image that's usually a good test to see how accurate it is. So that's a reason why you see that come up quite a bit and K-means is not unlike the others it gives you a very good set of recommendations whenever you're using it for image recognition.

Now another use case that we haven't seen before is using this algorithm for motion sensors. So imagine a scenario where you have a security company and they have all of these cameras all over the place and they do not want to pick up and send an alert out for every single motion that goes in front of the camera. So in other words, if a squirrel walks by you don't want to send an alert to someone as if a truck rolled by. And so that is something that is very important is K-means clustering can determine the difference between those two.

So it can take all of the different motion clusters and say OK this looks like a small animal this looks like a huge truck and so it will be able to tell the difference and then give your recommendation based off of that.

So now that we've walked through the definition and the use cases. Now let's analyze the pros and the cons

large

First K-means is incredibly fast, especially when compared with other clustering non-supervised algorithms. That's one of the reasons why K-means is used so often is because it is a pretty fast algorithm in comparison to other clustering algorithms which tend to be a little bit on the slower side. And the second reason is it's also very straightforward to implement. There are a number of different algorithm implementations and libraries that you can use for K-means where you literally can just pipe in a few pieces of data and then the algorithm will do all of the work for you and so that makes it pretty straightforward to implement.

And even if you were having to build this yourself which is not something that you're going to have to do very often but if you did K-means from a mathematical formula perspective is slightly more straightforward to understand then some of the other ones out there.

Now the cons or what are some of the potential cons.

large

One is and this is going to apply to pretty much every clustering algorithm and that is that it can be difficult to gauge the quality of clusters. So when we walk through other algorithms such as decision trees or naive bayes it usually is pretty easy to see if our prediction was good or if it was bad.

But with K-means because it's unsupervised it's much more difficult to tell if the quality is there or not. And so like many other algorithms that means that some level of domain expertise has to be there in order for you to build an accurate system. And so whenever you're talking about clustering usually you're talking about fuzzy kind of accuracy levels. So take for example the case study we're going to walk through where we're gauging interest level. This doesn't have to be as accurate as telling someone that they can or they can't get a mortgage. So our accuracy level doesn't have to be anywhere near as high. Usually, when you're using K-means clustering you care less about accuracy and you care more about generalizations. And so that's something that is important to keep in mind and so that is a con though. If you have a need and your program has a need for a high level of accuracy then K-means clustering may not be the right one for you right away.

You may be able to use it in tandem with other algorithms so you could use it as a filter where you could first group your data and then help the system make decisions based off of that and then the second item is you are forced to select the value of K which is the number of clusters so if you pick the wrong number of K say you give too high of a number for K then you're going to end up with clusters that may not be very accurate because the system's going to keep on iterating and it's going to keep on trying to find the right set of clusters or the right data to put in those and it may not make much sense.

Likewise, if you go with too few numbers if you go with K being too low then you could run the risk of creating clusters that really don't make any sense because they're too general. So that is also an issue that you have to keep in mind and there are a number of ways and when we get into the implementation side of this we'll walk through what it means and what you need to do in order to make an intelligent decision for K.

So we've walked through the definition we've talked about the pros and cons we've talked about popular use cases. Now let's dive into the case study and so what we're going to do is we are going to imagine that we have been hired to work for a social networking company and a social networking company has all of these users but they don't really have a way of grouping their interests. And as soon as they can do that then they can start building a recommendation system where they can say we see that you have been interested in these other items so maybe you'd like to be a part of this group or maybe you'd like to follow this page and then obviously this could also apply to advertising and those kinds of elements.

So let's dive into what the data could look like.

large

So right here we're going to be looking at a few different elements so we have age, location, and then current interest so age could be something like the top row there where age is 19 and then they're located in Newport Beach California and then their current interests are surfing, paddleboarding, X Games, and snowboarding.

You can see that that's very different than the user at the very end there who's 42 years old. They live in Seattle Washington and their interest are wine, travel, and coffee and then you have a little bit of everything in between. And so at K-means clustering can do is you could pipe in this data into the algorithm

large

and then when it goes into that funnel and in this case we're going to say that we want to have two clusters so we may just want to have outdoor enthusiasts and then we have other people who are interested in other kinds of things like food and wine and travel and so what K-means can do is it can take all of that data in and then give us the recommendations on what the interest level is for our group of users.

So inside of that funnel, there are a few steps that it's going to go through.

large

First is we need to select the K. We need to select the number of clusters that we want the system to output. Then from there, this is where it gets a little bit different than pretty much any other algorithm we've looked at so far is it's going to randomly assign what is called the centroid and the centroid is the key component to understand when it comes to K-means clustering. And so with the centroid is is it is with the central part of our clusters. So this is very different than say a support vector machine where they were looking for the outlying vectors for the outlying data points and then we're looking to separate them that way.

With K-means clustering we are taking a different approach where we're actually creating these clusters and all of the data centers around these what are called centroid. And then the way the system works is the very first version of this is going to be typically very bad because it just randomly assigns the centroid. So that could be anywhere on the graph and then from there though what it does is K-means starts to go through its mathematical formula. So it starts to iterate and it looks at all of those different data points it looks at the interest level the age the location and then it refines and moves the centroid so that it is a better representation of where those clusters should be.

So let's see what this looks like on the graph.

large

So in the very beginning, this is the way it might look. So we have these different data points that are represented by the blue dots and then the system just randomly is going to assign the centroid. So it's just gonna pick out through some kind of random number generator it's going to place this on the graph and you can see that we have that green diamond and that red diamond and there just placed kind of anywhere there. The system is then going to iterate and it's going to look at all of those interest levels in each one of those circles that's gonna look at interests, location, and age and it's gonna start to find similarities.

And as it does that it's going to continually refine the distance between the centroid and the different clusters

large

so it is going to look at all those data points and it's going to continue to loop through the system until it has created what it believes to be the best and closest distance between the centroid and the elements around it. All of those data points that it thinks are the closest.

Now this usually does not work out perfectly the first time usually this is something where you really need to have the right kind of data to have the system go through because as you can see from here the data completely determines those centroid clusters and so if you have data that overlaps too much, so say you had almost everyone sharing similar interests and similar ages and locations. Then the systems are not really going to be able to give you a good defined set of clusters because all the dots are going to be right next to each other.

So what this typically works best and is a scenario where you have data that is spread out and that does form into natural clusters. One of the nice things about this specific case study if you are working for a social network or something like that we do fall into those kinds of categories. By nature, we are very very culturally likely to follow into similar patterns and so if we like say one page on Facebook we're probably going to have also liked a few other similar pages and we may be similar and would be likely to be friends with people who have the same interests as us. And so this is a very common use case for utilizing K-means clustering.