- Read Tutorial
- Watch Guide Video
- Complete the Exercise
What we're going to do is we're going to create a miniature version of the Google page rank algorithm. Now if you remember Google originally created their entire system in python. So I thought that this would be a pretty cool fit. So we're going to build a web crawler that is going to rank pages.
So let's talk about the details and exactly what has to get done. I always like to break all of my programs into their most simplistic types of features as possible so to break it down in a very easy way. We're going to have a set of inputs. This is going to be around 100 or so websites and I'm going leave that up to you to pick out any types of websites that you want and you can use more or you can use fewer. However, they should be at least around 100 in order to get a fair score.
Now that is going to go into a system where you're going to build a web crawler that is going to go out and it's going to go scan all the way through those pages. And what you need to do is find all of the links in each page and then from there what your system. The main thing that you need to do is start analyzing all of that data and then you want to create an entire set of data structures that are going to categorize all of those links into two groups.
We're going to make page rank a little bit more simplistic in the very beginning Google implemented five different categories where you were a page rank 1 through 5. We're simply going to say low and high. And so all of these links are going to be valued on how many times they were listed. So in other words, if you have 100 links and you find that one link was listed on the entire set of Guides an entire set of links. If you find out that it was referenced 20 times and that was the highest number then that should be in the top category.
Now, this is still a little vague. Let's break it down even further. I want to give you the tools you need and then you can go and build the project yourself.
So step one is compile your list of links and so this is going to be you know somewhere around a hundred or so links. Now if you want to create even more accurate scores then you could make a list of 1000 links that's completely up to you.
Now from there, you're going to implement a web crawler and if you're freaking out right now when I say implement a web crawler I'll give you a hint. You're not going to have to go build your own web crawler from scratch. When we went through the section on Python modules you were going to be able to once you research it find a large number of various web crawlers that you can implement so you can import a web crawler right into your system and it's going to do a lot of the low-level work for you.
So you're going to implement a web crawler and you want to pull out all of the links on each of these pages. So if you have a hundred links each one of these pages may have 20 30 or even 100 links inside of them. So right here you could end up probably with a few thousand links so you might end up with a few thousand links. And so that is what you're actually going to work with. This is these links here and I want to clarify this because I know we're using the same word but these links right here are going to be what you're actually working through.
So once you have this set of links so all of the links that you found on your pages that you crawled then this is where the magic is going to happen and you're going to start building out your ranking system. So what you need to do is you need to create some way of looping over crawled data so you're going to loop over these couple thousand links right here and then you need to keep track of how many times one of these links appears. So there is going to be a large number of links that only appear one time but then you may also have a way of having there may be some links that are shown 20 or 30 times. And so you need to keep track of it.
Now I'll give you a little bit of a hint. If I wanted to keep track of something like this I wanted to create a counter and we work through an example kind of similar to this I might think about using a dictionary because a dictionary can have a unique URL it can keep that key as a unique URL so each one of these links could be that key and then the value can be whatever that count was. And as you loop through you can whenever you find one that already has a value you can simply increment that value. Now we're not done yet. Now that we have that now we need to extract all those counts and we need to create essentially a range we need to be able to see and I will draw this in a number line format right here.
So we need to be able to see from say one link or we'll never have zero links because if you found one then you're always going to have at least one link all the way through maybe 30 links there may be a count of 30 there may be one link that gets reference 30 times. And so we have this little counter range here and you're going to have all kinds of values you'll have some here some here you know these might be the three links or you know a count of five. So on and so forth and so you're going to keep this range just like this.
And what you're going to have to do is you need to find the median value. So somewhere around here there's going to be a sweet spot that is the median if you're not familiar with what a median is from your stat class then you can research that and I'll give that to you as a task it's one of the middle values it's very important to understand if you're building out machine learning algorithms or anything like that.
Another hint I will give you is the median value is provided and there is a median function inside of python so you can find whatever the median value is and then what I want you to output is two groupings all the links that are below the median and then all of the links that are at or above the median.
So this is what you should be giving back. This is what the output is. Remember how I said I always like to break things down to what the input is and what the output as well our input is going to be about 100 links and then our output is going to be a grouping of those links that you found in here and they're going to be grouped by the largest number of occurrences to the smallest number of occurrences and so you should have two lists as the output of this function. So now that you have a good overall idea of what we need to do on this project I hope you have fun with it and I'll see you in the solution.