Python Package Section Project Requirement

Note

Dailysmarty is currently not accessible.

Webscraping has recently become less common as a general practice, and many websites have set up security measures to prevent scraping of their data. It is not something we will use elsewhere in the course but it is good to know about as it allows pulling information from a website when it doesn't have an accessible API.

Watching the video will suffice for this lesson as it is difficult to reference a site for scraping. Please reference the documentation to learn more about webscraping and the BeautifulSoup package: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html

Part of the reason why I have structured this particular project in the way that I have is because it will help you on the entire course capstone project because there's going to be a few fundamental concepts that you're going to be able to learn while you go through this project and when you watch me go through the solution that are going to help you on the much larger and more challenging course capstone project. I wanna give you some of the knowledge that you're going to need in order to do that.

Technically you have learned everything that you need to know in order to build out this project. And also the entire course capstone project but it also helps to be familiar with some of the libraries that can make your life a little bit easier and make your code more straightforward to implement. So what we're going to do is we're going to be building out a web scraper. And if you've never heard of what a web scraper is, what it is going to do is go out to a Web site and it is going to go and extract components from that site so that you can use them, you can bring them into your own program.

Now we've walked through how to communicate with an outside API. We saw how we could leverage the request library in order to communicate with the daily smarty API like we did in this section. But now there are going to be many other applications that you're going to be working with especially if you're getting into the machine learning space that doesn't have APIs and you need to build out a web scraper so you need to bypass the entire concept of an API and you essentially have to build your own version of that. That's what we're going to be building out in this project.

What I want you to do is you're going to go to this pure web page do not leverage the API. We will be able to see in your code if you do that. And that would be cheating, so please do not do that. And so instead what I want you to do is to go to this URL I'll provide it in the show notes it's http://www.dailysmarty.com/topics/python. And so these are going to be the python related posts on daily smarty. What I want you to do is as you can see we have all of these titles here and they are in link form. What I want you to be able to build is a program that comes to this URL and then scrapes the code from this.

Now technically just like every other website, this is just HTML code. So if I were to right-click here and view page source this is what the Web site actually looks like to the browser.

large

If you leverage the requests library you're going to be able to call the URL directly and then get access to all of this content. Now if this looks very confusing, do not worry it is going to be something that you're going to learn how to implement and I'm also going to help you and give you a few hints on how you can parse pure HTML code. And it's going to be with a few more packages and libraries.

But before we get into those let me show you the full set of requirements that I want you to do so you're going to come to this URL http://www.dailysmarty.com/topics/python. The program is going to parse through all of the data on there. And I want you to select all of the links that go to posts though if I were to right-click on this and click copy link address right here let me open up in a text editor say vim project.py and so I'm going to paste in right here what that URL looks, and so you are going to get access to this.

large

Now, what I want you to do is to only pull out the links that are related to posts. Now they're going to be links all over this page. There are links that go to the feed topics users posts to a new post to URL's they are going to be all over the place. And so I want you to filter out the ones that you do not want and only grab these so only the ones that go right to a post.

large

And so as you go through and you get all of the URL's you're going to notice some patterns and that's going to help you decide which links you want and which links you do not want.

So the next thing that I want you to do is I want you to be able to take the link right here in your text editor and I want you to convert that link into a page title that will look something like this.

large

I don't want you to go through and figure out how to grab the page title element itself because if you click on the element here and look at the code then I don't want you to simply come here and grab the a link text

large

that would kind of defeat the purpose of what I'm wanting you to do.

Instead what I want you to do is grab the URL only and then I want you to build a function that converts the title text and the title text is in the URL into something that looks like this. And so if I were to grab this URL here that says how to implement fizz buzz in Python. I would want the output to look something like this, even with capitalization and those types of components.

"How to Implement FizzBuzz in Python"

The final output for this project should look something like this. If I have a list of all of these items here it should look something like this where I come down here and I have one title, two titles all the way down to whatever the last one is.

large

Now, these are going to be slightly different depending on when you're taking this course because new posts are being added to daily smarty on a daily basis so don't worry about the titles lining up perfectly.

I simply want you to be able to take a set of URLs and then convert them the way that I've done right here. So far I know that this may seem like a lot if you've never built out this type of behavior before so I'm going to give you a few hints. So first I will recommend that you use the requests library I'll put in the libraries to use so I would recommend the request library. Another one that I would recommend is the inflection library and I recommend that you go and research what that represents. And then lastly is the beautifulsoup library. This is going to be a critical one for any type of parsing and web scraping that you're going to do whenever it comes to building out these types of applications.

large

Now if you're using this with Python 3 then you can use the traditional pip install request if you haven't installed it already. And then you're also going to do the same thing for the inflection library. But now for beautifulsoup, you can't simply call pip install beautifulsoup. You have to use the latest version so you're going to call beautifulsoup and then 4 that will install the version that you're going to need for this program.

pip install requests
pip install inflection
pip install beautifulsoup4

So with those 3 libraries, you're going to be able to build out this entire system. And obviously I'm not going to tell you how to do that yet. These are the hints that you are going to do, I recommend you go and you research each one of these libraries see what features they offer and how you can combine all of those to build out the type of functionality that I've walked through.

If you have any questions whatsoever about this project feel free to reach out to your instructor ask any questions they'll help you to implement the solution. So good luck with the project and I will see you in the next guide where I walk through my own personal solution.

Resources

Link to DailySmarty Python Topic