This article is Part 1 in a 2-Part Web Scraping with Python.
In this series of tutorial, we’ll build several scrapers with increasing complexity. The complexity in the scraper arises due to the complexity of the target website. Many large websites have teams dedicated to prevent scrapers from downloading the content from their website. There are many things to consider when we develop our scraper. We’ll visit these issues as we go along.
Let’s look at the most basic version of scraper you can imagine.
Let’s imagine that our
seed url is a url to a Google search result page https://www.google.com/search?q=scraping and we would like to extract titles used by top 30 results. So, the scraper downloads the contents of that url, extracts the titles from the page and saves it. Websites don’t show all the results in a single page and they have pagination to help users navigate through the results. In most of the cases clicking a button or a link with
Next written to it will do. So the scraper should also extract the link to the next page. If it found a link then it’ll start from the beginning with the newly extracted link otherwise it will exit.
A scraper has at minimum three components. Either you can assemble these and other necessary components yourself to build your own scraper or use existing library such as Scrapy which does all the plumbing for you.
A http client is a component that is used to construct HTTP requests and handle the HTTP response from the server. You can think of it as a very basic browser that just downloads the web page but does not render it or execute scripts contained in the page. On your browser, visit any page, and press
Here are two commonly used http libraries in Python. There are many others including builtins like
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
pip install requests
Asynchronous HTTP Client/Server for asyncio and Python.
pip install aiohttp
The most popular one is
BeautifulSoup. It is quite easy to use and supports xpath and css selectors to find the required elements. You can also use
html.parser which is a builtin Python module, or you can also use
lxml is fast and can handle messy HTML documents.
BeautifulSoup uses these libraries underneath and you can specify which one to use as a parser.
lxml is recommended.
pip install beautifulsoup4
pip install lxml
After you scrape your data, you want to save it somewhere. It could be in a CSV file, JSON file, or some database like Mongo or MySQL.
We’ll start by writing our first scraper using
BeautifulSoup. We’ll immediately observe some limitations and improve upon it further. Finally, we’ll use
scrapy to build a robust crawler.