Introduction

Web scraping is a process of collecting data from web pages. We download the web pages, parse the HTML and extract only what we need. In a web page HTML is used to layout the content, styled via CSS and Javascript is used to provide interactivity and other functionalities. Depending on complexity of the website, it might be quite easy to extract the data or it can be very difficult. For example, extracting data from Facebook is much more difficult than from this blog.

In this series of tutorial, we’ll build several scrapers with increasing complexity. The complexity in the scraper arises due to the complexity of the target website. Many large websites have teams dedicated to prevent scrapers from downloading the content from their website. There are many things to consider when we develop our scraper. We’ll visit these issues as we go along.

Let’s look at the most basic version of scraper you can imagine.

graph LR; init[Seed URL] --> a[Download Web Page] a --> b[Extract Content] b --> c[Save] b --> d[Extract links] d --found more links?--> a d --no links found? --> Exit

Let’s imagine that our seed url is a url to a Google search result page https://www.google.com/search?q=scraping and we would like to extract titles used by top 30 results. So, the scraper downloads the contents of that url, extracts the titles from the page and saves it. Websites don’t show all the results in a single page and they have pagination to help users navigate through the results. In most of the cases clicking a button or a link with Next written to it will do. So the scraper should also extract the link to the next page. If it found a link then it’ll start from the beginning with the newly extracted link otherwise it will exit.

Components

A scraper has at minimum three components. Either you can assemble these and other necessary components yourself to build your own scraper or use existing library such as Scrapy which does all the plumbing for you.

HTTP client

A http client is a component that is used to construct HTTP requests and handle the HTTP response from the server. You can think of it as a very basic browser that just downloads the web page but does not render it or execute scripts contained in the page. On your browser, visit any page, and press Ctrl + U, it will show you the source of the page. This is what most of the http clients will fetch from the server. Browsers on the other hand, will download other resources from the page such as images, CSS files, Javascript files etc so that it can render it as a complete web page.

Here are two commonly used http libraries in Python. There are many others including builtins like urllib2.

requests

http://docs.python-requests.org/en/master/

Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

Installation pip install requests

AIOHTTP

https://aiohttp.readthedocs.io/en/stable/

Asynchronous HTTP Client/Server for asyncio and Python.

Installation pip install aiohttp

HTML parser

The most popular one is BeautifulSoup. It is quite easy to use and supports xpath and css selectors to find the required elements. You can also use html.parser which is a builtin Python module, or you can also use lxml is fast and can handle messy HTML documents. BeautifulSoup uses these libraries underneath and you can specify which one to use as a parser. lxml is recommended.

Installation

pip install beautifulsoup4

pip install lxml

Exporter

After you scrape your data, you want to save it somewhere. It could be in a CSV file, JSON file, or some database like Mongo or MySQL.

Next Steps

We’ll start by writing our first scraper using requests and BeautifulSoup. We’ll immediately observe some limitations and improve upon it further. Finally, we’ll use scrapy to build a robust crawler.

Comments