Introduction

In this post, we will build a simple scraper using requests and BeautifulSoup libraries. We’ll be using requests to send and receive HTTP requests and responses and use BeautifulSoup to parse the HTML content. Let’s define our scraper. The scraper will scrape all the title, image and excerpt of articles posted in this blog. If you navigate to the home page, you’ll see that there are articles listed. Not all articles are listed in the same page so there is a pagination at the end to help navigate. The concepts that you’ll learn here can be applied to any other websites where the items are paginated.

The steps that our scraper will follow is shown below.

graph LR; init[Seed URL] --> a[Download Web Page] a --> b[Extract Content] b --> c[Extract links] c --found more links?--> a c --no links found? --> d[Save Extracted Contents] d --> Exit

Here, seed url is the link to the home page of this blog. Then the scraper downloads the page, extracts the title, image and excerpt of each article and finds the link to “next” page. If link to “next” page exists, then it’ll repeat the steps otherwise it’ll exit.

Code

First, import the required libraries

import requests
from bs4 import BeautifulSoup
import json
from urllib.parse import urljoin

Now we’ll define our main loop of the scraper. Note that the functions in the main loop have not been defined yet.

seed_url = "https://sanjayasubedi.com.np"

if __name__ == "__main__":
    next_url = seed_url
    articles = []
    while next_url is not None:
        print("Downloading {}".format(next_url))
        html = download_page(next_url)
        # create a beautifulsoup object from html text
        soup = BeautifulSoup(html, "html.parser")
        
        
        extracted_articles = extract_articles(soup)
        print("Extracted {} articles".format(len(extracted_articles)))
        # concatinate two list of articles
        articles += extracted_articles
        
        next_url = extract_next_url(soup)

    save_articles(articles, "./articles.json")
    print("Finished")
    

So first we assign a value to seed_url and also assign it to next_url. The while loop will repeat until the next_url contains some value. This is necessary so that if the scraper can’t find any links to crawl other than the seed_url, it will exit. So the scraper downloads the html content from the url using download_page function. Then we create a BeautifulSoup object and use it to extract articles using extract_articles function and also next url using extract_next_url function. For every articles extracted from a page, we add them to articles list so that once the loop is finished, it contains every article from all the pages that were scraped. Finally, we save the articles to a file using save_articles function. For many simple and one-off scraping tasks, this approach is good enough. Of course, there is no error handling code here but apart from that this is mostly how it would be.

Now let’s implement the functions that we’ve used in the loop.

def download_page(url):
    """
        downloads a page and returns its contents
        :param url url to download
        :returns content received from the url
    """
    # download the web page 
    resp = requests.get(url)
    # make sure that http status code is 200
    assert resp.status_code == 200
    return resp.text

download_page is a simple function that creates a HTTP GET request for the given url. We make sure that the HTTP response code is 200 i.e. everything as fine and there were no server errors like 404 Page not found. Finally we return the content downloaded.

def extract_articles(soup):
    """
        extracts article title, image, and excerpt
        :param soup beautifulsoup object
        :returns a list of dict
    """
    output = []
    article_elements = soup.select("div.archive article")
    for ele in article_elements:
        title = ele.select_one("h2").text
        excerpt = ele.select_one("p[itemprop=description]").text
        image = urljoin(seed_url, ele.select_one("div.archive__item-teaser > img").get("src"))
        
        output.append(dict(title=title, excerpt=excerpt, image=image))
        
    return output

Next one is extract_articles function which takes in a BeautifulSoup object and returns a list of dicts containing title, excerpt and image. In BeautifulSoup you can use xpath or css selectors to find the elements. Here I’m using CSS selectors. .select function returns a list of elements that matches the particular selector whereas .select_one returns only one element. If there are multiple matches, then it returns the first one. If there are no matches, it returns None.

On your browser press Ctrl+Shit+I or right click and choose Inspect Element to open up the inspector in your browser. You can find that that all the articles are inside a div with class archive and each article is inside article tag. So we create a selector div.archive article which means find all elements with tag article that is inside a div with class archive. For more details on CSS selectors, visit this page http://htmldog.com/references/css/selectors/.

inspect element Now for each elements that are supposed to contain the articles, we extract title, excerpt and image. Once again, after inspecting a article block, we can see that the title is inside h2 tag so we call select_one with the selector h2. For excerpt, we can see that it is in a p tag which also has an attribute called itemprop and its value is set to description, so create a selector p[itemprop=description] that means find all elements with p tag that has itemprop attribute and its value is set to description. If there are other p elements with out that itemprop attribute with description value, they won’t be selected. For image, we know that it is in an img tag which is a direct child of a div with class archive__item-teaser, so we create a selector div.archive__item-teaser > img (note the > since img is direct child of the div) and use get function to retrieve src attribute of img. You might have noticed that the value in src does not contain absolute link, so we need to make it absolute which can be done by calling urljoin function.

Finally we create a dictionary and append it to output and return once all articles are processed.

def extract_next_url(soup):
    """
        extracts next url from the pagination if it exists
        :param soup beautifulsoup object
        :returns url of next page or None
    """
    a_elements = soup.select("nav.pagination a")

    for ele in a_elements:
        # classes assigned to this element, either empty list or actual classes
        classes = ele.get("class") or []
        # find a with text "Next" and its classes is not one of disabled
        if ele.text == "Next" and "disabled" not in classes:
            return urljoin(seed_url, ele.get("href"))
    
    return None

extract_next_url function extracts next url to scrape. Here, we are trying to find an element with tag a inside nav. For each a elements, we check if it contains text Next and if it does not have a class disabled. This logic is specific to my blog. For other websites it will be different, so you’ll have to come up with your own logic. Try going to the last page and see what happens to the Next button. Some website completely remove the button, some websites just hide it so users can’t see it but it is present. You need to make sure that you covered all edge cases. Otherwise, the loop will run infinitely!

def save_articles(articles, filepath):
    """
        saves articles to a json file
        :param article a list of dict
        :param path of file to save
    """
    with open(filepath, "w") as f:
        json.dump(articles, f, indent=4)

Finally, the save_articles function. It just dumps the list of articles in a JSON format.

Here is what I got when I ran the program.

Downloading https://sanjayasubedi.com.np
Extracted 12 articles
Downloading https://sanjayasubedi.com.np/page2/
Extracted 4 articles
Finished

And the file contains

[
    {
        "title": "\nPredicting if a headline is a clickbait or not\n\n",
        "excerpt": "Learn how to classify if a headline is a clickbait or not\n",
        "image": "https://sanjayasubedi.com.np/assets/images/teaser.jpg"
    },
    {
        "title": "\nShort Tutorial on Named Entity Recognition with spaCy\n\n",
        "excerpt": "A simple and minimal example showing how to detect named entities in an unstructured text\n",
        "image": "https://sanjayasubedi.com.np/assets/images/teaser.jpg"
    },
    {
        "title": "\nText Classification with sklearn\n\n",
        "excerpt": "A tutorial on text classification using sklearn\n",
        "image": "https://sanjayasubedi.com.np/assets/images/teaser.jpg"
    },
    {
        "title": "\nWord Embeddings in Keras\n\n",
        "excerpt": "A short tutorial on using word embedding layer in Keras\n",
        "image": "https://sanjayasubedi.com.np/assets/images/deep-learning/word-embeddings/glove_airbnb2_sbd_01.png"
    },
    {
        "title": "\nNepali License Plate Recognition with Deep Learning\n\n",
        "excerpt": "Train a neural network model to predict letters and numbers from a license plate\n",
        "image": "https://sanjayasubedi.com.np/assets/images/deep-learning/nepali-license-plate/samples.png"
    },
    {
        "title": "\nBlack and White to Color using Deep Learning\n\n",
        "excerpt": "A short tutorial on fully convolutional neural networks\n",
        "image": "https://sanjayasubedi.com.np/assets/images/deep-learning/bw-to-color/data-pair.png"
    },
    ...
]  

Note the \n characters, they indicate a new line character and there might be whitespaces around the text you extracted. These can be removed by calling strip() function on the string e.g. new_title = title.strip()

Next Steps

We wrote a simple scraper to extract article contents. But many things are missing like error handling. What happens if we are scraping 10th page and suddenly there is an error. Our program will crash and already extracted articles will also be lost. Should we retry? Should we save often? What if the server blocks our IP for making too many requests? What if you have 10,000 pages to scrape and each page takes 1 second, you’ll have to wait almost 3 hours. In the next post we’ll improve our scraper and solve some of these issues.

Leave a comment