No one likes clickbait headlines yet they seem to be everywhere. It would be great if a computer model could figure out if a headline is clickbait or not so that we don’t have to waste our time with those garbages. In this post, I’ll show you how we can build a simple but accurate model for detecting whether a headline is clickbait or not.
First, download the dataset from here. I’ll quote the original authors of this dataset
The directory contains two files each consisting the headlines of 16,000 articles. Both files are compressed using gzip, and each line in the decompressed files contains one article headline. The clickbait corpus consists of article headlines from ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’. The non-clickbait article headlines are collected from ‘WikiNews’, ‘New York Times’, ‘The Guardian’, and ‘The Hindu’.
Extract the files from the two archives. You should have two files in your folder. I’ll be using Jupyter notebook to write and execute the code.
Let’s import some libraries
import pandas as pd import os import re import matplotlib.pyplot as plt from matplotlib import style style.use("ggplot") %matplotlib inline
Next we’ll read those two files. Since the files contain blank lines as well, we’ll discard those and for the rest of the lines, we
strip them i.e. remove whitespace from beginning and end of the line.
def read_lines(path): with open(path, encoding="utf8") as f: return [l.strip() for l in f if len(l.strip()) > 0] clickbaits = read_lines("./clickbait_data") nonclickbaits = read_lines("./non_clickbait_data") len(clickbaits), len(nonclickbaits)
You should get the same number as I got.
Now to create a labeled dataset we need to run the following. All we are doing is concatenating two lists
nonclickbaits into a single list
titles. In the second line, we create labels for our data.
1 indicates that it is clickbait and
0 indicates it is not. So we generate a list of
 * len(clickbaits). This will create a list of size
len(clickbaits) where each element is 1. We do the same thing and generate a list of
0s and finally concatenate them into a single list called
titles = clickbaits + nonclickbaits labels =  * len(clickbaits) +  * len(nonclickbaits)
We also split the data into training and testing set.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(titles, labels)
Rule Based Model
Whenever I think of clickbaits, usually these patterns come to my head
- 45 facts about X you never knew. You won’t believe number 53!
- Why you should do X
- … and here is why
So let’s create a simple rule based model. Usually, these headlines start with Wh words like
Why or with digits. So our first model will simply assign a headline as clickbait if it starts with Wh words or digits.
First let’s create a helper function to print prediction results.
from sklearn.metrics import classification_report, accuracy_score def print_pred_report(y_true, y_pred): print("accuracy = ", accuracy_score(y_true, y_pred)) print(classification_report(y_true, y_pred))
We’ll use regular expression to check if the title starts with digit or Wh word. If it does then we’ll classify it as a clickbait.
# simple model based on whether the title starts with "wh" words or digits starts_wh_re = re.compile(r"^(What|How|Why|When|Which|Who|Where)", re.I) starts_digit_re = re.compile(r"^\d+") preds = list(map(lambda title: starts_wh_re.match(title) != None or starts_digit_re.match(title) != None, x_test)) print_pred_report(y_test, preds)
accuracy = 0.7405 precision recall f1-score support 0 0.67 0.97 0.79 4016 1 0.94 0.51 0.66 3984 micro avg 0.74 0.74 0.74 8000 macro avg 0.80 0.74 0.73 8000 weighted avg 0.80 0.74 0.73 8000
Not bad for a simple rule based model but we can do better. The recall for class
1 i.e. clickbaits is quite low indicating it did not classify them as clickbait even though they were. Obviously, not all clickbaits start with Wh or digits so this is expected. Precision for nonclickbaits is also quite low with only
Let’s improve this with a simple logistic regression model.
Logistic Regression Model
We’ll train a logistic regression model to classify the headlines. But first, we need to convert the headlines which are in textual format into numeric data so that the models can be trained and queried. For creating features from the text, we’ll use
TfidfVectorizer. We also need to train this vectorizer so that it can learn the vocabulary (unique words in the given dataset), what score to assign to each word etc. Typically Tf-Idf gives more weight to less frequent words than frequent ones. Check these links for more info
Note that we are not doing any pre-processing on the words like stop word removal, stemming etc. but you could definitely try it out.
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() tfidf.fit(x_train)
Now that our vectorizer is trained, we can create our model and train it. For this demo we’ll just use
LogisticRegressionCV with default parameters. You can try other models as well like
from sklearn.linear_model import LogisticRegressionCV from sklearn.pipeline import make_pipeline cls = LogisticRegressionCV() cls.fit(tfidf.transform(x_train), y_train) pipe = make_pipeline(tfidf, cls) preds = cls.predict(tfidf.transform(x_test)) print_pred_report(y_test, preds)
accuracy = 0.9745 precision recall f1-score support 0 0.97 0.98 0.97 4066 1 0.98 0.97 0.97 3934 micro avg 0.97 0.97 0.97 8000 macro avg 0.97 0.97 0.97 8000 weighted avg 0.97 0.97 0.97 8000
97% accuracy! Nice.
recall of both classes are very good. But what did the model learn? What can we learn from what the model learned?
To see what the model learned, we’ll use a library called
eli5 which can help you visualize your models built with sklearn and few other libraries. Read its documentation at https://eli5.readthedocs.io/en/latest/ We just need to call
show_weights function that will show the top 50 most influencing words for each category.
import eli5 eli5.show_weights(cls, top=50, vec=tfidf)
As we can see the common words are
you, this, these, your, buzzfeed, 21, people, 17, 2015, we, things, 2016, how, 19, 23, my, adorable. See how many of these are pronouns and digits.
Now let’s see the predictions for each sentence and visualize the output.
from IPython.display import HTML, display_html for idx in range(40): print(x_train[idx], y_train[idx]) tbl = eli5.show_prediction(cls, x_train[idx], vec=tfidf) display_html(HTML(tbl.data))
For each word, the green means the word has positive impact for the predicted category and the red means it has negative impact. The intensity of color determines the weight each of those words put in.
Many of the headlines from clickbait articles
y=1 have these words with great influence:
23, worst, relationship, anyone, these, you whereas normal headlines seem to have words like the name of countries,
Finally, to see frequently used words in clickbaits and normal titles, let’s plot a couple of charts.
from collections import Counter def plot_freq(sentences, n=40, title="Word frequency"): words = [word.strip() for sentence in sentences for word in sentence.lower().split()] c = Counter(words) wc = pd.DataFrame(c.most_common(n), columns=["word", "count"]) plt.figure(figsize=(20, 8)) plt.title(title) plt.xticks(rotation=-45) plt.bar(wc["word"], wc["count"]) plot_freq(clickbaits, title="Word frequency of top words in clickbait titles") plot_freq(nonclickbaits, title="Word frequency of top words in clickbait titles")
In normal headlines, a lot of top frequent words are stopwords like in, the etc. We can experiment with stopword removal to see what difference it makes. We can pass
TfidfVectorizer or provide our own list of stopwords and pass it.
We were able to correctly classify between clickbaits and normal headlines using a very simple logistic regression. Although the model performed really good in this dataset, it may or may not work in the wild. But I’m pretty sure that article headlines from Buzzfeed and similar websites, from where this dataset was compiled, will be correctly classified as clickbaits.