Objective
- To classify news articles
- Learn the basics of natural language processing
- Build models using sklearn and choose the best one
- Use sklearn’s Pipeline class
In this post we’ll classify news articles into different categories. First download the dataset from http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip and extract. The dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech.
Let’s import necessary libraries and functions.
1
2
3
4
5
6
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_files
DATA_DIR = "./bbc/"
We’ll use load_files function which loads text files with categories as subfolder names. Our dataset already has articles organized into different folders. After loading the data, we’ll also check how many articles are there per category.
1
2
3
4
5
6
7
data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace")
# calculate count of each category
labels, counts = np.unique(data.target, return_counts=True)
# convert data.target_names to np array for fancy indexing
labels_str = np.array(data.target_names)[labels]
print(dict(zip(labels_str, counts)))
> {'tech': 401, 'sport': 511, 'business': 510, 'entertainment': 386, 'politics': 417}
Each category has different number of articles. However, it does not look too imbalanced and the model should be able to learn properly.
Data preparation
Now we’ll split the data into training and testing set and then print out first 80 chars of some samples.
1
2
3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
list(t[:80] for t in X_train[:10])
Before we go further, lets quickly go through what are the common natural language processing pipeline.
- Tokenize i.e. split the text into words
- Convert the case of letters to either upper or lower
- Remove stopwords. For e.g. “the”, “an”, “with”
- Perform stemming or lemmatization to reduce inflected words to its stem. For e.g. transportation -> transport, transported -> transport (maybe some others)
- Vectorization (Count, Binary, TF-IDF)
Many libraries already exist to perform all of the steps mentioned above.
The data is in textual format and we cannot use it as it is. We need to convert it to a numerical format. A very common method, among others, is to calculate TF-IDF matrix. TF stands for term frequency in which we calculate how many times a term/word appears in a document. IDF stands for inverse document frequency which measures how important a word is. In simple terms it gives more weight to rare words than common ones. Once we calculate both TF and IDF, we can simply multiply them together to obtain TF-IDF value.
tfidf(t, d, D) = tf(t, d) * idf(t, D)
where,
- t is a term
- d is a document
- D is set of all documents
For details about TF-IDF check http://www.tfidf.com/ https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
1
2
3
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000, decode_error="ignore")
vectorizer.fit(X_train)
We used TfidfVectorizer
to calculate TF-IDF. When initializing the vectorizer, we passed stop_words as “english” which tells sklearn to discard commonly occurring words in English. Then we also specifed max_features to 1000. The vectorizer will build a vocabulary of top 1000 words (by frequency). This means that each text in our dataset will be converted to a vector of size 1000.
Next, we call fit function to “train” the vectorizer and also convert the list of texts into TF-IDF matrix. We can also use another function called fit_transform, which is equivalent to:
1
2
vectorizer.fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)
Important We should use only the training data to fit the vectorizer, otherwise it is cheating.
Model
We’ll create a simple naive Bayes model first.
1
2
3
4
5
6
7
8
9
10
from sklearn.naive_bayes import MultinomialNB
cls = MultinomialNB()
# transform the list of text to tf-idf before passing it to the model
cls.fit(vectorizer.transform(X_train), y_train)
from sklearn.metrics import classification_report, accuracy_score
y_pred = cls.predict(vectorizer.transform(X_test))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
1
2
3
4
5
6
7
8
9
10
0.958707360862
precision recall f1-score support
0 0.95 0.95 0.95 123
1 0.99 0.94 0.96 100
2 0.92 0.96 0.94 95
3 0.97 1.00 0.98 115
4 0.97 0.94 0.96 124
avg / total 0.96 0.96 0.96 557
95% accuracy! Not bad. Let’s see if we can find a better model. We’ll train several models using sklearn Pipelines. Pipelines allow us to add the necessary steps for a model to do its task. In our case, we need to convert the raw texts into vectorized format and then pass it to the model. Pipeline allows us to group these related steps. We can consider a Pipeline object as a model itself i.e. we can call fit, predict functions.
For this demo, we’ll create four different pipelines using TF-IDF and CountVectorizer
for vectorization and SGDClassifier
and SVC
(support vector classifier). Then using cross_val_score function, we’ll train the each model two times and record their mean accuracy. We’ll choose the highest performing model and train it and then evaluate it in the test set.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score
# start with the classic
# with either pure counts or tfidf features
sgd = Pipeline([
("count vectorizer", CountVectorizer(stop_words="english", max_features=3000)),
("sgd", SGDClassifier(loss="modified_huber"))
])
sgd_tfidf = Pipeline([
("tfidf_vectorizer", TfidfVectorizer(stop_words="english", max_features=3000)),
("sgd", SGDClassifier(loss="modified_huber"))
])
svc = Pipeline([
("count_vectorizer", CountVectorizer(stop_words="english", max_features=3000)),
("linear svc", SVC(kernel="linear"))
])
svc_tfidf = Pipeline([
("tfidf_vectorizer", TfidfVectorizer(stop_words="english", max_features=3000)),
("linear svc", SVC(kernel="linear"))
])
all_models = [
("sgd", sgd),
("sgd_tfidf", sgd_tfidf),
("svc", svc),
("svc_tfidf", svc_tfidf),
]
unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=2).mean()) for name, model in all_models]
scores = sorted(unsorted_scores, key=lambda x: -x[1])
print(scores)
1
[('svc_tfidf', 0.973026575899821), ('svc', 0.95623710562069142), ('sgd_tfidf', 0.95384189603985314), ('sgd', 0.93645074796385619)]
Support Vector Machine with tf-idf features scored the highest accuracy of 97%. Lets train it and evaluate it in the test dataset.
1
2
3
4
5
model = svc_tfidf
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
1
2
3
4
5
6
7
8
9
10
0.978456014363
precision recall f1-score support
0 0.99 0.94 0.97 141
1 0.98 1.00 0.99 96
2 0.96 0.99 0.98 99
3 0.97 1.00 0.99 114
4 0.98 0.97 0.98 107
avg / total 0.98 0.98 0.98 557
98% accuracy! Unlike before, we don’t have to vectorize the documents manually before passing it to the model, since we have defined the vectorization process in the pipeline itself.
Comments