This article is Part 3 in a 4-Part Natural Language Processing with Python.
Clustering is a process of grouping similar items together. Each group, also called as a cluster, contains items that are similar to each other. Clustering algorithms are unsupervised learning algorithms i.e. we do not need to have labelled datasets. There are many clustering algorithms for clustering including KMeans, DBSCAN, Spectral clustering, hierarchical clustering etc and they have their own advantages and disadvantages. The choice of the algorithm mainly depends on whether or not you already know how many clusters to create. Some algorithms such as KMeans need you to specify number of clusters to create whereas DBSCAN does not need you to specify. Another consideration is whether you need the trained model to able to predict cluster for unseen dataset. KMeans can be used to predict the clusters for new dataset whereas DBSCAN cannot be used for new dataset.
In this post we’ll classify news articles into different categories. First download the dataset from http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip and extract. The dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech. For most part, we’ll ignore the labels but we’ll use them while evaluating the trained model since many of the evaluation metrics need the “true” labels.
Let’s import the required libraries.
import pandas as pd import numpy as np from sklearn.cluster import MiniBatchKMeans from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import PCA import matplotlib.pyplot as plt
Now we can read the data. If you look at the extracted zip, you’ll see there are 5 folders each containing articles. Each folder indicates a category and the articles contained in that folder belong to that category. As said before we aren’t concerned about the categories but it is there. To load data in this kind of format, sklearn has easy utility function called
load_files which load text files with categories as subfolder names.
from sklearn.datasets import load_files # for reproducibility random_state = 0 DATA_DIR = "./bbc/" data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace", random_state=random_state) df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label']) df.head()
|0||Tate & Lyle boss bags top award\\n\\nTate & Lyle...||0|
|1||Halo 2 sells five million copies\\n\\nMicrosoft ...||4|
|2||MSPs hear renewed climate warning\\n\\nClimate c...||2|
|3||Pavey focuses on indoor success\\n\\nJo Pavey wi...||3|
|4||Tories reject rethink on axed MP\\n\\nSacked MP ...||2|
For each article in our dataset, we’ll compute TF-IDF values. If you are not familiar with TF-IDF or feature extraction, you can read about them in the second part of this tutorial series called “Text Feature Extraction”.
vec = TfidfVectorizer(stop_words="english") vec.fit(df.text.values) features = vec.transform(df.text.values)
Now we have our feature matrix, we can feed to the model for training.
Let’s create an instance of
KMeans. I’ll choose 5 as the number of clusters since the dataset contains articles that belong to one of 5 categories. Obviously, if you do not have labels then you won’t exactly know how many clusters to create so you have to find the best one that fits your needs via running multiple experiements and using domain knowledge to guide you.
Creating a model is pretty simple.
cls = MiniBatchKMeans(n_clusters=5, random_state=random_state) cls.fit(features)
That is all it takes to create and train a clustering model. Now to predict the clusters, we can call
predict function of the model. Note that not all clustering algorithms can predit on new datasets. In that case, you can get the cluster labels of the data that you used when calling the
fit function using
labels_ attribute of the model.
# predict cluster labels for new dataset cls.predict(features) # to get cluster labels for the dataset used while # training the model (used for models that does not # support prediction on new dataset). cls.labels_
To visualize, we’ll plot the features in a 2D space. As we know the dimension of features that we obtained from TfIdfVectorizer is quite large ( > 10,000), we need to reduce the dimension before we can plot. For this, we’ll ues PCA to transform our high dimensional features into 2 dimensions.
# reduce the features to 2D pca = PCA(n_components=2, random_state=random_state) reduced_features = pca.fit_transform(features.toarray()) # reduce the cluster centers to 2D reduced_cluster_centers = pca.transform(cls.cluster_centers_)
Now that we have reduced our features and cluster centers into 2D, we can plot those points using a scatter plot. The first dimension will be used as X values and second dimension will be used as Y values in a XY plot. We also assign colors to each items using the predicted cluster labels so that items in a same cluster will be represented with same color. This is done by passing the labels to
c parameter in
scatter function. Again, if the clustering algorithm does not support
predict function or if you want to visualize in the training data itself, you can also use
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cls.predict(features)) plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=150, c='b')
From the plot below we can see that apart from the rightmost cluster, others seems to be scattered all around and overlapping as well.
Evaluation for unsupervised learning algorithms is a bit difficult and requires human judgement but there are some metrics which you might use. There are two kinds of metrics you can use depending on whether or not you have the labels. For most of the clustering problems, you probably won’t have labels. If you had you’d do classification instead. But it is what it is. Here are a couple of them which I want to show you but you can read about other metrics on your own.
Evalauation with labelled dataset
If you have labelled dataset then you can use few metrics that give you an idea of how good your clustering model is. The one I’m going to show you here is
homogeneity_score but you can find and read about many other metrics in
sklearn.metrics module. As per the documentation, the score ranges between 0 and 1 where 1 stands for perfectly homogeneous labeling.
from sklearn.metrics import homogeneity_score homogeneity_score(df.label, cls.predict(features))
Evaluation with unlabelled dataset
If you don’t have labels for your dataset, then you can still evaluate your clustering model with some metrics. One of them is Silhouette Coefficient. From the sklearn’s documentation:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (
a) and the mean nearest-cluster distance (
b) for each sample. The Silhouette Coefficient for a sample is
(b - a) / max(a,b). To clarify,
b is the distance between a sample and the nearest cluster that the sample is not a part of.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
from sklearn.metrics import silhouette_score silhouette_score(features, labels=cls.predict(features))
So this value means that our clusters are overlapping. We can also see this in the plot above. Perhaps tuning different parameters for feature extractor and the clustering model will increase this score.
This post showed you how to cluster text using KMeans algorithm. You can cluster any kind of data, not just text and can be used for wide variety of problems. While the evaluation of clustering algorithms is not as easy compared to supervised learning models, it is still desirable to get an idea of how your model is performing. However, it is important to know what the metrics are measuring and if they are biased or have any limitations.
This article is Part 3 in a 4-Part Natural Language Processing with Python.