Introduction

Text themselves cannot be used by machine learning models. They expect their input to be numeric. So we need some way that can transform input text into numeric feature in a meaningful way. There are several approaches for this and we’ll briefly go through some of them.

Methods

This section presents some of the techniques to transform text into a numeric feature space. For this demonstration, I’ll use sklearn and spacy.

For the demo, let’s create some sample sentences.

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"
]

Binary Encoding

A simple way we can convert text to numeric feature is via binary encoding. In this scheme, we create a vocabulary by looking at each distinct word in the whole dataset (corpus). For each document, the output of this scheme will be a vector of size N where N is the total number of words in our vocabulary. Initially all entries in the vector will be 0. If the word in the given document exists in the vocabulary then vector element at that position is set to 1. Let’s implement this to understand.

First we need to create the vocabulary.

vocab = sorted(set(word for sentence in texts for word in sentence.split()))
print(len(vocab), vocab)
12 ['and', 'black', 'blue', 'car', 'crow', 'i', 'in', 'my', 'reflection', 'see', 'the', 'window']

We have 12 distinct words in our entire corpus. So our vocabulary contains 12 words. After transforming, each document will be a vector of size 12.

import numpy as np
def binary_transform(text):
    # create a vector with all entries as 0
    output = np.zeros(len(vocab))
    # tokenize the input
    words = set(text.split())
    # for every word in vocab check if the doc contains it
    for i, v in enumerate(vocab):
        output[i] = v in words 
    return output

print(binary_transform("i saw crow"))
[0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

The method is pretty simple. We are looping through each word in our vocabulary and setting the vector entry corresponding to that word to 1 if the input document contains it. When we apply that function to our example input, it produced a vector of size 12 where two entries corresponding to vocabulary words crow and i are set to 1 while rest of them are zero. Note that the word saw is not in the vocabulary and is completely ignored. This is true for all the methods discussed below. So it is recommended that you have a sufficiently big corpus to build the vocabulary so that it contains as many words as possible.

sklearn library already provides this functionality. We can use CountVectorizer class to transform a collection of documents into the feature matrix.

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True)
vec.fit(texts)
print([w for w in sorted(vec.vocabulary_.keys())])
['and', 'black', 'blue', 'car', 'crow', 'in', 'my', 'reflection', 'see', 'the', 'window']

The vocabulary does not contain the word i since sklearn by default ignores 1 character tokens but other than that, it looks exactly the same as the one before. Let’s visualize the transformation in a table. The columns are each word in the vocabulary and the rows represent the documents.

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))
and black blue car crow in my reflection see the window
0 1 0 1 1 0 0 0 0 0 0 1
1 0 1 0 0 1 1 0 0 0 1 1
2 0 0 0 0 0 1 1 1 1 1 1

As expected, we have a matrix of size 3 *12 and the entries are set to 1 accordingly. This is a simple representation of text and can be used in different machine learning models. However, many models perform much better with other techniques since this does not capture any information other than if a word exists or not.

For more information about CountVectorizer visit: CountVectorizer docs

Counting

Counting is another approach to represent text as a numeric feature. It is similar to Binary scheme that we saw earlier but instead of just checking if a word exists or not, it also checks how many times a word appeared. In sklearn we can use CountVectorizer to transform the text.

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=False) # we cound ignore binary=False argument since it is default
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))
and black blue car crow in my reflection see the window
0 1 0 2 1 0 0 0 0 0 0 1
1 0 1 0 0 1 1 0 0 0 1 1
2 0 0 0 0 0 1 1 1 1 1 1

In the first sentence, “blue car and blue window”, the word blue appears twice so in the table we can see that for document 0, the entry for word blue has a value of 2. The output has a bit more information about the sentence than the one we get from Binary transformation since we also get to know how many times the word occurred in the document. Essentially, we are giving each token a weight based on the number of occurrences. But this weighing scheme not that useful for practical applications. Words that occur frequently such has a, an, have etc. will have heigher weight than others. Later in this series of posts, I’ll demonstrate its limitations when building a search engine.

TF-IDF

TF-IDF stands for term frequency-inverse document frequency. We saw that Counting approach assigns weights to the words based on their frequency and it’s obvious that frequently occurring words will have higher weights. But these words might not be important as other words. For example, let’s consider an article about Travel and another about Politics. Both of these articles will contain words like a, the frequently. But words such as flight, holiday will occur mostly in Travel and parliament, court etc. will appear mostly in Politics. Even though these words appear less frequently than the others, they are more important. TF-IDF assigns more weight to less frequently occurring words rather than frequently occurring ones. It is based on the assumption that less frequently occurring words are more important.

TF-IDF consists of two parts:

  • Term frequency which is same as Counting method we saw before
  • Inverse document frequency: This is responsible for reducing the weights of words that occur frequently and increasing the weights of words that occur rarely.

Formula to calculate tf-idf is:

tfidf(t, d, D) = tf(t, d) * idf(t, D)

where,

  • t is a term (word)
  • d is a document that this term is in
  • D is a collection of all documents

For more details on TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In sklearn, it is pretty easy to compute tf-idf weights.

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))
and black blue car crow in my reflection see the window
0 0.396875 0.000000 0.793749 0.396875 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000 0.234400
1 0.000000 0.534093 0.000000 0.000000 0.534093 0.406192 0.00000 0.00000 0.00000 0.406192 0.315444
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.358291 0.47111 0.47111 0.47111 0.358291 0.278245

Conclusion

In this post we briefly went through different methods available for transforming the text into numeric features that can be fed to a machine learning model. In the next post, we’ll combine everything we went through in this series to create our first text classification model.

Categories:

Updated:

Leave a comment