This article is Part 1 in a 5-Part Natural Language Processing with Python.
Introduction
To simply put, Natural Language Processing (NLP) is a field which is concerned with making computers understand human language. NLP techniques are applied heavily in information retrieval (search engines), machine translation, document summarization, text classification, natural language generation etc. In this series of posts, we’ll go through the basics of NLP and build some applications including a search engine, document classification system, machine translation system and a chatbot.
A typical flow of NLP application looks like:
In this post, we’ll focus on Pre-processing
.
Pre-processing
In this section, I’ll introduce some of the common pre-processing steps. As an input, we have a text. It could be a news article, search query, instructions for a chat-bot etc. We feed this input to a Pre-processing
step where we need to extract the tokens, which could be a word or a phrase or even a sentence, and clean our input text i.e. fix spelling mistakes, remove useless words (stop-words), augment the words with part of speech or something else etc. What we do in this step depends on the problem we are trying to solve but for many applications tokenization
, stop-word removal
and stemming
are fairly common.
Let’s take an example input
1
text = "This warning shouldn't be taken lightly."
I’ll use this example to demonstrate different pre-processing steps.
Tokenization
Tokenization is a process of splitting the text into pieces. These pieces are called tokens. A token could be a word, a phrase or even a sentence. In many applications, tokenization refer to splitting the text into words and I’ll only demonstrate the work tokenization.
There are different tokenization strategies. A simple tokenization strategy would be to consider space
as a separator and discard punctuation characters from the text and we would end up with the words. In Python, we can use split
function with space
as separator to get a list of words from a text.
1
2
print(text.split(sep=" "))
# we can also use text.split() which by default uses space as delimiter
1
['This', 'warning', "shouldn't", 'be', 'taken', 'lightly.']
With just one function call, it seems we got the results. But look at it carefully - the tokens shouldn't
and lightly.
contain a punctuation. We need to remove it. For that we can use regular expressions.
First we’ll install regex
library since the builtin re
module in Python does not support unicode categories. The api is same as re
module but is more flexible. You can install by running the following:
pip install regex
Note that the code below does not work with builtin re
because it can’t recognize \p{P}
which means match any punctuation character.
1
2
3
import regex as re
clean_text = re.sub(r"\p{P}+", "", text)
print(clean_text.split())
['This', 'warning', 'shouldnt', 'be', 'taken', 'lightly']
Now it seems that the punctuation characters are gone. But there is a problem with the token shouldnt
. Should it be further divided into should
and not
or should
and nt
or should we leave the punctuation and let it be shouldn't
. There are many other scenarios where this would not work. For example, in Tweets, it is common to use smileys like :)
, :(
and hastags like #python
. Our tokenizer would completely remove such characters and we will loose a lot of meaning from the text. Fortunately, there are libraries that implement advanced tokenizers that are able to deal with such scenarios. I’ll be using spaCy
library for the demonstration but you can use others like NLTK
.
Installation:
1
2
pip install spacy
python -m spacy download en
It will install spaCy and also download English language model. spaCy provides a number of models trained in different datasets and in different languages. Check them out at https://spacy.io/usage/models
Now we’ll load the English model and apply spaCy on it.
1
2
3
4
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
1
['This', 'warning', 'should', "n't", 'be', 'taken', 'lightly', ':)', '#', 'python', '.']
Now the tokenization looks much better. The punctuations are still present but we can easily remove them. Every token produced by spaCy
is of type spacy.tokens.token.Token
and it has a number of properties. Among them there are a few that start with is_*
e.g. is_digit
, is_punct
, is_stop
etc. that can be used to determine what kind of token it is.
Stopword Removal
Stop-words are words that occur frequently but don’t carry any meaning on their own. For example, a
, an
, the
occur very frequently and can be discarded without any loss of meaning for most of NLP tasks. Depending on the domain and language, there will be different set of stop-words. In case of above example, we can easily figure out if a word is a stop-word or not by checking is_stop
property of a spaCy Token
.
1
print ([(token.text, token.is_stop) for token in doc])
1
[('This', False), ('warning', False), ('should', True), ("n't", False), ('be', True), ('taken', False), ('lightly', False), (':)', False), ('#', False), ('python', False), ('.', False)]
There are a couple of stop-words in our sentence: should
and be
. Stop words are removed to reduce the number of vocabulary (unique words in our entire dataset) that we have to keep track of. This helps in faster computation, less memory requirements and most importantly it reduces noise.
Stemming
Stemming is a process of reducing the words to their root form. For example, stem of cats
would be cat
, transportation
would be transport
etc. Again, this is to reduce the size of vocabulary because for most of the applications, distinction between cats
and cat
is not important. For example, when a user searches for documents containing the word cats
but we only have documents containing the word cat
, then the user would get zero results. But if we stem the user’s query then we would be able to retrieve some results.
A popular algorithm used for stemming is Porter algorithm. spaCy does not have any feature for stemming but libraries like NLTK
have such feature.
Stemming algorithms are mostly based on rules and the output is not always a valid word. Consider the following examples.
word | stem |
---|---|
meeting | meet |
technology | technolog |
In the first case the word meeting
is stemmed to the word meet
. In a sentence, if the word meeting
is used as a verb then this stemming is correct. E.g. “We are meeting tomorrow”. But if the word meeting
is used as a noun, e.g. in “I’m in a meeting now”, then we don’t want it altered but stemming algorithms like Porter don’t care about how the word is being used and produce the same output regardless.
Lemmatisation
Lemmatisation is a more complex version of stemming. Part of speech (POS) of each word is determined and then different rules are applied for different POS. spaCy provides lemmatisation since it is much better than stemming but it is a bit more computationally expensive. Let’s look at how we can get lemma of a word.
1
2
print ([(token.text, token.lemma_) for token in nlp("we are meeting tomorrow")])
print ([(token.text, token.lemma_) for token in nlp("i am going to a meeting")])
1
2
[('we', '-PRON-'), ('are', 'be'), ('meeting', 'meet'), ('tomorrow', 'tomorrow')]
[('i', 'i'), ('am', 'be'), ('going', 'go'), ('to', 'to'), ('a', 'a'), ('meeting', 'meeting')]
We can see that the words have been reduced to their lemma depending on their POS. In the first sentence, meeting
is transformed into meet
since it is being used as a verb but in second sentence it is not altered since it is used as a noun. Similarly are
, am
are both transformed into same lemma be
.
POS Tagging
Part-of-speech tagging is a processing of determining POS for each word in a text. POS tagging is a necessary step for many NLP applications like lemmatization, machine translation, sentiment analysis etc. The techniques vary from using a simple word to POS lookup table to deep learning based models. Check this http://www.stat.columbia.edu/~madigan/DM08/hmm.pdf article for an overview of different algorithms for POS tagging and this one for how spaCy works.
Using our example document, we can print the POS of a token using pos_
property as follows
1
print ([(token.text, token.pos_) for token in doc])
1
[('This', 'DET'), ('warning', 'NOUN'), ('should', 'VERB'), ("n't", 'ADV'), ('be', 'VERB'), ('taken', 'VERB'), ('lightly', 'ADV'), (':)', 'NOUN'), ('#', 'NOUN'), ('python', 'NOUN'), ('.', 'PUNCT')]
Conclusion
There are many NLP libraries available in Python including spaCy
, NLTK
, gensim
, textblob
etc. Each of these focus on different aspect of NLP but they can be used together to build a powerful NLP application. One thing to note that these libraries use pre-trained models for many tasks e.g. for tokenization, POS tagging etc. and may not work as expected if the domain is different. E.g. POS tagging might not work for Tweets since the words in tweets are often shortened on purpose and the model provided by the library might have never seen such words.
We briefly went through commonly applied methods in pre-processing step of a NLP application. In the next post we’ll go through how we can convert the words into features so that we can feed it to a model (chatbot, document classification) for training or inference.
This article is Part 1 in a 5-Part Natural Language Processing with Python.
Comments