In this guide, you will be doing hands-on activities on the spaCy library of Python and learning the basic terminologies of NLP concerning it.
What is NLP?
A process through which our machines are trying to understand and recognize the text. An example is text categorization.
What is NLU?
A process through which our machines are trying to understand the meaning of a text. An example is a text summarization.
What is spaCy?
It is an open-source NLP framework, used to analyze human language in a computer system.
How to install spaCy?
Go to this link: https://spacy.io/usage, and select the correct parameters based on your requirements. I suggest you do not select the GPU for Hardware, since you are just going to start learning spaCy, it will be good not to use those advanced features, one more thing is that we will be working with English text, so for the Trained pipelines parameter, you must check the English language. When you are done selecting the parameters, it will show the commands you can use in the terminal (CMD/command prompt) to install spaCy.
Below image shows my selected parameters:
The last line of code is downloading en_core_web_sm, which is a small English model trained on (blogs, news, etc.). In the future, we will be learning more about this.
Throughout this guide, I will be using Jupyter Notebook. Now that we have installed spaCy, let’s import it using:
Now that we have downloaded a small English model, Let’s import it using space.load command.
The first thing we need to learn in spaCy is Container.
What is Container?
Containers are objects that contain a large quantity of text data. While using spaCy, we create different containers. Here is a complete list of containers:
Let’s visualize the Doc object, so you can get a brief understanding of containers.
In Doc object, each sentence is represented by sent, each word represented by the token, and a group of words are represented by spanGroup.
Now that we have imported the spaCy library and small English model using these commands:
I am using text data about the United States from the Wikipedia page.
Here is the link to that txt file: https://raw.githubusercontent.com/wjbmattingly/freecodecamp_spacy/main/data/wiki_us.txt
Let’s read that text file in a variable mytext:
The output of the mytext variable is a string:
Normally we do clean our dataset once we load it, removing the unwanted characters from the text, but this guide is more focused on elaborating the features of spaCy.
Now that we have imported our text data, let’s create a doc object of our data using the small English model that we imported earlier in an NLP variable.
In the previous code spaCy is creating a doc object, i.e., breaking our text into tokens, sent, etc, as I mentioned earlier that the English small model is trained on news, and blogs, which helps spaCy to identify sentences, and words in the given text.
Let’s look at the tokens (separate words) of our doc object using for loop:
If we did the same task i.e., breaking the text into separate words, with regex library. It would be time-consuming and difficult for us to decide on what should be treated as a word. This is the first advantage of spaCy that it did the same task, within a few seconds, using the pre-trained small English model.
Similarly, we can look at the sent (each sentence) of our doc object using for loop on doc.sents:
These are the first two sentences of our doc object. spaCy used sentence boundary detection to break your text into sentences.
The number of tokens (words) or sent (sentences) spaCy detect is based on (en_core_web_sm) i.e., a small English model, if we have used a large English model, spaCy detect more tokens and sentences in our text because a large model is trained on bigger data.
If you want to grab a single sentence, you need to convert doc.sents into a list first, because doc.sents is a generator object which cannot be subscribable.
Let’s grab the first sentence of our doc object, and then select the second word of that sentence.
You may think that the second_token variable contains a string value i.e., United, and nothing else, but this is not true, it has some more features with it.
By checking the type of the second_token variable, we can see that it is not a string value but a token.
Now we will be working with some Token attributes. To get the string value of the token we must do:
It will return a raw string value.
If there exists a multi-word token, i.e., contains a collective meaning then we can use the left_edge property to find the Start of the multi-word token, and the right_edge to find the end of the multi-word token.
It returns the same word because the second_token variable is a single-word token not a multi-word, If it was a multi-word, then left_edge and right_edge return different text.
To find the entity type of the token we can use:
ent_type is used to recognize locations, companies, etc.
.lemma_ is used to convert our selected token into its root form
Since the United word has no root form, but if our token word was a verb like known, then lemma_ will return know, i.e., the root form of our word.
Let’s apply some rules of grammar to our single-word token.
First, we use the pos_ (Identify Part of Speech) command:
It is considered a proper noun. You may need to spend some time learning grammar rules because NLP does require grammar understanding.
To check what role our selected token plays in the sentence, we can use:
To check the language of our doc, from which this token is extracted, we can use:
It will return the language of our parent doc to which this token belongs, in our case, it returns en i.e., English language.