How to Create N-Grams From Text in Python
-
Use the
for
Loop to Create N-Grams From Text in Python -
Use
nltk
to Create N-Grams From Text in Python
In computational linguistics, n-grams are important to language processing and contextual and semantic analysis. They are continuous and consecutive sequences of words adjacent to one another from a string of tokens.
The popular ones are unigrams, bigrams, and trigram, and they are effective, and where n>3
, there can be data sparsity.
This article will discuss how to create n-grams in Python using features and libraries.
Use the for
Loop to Create N-Grams From Text in Python
We can effectively create a ngrams
function which takes the text and the n
value, which returns a list that contains the n-grams.
To create the function, we can split the text and create an empty list (output
) that will store the n-grams. We use the for
loop to loop through the splitInput
list to go through all the elements.
The words (tokens) are then appended to the output
list.
def ngrams(input, num):
splitInput = input.split(" ")
output = []
for i in range(len(splitInput) - num + 1):
output.append(splitInput[i : i + num])
return output
text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))
The output of the code
[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]
Use nltk
to Create N-Grams From Text in Python
The NLTK library is a natural language toolkit that provides an easy-to-use interface to resources important for text processing and tokenization, among others. To install nltk
, we can use the pip
command below.
pip install nltk
To show us a potential issue, let’s use the word_tokenize()
method, which helps us create a tokenized copy of the text we pass to it using NLTK’s recommended word tokenizer before we move on to writing a more detailed code.
import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
The output of the code:
Traceback (most recent call last):
File "c:\Users\akinl\Documents\Python\SFTP\n-gram-two.py", line 4, in <module>
tokens = nltk.word_tokenize(text)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "C:\Python310\lib\site-packages\nltk\data.py", line 750, in load
opened_resource = _open(resource_url)
File "C:\Python310\lib\site-packages\nltk\data.py", line 876, in _open
return find(path_, path + [""]).open()
File "C:\Python310\lib\site-packages\nltk\data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('punkt')
[0m
For more information see: https://www.nltk.org/data.html
Attempted to load [93mtokenizers/punkt/english.pickle[0m
Searched in:
- 'C:\\Users\\akinl/nltk_data'
- 'C:\\Python310\\nltk_data'
- 'C:\\Python310\\share\\nltk_data'
- 'C:\\Python310\\lib\\nltk_data'
- 'C:\\Users\\akinl\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************
The reason for the above error message and issue is the NLTK library requires certain data for some methods, and we have not downloaded the data, especially if this is your first use. Therefore, we need the NLTK downloader to download two data modules, punkt
and averaged_perceptron_tagger
.
The data is available for use, for example, when using the methods such as words()
. To download the data, we need the download()
method if we need to run it through our Python script.
You could create a Python file and run the below code to solve the issue.
import nltk
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
Or run the following commands through your command line interface:
python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger
Example Code:
import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print(list(textBigGrams), list(textTriGrams))
The output of the code:
[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]
Example Code:
import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print("The Bigrams of the Text are")
print(*map(" ".join, textBigGrams), sep=", ")
print("The Trigrams of the Text are")
print(*map(" ".join, textTriGrams), sep=", ")
The output of the code:
The Bigrams of the Text are
well the, the money, money has, has finally, finally come
The Trigrams of the Text are
well the money, the money has, money has finally, has finally come
Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.
LinkedIn