Crear N-Grams a partir de texto en Python

Olorunfemi Akinlua 11 diciembre 2023 Python Python NLP

Use el bucle for para crear N-Grams a partir de texto en Python
Use nltk para crear N-Grams a partir de texto en Python

Crear N-Grams a partir de texto en Python

En lingüística computacional, los n-gramas son importantes para el procesamiento del lenguaje y el análisis contextual y semántico. Son secuencias continuas y consecutivas de palabras adyacentes entre sí de una cadena de tokens.

Los populares son unigramas, bigramas y trigramas, y son efectivos, y donde n>3, puede haber escasez de datos.

Este artículo discutirá cómo crear n-gramas en Python usando funciones y bibliotecas.

Use el bucle `for` para crear N-Grams a partir de texto en Python

Podemos crear efectivamente una función ngramas que toma el texto y el valor n, que devuelve una lista que contiene los n-gramas.

Para crear la función, podemos dividir el texto y crear una lista vacía (salida) que almacenará los n-gramas. Usamos el bucle for para recorrer la lista splitInput para recorrer todos los elementos.

Las palabras (tokens) luego se agregan a la lista de salida.

def ngrams(input, num):
    splitInput = input.split(" ")
    output = []
    for i in range(len(splitInput) - num + 1):
        output.append(splitInput[i : i + num])
    return output


text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))

La salida del código

[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]

Use `nltk` para crear N-Grams a partir de texto en Python

La biblioteca NLTK es un conjunto de herramientas de lenguaje natural que proporciona una interfaz fácil de usar para recursos importantes para el procesamiento de texto y tokenización, entre otros. Para instalar nltk, podemos usar el comando pip a continuación.

pip install nltk

Para mostrarnos un problema potencial, usemos el método word_tokenize(), que nos ayuda a crear una copia tokenizada del texto que le pasamos usando el tokenizador de palabras recomendado por NLTK antes de pasar a escribir un código más detallado.

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

La salida del código:

Traceback (most recent call last):
  File "c:\Users\akinl\Documents\Python\SFTP\n-gram-two.py", line 4, in <module>
    tokens = nltk.word_tokenize(text)
  File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "C:\Python310\lib\site-packages\nltk\data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "C:\Python310\lib\site-packages\nltk\data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "C:\Python310\lib\site-packages\nltk\data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\akinl/nltk_data'
    - 'C:\\Python310\\nltk_data'
    - 'C:\\Python310\\share\\nltk_data'
    - 'C:\\Python310\\lib\\nltk_data'
    - 'C:\\Users\\akinl\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************

El motivo del mensaje de error anterior y el problema es que la biblioteca NLTK requiere ciertos datos para algunos métodos, y no hemos descargado los datos, especialmente si es la primera vez que los usa. Por lo tanto, necesitamos el descargador NLTK para descargar dos módulos de datos, punkt y averaged_perceptron_tagger.

Los datos están disponibles para su uso, por ejemplo, cuando se utilizan métodos como words(). Para descargar los datos, necesitamos el método download() si necesitamos ejecutarlo a través de nuestro script de Python.

Puede crear un archivo de Python y ejecutar el siguiente código para resolver el problema.

import nltk

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

O ejecute los siguientes comandos a través de su interfaz de línea de comandos:

python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger

Código de ejemplo:

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)

print(list(textBigGrams), list(textTriGrams))

La salida del código:

[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]

Código de ejemplo:

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)

print("The Bigrams of the Text are")
print(*map(" ".join, textBigGrams), sep=", ")

print("The Trigrams of the Text are")
print(*map(" ".join, textTriGrams), sep=", ")

La salida del código:

The Bigrams of the Text are
well the, the money, money has, has finally, finally come
The Trigrams of the Text are
well the money, the money has, money has finally, has finally come

¿Disfrutas de nuestros tutoriales? Suscríbete a DelftStack en YouTube para apoyarnos en la creación de más guías en vídeo de alta calidad. Suscríbete

Autor: Olorunfemi Akinlua

Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.

Use el bucle for para crear N-Grams a partir de texto en Python

Use nltk para crear N-Grams a partir de texto en Python

Use el bucle `for` para crear N-Grams a partir de texto en Python

Use `nltk` para crear N-Grams a partir de texto en Python