How to Detect Languages in Python
- Use Libraries and API for Language Detection in Python
- Use Language Models for Language Detection in Python
- Use Intersecting Sets for Language Detection in Python
- Conclusion
Even though we are aware of a few languages as humans, it is not enough when we deal with datasets with mixed languages because we have to identify the language used in text or documents to proceed with the process. Due to that, adapting to a language detection method assists in such situations.
To detect languages, Python has different language detection libraries. We can pick which suits us most as the Python libraries used in language detection recognize the character among the expressions and commonly used words in the content.
We can build models using Natural Language Processing or Machine Learning to detect languages and Python libraries. For instance, when Chrome detects that the web page’s content is not in English, it pops up a box with a button to translate.
The idea behind that scenario is Chrome is using a model to predict the language of text used on a webpage.
Use Libraries and API for Language Detection in Python
The first method we used in Python to detect languages is a library or an API. Let’s see the most used libraries we can use for language detection in Python.
langdetect
langdetect
is also Google’s language detection library that needs to be installed as the previous modules because this doesn’t come with the standard utility modules.
This API is useful in text processing and linguistics and supports 55 languages.
Python versions should be 2.7 or 3.4+ to use this API. We can install the langdetect
API as below.
$ pip install langdetect
We can use the langdetect
API to detect languages after importing the detect
module. After that, the code prints the detected language of the given sentence.
Here we have provided three sentences as examples, and it displays their languages as English (en)
, Italian (pt)
, and Chinese (ko)
, respectively.
Code:
from langdetect import detect
print(detect("Hello World!"))
print(detect("Ciao mondoe!"))
print(detect("你好世界!"))
Output:
langid
langid
is another API used in detecting the language names with minimal dependencies. And also it is a standalone language identification tool that can detect 97 languages.
To install, we have to type the below command in the terminal.
$ pip install langid
Using the method below, we can detect the language using the langid
library. As in TextBlob
, while looping, it sees the language of three sentences and prints out the respected language of each sentence as English (it)
, Italian (gl)
and Chinese (zh)
.
Code:
import langid
T = ["Hello World!", "Ciao mondoe!", "你好世界!"]
for i in T:
print(langid.classify(i))
Output:
textblob
textblob
is another API that uses Google Translate’s language detector to perform on textual data. It plays nicely with the help of NLTK
(Natural Language Toolkit) and pattern
modules, considered giants in Python.
This simple API does sentimental analysis, noun phrase extraction, part of speech tagging, classification, and more rather than detecting the language.
To use this API, the version of Python should be above or equal to 2.7 or 3.5 and requires an internet connection.
We have to install the package with the pip
command.
$ pip install textblob
After that, we can detect the language by importing the module TextBlob
. Here we have assigned three sentences with different languages to the array named "T"
.
While looping through a for
loop, it detects the wording of three sentences and prints them out.
Code:
from textblob import TextBlob
T = ["Hello World!", "Bonjour le monde!", "你好世界!"]
for i in T:
lang = TextBlob(i)
print(lang.detect_language())
As the textblob
library is already deprecated, the above code displays an error instead of an accurate output. So, using this method is not recommended; instead of this, we can use Google Translate API.
Learn more on TextBlob
here.
In addition to the above APIs and libraries, we have googletrans
, FastText
, Spacy
, polyglot
, pycld
, chardet
, guess language
, and many more. As per the use case, we can use them too.
Among them, polyglot
and FastText
are the best libraries for long text with high accuracy. Also, polyglot
and pycld
can detect multiple languages in a text.
googletrans
is a free Python library that allows us to make unlimited requests. It can auto-detect languages and is fast and reliable.
FastText
is a text classifier that can recognize 176 languages and provides quicker and more accurate outputs. The language detection library used by Facebook is FastText
.
Apart from using libraries or APIs, we can detect languages by using language models or intersecting sets.
Use Language Models for Language Detection in Python
Here, the model gives the probability of a sequence of words, and we can use N
language models for each language with the highest score.
These language models enable us to detect the language of the text even if it contains a diverse set of languages.
Use Intersecting Sets for Language Detection in Python
And the following method we can detect languages is using intersecting sets. Here we are preparing N
sets with the most frequent words in each language and intersecting the text with each set.
Then the detected language is the set that has more intersections.
Conclusion
Overall, Python’s systematic method of detecting languages uses libraries and APIs. But they differ due to accuracy, language coverage, speed, and memory consumption.
We can choose suitable libraries and build models per the use case.
When the model only depends on one language, the other languages can be considered noise. Language detection is a step in data cleaning; therefore, we can get noise-free data by detecting languages.
Nimesha is a Full-stack Software Engineer for more than five years, he loves technology, as technology has the power to solve our many problems within just a minute. He have been contributing to various projects over the last 5+ years and working with almost all the so-called 03 tiers(DB, M-Tier, and Client). Recently, he has started working with DevOps technologies such as Azure administration, Kubernetes, Terraform automation, and Bash scripting as well.