How to Convert Unicode Characters to ASCII String in Python
-
Use
unicodedata.normalize()
andencode()
to Convert Unicode to ASCII String in Python -
Use the
unidecode
Library to Convert Unicode to ASCII String in Python - Conclusion
Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.
This tutorial will demonstrate how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.
Use unicodedata.normalize()
and encode()
to Convert Unicode to ASCII String in Python
The Python module unicodedata
provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.
Normalizing Unicode
unicodedata
has a function called normalize()
that accepts two parameters, the normalized form of the Unicode string and the given string.
There are 4 types of normalized Unicode forms: NFC
, NFKC
, NFD
, and NFKD
. To learn more about this, the official documentation is readily available for an in-depth explanation for each type.
The NFKD
normalized form will be used throughout this tutorial.
Syntax:
unicodedata.normalize(form, unistr)
Parameters:
form
: This specifies the Unicode normalization form to apply to the input string.unistr
: The input Unicode string that we want to normalize according to the chosen normalization form.
Now, let’s declare a string with multiple Unicode characters.
Code Example:
import unicodedata
stringVal = "Här är ett exempel på en svensk mening att ge dig."
print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore"))
In the code, we start by importing the unicodedata
module, which allows us to work with Unicode characters. We define a Unicode string called stringVal
with the value "Här är ett exempel på en svensk mening att ge dig."
; and this string contains various Unicode characters, including diacritics.
We then use the unicodedata.normalize()
function with the "NFKD"
(Normalization Form KD) parameter to normalize the stringVal
. This normalization form decomposes characters with diacritics into their base characters and diacritic marks.
The result of the normalization is encoded using the "ascii"
codec, and we specify "ignore"
as the error handler. This means that any character that cannot be converted to ASCII will be ignored.
Output:
b'Har ar ett exempel pa en svensk mening att ge dig.'
The output displayed is a byte literal (indicated by the b
prefix) containing the normalized string with non-ASCII characters replaced with their closest ASCII equivalents.
In this case, the characters ä
and å
are replaced with a
, and the resulting string is b'Har ar ett exempel pa en svensk mening att ge dig.'
. The byte literal can be further decoded to obtain a plain ASCII string if needed.
In order to remove the symbol and the single quotes encapsulating the string, call the function decode()
after calling encode()
to re-convert it into a string literal.
import unicodedata
stringVal = "Här är ett exempel på en svensk mening att ge dig."
print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore").decode())
Output:
Har ar ett exempel pa en svensk mening att ge dig.
Handling Untranslatable Characters
Let’s try another example using the replace
as the second parameter in the encode()
function. For this example, let’s try out a string having characters that do not have ASCII counterparts.
Code Example:
import unicodedata
stringVal = "áæãåāœčćęßßßわた"
print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "replace").decode())
In the code, we begin by importing the unicodedata
module. Then, we define a Unicode string called stringVal
, which contains a mix of characters, "áæãåāœčćęßßßわた"
.
Next, we utilize the unicodedata.normalize()
function with the "NFKD"
parameter to normalize the stringVal
. This normalization form decomposes characters into their base characters and diacritic marks, preparing them for conversion to ASCII.
The normalized string is then encoded using the "ascii"
codec with the "replace"
error handler. When characters in the string do not have direct ASCII representations, the "replace"
handler replaces them with a question mark (?
) symbol.
Output:
a??a?a?a??c?c?e??????
The output displayed is a string where non-ASCII characters are replaced with question marks. In this case, the output string becomes a??a?a?a??c?c?e??????
.
This is a common way to handle characters that don’t have a direct ASCII equivalent during conversion, ensuring that the output remains in a recognizable format.
To remove the ?
, we will use "ignore"
instead of "replace"
on the same string:
import unicodedata
stringVal = "áæãåāœčćęßßßわた"
print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore").decode())
Output:
aaaacce
As seen in the output, all the supposedly question marks (?
) are removed since "ignore"
is used instead of "replace"
, resulting in the output string: aaaacce
.
Use the unidecode
Library to Convert Unicode to ASCII String in Python
To use the unidecode
library, we need to install it and then modify our code. Here’s how to use the unidecode
library to convert Unicode to ASCII.
First, we need to install the unidecode
library. We can do this by using the pip install unidecode
command.
Once we have the library installed, we can use it to convert Unicode text to its closest ASCII representation.
Basic Syntax:
from unidecode import unidecode
ascii_text = unidecode(unicode_text)
Parameter:
input_text
: This is the Unicode text that we want to convert to its closest ASCII representation. We pass the Unicode text as an argument to the function, and it returns the corresponding ASCII string.
Code Example:
from unidecode import unidecode
stringVal = "Här är ett exempel på en svensk mening att ge dig."
ascii_str = unidecode(stringVal)
print(ascii_str)
In this code, we import the unidecode
function from the unidecode
library. Then, we pass your Unicode string, stringVal
, to the unidecode
function, which will return an ASCII representation of the string.
Finally, we print the ascii_str
, which contains the ASCII representation of the original Unicode string.
Output:
Har ar ett exempel pa en svensk mening att ge dig.
The unidecode
library has transformed the original Unicode string "Här är ett exempel på en svensk mening att ge dig."
into an ASCII representation while preserving the closest phonetic representation. In this case, it replaced characters like ä
and å
with their closest ASCII equivalents.
Conclusion
This article explores two methods for converting Unicode characters to ASCII strings in Python. It starts by demonstrating the use of the unicodedata
module, which provides precise normalization of Unicode characters but may involve character replacement or removal.
Then, it introduces the unidecode
library, a convenient tool that ensures phonetic representations are maintained while converting Unicode to ASCII. The choice of method depends on our specific requirements, offering Python developers versatile options for handling Unicode data effectively while ensuring compatibility with ASCII-based systems.
Skilled in Python, Java, Spring Boot, AngularJS, and Agile Methodologies. Strong engineering professional with a passion for development and always seeking opportunities for personal and career growth. A Technical Writer writing about comprehensive how-to articles, environment set-ups, and technical walkthroughs. Specializes in writing Python, Java, Spring, and SQL articles.
LinkedIn