For Python 2 developers dealing with data, you’ll probably run into an encoding or decoding issue. If you’re in Python 3, consider yourself lucky that text is unicode and that a few of these problems go away.
Here’s my (probably oversimplified) notes from working with encoding and decoding data for Python 2. I highly recommend reading The Absolute Minimum Every Software Developer Absolutely Must Know About Unicode and Character Sets
There’s Unicode, where letters map to a code point (a theoretical concept). These code points are mapped so that the character A
is also mapped to the lower case a
or an italicied or bold version. What’s cool is that other languages like the Arabic letter Ain
has its own mapping. You can see exactly what is mapped where at the Unicode web site
The idea is that there are character encodings with the default encoding of ascii, which has the bare minimum English characters. We also have other encodings like utf-8, utf-16, __latin-1, etc. Some take 8 bits, others 16, etc. You hopefully know what this encoding is (if not, I use the below libraries to help guess, but again this is only a guess).
Encoding and Decoding means to change formats.
An example would be:
my_str.encode('ascii', 'ignore') # encode your string into ascii format, ignore any errors
An example would be:
In Python 2, we’re dealing with instances of either strings (bytes, not really built to support different encodings) or unicode. In Python 3 its only unicode. If you know the encoding (e.g. latin-1, ascii, utf-8), then explicitly state it when you decode it. Otherwise, you can try these libraries for a guess:
from unidecode import unidecode
my_str = u'idzie wąż wąską dróżką'
import chardet
my_str = 'hello world'
chardet_result = chardet.detect(my_str)
Besides just making sure your code works, make sure that your database character sets are correct.
mysql> use some_database_name;
> show variables like 'character_set_database'
| Variable_name | Value |
| character_set_database | latin1 |