Data types and variables are some of the Python language’s greatest strengths. The language typically makes it easy to manipulate data without needing to worry too much about specific typing. However, there are some instances where similar data types are fundamentally incompatible with each other. One of the best example of this dichotomy is Python’s strings and byte like objects. Thankfully, it’s fairly easy to convert a byte like object into a string.
A Closer Look at the Two Formats
The differences between byte like objects and strings is often extremely subtle. Python’s strings are essentially just a sequence of characters in the Unicode format. Byte like objects, as the name suggests, are a collection of bytes.
The main point of confusion stems from the fact that everything on a computer can be essentially reduced into bytes. A sentence in Unicode within a text file consists of bytes. But Unicode sentences can also exist as strings within Python code. The difference between the two essentially comes down to readability.
Python’s strings are always presented as Unicode text. But byte like objects aren’t really considered as anything other than bytes. There might be plain text within those bytes. But Python’s interpreter won’t treat that text any differently than bytes signifying, say, an executable file. The byte like objects are essentially treated as a mystery box that’s inherently unintelligible. The system won’t make any assumptions about what anything in the byte like objects actually mean. Whereas it will always assume that a string consists of Unicode characters.
Directly Working With Strings and Byte Like Objects
The difference between the two types can be more easily understood by looking at a small code sample. Consider the following code that uses both types.
initialString = ‘Something ödd in this string.’
stringEncodedToBinUtf8 = initialString.encode()
stringEncodedToBinUtf16 = initialString.encode(‘UTF-16’)
bin8Decoded2string = stringEncodedToBinUtf8.decode(‘UTF-8’)
bin16Decoded2string = stringEncodedToBinUtf16.decode(‘UTF-16’)
print(type(initialString))
print(‘Original:’, initialString)
print(type(stringEncodedToBinUtf8))
print(‘UTF8:’, stringEncodedToBinUtf8)
print(type(stringEncodedToBinUtf16))
print(‘UTF16:’, stringEncodedToBinUtf16)
We begin by creating one of Python’s strings as initialString. It’s fairly standard, except for the use of ö instead of o. This will demonstrate why bytes and strings aren’t as similar as they might seem. We then proceed to encode the string’s data into bytes. Note that we don’t pass any arguments on the initial encoding. This defaults to a UTF-8 format. Next we encode initialString as another variable, this time as UTF-16. We then proceed with a series of prints and types. The type call shows the nature of the variable type. And the print shows the variable’s contents.
The initialString and stringEncodeToBinUtf8 seem quite similar when printed to screen. But note the difference with stringEncodeToBinUtf16. This demonstrates an important point. Byte like objects and strings can look quite similar. But Python’s interpreter will treat them very differently. There are situations where you can get away with treating them the same. But even in those cases it can lead to unpredictable behavior. Try running the prior example again, but this time change the initial declaration on line 1 to the following.
initialString = ‘Something odd in this string.’
You’ll note that the UTF-8 byte, when printed to screen, is nearly identical to the initialString. This demonstrates just how quickly strings and bytes that are seemingly in parity can fork away from each other. A single character is all it takes. And that’s one of the most significant reasons why it’s important to convert byte like objects into strings if they’re going to be actively used within your code.
How To Convert Byte Like Objects to Strings
The actual conversion of byte like objects to strings is usually fairly simple. In fact, you’ve already seen it demonstrated. Think back to this line in the initial example.
bin8Decoded2string = stringEncodedToBinUtf8.decode(‘UTF-8’)
The decode method will convert byte like objects into strings using the formatting type passed by decode. Keep in mind how Python’s interpreter produced very different results with UTF-8 and UTF-16. Passing a codec to decode is essentially like telling someone what translation dictionary to pick up off the shelf. Note that this can easily go wrong if you’re unsure of what the contents of the byte like objects actually are. Take a look at this variation in the initial example.
initialString = ‘Something ödd in this string.’
stringEncodedToBinUtf16 = initialString.encode(‘UTF-16’)
bin8Decoded2string = stringEncodedToBinUtf16.decode(‘UTF-8’)
print(bin8Decoded2string)
We try the same conversation that worked before, but this time pass UTF-8 rather than the correct UTF-16. In this example, we do know what the correct encoding is. But that isn’t going to always be the case in real world situations. And, as this example shows, incorrect assumptions when converting byte like objects to strings can cause serious problems. In this instance the script termintes when it can’t decode the bytes. However, it’s fairly easy to get around that problem. Try running this variation on the previous idea.
initialString = ‘Something ödd in this string.’
stringEncodedToBinUtf16 = initialString.encode(‘UTF-16’)
try:
stringEncodedToBinUtf16.decode(‘utf-8’)
print (“It is UTF-8”)
except UnicodeError:
print (“It is not UTF-8”)
If we’d encoded to UTF-8 it would have been successfully detected and we could proceed to apply whatever logic was needed. For example, the conversion to strings. But since it’s not, we gracefully exit from the try block with a solid explanation which could be worked with by further conditionals if desired. We simply need to extend that test to cover any codecs we might be working with. Though there’s also the option of using third-party libraries, like chardet, which do something similar with every common codec.