Strings Bytes versus Unicode

In Python 2 there are two variants of string: those made of bytes with type (str) and those made of text with type (unicode).

In Python 2, an object of type str is always a byte sequence, but is commonly used for both text and binary data.

A string literal is interpreted as a byte string.

s = 'Cafe'    # type(s) == str

There are two exceptions: You can define a Unicode (text) literal explicitly by prefixing the literal with u:

s = u'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == str

Alternatively, you can specify that a whole module’s string literals should create Unicode (text) literals:

from __future__ import unicode_literals

s = 'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == unicode

In order to check whether your variable is a string (either Unicode or a byte string), you can use:

isinstance(s, basestring)

In Python 3, the str type is a Unicode text type.

s = 'Cafe'           # type(s) == str
s = 'Café'           # type(s) == str (note the accented trailing e)

Additionally, Python 3 added a bytes object, suitable for binary “blobs” or writing to encoding-independent files. To create a bytes object, you can prefix b to a string literal or call the string’s encode method:

# Or, if you really need a byte string:
s = b'Cafe'          # type(s) == bytes
s = 'Café'.encode()  # type(s) == bytes

To test whether a value is a string, use:

isinstance(s, str)

It is also possible to prefix string literals with a u prefix to ease compatibility between Python 2 and Python 3 code bases. Since, in Python 3, all strings are Unicode by default, prepending a string literal with u has no effect:

u'Cafe' == 'Cafe'

Python 2’s raw Unicode string prefix ur is not supported, however:

>>> ur'Café'
  File "<stdin>", line 1
    ur'Café'
           ^
SyntaxError: invalid syntax