Skip to content

Bytes and string handling in Python 3

Vojtěch Jelínek edited this page Aug 7, 2021 · 1 revision

Introduction

Python 3 differs a lot in handling strings and bytes from Python 2 (you can read more about this in this article or in this Pragmatic Unicode talk). Basically, strings (str) in Python 3 are Unicode by default and “bytes” (bytes) are lists of integers from 0 to 255 (lists of 8 bits). There is no implicit conversion between str and bytes in Python 3, so any conversion needs to be done explicitly using encode (strbytes) and decode (bytesstr) functions.

Throughout Oppia, we typically use strings. However, you may come across bytes in places where there is an interaction with some outside library or API — for example, when standard input or output is read or written, or when data is read from or written to files. Some standard Python libraries also only accept bytes.

Rules for handling strings and bytes

Bytes outside, strings inside

The general rule you should follow is to keep all text in Oppia as strings, where possible. If a conversion to bytes is necessary, that conversion should happen as close to the “edges” of the app as possible. So, for example:

  • When you receive bytes from some library, immediately convert them to string using decode.
  • If you need to use a function that needs bytes, use encode to convert the string to bytes immediately before you call the function.

Use utf-8 (or ascii)

In the Oppia codebase all data (that we can decide about) should be encoded/decoded using utf-8 encoding (encode('utf-8')). If you find a case where utf-8 cannot be used, please raise this with the Core Maintainers team.

If, in some case, an external source returns or receives data with a different encoding, it is fine to use that encoding only for that source. However, please first be sure to investigate whether that source can be configured to use utf-8 instead.

Core documentation


Developing Oppia


Developer Reference

Clone this wiki locally