SQLite: UTF-8 or UTF-16?

Submitted by mimec on 2011-04-22

SQLite stores all text as Unicode characters encoded using either UTF-8 or UTF-16 encoding. The API contains functions which allow passing and retrieving text using either UTF-8 or UTF-16 encoding. These functions can be freely mixed and proper conversions are performed transparently when necessary.

The SQLite driver for Qt uses the UTF-16 version of most functions, because that encoding is used internally by the QString class. However SQLite uses the default UTF-8 encoding internally, so it needs to convert all text back and forth when reading and writing data. Usually such conversions does not affect performance very much. The benefit of using UTF-8 encoding internally is that it requires less storage than UTF-16. According to this email from the SQLite mailing list, the performance gain resulting from reduced size of the database is much bigger than the cost of the conversion.

However there is one case when the cost of conversion may become significant. In the previous post I showed how to implement custom collation based on QString::localeAwareCompare. I mentioned that the function expects text to be encoded using UTF-16 encoding because that allows passing it directly to QString. The use of fromRawData ensures that the data is not even copied to a separate buffer. However SQLite still needs to copy both compared strings to a temporary buffer and convert them from UTF-8 to UTF-16 and this happens every time the strings are compared, which is O(N·logN) when sorting query results by a text column.

I made a little experiment to see which storage format is better in such case and the results are a bit surprising. It turns out that UTF-16 is 4% faster than UTF-8 when sorting 1000 rows and 7% faster when sorting 10,000 rows. However UTF-8 becomes faster with large amounts of data - it is 27% faster than UTF-16 when sorting 100,000 rows. At this point the size of the database (which is 4.3 MB for UTF-8 and 5.7 MB for UTF-16) becomes more significant than the cost of conversion.

So the answer to the question which encoding to use is, as usual, "it depends". In my case the database will be rather small and there will rarely be more than a few thousand rows per query, so I will use UTF-16 encoding to avoid the conversions. There are two ways to achieve that. We can execute the following statement directly after opening the database, before any tables are created:

database.exec( "PRAGMA encoding = \"UTF-16\"" );

Another way is to modify the SQLite driver (since we're already using a copy of it) and use sqlite3_open16 instead of sqlite3_open_v2 to open the database - it enables UTF-16 encoding automatically. In that case we won't be able to open the database in read only mode, but if we're not using the advanced connection options then it's not a problem.