SQL Server 2019 Introduces UTF-8 Support

Microsoft in the News

As part of Microsoft’s commitment to sustainability, their newest tech campus will come with a Net-Zero Water certification.

This is an important feature for buildings in California. Largely due to agriculture, California uses more water than any other state. As droughts become more severe due to climate change, water is becoming more precious. Already, most of the water flowing through the state is used once before it reaches the ocean. Many rivers are dry by the time they reach the ocean. To make matters worse, the population of California is predicted to grow by 30% by 2050.

Microsoft’s new campus covers 15 acres and will handle more than 2,000 employees. Drinking fountains and sinks will still use municipal water, but landscape, toilet, and any other item that uses water will make use of rainfall and/or on-site recycled water. Wastewater from kitchens and bathrooms is treated and reused for irrigation and toilets so the same water can be used over and over.

Even the parking lots will act as collection areas for water with the excess being directed toward the local creek.

This is the sort of thing good corporate citizens do!

SQL Server 2019 Introduces UTF-8 Support

Variable Character Field, or as it colloquially known, “VARCHAR” is a set of character data that can consist of letters and numbers. A VARCHAR field can be any size, but by default is limited by SQL to 8,000 characters. If VARCHAR (MAX) is used, it is limited to 2 gigabytes. VARCHAR is normally 1 byte per character.

For accented letters and other non-VARCHAR characters there is NVARCHAR. NVARCHAR can represent all the characters in the Unicode list, but each character is now 2 bytes.

Unfortunately, we don’t always know in advance what input will be entered into the fields we create. If we know for certain that only ASCII will be entered, we can use VARCHAR to define those fields. But how often are you positive that for the foreseeable future, nobody will stray from the ASCII keys? So, we tend to play it safe and use NVARCHAR. That, as I just pointed out, takes up twice the amount of space. But hard drives are cheap and we would all like to avoid problems down the road, especially if we can offload that problem onto the sysadmin and their budget.

The solution to this problem was introduced in SQL 2005. SQLCLR was introduced and it allowed you to use .NET and access GZip/Deflate. With this function, we could compress data into VARBINARY(MAX) when saving and decompress it back to NVARCHAR(MAX) when we needed it. Unfortunately, in order to search this data, you had to decompress it first.

With SQL Server Enterprise 2008 (not regular SQL), Microsoft solved the problems introduced by SQLCLR by introducing Data Compression. Unfortunately, Data Compression did not work with row overflow and MAX values that can’t fit in-row. So Microsoft brought us Unicode Compression in SQL Server 2008 R2, but that ended up not solving the problems.

As of SQL Server 2016, Microsoft brought us Compression for all their SQL editions, but the problem with off-row values and in-row NVARCHAR(MAX) still persisted.

The built in functions of COMPRESS and DECOMPRESS were introduced with SQL Server 2016. Many users were still not happy as these functions were not useful in all situations.

With SQL Server 2017 support for MAX data types in Clustered Columnstore Indexes was unveiled. But since this affects the whole table, again, it did not fit everyone’s needs.

Finally, all those previous patches have been rendered redundant. Almost begrudgingly, Microsoft has brought us UTF-8 for SQL Server 2019. As explained in my previous blog, UTF-8 is a variable byte encoder that uses a single byte for all ASCII characters and 2 bytes for the next 1,920 characters covering the majority of languages and characters you will likely ever deal with, and more if needed.

Using UTF-8 solves:

  1. The storage issue introduced with NVARCHAR,
  2. The search issue introduced with compression, and
  3. No more need to us SQLCLR or COMPRESS.

All this could result in significant storage savings. If you manage to achieve significant storage savings, there is an add on bonus that there will then be a very real potential for an improvement in the performance of your database.

The caveat to all this is that, as I just discovered, according to docs.microsoft.com as of today (Jan 20, 2019) we may have to wait for the next version to see the space savings. According to Microsoft, the current version of SQL Server 2019 preview does not include UTF-8 support for all character sets.

SQL Server 2019 preview does not change the behavior of previously existing Unicode (UTF-16) data types (nchar, nvarchar, and ntext).

In my next blog, I will discuss the limitations to your potential storage savings.

Leave a Reply

Your email address will not be published. Required fields are marked *