"The Properties and Promises of UTF-8", by Martin J. Dürst (1997).
Presented by Marvin Humphrey.
Created by Ken Thompson one night in September 1992 on a placemat in a New Jersey diner, UTF-8 is a spectacular quintuple bank shot of design: optimal across so many criteria that it's hard to believe anybody could pull it off.
Over the years, UTF-8 has increasingly eclipsed all of the alternatives. UTF-1 and UTF-7 are obsolete and only of historical interest; UTF-32 has limited practical utility; UCS-2 can't express all of Unicode... the only serious competitor remaining is UTF-16, but UTF-8 has continued to gain market share.
In 1997, Martin J. Dürst gave a talk at the 11th International Unicode Conference in San Jose on "The Properties and Promises of UTF-8". Two decades later at Papers We Love San Diego, we will review those marvelous properties and contemplate all of the promises that UTF-8 has fulfilled.
Here is a Rob Pike email from 2003 describing the scene in the diner, along with historical emails from Ken Thompson in 1992 articulating some of the design criteria for UTF-8:
SOME ADVICE ON READING THE PAPER:
Dürst's "paper" is really the transcription of a talk. Each page consists of a slide at the top, accompanied by text transcribing the presenter's narration over that slide. As a result, while the paper is 26 pages, it is not dense.
A significant portion of the paper is dedicated to heuristic detection of UTF-8. We will simply accept that heuristic detection of UTF-8 works very well — so feel free to gloss over the details of the heuristic algorithm.
If you don't have time to read the paper, contemplate the six design criteria enumerated by Ken Thompson in his 1992 email:
1) Compatibility with historical file systems: Historical file systems disallow the null byte and the ASCII slash character as a part of the file name.
2) Compatibility with existing programs: The existing model for multibyte processing is that ASCII does not occur anywhere in a multibyte encoding. There should be no ASCII code values for any part of a transformation format representation of a character that was not in the ASCII character set in the UCS representation of the character.
3) Ease of conversion from/to UCS.
4) The first byte should indicate the number of bytes to follow in a multibyte sequence.
5) The transformation format should not be extravagant in terms of number of bytes used for encoding.
6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream.
In addition consider the following three properties of UTF-8:
A) Although UTF-8 is inefficient for CJK (Chinese Japanese Korean) texts compared with UTF-16, for CJK HTML documents that inefficiency is offset by UTF-8's efficiency in encoding HTML markup. This characteristic has been key to UTF-8's dominance as a web encoding.
B) Unlike UTF-16 (and UTF-32), UTF-8 has no big-endian and little-endian
variants, and thus does not need a Byte Order Mark.
C) UTF-8 byte sequences sort in the same order as their corresponding expanded strings sorted in Unicode code point order.
Big ups to Comcast for hosting this month!
As a chapter of Papers We Love we abide by and enforce the PWL Code of Conduct (https://github.com/papers-we-love/seattle/blob/master/code-of-conduct.md) at our events. Please give it a read, plan on acting like an adult, and involve one of the organizers if you need help.
Stop slacking and join us in the #seattle channel at https://papersweloveslack.herokuapp.com!
If you have a paper you'd like to present, or even just a mini, please hit up one of the organizers :) We're always looking for more presenters.