Past Meetup

The Properties and Promises of UTF-8

This Meetup is past

15 people went

Every 1st Thursday of the month

Location image of event venue

Details

"The Properties and Promises of UTF-8", by Martin J. Dürst (1997).

Presented by Marvin Humphrey.

OVERVIEW:

Created by Ken Thompson one night in September 1992 on a placemat in a New Jersey diner, UTF-8 is a spectacular quintuple bank shot of design: optimal across so many criteria that it's hard to believe anybody could pull it off.

Over the years, UTF-8 has increasingly eclipsed all of the alternatives. UTF-1 and UTF-7 are obsolete and only of historical interest; UTF-32 has limited practical utility; UCS-2 can't express all of Unicode... the only serious competitor remaining is UTF-16, but UTF-8 has continued to gain market share.

In 1997, Martin J. Dürst gave a talk at the 11th International Unicode Conference in San Jose on "The Properties and Promises of UTF-8". Two decades later at Papers We Love San Diego, we will review those marvelous properties and contemplate all of the promises that UTF-8 has fulfilled.

DOWNLOAD:

https://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf

SLIDES:

http://www.rectangular.com/preso-utf8-props-promises.pdf

AUXILIARY MATERIALS:

Here is a Rob Pike email from 2003 describing the scene in the diner, along with historical emails from Ken Thompson in 1992 articulating some of the design criteria for UTF-8:

http://doc.cat-v.org/bell_labs/utf-8_history

SOME ADVICE ON READING THE PAPER:

Dürst's "paper" is really the transcription of a talk. Each page consists of a slide at the top, accompanied by text transcribing the presenter's narration over that slide. As a result, while the paper is 26 pages, it is not dense.

A significant portion of the paper is dedicated to heuristic detection of UTF-8. We will simply accept that heuristic detection of UTF-8 works very well — so feel free to gloss over the details of the heuristic algorithm.

If you don't have time to read the paper, contemplate the six design criteria enumerated by Ken Thompson in his 1992 email:

1) Compatibility with historical file systems: Historical file systems disallow the null byte and the ASCII slash character as a part of the file name.

2) Compatibility with existing programs: The existing model for multibyte processing is that ASCII does not occur anywhere in a multibyte encoding. There should be no ASCII code values for any part of a transformation format representation of a character that was not in the ASCII character set in the UCS representation of the character.

3) Ease of conversion from/to UCS.

4) The first byte should indicate the number of bytes to follow in a multibyte sequence.

5) The transformation format should not be extravagant in terms of number of bytes used for encoding.

6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream.

In addition consider the following three properties of UTF-8:

A) Although UTF-8 is inefficient for CJK (Chinese Japanese Korean) texts compared with UTF-16, for CJK HTML documents that inefficiency is offset by UTF-8's efficiency in encoding HTML markup. This characteristic has been key to UTF-8's dominance as a web encoding.

B) Unlike UTF-16 (and UTF-32), UTF-8 has no big-endian and little-endian
variants, and thus does not need a Byte Order Mark.

C) UTF-8 byte sequences sort in the same order as their corresponding expanded strings sorted in Unicode code point order.

ADMINISTRIVIA:

Street parking on 6th, 7th & 8th Avenues north of B Street is usually easy at that hour. Meters nearby are free after 6. Read signage before you park on A street.

If you're interested in presenting a paper please fill out this form (https://docs.google.com/forms/d/e/1FAIpQLScaI-fWdys27-ByT_HdtsJ73V4AxZr0hf1GSqLsQ1IwAaPdIQ/viewform) or talk to us in person at the meetup.