Demystifying Unicode: the Fundamentals of Encoding a Script

This is a past event

78 people went


Networking 6:30-7:00 PM, and after the talk as time allows.
Parking: Park in our usual garage under the East Tower. (The West Tower garage will be closed to visitors that evening.) Walk along S. Almaden Blvd. to the Almaden Tower next door. Adobe security guards in the garage and lobbies can provide further directions. See also
Webcast: RSVP, then contact us ( to register for remote access.

The ability to internationalize, localize, and translate content requires support for languages and scripts at the most fundamental software layer. Today, this layer is the Unicode standard. Encompassing more than one hundred scripts and substantial locale data, Unicode enables user communities around the world to create and consume content in numerous languages. But, how to support orthographic requirements when the basic technology that enables such support does not itself exist? The first step, of course, is to implement support for the script in Unicode.

This talk demystifies the process of transforming an analog script into a digitized encoding. The Unicode code chart that presents the encoded repertoire of a script is the result of a standardization process for which best practices or packaged solutions do not exist, but we will learn how this can be done by evaluating two cases. The first is a historical southern Indic script known as ‘Pallava’, which travelled to south-east Asia, Indonesia, and Philippines, and is the ancestor of the major historical and modern scripts of these regions. The second is Lampung, a modern minority script used in Sumatra, Indonesia, which is a descendant of Pallava. These two scripts are not yet encoded in Unicode, and they present several issues for standardization.

The talk draws upon analyses of Pallava and Lampung sources in order to explain decisions that inform the definition of encoding models, character repertoires, and representative glyphs for scripts. It also presents the technical issues regarding the encoding of these to illuminate various Unicode principles, such as the character-glyph model and unification.

Anshuman Pandey is a technology analyst, Asia specialist, linguist, and historian. Most recently, he was a post-doctoral researcher in Linguistics at UC Berkeley, where he was funded by a Google Research Award. He earned a Ph.D. in History from the University of Michigan. Anshuman has encoded more than twenty scripts in Unicode on behalf of the Script Encoding Initiative and is pursuing twenty more. He was awarded the 'Bulldog Award' by the Unicode Consortium in 2011 for his contributions to Unicode.

Admission is free for IMUG members and Adobe employees, and $5 for non-members. IMUG membership is only $20 for the first year, $15 for annual renewal, or $100 for lifetime membership. (At this time we are not charging for remote webcast attendance, but please support IMUG if you find these events useful.) Click here to join, renew or pay a single non-member event fee via PayPal. Cash and checks also accepted at our events.

Please RSVP via Meetup by 4:30 PM two days before the event to help IMUG and our hosts prepare badges for you in advance. After that time it's still OK to RSVP right up to the last minute, as that will help us ensure enough seats for everyone!

Adobe® Connect™ webcast: If you can't make it to downtown San Jose for this event, please join us via your browser. The url is different every time. RSVP and then contact us ( to register. We are not charging for webcast access at this time. (Recording is subject to speaker approval, and will be announced after the event if available.)

Many thanks to Dr. Ken Lunde, Janice Campbell, and Adobe Globalization for hosting IMUG! For maps, detailed directions, restaurants and more, see

The hashtag for IMUG events is #IMUG408 (, honoring Silicon Valley's main area code. Follow @i18n_mug: