What would it take to localize interfaces into thousands of dialects and standard languages? Or to let a speaker of Zulgo-Gemzek in Cameroon search for “mázlə̀rpə́pa” and retrieve an image tagged with “исленг” by a speaker of Tati in Azerbaijan? In other words, how could we translate instantly across 50 million language pairs?
A critical enabler would be a corpus of lexical translations between arbitrary pairs of languages. PanLex, a project of The Long Now Foundation in San Francisco, is building such a corpus out of thousands of bilingual dictionaries, glossaries, wordlists, thesauri, standards, and other lexical resources, paper and digital. It now documents about 22 million lexemes in about 10,000 languages and dialects. It can supply a billion attested translation pairs, plus 30 billion distance-2 (bridged) translation pairs.
The PanLex team will show how it acquires resources, extracts data from them, and provides access to the data. Decisions on database design, operationalizing “word” and “language”, text encoding, polysemy and ambiguity, attribution and provenance, and API and human-interface design will be discussed. You will learn how you can help in the effort.
David Kamholz is PanLex’s Lexical Data Specialist. He has a Ph.D. in linguistics from the University of California, Berkeley. His research focuses on Austronesian languages, computational lexicography, historical linguistics, and language typology.
Alexander DelPriore, Gary Krug, and Benjamin Yang are Source Analysts, and Julie Anderson is a Source Acquisition Specialist, in the PanLex project. DelPriore is completing an MLIS at Rutgers with an emphasis in digital libraries and has worked on electronic publishing and cataloging. Krug is a computational linguist who has worked as a programmer and technical support engineer. Yang has worked as a linguistic engineer and voice interface designer. Anderson has an MA in linguistics from the University of Hawaii and has worked on language and software documentation and nonprofit management.
Jonathan Pool directs the PanLex project. He has taught at SUNY/Stony Brook and the University of Washington in Seattle and published on the politics and economics of language and artificial and controlled languages.
6:30-7:00 Social time with snacks
7:00-8:00 Presentation and discussion
8:00-8:30 Social time
Disclosure policies: Adobe requires all visitors to their office to sign a non-disclosure agreement in case you see or hear anything Adobe-confidential while at their office. Note however that all information shared at the meet-up itself is considered public and may be used by anyone at the meet-up with no restrictions. Therefore, please do not share proprietary information or intellectual property that you or your organization would not appreciate to become public knowledge.
This meeting is in a branch office of Adobe Systems that’s only one block from the SF Caltrain with all its rail and bus lines, two blocks closer than their main SF office.