The Token Wars


Details
The MLAI Meetup is a community for AI researchers and professionals which hosts monthly talks on exciting research. Our format is:
- 6:00 - 6:20: Socializing
- 6:20 - 6:40: Announcements and AI news
- 6:40 - 7:40: Talk(s) and Q&A
- 7:40 - 8:00 Networking
- 8:00: Head to the nearest pub for dinner
Speaker: Kathy Reid
Talk Title: The Token Wars: why not all our content should be open
Abstract: In recent years, there has been an explosion in generative AI. Most of us are now familiar with tools like ChatGPT, Midjourney, Sora, and others. At the heart of generative AI is a machine learning architecture called the "transformer", which is fed by huge datasets - text, images and videos. Those datasets are "tokenised" - cut up into chunks which the transformer can ingest. Those actors who can obtain the most tokens can generally train the best models (for various values of "best").
We are now witnessing a battle between the creators of generative AI models - who seek to obtain as much data as possible for tokenisation - while their targets try to stop them. The social ramifications of this resource conflict are widespread, resulting in "alateral damage" - a term I am coining to point to the unforeseen, unintended, distal consequences of a seemingly innocuous technology.
These are the Token Wars.
And they're the reason not all our content should be openly available.
In this three-part talk, I first provide a technical grounding on transformers, tokens and how they're used to build text-based generative AI. In the second part, I draw on economics to ask, "why are tokens so valuable?", showing that as the internet becomes filled with AI slop, human-created data is becoming more scarce - and so more expensive. In the third part I explore how you might approach guarding your token treasure, from data poisoning to alternative licensing models and data sovereignty.
You'll leave this talk never looking at data or ChatGPT the same way again.
Speaker Bio: Kathy Reid works at the intersection of open source, ML-enabled speech and language technologies and the people served - or not served - by them.
With a 25-year career spanning leadership positions such as Digital Platforms and Operations Manager at Deakin University, Director of Developer Relations at Mycroft.AI and as President of Linux Australia, more recently she has consulted to organisations such as NVIDIA in the area of speech data.
In 2019, she was one of 16 people from across the world selected to co-create a Masters Program in Applied Cybernetics at Australian National University in Canberra, where she is now a PhD candidate researching voice data and bias in technologies like speech recognition. As part of her Masters' program, she developed a first in the world sensing mastectomy prosthetic called SenseBreast.
She currently contracts for Mozilla Common Voice as a linguistic engineer.
She *will* finish her PhD this year, she promises :-)

The Token Wars