Training Speech Recognition and Generation Models at Scale

Name: Training Speech Recognition and Generation Models at Scale
Start: 2025-03-11T18:00:00-04:00
End: 2025-03-11T20:00:00-04:00
Location: Venue

Hosted By

Jeremy M.

Training Speech Recognition and Generation Models at Scale

Details

After an almost five-year hiatus, we're thrilled to relaunch PyData Ann Arbor with an exceptional talk on speech AI using Python!

Talk Context
Speech technology is revolutionizing how we interact with computers and automate communication. From voice assistants helping us navigate our daily lives to real-time transcription enabling better accessibility in virtual meetings, speech-to-text (STT) and text-to-speech (TTS) technologies have become fundamental building blocks of modern applications. These technologies power everything from customer service voice agents and automated meeting notes to audiobook creation and voice cloning for content creators.

Join us as we welcome Matthew Lightman, a Senior Machine Learning Engineer from Deepgram, a leader in speech AI technology right here in Ann Arbor. Deepgram has pushed the boundaries of speech recognition accuracy and efficiency, making them a cornerstone of the speech AI ecosystem. Their state-of-the-art models are used by companies worldwide for everything from call center analytics to media subtitle generation.

This talk will dive deep into the fascinating world of training speech models at scale, exploring the unique challenges and considerations that set speech AI apart from traditional language models. Whether you're interested in machine learning, audio processing, or the future of human-computer interaction, you won't want to miss this insightful presentation from one of the leading companies in the field.

Talk abstract:
In the last few years, researchers have been able to create effective Large language models (LLMs) through self-supervised training on large quantities of text data from the internet. There are well studied scaling laws for the performance of LLMs as a function of the amount of training data, compute budget, and model size. Similar scaling laws apply in the domain of speech, including speech-to-text and text-to-speech model training. However, there are distinct considerations for training on speech data. For example: How are the ground truth transcripts for the speech produced? How do we make use of noisy versus clean speech? In this talk I will discuss such considerations, and how they impact training of speech-to-text and of text-to-speech models at scale.

P.S. They're hiring!

Events in Ann Arbor, MI AI Algorithms Python

Cloud Computing Data Science using Python