BISH Bash hosted by Adobe


Details
Live URL
Join us via Teams during the talks:
Join the meeting
Description
Adobe in San Francisco will be hosting the next BISH Bash on Thursday, August 1st! Please, join us for some talks, networking, and bites. This time, we'll have a special Research Internship Talk Jam, given the amount of interns in the Bay Area during the summer!
Event is currently sold out.
Agenda
6pm: Networking + Drinks + Pizza
6:30pm: Welcome Remarks (Oriol Nieto & Camille Noufi)
---Main Talks---
6:40pm: Gautham Mysore
7:00pm: Janne Spijkervet
---Research Interns Jam I---
7:20pm: Hugo Flores García
7:30pm: Margarita Geleta
7:40pm: Zachary Novack
7:50pm: Interns Q&A I
---Research Interns Jam II---
8:00pm: Lisa Dunlap
8:10pm: Ziyang Chen
8:20pm: Justin Lovelace
8:30pm: Interns Q&A II
---Networking---
8:40pm: Networking + Drinks + Pizza
9pm: End of event
Abstracts
- Audio AI Research at Adobe (Gautham Mysore, Head of Audio and Video AI Research at Adobe)
Generative AI is reducing the gap between idea and produced video. People will have all the creative control they desire to tell their story. They will co-create with AI. This is the future that we aim to create at Adobe and we have only scratched the surface of what will be possible. I will discuss our approach to the research area and a sampling of our technology.
- StemGen: A music generation model that listens (Janne Spijkervet, Machine Learning Researcher at TikTok)
We’ll dive deeper into generative modeling of musical audio using recent advances in deep learning. We’ll discuss an alternative paradigm for producing music generation models that can listen and respond to musical context.
- VampNet - The Voice As the Interface (and Other Techniques) for Masked Acoustic Token Models (Hugo Flores García, Adobe Intern)
We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference.
- Image-in-Audio Deep Steganography (Margarita Geleta, Dolby Intern)
This talk introduces an end-to-end deep steganographic method for embedding color images within audio waveforms. This technique opens new avenues for covert communication and multimedia applications.
- Describing Differences in Image Sets through Automated Data Science (Lisa Dunlap, Adobe Research Intern)
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, I'll introduce the task of automatically describing the differences between two sets of images in natural language using a combination of LLM's and VLM's.
- Images that Sound: Composing Images and Sounds on a Single Canvas (Ziyang Chen, Adobe Research Intern)
We use diffusion models to generate spectrograms that look like images but can also be played as sound. Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound.
- Simple-TTS: Sample-Efficient Diffusion for Text-To-Speech Synthesis (Justin Lovelace, Adobe Research Intern)
We introduce Simple-TTS, a latent diffusion model that offers a simple, sample-efficient alternative to pipelined approaches for TTS synthesis. Operating in the latent space of a pre-trained audio autoencoder, Simple-TTS utilizes a single unified model for the entire generative process, producing audio representations conditioned only on the transcript. Unlike many prior systems, Simple-TTS does not require separate modules such as phoneme duration predictors or alignment techniques like monotonic alignment search.
- DITTO: Diffusion Inference-Time T-Optimization for Music Generation (Zachary Novack, Adobe Research Intern)
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained diffusion models at inference-time via optimizing initial noise latents. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model.

BISH Bash hosted by Adobe