Paper Discussion: SAM Audio: Segment Anything in Audio
Details
Recent advances in foundation models are reshaping how we understand perception across vision, audio, and language. Following the success of Segment Anything in computer vision (SAM 1, 2, 3), Meta’s SAM Audio (SAM A) extends the idea of promptable segmentation into the audio domain, enabling open-ended sound separation driven by text, visual cues, or temporal spans.
This research-focused meetup introduces SAM Audio and situates it within the broader trajectory of Perception Encoder, Meta’s NeurIPS Oral work on unified multimodal perception. Rather than viewing SAM Audio as a standalone system, we explore how promptability, audio–visual grounding, and general-purpose perception encoders come together to support flexible, interactive audio segmentation.
We will discuss the design motivations behind SAM Audio, the role of Perception Encoder / PE-AV as a shared multimodal backbone, and how prior work in audio–visual learning, contrastive audio–language alignment, and generative modeling influenced its architecture.
Related Resources:
- SAM Audio Blog: https://ai.meta.com/blog/sam-audio/
- SAM Audio demo: https://ai.meta.com/samaudio/
- SAM Audio paper: https://ai.meta.com/research/publications/sam-audio-segment-anything-in-audio/
- Perception Encoder paper: https://arxiv.org/abs/2504.13181
