MMW 67: Multimodal Benchmarks, Video Prediction, and Multimodal Video Models

Name: MMW 67: Multimodal Benchmarks, Video Prediction, and Multimodal Video Models
Start: 2025-01-10T13:30:00-08:00
End: 2025-01-10T14:30:00-08:00

Hosted by James L.

Multimodal Minds

Details

In the 67th session of Multimodal Weekly, we have three exciting presentations on multimodal benchmarks, video prediction, and multimodal video models.

✅ Jieyu Zhang and Zixian Ma from the University of Washington will discuss Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user’s needs.

✅ Yiqi Zhong from Microsoft will discuss her paper Motion Graph Unleashed. This is a novel approach to the video prediction problem, which predicts future video frames from limited past data.

✅ Mu Cai from the University of Wisconsin - Madison will present his recent works that challenge and advance the capabilities of multimodal video models.

First, he will introduce two new benchmark called TemporalBench and Vinoground, which evaluates the temporal dynamics and counterfactual reasoning capabilities of existing models. These benchmarks examine scenarios where questions cannot be answered by using a single frame. Spoiler alert: they aren't great (to put it mildly).
Second, he will present a novel approach inspired by the Matryoshka doll to improve the efficiency of multimodal models. It learns to compress the total number of visual tokens in a nested fashion, significantly reducing the number of tokens that the subsequent language model needs to process.

Join the Multimodal Minds community to connect with fellow Twelve Labs users!

Multimodal Weekly is organized by Twelve Labs, a startup building multimodal foundation models for video understanding. Learn more about Twelve Labs here: https://twelvelabs.io/

Multimodal Minds

MMW 67: Multimodal Benchmarks, Video Prediction, and Multimodal Video Models

Multimodal Minds

Details

Related topics

You may also like