Hugo Laurençon | What Matters When Building Vision-Language Models?


Details
Virtual London Machine Learning Meetup.
Title: What matters when building vision-language models?
Speaker: Hugo Laurençon (Meta AI)
Paper: https://arxiv.org/abs/2405.02246
Abstract: The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of the models Idefics2 and Idefics3.
Bio: I spent 3 years at Hugging Face while I was doing my PhD, and my research focused on developing vision-language models and creating datasets for their training. Now I've just started a new role as an AI Research Scientist at Meta.
Agenda:
- 18:25: Virtual doors open
- 18:30: Talk
- 19:10: Q&A session
- 19:30: Close
Sponsor: Evolution AI - Generative AI-powered data extraction from financial documents.

Sponsors
Hugo Laurençon | What Matters When Building Vision-Language Models?