Skip to content

Details

Vision Language Models (VLMs) such as LLaVA, Qwen-VL, Nemotron VL and GPT-4o have unlocked a new era in AI where language and sight converge into a single reasoning system. These models can describe images, answer visual questions, read documents, and interpret charts — but their true power emerges when you fine-tune them on domain-specific data, teaching them to see and reason about the visual world that matters to you.

In this hands-on session, we walk through the full fine-tuning pipeline for vision-language models. We begin by understanding how these architectures work — how a visual encoder like CLIP or SigLIP is coupled with a large language model through a projection layer, and how the interplay between image tokens and text tokens enables multimodal reasoning. We then move into the practical side: preparing image-text datasets, choosing the right fine-tuning strategy (full fine-tuning vs. LoRA vs. QLoRA), configuring training with frameworks like Hugging Face Transformers and PEFT, and evaluating results on real-world benchmarks.

We will also discuss when and why fine-tuning a VLM is the right approach compared to alternatives like prompt engineering or retrieval-augmented generation, and how to avoid common pitfalls such as catastrophic forgetting, overfitting on small visual datasets, and misaligned image-text pairs.

Bring your laptop, your curiosity, and — if you have one — a dataset you would like to try. By the end of the session you will have a clear, reproducible workflow for adapting any open-source vision-language model to a custom task.

**📅 Open to all levels — no prerequisites, just curiosity.**

Related topics

You may also like