[PDG 484] Revisiting Multimodal Positional Encoding In Vision–Language Models
Details
Link to article: https://arxiv.org/pdf/2510.23095
Title: Revisiting Multimodal Positional Encoding In Vision–Language Models
Content: This paper systematically studies multimodal Rotary Positional Embedding (RoPE) for vision-language models and finds that good multimodal position encoding should maintain positional coherence, fully use available frequencies, and preserve the text-side positional priors inherited from the pre-trained language model. Based on these principles, the authors propose two simple drop-in methods, MHRoPE and MRoPE-I, that improve how text and visual positions are encoded without changing the model architecture. Across a range of benchmarks, these variants consistently outperform prior multimodal RoPE approaches, especially for both general understanding and fine-grained vision-language tasks.
Slack link: ml-ka.slack.com, channel: #pdg. Please join us -- if you cannot join, please message us here or to mlpaperdiscussiongroupka@gmail.com.
In the Paper Discussion Group (PDG) we discuss recent and fundamental papers in the area of machine learning on a weekly basis. If you are interested, please read the paper beforehand and join us for the discussion. If you have not fully understood the paper, you can still participate – everyone is welcome! You can join the discussion or simply listen in. The discussion is in German or English depending on the participants.
