Skip to content

Gemini Robotics: Bringing AI into the Physical World

Photo of Rakshak Talwar
Hosted By
Rakshak T.
Gemini Robotics: Bringing AI into the Physical World

Details

This will be a journal club event

Gemini Robotics: Bringing AI into the Physical World

This talk covers the new Gemini ER (embodied reasoning) model's research paper.

Speaker
Rakshak Talwar

Abstract
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. Generally useful robots need to be able to make sense of the physical world around them, and interact with it competently and safely. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks like folding an origami fox or playing a game of cards, learning new short-horizon tasks from as few as 100 demonstrations, adapting to completely novel robot embodiments including a bi-arm platform and a high degrees-of-freedom humanoid. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini’s multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as 3D understanding in the form of multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications, e.g., zero-shot (via robot code generation), or few-shot (via in-context learning). We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realize AI’s potential in the physical world.

Info
Austin Deep Learning Journal Club is group for committed machine learning practitioners and researchers alike. The group typically meets every first Tuesday of each month to discuss research publications. The publications are usually the ones that laid foundation to ML/DL or explore novel promising ideas and are selected by a vote. Participants are expected to read the publications to be able to contribute to discussion and learn from others. This is also a great opportunity to showcase your implementations to get feedback from other experts.

Sponsors:
Thank you to Capital Factory for sponsoring Austin Deep Learning. Capital Factory is the center of gravity for entrepreneurs in Texas. They meet the best entrepreneurs in Texas and introduce them to their first investors,
employees, mentors, and customers. To sign up for a Capital Factory
membership, click here.

Antler: Meet your co-founders, join our global community, and access capital to build and scale your company faster.

Photo of Austin Deep Learning group
Austin Deep Learning
See more events

Every 1st Tuesday of the month until December 2, 2025

Capital Factory, Voltron Room (1st floor)
701 Brazos Street · Austin, TX
Google map of the user's next upcoming event's location
FREE