Part of AI, Machine Learning and Computer Vision Meetup Network - 48 groups

Hyderabad AI Machine Learning and Computer Vision Meetup

4.7•35 ratings

About us

🖖 This group is for data scientists, machine learning engineers, and open source enthusiasts.

Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.

Are you interested in speaking at a future Meetup?
Is your company interested in sponsoring a Meetup?

Send me a DM on Linkedin

This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.

Upcoming events

See all

Network event
April 30 - Best of WACV 2026
Thu, Apr 30 · 9:30 PM IST
·
Online
Online
79 attendees from 48 groups
Welcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you

Date, Time and Location

Apr 30, 2026
9AM - 11AM Pacific
Online. Register for the Zoom!

Zero-Shot Coreset Selection via Iterative Subspace Sampling

Deep learning's reliance on massive datasets leads to significant costs in storage, annotation, and training. Although coreset selection aims to mitigate these costs by finding performant data subsets, state-of-the-art methods typically require expensive ground-truth labels and dataset-specific training. To overcome these scalability issues, ZCore introduces a zero-shot approach that functions without labels or prior training on candidate data. Instead, ZCore uses foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. On ImageNet, ZCore outperforms previous label-based methods at a 90% prune rate while eliminating the need to annotate over one million images.

About the Speaker

Brent Griffin is a Principal Machine Learning Scientist at Voxel51 specializing in low-cost machine learning on unstructured data. Previously, he was the Perception Lead at Agility Robotics and an assistant research scientist at the University of Michigan conducting research at the intersection of computer vision, control, and robot learning. He is lead author on publications in all of the top IEEE conferences for computer vision, robotics, and control, and his work has been featured in Popular Science, in IEEE Spectrum, and on the Big Ten Network.

ENCORE: A Neural Collapse Perspective on Out of-Distribution Detection in Deep Neural Networks

We present ENCORE, a post-hoc out-of-distribution (OOD) detection method grounded in the geometric properties of neural collapse in deep neural networks. By leveraging the observation that in-distribution features align with class means while OOD features tend to be misaligned or orthogonal, ENCORE modifies inference through cosine-based scoring and adaptive feature scaling to enhance separation between known and unknown inputs. The method approximates neural collapse behavior at test time without requiring retraining, enabling more reliable uncertainty estimation. It is lightweight, memory-efficient, and compatible with a wide range of architectures, including convolutional networks and vision transformers. Extensive experiments on standard benchmarks demonstrate consistent improvements over existing OOD detection approaches in both near- and far-distribution shifts.

About the Speaker

A.Q.M. Sazzad Sayyed is a Ph.D. candidate in Electrical and Computer Engineering at Northeastern University, focusing on robust, secure, and efficient deep learning. His research centers on out-of-distribution detection, uncertainty modeling, and machine learning reliability for safety-critical and edge AI systems.

Synthesizing Compositional Videos from Text Description

Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions.

Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy.

About the Speaker

Shanmuganathan Raman is a prominent academic and researcher in the fields of computer vision, deep learning, computational photography, and computer graphics. He is a Professor at the Indian Institute of Technology Gandhinagar (IIT Gandhinagar), where he holds a joint appointment in the Departments of Electrical Engineering and Computer Science and Engineering. He serves as the Head of the Department of Computer Science and Engineering at IIT Gandhinagar.

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Multimodal large language models can answer impressively complex visual questions, but do they truly understand what they see? We present The Perceptual Observatory, a framework for characterizing robustness and grounding in MLLMs beyond standard leaderboard scores. We evaluate models on interpretable tasks such as image matching, grid pointing game, and attribute localization across pixel-level corruptions and diffusion-based stylized illusions. Our analysis reveals that scaling the language model alone does not guarantee better perceptual grounding, uncovering systematic weaknesses in robustness, spatial invariance, fairness, and reasoning-based perception. The Perceptual Observatory offers a more principled way to study multimodal perception and provides actionable insights for building future MLLMs that are reliable and truly grounded in visual evidence.

About the Speaker

Fenil Bardoliya is a Researcher at the Complex Data Reasoning & Analysis Lab (CORAL) at Arizona State University. His research revolves around Multimodal Model Evaluation and Benchmarking, Machine Unlearning, and Structured Reasoning.
4 attendees from this group
Network event
May 1 - Best of WACV (Day 2)
Fri, May 1 · 9:30 PM IST
·
Online
Online
61 attendees from 48 groups
Welcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

Date, Time and Location

May 01, 2026
9 AM - 11 AM Pacific
Online. Register for the Zoom!

Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality

Recent advances in 3D Gaussian Splatting (3DGS) enable high-quality rendering but fall short of mixed reality's demanding requirements for high refresh rates, stereo viewing, and limited compute budgets. We propose a perception-guided, continuous level-of-detail framework that exploits human visual system limitations through a lightweight, gaze-contingent model to predict and adaptively modulate rendering quality across the visual field, maximizing perceived quality under compute constraints.

Combined with an edge-cloud collaborative rendering framework for untethered MR devices, our method achieves superior computational efficiency with minimal perceptual quality loss compared to vanilla and foveated baselines, validated through objective metrics and user studies.

About the Speaker

Sai Harsha Mupparaju is an MS Computer Science student at NYU working in the Immersive Computing Lab with Prof. Qi Sun, where he focuses on 3D Gaussian Splatting, neural rendering, and perceptual VR/MR systems. He previously earned his undergraduate degree from BITS Pilani and conducted research at the Indian Institute of Science (IISc). His research has been published at IEEE WACV 2026, ACM Transactions on Graphics, and ACM SIGGRAPH 2024.

SAVIOR: Sample-efficient Adaptation of Vision-Language Models for OCR Representation

OCR pipelines and vision-language models systematically underperform on document patterns critical to financial workflows, such as vertical text, logo-embedded vendor names, degraded scans, and complex multi-column layouts. While underrepresented in public datasets, these patterns constitute a substantial portion of real-world failure cases.

We introduce SAVIOR, a sample-efficient data curation methodology that targets such high-impact failure scenarios to adapt vision-language models for robust financial OCR, and PaIRS, a structure-aware evaluation metric that measures layout fidelity by comparing pairwise spatial relationships between tokens. When fine-tuned with SAVIOR-Train, Qwen2.5-VL-Instruct demonstrates robust financial OCR performance, outperforming both open and closed-source baselines including GPT-4o, Mistral-OCR, PaddleOCR-VL, and DeepSeek-OCR.

About the Speaker

Akshata Bhat is an AI/ML Research Engineer at Hyprbots Inc. Her research interests include multimodal learning, vision-language models, and large-scale document understanding systems.

SynthForm: Towards a DLA-free E2E Form understanding model

We present SynthForm-3k, the first large-scale publicly available dataset of synthetically perturbed forms, comprising 3,417 samples across six domains: taxation, immigration, finance, healthcare, dental, and insurance. Ground-truth Markdown is constructed via an intermediate HTML representation generated by GPT-5 under high-reasoning inference, followed by deterministic HTML-to-Markdown conversion and scan-like perturbations (dust, scan lines, blur, rotation) that simulate real-world faxed and scanned documents.

We further introduce SynthForm-VL, a family of 2B, 4B, and 8B models obtained via full-parameter supervised fine-tuning of Qwen3-VL on this dataset. All three variants outperform their respective baselines, yielding ANLS improvements of +5.8, +9.3, and +10.3, with the fine-tuned 2B model exceeding the performance of the 4× larger Qwen3-VL-8B baseline — demonstrating that targeted domain adaptation on perturbation-robust data offers a more favorable cost–performance tradeoff than scale alone for structured form understanding.

About the Speaker

Andre Fu is an ML researcher and founder whose work spans multimodal learning, GPU inference infrastructure, and document understanding, with prior publications at NeurIPS, ICCV, CVPR, and WACV. He has worked in document processing and ML for the last 4 years, specializing in InsurTech, FinTech & HealthTech use cases.
4 attendees from this group
Network event
May 6 - Building Composable Computer Vision Workflows in FiftyOne
Wed, May 6 · 10:30 PM IST
·
Online
Online
75 attendees from 48 groups
This workshop explores the FiftyOne plugin framework to build custom computer vision applications. You’ll learn to extend the open source FiftyOne App with Python based panels and server side operators, as well as integrate external tools for labeling, vector search, and model inference into your dataset views.

Date, Time and Location

May 6, 2026
10 AM - 11 AM PST
Online. Register for the Zoom!

What You'll Learn
- Build Python plugins. Define plugin manifests and directory structures to register custom functionality within the FiftyOne ecosystem.
- Develop server side operators. Write functions to execute model inference, data cleaning, or metadata updates from the App interface.
- Build interactive panels. Create custom UI dashboards using to visualize model metrics or specialized dataset distributions.
- Manage operator execution contexts. Pass data between the App front end and your backend to build dynamic user workflows.
- Implement delegated execution. Configure background workers to handle long running data processing tasks without blocking the user interface.
- Build labeling integrations. Streamline the flow of data between FiftyOne and annotation platforms through custom triggers and ingestion scripts.
- Extend vector database support. Program custom connectors for external vector stores to enable semantic search across large sample datasets.
- Package and share plugins. Distribute your extensions internally and externally
2 attendees from this group
Network event
May 7 - Visual AI in Healthcare
Thu, May 7 · 9:30 PM IST
·
Online
Online
239 attendees from 48 groups
Join us to hear experts on cutting-edge topics at the intersection of AI, ML, computer vision and healthcare.

Date, Time, and Location

May 07, 2026
9AM PST
Online. Register for the Zoom!

Representation Learning Under Weak Supervision in Computational Pathology

Computational pathology has advanced rapidly with deep learning and, more recently, pathology foundation models that provide strong transferable representations from whole-slide images. Yet important gaps remain: pretrained features often retain domain shift relative to downstream clinical datasets, and most existing pipelines do not explicitly model the geometric organization of tissue architecture that underlies disease progression.

In this talk, I will present our work on weak- and semi-supervised representation learning methods designed to address these challenges, including adaptive stain separation for contrastive learning, bag-label-aware contrastive pretraining for multiple-instance learning, and distance-aware spatial modeling that injects tissue geometry into slide-level prediction. These methods reduce dependence on dense annotations while improving the quality, robustness, and clinical relevance of learned representations in histopathology. Across kidney and prostate cancer studies, they produce stronger downstream performance than standard self-supervised, semi-supervised, and MIL baselines, including improved classification on ccRCC datasets and more accurate prediction of metastatic risk from diagnostic prostate biopsies.

About the Speaker

Dr. Tolga Tasdizen is Professor and Associate Chair of Electrical and Computer Engineering and a faculty member of the Scientific Computing and Imaging Institute at the University of Utah, where he works on AI and machine learning for image analysis with applications in biomedical imaging, public health, and materials science. His research spans self- and semi-supervised learning, domain adaptation, and interpretability.

Efficient and Reliable AI for Real-World Healthcare Deployment

Healthcare is one of the highest-impact domains for AI, yet reliable deployment at scale remains difficult. To truly improve patient care and clinical workflows, AI must operate under real clinical constraints, not just in ideal lab settings. In practice, deployment is limited by high compute and memory costs, scarce labeled data, and distribution shifts across sites and time. Many clinically important findings are also rare and long-tailed, which makes generalization especially challenging. My research makes deployability a design objective by developing methods that stay accurate under strict resource and data constraints.

In this talk, I will first discuss high-performance lightweight deep learning architectures built by redesigning core building blocks. I will then present training-time generative supervision strategies that improve data efficiency and generalization to rare and long-tailed cases with no inference overhead. I will conclude with a forward-looking direction toward real-time perception for surgical assistance, where reliable performance under strict constraints is non-negotiable.

About the Speaker

Md Mostafijur Rahman is a Ph.D. candidate at The University of Texas at Austin, advised by Radu Marculescu. His research sits at the intersection of AI, biomedical imaging, and computer vision, with a focus on building efficient, reliable, and scalable AI systems for deployment in healthcare under real-world constraints. His work has been translated to practice through research internships at GE Healthcare, the National Institutes of Health (NIH), and Bosch Research.

VIGIL: Vectors of Intelligent Guidance in Long-Reach Rural Healthcare

VIGIL (Vectors of Intelligent Guidance in Long-Reach Rural Healthcare) is an AI-driven system designed to support generalist clinicians through interactive, multimodal guidance. The system combines perception, language understanding, and tool use to assist with tasks such as ultrasound acquisition and interpretation in real time. In this talk, we focus on the overall system architecture, highlighting how individual components—ranging from visual models to medical reasoning agents—interact to produce coherent guidance. We also discuss key challenges we have encountered, including tool orchestration, latency, and robustness across components. This presentation aims to provide a systems-level perspective on building embodied AI agents for real-world healthcare settings.

About the Speaker

Andrew Krikorian is a Ph.D. student in Robotics at the University of Michigan, where he is a member of the Corso Group (COG). His research focuses on building physically grounded AI agents that combine perception, tool use, and planning to operate effectively in real-world environments, with a particular emphasis on healthcare applications. He is actively involved in the ARPA-H PARADIGM program, developing intelligent systems for rural clinical settings.

Scaling Healthcare AI with Synthetic Data and World Models

The scarcity of labeled, privacy-compliant medical imaging data remains one of the biggest bottlenecks in healthcare AI development. Emerging world models are changing this landscape by generating high-fidelity synthetic data — from radiology scans to surgical scene simulations — that can augment real-world datasets without compromising patient privacy. However, synthetic data is only as valuable as your ability to curate, validate, and evaluate it alongside real clinical data. In this talk, we explore how teams are using FiftyOne to build rigorous quality pipelines around synthetic medical imagery, enabling them to detect distribution gaps, measure model performance across rare pathologies, and ensure that generated samples meaningfully improve downstream diagnostics. We'll walk through practical workflows that combine world model outputs with real-world medical datasets to accelerate Visual AI in healthcare — responsibly and at scale.

About the Speaker

Daniel Gural is an expert in Physical AI and has been working in the field for over 8 years. Working across healthcare he has experience in both operating use case as well as using Visual AI as an aid in psychology applications as well.
7 attendees from this group