Skip to content

Details

Abstract
In this talk, I'll share how I tackled the challenge of filtering dead webpages at petabyte scale by combining AI, machine learning, and strategic preprocessing techniques. I'll walk through my approach to classifying pages with meaningful content versus empty or dead pages, starting with data science techniques for exploratory analysis and leveraging AI to automate the labeling process.

You'll see how I found a production-grade solution that operates at massive scale, along with the key architectural decisions that made this solution work in a real-world, high-volume environment. Whether you're dealing with large-scale data pipelines or interested in practical applications of AI for data quality problems, you'll learn how to approach similar challenges in your own infrastructure.

About the Speaker
Yair is a dedicated leader in data science and machine learning, with a deep commitment to driving innovation through advanced analytics and scalable infrastructure. His career spans various industries, where he specialized in building and leading high-performing teams that bridge the gap between data science and engineering.

With a track record at companies like Anaplan, Oracle, and others, he has led the delivery of impactful machine learning projects, automated complex workflows, and championed best practices in model development and infrastructure scaling. His focus is on enabling organizations to harness the power of machine learning by building solutions that support business intelligence, analytics, and data-driven decision-making.

Yair thrives in environments that prioritize collaboration, innovation, and continuous learning. Whether working on cloud-based architectures (Azure, AWS), mentoring teams to excel in their roles, or contributing to AI-driven initiatives, he is passionate about advancing technology and empowering teams to deliver impactful results.

Related topics

Artificial Intelligence

You may also like