How Uber Uses Open Source Big Data Technologies


Details
Uber Engineering is excited to welcome you to our first Uber Open Source meetup in Seattle. The event will highlight how we use Hadoop, Spark, and other open source big data technologies to power our backend ecosystem. During this meetup, we we will present on how we use an operator framework to distribute Uber’s machine learning platform, profile Spark applications at scale with Uber Hadoop Profiler, and log observability for Hadoop workloads.
Join us to learn more about how Uber embraces open source technology and culture to create an open and innovative engineering organization.
Agenda
6:00pm -6:25pm - Doors open, food, and drinks
6:25pm -6:30pm - Welcome
6:30pm - 6:50pm - Mingshi Wang - Distributing Uber's Machine Learning Platform with Operator Frameworks
6:50pm - 7:10pm - Bo Yang - Profile Spark Applications at Scale
7:10pm - 7:30pm - Nan Zhu - LogX: Creating Log Observability in Hadoop Workloads at Uber
7:30pm -8:00pm- Q&A and Networking
More information about the talks:
Distributing Uber's Machine Learning Platform with Operator Frameworks
Speaker: Mingshi Wang
Uber is committed to developing technologies that create seamless, impactful experiences for our customers. We are increasingly investing in artificial intelligence (AI) and machine learning (ML) to fulfill this vision. We built Michelangelo, our ML-as-a-service platform, to democratize machine learning and make it easy to scale AI to meet the needs of our business. In this presentation, we will introduce the architecture of Michelangelo and dive deep into how we use the operator framework to scale the system. We will also talk about the application of the operator framework to two specific problems: partitioned model training and hyperparameter searches that allow users to train accurate models with large-scale data sets.
Profiling Spark Applications at Scale
Speaker: Bo Yang
The Uber Hadoop Profiler provides a Java Agent to collect various metrics and stack traces for Hadoop/Spark JVM processes in a distributed way. Among other features, it also provides advanced profiling capabilities to trace arbitrary Java methods and arguments on the user code without user code change requirements., can trace HDFS NameNode call latency for each Spark application, and identify NameNode bottlenecks. I In this presentation, we will discuss how the Uber Hadoop Profiler enables us to profile Spark applications at scale across our microservices.
LogX: Creating Log Observability in Hadoop Workloads at Uber
Speaker: Nan Zhu
At Uber, log files are one of the most important sources supporting our ability to debug/tune data analysis applications and cluster operations. While human-based log analysis is adopted by many organizations, Uber’s size makes this analysis difficult to scale . To automatically analyze the massive volume of logs produced in our data analysis clusters, we developed LogX, a tool that persists and analyze logs, as well as derives insights ranging from high-level operational metrics to fine-grained root cause analysis for failed applications. In this talk, we will discuss how we built LogX and share lessons learned about the development of this data analysis pipeline.

How Uber Uses Open Source Big Data Technologies