At this month's WePay engineering meetup, we'll host presenters from LinkedIn and PayPal. The focus will be on managing your entire data ecosystem (frameworks, data sources, access, etc) as your organization grows in size.
We'll provide pizza and drinks. The presentations will also be live streamed and recorded, as well. The live stream link will be posted here at 6:30pm (pacific).
--- Schedule ---
6:00-6:30 Meet, greet, eat and drink
8:00-9:00 Community open-mic, announcements, discussion, etc.
--- Talk 1 ---
Title: Exploiting the Data/Code Duality: Applying Modern Software Development Practices to Data with Dali
Presenter: Carl Steinbach @ LinkedIn
Summary: Most large software projects in existence today are the result of the collaborative efforts of hundreds or even thousands of developers. These projects consist of millions of lines of code and leverage a plethora of reusable libraries and services provided by third parties. Projects of this scale would not be possible without the tools and processes that now define the practice of modern software development: language support for decoupling the interface from the implementation, version control, semantic versioning of artifacts, dependency management, issue tracking, peer review of code, integration testing, and the ability to tie all of these things together with comprehensive code search and dependency tracking mechanisms.
We have observed similar forces at play in the world of big data. At LinkedIn the number of people who produce and consume data, the number of datasets they need to manage, and the rate at which these datasets change are all growing at an exponential rate. This has resulted in a host of problems: rampant duplication of business logic and data, increasingly fragile and hard to maintain data pipelines, and schemas that are littered with deprecated fields due to the prohibitive costs of making backward incompatible changes. In order to cope with these challenges we built Dali, a unified data abstraction layer for offline (Hadoop, Spark, Presto, etc) and nearline (Kafka, Samza) systems that enables data engineers to benefit from the same processes and infrastructure that are already used by LinkedIn’s software engineers.
In this talk I will explain how Dali employs virtual SQL views to decouple the API of a dataset from the details of its implementation, describe how view versioning and dependency tracking allow us to make backward incompatible changes without breaking downstream consumers, and review the ways we have integrated Dali with the rest of LinkedIn’s software development ecosystem. Finally, I will discuss how we leveraged Dali in several company-wide initiatives including the redesign of the LinkedIn mobile app and GDPR.
--- Talk 2 ---
Title: Gimel: PayPal’s Analytics Data Platform
Presenter: Romit Mehta @ PayPal, Deepak (DC) Chandramouli @ PayPal
Summary: At PayPal, data engineers, analysts and data scientists work with a variety of datasources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive).
Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed "Gimel", a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that are powered by a centralized data catalog.
In this session, we will introduce you to the various components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.