BDAM 01/31: Self Service Data Lakes, Apache Spark & Apache Ignite, and more


Details
Welcome to the first BDAM of 2018 - Shoutout to GridGain for kindly sponsoring this meetup!
GridGain will also be giving away a CanaKit Raspberry Pi 3 Starter Kit in a raffle! Enter the raffle on the day of the event for a chance to win.
AGENDA
6:00 - 6:30 - Socialize over food and beverages
6:30 - 8:00 - Talks
TALKS
Talk 1: Building a Self-Service Data Lake on Google Cloud Platform, by Ali Anwar, Cask
Talk 2: Apache Spark and Apache Ignite: Where Fast Data Meets the IoT, by Denis Magda, GridGain
Talk 3: Scalable Clusters on Demand, by Gustavo Torres and Bogdan Kyryliuk, Opendoor
ABSTRACTS
Talk 1: Building a Self-Service Data Lake on Google Cloud Platform, by Ali Anwar, Cask
With the latest technology options for big data processing, storage, and resource management easily accessible in the cloud, more and more organizations are ready to build their data lake in the cloud. But as in the on-premises world, challenges remain with respect to integrating data, operationalizing, securing and governing the data lake, and enabling self-service access to data with “IT guardrails”.
In this talk, Ali Anwar will demonstrate how Cask Data Application Platform (CDAP) helps architects, developers and data scientists avoid the complexities and inefficiencies of the messy and diverse nature of big data, and how to use its comprehensive platform capabilities, frameworks and self-service tools to go from data prep to a fully operational data lake on the Google Cloud Platform (GCP). Ali will highlight GCP-specific integrations in CDAP, and describe popular use cases such as Change Data Capture, cloud migration and machine learning/AI.
Talk 2: Apache Spark and Apache Ignite: Where Fast Data Meets the IoT, by Denis Magda, GridGain
It is not enough to build a mesh of sensors or embedded devices to obtain more insights about the surrounding environment and optimize your production systems. Usually, your IoT solution needs to be capable of transferring enormous amounts of data to storage or the cloud where the data have to be processed further. Quite often, the processing of the endless streams of data has to be done in real-time so that you can react on the IoT subsystem's state accordingly.
This session will show attendees how to build a Fast Data solution that will receive endless streams from the IoT side and will be capable of processing the streams in real-time using Apache Ignite's cluster resources.
Talk 3: Scalable Clusters on Demand, by Gustavo Torres and Bogdan Kyryliuk, Opendoor
At Opendoor, we do a lot of big data processing, and use Spark and Dask clusters for the computations. Our machine learning platform is written in Dask and we are actively moving data ingestion pipelines and geo computations to PySpark. The biggest challenge is that jobs vary in memory, cpu needs, and the load in not evenly distributed over time, which causes our workers and clusters to be over-provisioned. In addition to this, we need to enable data scientists and engineers run their code without having to upgrade the cluster for every request and deal with the dependency hell.
To solve all of these problems, we introduce a lightweight integration across some popular tools like Kubernetes, Docker, Airflow and Spark. Using a combination of these tools, we are able to spin up on-demand Spark and Dask clusters for our computing jobs, bring down the cost using autoscaling and spot pricing, unify DAGs across many teams with different stacks on the single Airflow instance, and all of it at minimal cost.
https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif
SPEAKERS
-
Ali Anwar is a software engineer at Cask, where he is working on Cask Data Application Platform (CDAP). Prior to Cask, Ali attained his undergraduate degree in Computer Science from the University of California, Berkeley.
-
Denis Magda is a Director of Product Management at GridGain Systems and Apache Ignite PMC Chair. He is an expert in distributed systems and platforms. Before joining GridGain and becoming a part of Apache Ignite community, he worked for Oracle where he led the Java ME Embedded Porting Team - helping Java open cross new boundaries by entering the IoT market.
-
Gustavo Torres is a software engineer at Opendoor, where he is working on scaling Opendoor’s pricing infrastructure. Prior to Opendoor he worked at Google on Search Ads and App Engine serving infrastructure.
-
Bogdan Kyryliuk is data infrastructure team lead at Opendoor working on building data ingestion pipelines and providing infrastructure for other teams to run and orchestracte their ETL. Prior to Opendoor he worked at Airbnb on A/B testing, building Superset (https://superset.incubator.apache.org/) and at Google on YouTube views and revenue processing.
ARRIVAL AND PARKING
Cask HQ is only a few minutes walk from the California Avenue Caltrain Station.
Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby:
https://secure.meetupstatic.com/photos/event/5/b/2/f/600_438983343.jpeg

BDAM 01/31: Self Service Data Lakes, Apache Spark & Apache Ignite, and more