• Meetup #4 - Productionizing Machine Learning with Delta Lake, Koalas, and MLflow

    Daniel Arrizza: www.linkedin.com/in/danielarrizza Daniel is a Customer Success Engineer at Databricks. For many data scientists, the process of building and tuning machine learning models is only a small portion of the work they do every day. The vast majority of their time is spent doing the less-than-glamorous (but crucial) work of performing ETL, building data pipelines, and putting models into production. In this session, we’ll walk through the process of building a production data science pipeline step-by-step. Using open-source tools we will: - Walkthrough querying a data lake with Apache Spark™ and Delta Lake - Transforming the data with Koalas (distributed PySpark using the pandas API) - Perform machine learning experiments with hyperparameter tuning (Hyperopt), and - Log our experiment results to MLflow. ____________ Schedule: 6:00pm - Check-in, Socialize & Eat Pizza 6:30pm - Productionizing Machine Learning with Delta Lake, Koalas, and MLflow 7:30pm - Q&A 7:55pm - Meetup Conclusion ____________

    2
  • Meetup #3 - Making Apache Spark™ Better with Delta Lake

    Rubikloud Technologies Inc.

    Mladen Kovacevic: www.linkedin.com/in/mladenkovacevic Mladen is a Solutions Architect at Databricks that has helped dozens of clients spanning data engineers, data scientists and data analysts fully realize the potential of Apache Spark, MLflow and Delta Lake on the cloud by delivering robust engineering and AI solutions. Mladen has been building solutions using Apache Spark since 2014, and has been a contributor to several open-source Apache projects in the Big Data space. He is a published O'Reilly author who speaks at various events and throughout his career has worked as a software developer, performance analyst, consultant and solutions architect. Apache Spark™ is the dominant processing framework for big data. Delta Lake adds reliability to Spark so your analytics and machine learning initiatives have ready access to quality, reliable data. This talk will cover the use of Delta Lake to enhance data reliability for Spark environments. Topics: - The role of Apache Spark in big data processing - Use of data lakes as an important part of the data architecture - Data lake reliability challenges - How Delta Lake helps provide reliable data for Spark processing - Specific improvements that Delta Lake adds - The ease of adopting Delta Lake for powering your data lake ____________ Schedule: 6:00pm - Check-in, Socialize & Eat Pizza 6:30pm - Making Apache Spark™ Better with Delta Lake 7:30pm - Q&A 7:55pm - Meetup Conclusion ____________

    1
  • Meetup #2 - Uken Games

    Uken Games

    We will have a talk from Gihad Murad - Chief Architect from Uken Games: https://www.linkedin.com/in/engineering/ Last year Gihad lead an ambitious project involving Uken Games, a company in Nova Scotia and Sony Pictures in California. The scale of the project pushed technology boundaries!! There is a HOT new genre of entertainment experience which is a mixture of Live TV with Mobile Game where players and the show host interact in real time (e.g.: HQ Trivia). To develop a solution in this category Uken faced several technical challenges such as sending 2 million requests in 1 second to players via TCP. This talk will be a high-level technical overview of the different parts of technology that powers this product including the stacks for video streaming, service-oriented architecture for game backends, bidirectional real-time communication and Uken's data platform which has Apache Spark at its centre. ____________ Schedule: 6:00pm - Check-in, Socialize & Eat Pizza 6:30pm - Interactive Mobile Trivia Game Talk 7:15pm - Q&A 7:45pm - Meetup Conclusion ____________

  • Toronto Apache Spark 2.0 (TAS 2.0)

    1 Richmond St W

    *RESCHEDULED* Toronto Apache Spark 2.0 (TAS 2.0) will be having our FIRST Meetup of 2019!! Topic: Lessons Learned: Building high volume, reliable datalake based on Apache Spark. Summary: Paytm’s business generates a multitude of raw data, storing it in a variety of sources, such as RDBS - MySQL, Messaging queues - Kafka, SaaS apps, NoSQL, Object storages, etc., Ingesting many million records daily into the datalake (for business reporting, adhoc query, analytics & ML apps) with full confidence on freshness (timely data), completeness (data quality), schema evolution (structure changes) SLO guarantees present unique challenges at scale. This talk will explore these challenges and lessons learned using Apache Spark as Paytm's data processing engine in their datalake. ____________ Schedule: 6:00pm - Check-in, Socialize & Eat Pizza 6:15pm - Talk #1 6:45pm - Q&A 7:00pm - Break + More Pizza/Socialization 7:15pm - Talk #2 7:45pm - Q&A 8:00pm - Meetup Conclusion ____________