For July we have Snowflake from their local office and Russell, author/speaker of Agile Data Science 2.0!
Thank you Snowflake for feeding us and ATLAS Workbase for hosting this event!
6:00-6:30 Social, food/drink
6:35 Snowflake makes Apache Spark faster
7:05 Agile Data Science 2.0
Snowflake makes Apache Spark faster - Torsten Grabs
With machine learning and data science in Spark, efficiently processing large data sets is essential. This is also crucial for data warehousing workloads. At the intersection of these two workloads lie unique synergies that Snowflake's connector for Spark aims at. For any data stored in Snowflake, the connector transparently maps data processing operations in Spark such as transformations over dataframes or RDDs to highly efficient relational queries in Snowflake. As a result, performance for Spark workloads automatically improves as data gets stored in Snowflake. Large workloads with several Terabytes of data experience speed-ups by a factor of 10 or more when using Snowflake for storage as compared to file-based storage in AWS S3 using Parquet or Gzipped JSON format. These performance benefits are made possible through a range of optimizations in Snowflake such as JSON-optimized storage or automatic partition-pruning. In addition, the distributed processing architecture of Spark is a natural fit for the highly parallel, scaled-out processing performed by Snowflake's query processor. By growing both your Spark cluster and Snowflake warehouse in tandem, you can achieve virtually unlimited bandwidth and performance across Spark and Snowflake in order to cover today's most demanding data processing workloads.
Torsten currently serves as Director of Product Management for Snowflake where he oversees product management for its newly founded office in Bellevue, WA. A focus area of Torsten's work is Snowflake's developer platform and integration with developer ecosystems such as Spark. Another key priority for Torsten is growing the Snowflake team. Torsten also teaches cloud databases at the University of Washington in Seattle. Before joining Snowflake, Torsten spent more than a decade at Microsoft in the SQL Server product group in Redmond, WA, serving in different roles in software development and product management. Considering himself a “database person”, Torsten holds a PhD in computer science from Swiss Federal Institute of Technology (ETH), Zurich, Switzerland.
Agile Data Science 2.0 - Russell Jurney
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery.
This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
My name is Russell Jurney. I am principal consultant at Data Syndrome, a product analytics consultancy dedicated to advancing the adoption of the development methodology Agile Data Science, as outlined in the book Agile Data Science 2.0. I’ve worked as a data scientist building data products for over a decade, starting in interactive web visualization and then segwaying towards data products, machine learning and artificial intelligence at companies such as Ning, LinkedIn and Hortonworks. I am a self taught visualization software ngineer, data engineer, data scientist, writer and most recently, I’m becoming a teacher. In addition to applied work building analytics products, Data Syndrome offers live and video training courses.