- Online Workshop: Introduction to Apache Spark
Our affiliated Data + AI Online meetup is hosting an upcoming workshop: Introduction to Apache Spark. This is a cross-post, please RSVP here: https://www.meetup.com/data-ai-online/events/270166620/ Join us for Part 4 of our online learning series: Introduction to Data Analysis for Aspiring Data Scientists. This is the final online workshop in this series for anyone and everyone interested in learning about data analysis. Part 4: Introduction to Apache Spark Abstract: This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. We will be using data released by the NY Times (https://github.com/nytimes/covid-19-data). No prior knowledge of Spark is required, but Python experience is highly recommended. - Watch Part 1, Intro to Python - https://youtu.be/HBVQAlv8MRQ - Watch Part 2: Data Analysis with pandas - https://youtu.be/riSgfbq3jpY - Watch Part 3: Machine Learning - https://youtu.be/g103iO-izoI More details and to RSVP: https://www.meetup.com/data-ai-online/events/270166620/
- Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig
Our affiliated Online Spark Meetup is hosting an online meetup to discuss Delta Lake and Apache Spark. Developer Advocate Denny Lee will interview Brooke Wenig, Machine Learning Practice Lead, on the best practices and patterns when developing, training, and deploying Machine Learning algorithms in production. This is a cross-post, please RSVP HERE: https://www.meetup.com/spark-online/events/268966416/ Agenda: 10AM PST - 11AM PST (GMT-8) 10AM - 10:40AM - Interview with Brooke 10:40 - 11:00AM - Q&A Speakers: Brooke Wenig is the Machine Learning Practice Lead at Databricks. She guides and assists customers in implementing machine learning pipelines, as well as teaching Distributed Machine Learning & Deep Learning courses. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling. Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.
- The Genesis of Delta Lake - An Interview with Burak Yavuz
Our affiliated on-line Spark Meetup is hosting an online meetup to discuss Delta Lake and Apache Spark. Join us online to learn how Delta Lake and Apache Spark enable you to build reliable data pipelines. This is a cross-post, so please RSVP HERE: https://www.meetup.com/spark-online/events/268335912/ New decade, new start! Let's kick off 2020 with our first online meetup of the year featuring Burak Yavuz, Software Engineer at Databricks, for a talk about the genesis of Delta Lake. Developer Advocate Denny Lee will interview Burak Yavuz to learn about the Delta Lake team's decision-making process and why they designed, architected, and implemented the architecture that it is today. Understand the technical challenges that the team faced, how those challenges were solved, and learn about the plans for the future. Agenda: 10AM PST - 11AM PST (GMT-8) 10AM - 10:40AM - Interview with Burak 10:40 - 11:00AM - Q&A Speakers: Burak Yavuz is a Software Engineer at Databricks. He has been contributing to Spark since Spark 1.1 and is the maintainer of Spark Packages. Burak received his BS in Mechanical Engineering at Bogazici University, Istanbul, and his MS in Management Science & Engineering at Stanford. Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.
- Bay Area Apache Spark Meetup @ LinkedIn, Sunnyvale
Join us for an evening featuring tech-talks about Apache Spark and Delta Lake at scale from LinkedIn and Databricks. This meetup is hosted and sponsored by LinkedIn. Agenda: 6:00 - 6:30 pm: Social Hour with Food & Drinks 6:30 - 6:35 pm: Introduction & Announcements 6:35 - 6:55 pm: Tech Talk-1 from LinkedIn 6:55 - 7:15 pm: Tech Talk-2 from LinkedIn 7:35 - 8:15 pm: Databricks Talk Talk 1 Title: Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility Presenter: Adwait Tumbde Abstract: At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover * What is Dali and how it simplifies complex data ecosystem * Dali as unified data access layer at LinkedIn * DaliSpark Architecture * Roadmap including plans to open source Dali Bio: Adwait Tumbde is an engineering manager at LinkedIn and leads a team focused on simplifying data management for big data. He has also contributed to the development of Apache Pinot and Presto at LinkedIn. Before joining LinkedIn, he was one of the original developers of Sherpa, a large scale key-value store at Yahoo!. His interests include large scale distributed systems and databases. Talk 2 Title: Optimizing Apache Spark SQL at LinkedIn Presenter: Fangshi Li Abstract: Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as: * Improving Dataset performance with automated column pruning * Bringing an efficient 2d join algorithm to Spark SQL * Fixing join skewness with adaptive execution * Enhancing the cost-optimizer with a history-based learning approach Bio: Fangshi Li is a software engineer at Linkedin. He has been working on Spark core infrastructure, user libraries, AI solutions, and Spark SQL engine optimizations. He was one of the original developers of Dr. Elephant, the performance tuning tool for Hadoop/Spark. Talk 3 Title: Open Source Reliability for Data Lake with Apache Spark Presenter: Michael Armbrust (https://databricks.com/speaker/michael-armbrust) Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved! Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization. NOTE: You may need a government-issued ID to enter the premises or the conference room.
- Koalas: pandas APIs on Apache Spark/Magenta: Exploring the role on ML
This is a cross-posting event organized by our friends at SF PyData Meetup. PLEASE RSVP HERE: https://www.meetup.com/San-Francisco-PyData/events/262203749/ ## Agenda: 6:00 - 6:45pm: Mingling 6:45 - 6:50pm: Opening remarks 6:50 - 7:35pm: Tech-Talk-1: Koalas: pandas APIs on Apache Spark 7:35 - 8:05pm: Tech-Talk-2: Magenta: Exploring the role of machine learning in creativity 8:05 - 8:30pm: Mingling 8:30: Event over! #### Abstract In this talk, Reynold will present Koalas, a new open source project that was announced at the Spark + AI Summit in April. Koalas is a Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework. Reynold will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how he envisions Koalas could become the standard API for large scale data science. #### Speaker Bio Reynold Xin is a cofounder and Chief Architect at Databricks. In the open source community, Reynold is known as a top contributor to the Apache Spark project, having designed many of its core user-facing APIs and execution engine features. Reynold received a PhD in Computer Science from UC Berkeley, where he worked on large-scale data processing systems. ## Magenta: Exploring the role of machine learning in creativity #### Abstract: Hanoi will be giving an overview on Magenta, an open source team at Google led by Doug Eck, exploring the role of machine learning in the creative process. The talk will cover exciting and recent developments in the world of AI-generated art as well as live musical demos! #### Speaker Bio: Lamtharn (Hanoi) Hantrakul - AI Resident, Google Brain Hanoi - like the towers? No. Like the city? Yes! Hanoi's Thai parents fell in love there and nicknamed him after the charming city. Born and raised in Bangkok, Thailand, he is currently an AI Resident with Google Brain. His research focuses on real-time Neural Audio Synthesis: rendering sound directly using deep neural networks. Hanoi strives for technologies that are transcultural at heart; diversifying software, hardware and AI to encompass underrepresented cultures such as those from his home region of Thailand and beyond. He is most proud of fidular, a modular fiddle system he designed and engineered that enables components like resonators and strings to be swapped across cultures. The system is currently on display at the Musical Instruments Museum in Phoenix, AZ, and has been recognized internationally by the A’ Design Award and Core77 Design Award. Hanoi hold degrees in Applied Physics and Music Composition, both with Distinction from Yale University. In his MSc thesis at Georgia Tech, he developed machine learning models for an ultrasound sensor that enables amputees to perform high-dexterity tasks like playing piano; an impossible feat using today’s sensors on conventional prosthetics. He also writes music under the moniker "yaboi hanoi". Find his tunes on Instagram and Spotify!
- Bay Area Apache Spark Meetup @ Salesforce SF
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark, Machine Learning, and Delta Lake at scale from Salesforce and Databricks This meetup is hosted and sponsored by Salesforce. Agenda: 6:00 - 6:30 pm: Social Hour with Food & Drinks 6:30 - 6:35 pm: Introduction & Announcements 6:35 - 7:15 pm: Tech Talk from Salesforce Einstein 7:15 - 7:55 pm: Tech Talk from Databricks: Delta Lake 7:55 - 8:15 pm: Additional Networking, Q&A Talk 1 Title: Apache Spark at Salesforce Einstein for Marketing Cloud Presenters: Peter Krmpotic and Kexin Xie Abstract: Salesforce Einstein makes more than 6.5B+ predictions per day to deliver highly personalized customer experiences on behalf of some of the biggest brands in the world and their engineering/product teams have an extensive background in using Apache Spark in production. This talk will cover some of their biggest use cases, their successful transition from Apache Hadoop to Apache Spark, valuable insights about using Spark in production and an architecture review for one of their core capabilities. Bio: Peter Krmpotic Peter Krmpotic is a director of product management for Salesforce Marketing Cloud Einstein and former product leader at Adobe Experience Cloud, BrightEdge and Boost Media. This career experience has given him a comprehensive view of what brands must develop to provide unified customer experiences at scale, the bedrock of any digitally transformed organization. He works with customers to instill customer-centricity to set them up as AI-first companies and leaders of the technology-based emergence of the fourth industrial revolution. In his current role at Salesforce, Peter focuses on democratizing artificial intelligence, specifically deep learning, and natural language processing, for the purpose of personalizing customer experiences at scale. He holds MSc in Computer Science from Karlsruhe Institute of Technology (KIT) in Germany Bio: Kexin Xie At Salesforce, Kexin Xie is responsible for researching and designing the core distributed data processing and machine learning architecture for Marketing Cloud Einstein and Salesforce DMP. He leads a team of data science engineers with a strong focus on continuously improving operational aspects such as performance, fault tolerance, scale, automation, and cost. Before Salesforce, Kexin worked for Krux, BigCommerce, NICTA, Brandscreen, Freelancer and Microsoft Research building large-scale machine learning, data mining, real-time bidding, intelligent marketing, anti-fraud and anti-money laundering software systems. Kexin also holds a Ph.D. degree in computer science. Talk 2 Title: Open Source Reliability for Data Lake with Apache Spark Presenter: Michael Armbrust Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover * What data quality problems Delta helps address * How to convert your existing application to delta * How the Delta transaction protocol works internally * The Delta roadmap for the next few releases * How to get involved! Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization.
- Spark + AI Summit Meetup @ Moscone Center West, San Francisco
Join us for an evening of Bay Area Apache Spark Meetup at the Spark + AI Summit (https://databricks.com/sparkaisummit/north-america) featuring tech-talks from Google Inc and Workday Inc on Machine Learning. Thanks to #WomenInUnifiedAnalytics & Diversity Team at Databricks for sponsoring this meetup. (Note: This meetup is open to everyone. You don’t have to be registered for Spark + AI Summit.) Agenda: 6:00 - 6:30 pm Happy Hour: Mingling & Refreshments 6:30 - 6:40 pm Opening Remarks (Jules Damji, Databricks) 6:40 - 7:25 pm Talk-1: Paige Bailey, Google 7:25 - 8:05 pm Talk-2: Madhura Dudhgaonkar, Workday 8:05 - 9:00 pm More Mingling & Networking Talk 1: Details coming soon... Abstract: Bio: Paige Bailey (https://www.linkedin.com/in/dynamicwebpaige) is a Senior TensorFlow Developer Advocate at Google, based in Mountain View, CA. Prior to joining Google, Paige worked as a senior software engineer in the office of the Azure CTO; as a Cloud Developer Advocate for machine learning at Microsoft; and as a data scientist for Chevron in Houston, TX. Paige has over a decade of experience using Python for data analysis, five years of experience doing machine learning - and can't wait to show you about the new capabilities in TensorFlow 2.0. Talk 2: Machine Learning Products - How do you begin and when do you scale? Abstract: So you have heard all the hype around how Machine Learning is going to change the world. But within your business context, where do you start? How do you choose the right use cases to begin with? How do you get leadership buy-in for more investment? And when do you start thinking about scaling your ML Services? In this session, you will walk away with an actionable framework to bootstrap and scale an applied machine learning services function. You will see the framework in action through an actual 0 to 1 product journey involving deep learning where we delivered value in record speed in spite of not having a dataset when we started. You will get practical tips on how to make decisions about when and how to scale your capability to scale ML Services and platform. Furthermore, how to get leadership buy-in for more investment? You will go back with some counter-intuitive tips that we discovered as a result of productizing ML services over the last 5+ years using a diverse range of technologies: Vision, Language, Graph, Anomaly Detection, Search Relevance, Personalization. Bio: Madhura Dudhgaonkar (https://www.linkedin.com/in/madhurad) is a Machine Learning leader at Workday passionate about modernizing the future of work. Her team, a pioneer in the Enterprise Machine Learning space, has spent 6+ years building ML products leveraging Vision, Natural Language Processing, Recommendations, Anomaly Detection and more. Madhura’s career journey goes from being a hands-on engineer to leading large organizations across SUN Microsystems, Adobe and now Workday. Her background covers building both consumer and enterprise products - the latest of them involving multiple 0 to 1 product journeys leveraging Machine Learning. She is considered a thought leader in building ML products and is frequently invited to speak at AI conferences. Madhura holds a master’s degree in Math and Computer Science. She also is passionate about building diverse teams and leads diversity and inclusion work via leading a Women at Workday chapter.
- Bay Area Apache Spark Meetup @ Unravel Data in SF
Let's kick off the New Year 2019 with our first BASM Meetup! Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark at scale from Unravel Data and Databricks. Agenda: 6:30 - 7:00 pm: Social Hour with Food, Drinks, Beer & Wine 7:00 - 7:05 pm: Jules Introduction & Announcements 7:05 - 7:50 pm: Tech Talk from Unravel Data 8:00 - 8:45 pm: Tech Talk from Databricks 8:45 - 9:00 pm: Unravel Raffle and Additional Networking, Q&A Tech Talk 1: Putting AI to Work on Apache Spark Presenter: Shivnath Babu Abstract: Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems. This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more. Bio: CTO and Co-Founder at Unravel Data Systems and an adjunct professor of computer science at Duke University. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. Tech Talk 2: Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark Presenter: Lu Wang Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference. In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning. We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload. Bio: Lu Wang is a software engineer at Databricks. His main research interests are developing high-performance parallel algorithms for scientific computing and applications. He was actively involved in the development of the Project Hydrogen, Spark Deep Learning pipelines, and Spark MLlib since he joined Databricks. Before Databricks, he was working on parallel multigrid linear solvers on exascale parallel machines for solving the linear systems from reservoir simulations at Lawrence Livermore national laboratory. He received his Ph.D. from the Pennsylvania State University in 2014.
- Bay Area Apache Spark Meetup @ Adobe in San Jose, CA
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark from Adobe (https://www.adobe.com/) and Apache Spark Committer from Databricks (https://databricks.com). Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions (Jules Damji + Adobe) 6:40 - 7:20 pm Apache Spark at Adobe 7:20 - 8:00 pm Upcoming Apache Spark 2.4: What’s New & Why Should You Care 8:00 - 8:30 pm More Mingling & Networking Tech-Talk 1: Apache Spark at Adobe Abstract: The Adobe Cloud Platform is a multi-tenant, big data stack as a service on the cloud which provides the modern foundation for all the various parts of the Adobe Experience Cloud. From a data processing perspective, Adobe uses Apache Spark in a variety of scenarios. We will talk about the high-level data architecture, briefly touching on the infrastructure/scale/challenges, and lastly, we will cover how we are leveraging Spark. As part of the Cloud Platform, we have also built a Query Engine leveraging Spark SQL for ad-hoc data querying. The Query Engine has implemented a PostgreSQL protocol and leverages Akka Streams and the Presto Parser as an abstraction layer around Spark SQL. We will talk about the high-level architecture and talk about the various patches made to Spark SQL such as support for nested column pruning that are critical to our performance needs when accessing data with thousands of nested columns. Bio: Yogesh Natarajan is a senior software engineer in the Data Platform group at Adobe. His interests include building server-side web applications and big data technologies. He has previously worked at Chegg, Yahoo and graduated with a masters from UC Irvine Andrew is a senior software engineer in the Data Platform group at Adobe. He specializes in building modern, scalable, cloud-based Scala applications. Tech-Talk 2: Upcoming Apache Spark 2.4: What’s New & Why Should You Care Abstract: The upcoming Apache Spark 2.4 release is the fifth release in the 2.x series. This talk will provide an overview of the major features and enhancements in this upcoming release. * A new scheduling model (Barrier Scheduling) to enable users to properly embed distributed Deep Learning training as a Spark stage to simplify the distributed training workflow. * 35 high-order functions are added for manipulating arrays/maps in Spark SQL. * A new native AVRO data source, based on Databricks' spark-avro module. * PySpark also introduces eager evaluation mode on all operations for teaching and debuggability. * Spark on K8S adds PySpark and R support and client-mode support. * Various enhancements in structured streaming. e.g., stateful operators in continuous processing. * Various performance improvement in built-in data sources. e.g., Parquet nested schema pruning. Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC PARKING: All visitors attending the Adobe/Spark Meetup on ET 01 Park Conference room will need to park in the East Tower basement level 1.