Light Speed your Spark Deployments and Best Practices
Details
Spark Structured Streaming and observation/optimization of your Spark environment
#### Join us in person for the November Seattle Spark + AI Meetup at Blueprint HQ with lively discussions, industry-leading presentations, delicious food and networking opportunities.
#### FEATURED TOPICS:
Karthik Ramasamy
Bio: Karthik Ramasamy is the Head of Streaming at Databricks. Before joining Databricks, he was a Senior Director of Engineering, managing the Pulsar team at Splunk. Before Splunk, he was the co-founder and CEO of Streamlio that focused on building next-generation event processing infrastructure using Apache Pulsar and led the acquisition of Streamlio by Splunk. Before Streamlio, he was the engineering manager and technical lead for real-time infrastructure at Twitter where he co-created Twitter Heron, which was open sourced and used by several companies. He has two decades of experience working with companies such as Teradata, Greenplum and Juniper in their rapid growth stages building parallel databases, big data infrastructure and networking. He co-founded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Karthik has a Ph.D. in computer science from the University of Wisconsin, Madison, with a focus on big data and databases. During his college tenure, several of the research projects he participated in were later spun off as a company acquired by Teradata. Karthik is the author of several publications, patents and a popular book, Network Routing: Algorithms, Protocols and Architectures.
Co-Speaker: Praveen Gattu
Bio: Praveen currently manages the Structure Streaming team at Databricks. Prior to Databricks Praveen worked for 16 years in AWS. He was one of the early members in S3 developing features like Versioning, Multipart Upload, and Lifecycle functionalities in S3 and then he bootstrapped and built AWS Kinesis Data Analytics (Serverless Apache Flink) service from the ground up to a multi-million dollar business.
Topic: Project #Lightspeed - Next generation Apache Spark Structured Streaming
Streaming data is a critical area of computing today. Streaming processes data as it moves from source to destination in real time and facilitates quick insights. To meet the stream processing needs, Structured Streaming was introduced in Apache Spark™ 2.0. Spark Structured Streaming has experienced over 150% YOY growth and is widely adopted across thousands of organizations, processing more than 1 PB of compressed data per day on the Databricks platform alone. As adoption accelerated and the diversity of applications moving into streaming increased, new requirements emerged. Project Lightspeed is a new initiative that will take Spark Structured Streaming to the next generation. In this talk, we will give an overview of the proposed few features, performance and functionality in Project Lightspeed.
Nan Zhu
Bio: Nan Zhu is the Engineering Lead of the platform team in SafeGraph. He leads the effort building the data platform in SafeGraph from scratch and supports the rapidly growing business in an efficient and sustainable way. He is an expert in multiple technologies in data infrastructure, e.g. Spark, Iceberg, etc.
Topic: Avoid Burning Money with Spark!
SafeGraph is a geospatial data company providing comprehensive and accurate information on tens of millions of global places and how people interact with these locations. We build our data processing stack on top of Spark to transform and generate the massive size of datasets.
Together with the rapidly growing user demands and complexity of our data processing pipelines/algorithms, our Spark computing cost increases with an undesired pace until we take action on it. In this talk, Nan will share his experience to build, operate and optimize the Spark infrastructure saving hundreds of thousands of dollars per year. Specifically, Nan will share the building of observability stack to support easy detection of money burners, optimization of resource provisioning and examples of business logic optimization in Spark applications.
