Skip to content

Apache Spark Structured Streaming Architecture Walkthrough

Photo of Arivoli Tirouvingadame
Hosted By
Arivoli T.
Apache Spark Structured Streaming Architecture Walkthrough

Details

Apache Spark Structured streaming offers almost the same APIs with common Spark SQL queries. But the underlying processing logic and infrastructure are different because of the special requirements for stream processing. In this talk, we will walk the audience through the actual structured streaming infrastructure. We will start from how streaming is different from batch processing, delve into the intuition behind how structured streaming tackles these differences, and an architecture walkthrough on how they are solved.

SPEAKERS:
Speaker #1: Karthik Ramasamy, Head of Streamiing, Databricks
LinkedIn: https://www.linkedin.com/in/kramasamy/

Karthik Ramasamy is the Head of Streaming at Databricks. Before joining Databricks, he was a Senior Director of Engineering, managing the Pulsar team at Splunk. Before Splunk, he was the co-founder and CEO of Streamlio that focused on building next-generation event processing infrastructure using Apache Pulsar and led the acquisition of Streamlio by Splunk. Before Streamlio, he was the engineering manager and technical lead for real-time infrastructure at Twitter where he co-created Twitter Heron, which was open sourced and used by several companies. He has two decades of experience working with companies such as Teradata, Greenplum and Juniper in their rapid growth stages building parallel databases, big data infrastructure and networking. He co-founded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Karthik has a Ph.D. in computer science from the University of Wisconsin, Madison, with a focus on big data and databases. During his college tenure, several of the research projects he participated in were later spun off as a company acquired by Teradata. Karthik is the author of several publications, patents and a popular book, Network Routing: Algorithms, Protocols and Architectures.

Speaker #2: Wei Liu
LinkedIn: https://www.linkedin.com/in/wweil233/

Wei is a software engineer on the Structured Streaming team. He graduated from The University of Chicago with a M.S. in Computer Science in 2022, and B.S. in Mathematics + Computer Science from University of Illinois at Urbana-Champaign in 2021. In his undergraduate years, he interned at Meta and Amazon, and published short conference papers in the area of information retrieval on SIGIR and ECIR.

Speaker #3: Jerry Peng
LinkedIn: https://www.linkedin.com/in/boyang-jerry-peng/

Boyang Jerry Peng is currently a Staff Engineer at Databricks extensively working Apache Spark Structured Streaming. Before joining Databricks, he was a Principal Software Engineer at Splunk working on streaming and messaging projects especially with Apache Pulsar. Jerry is a committer and PMC member of Apache Pulsar, Apache Storm, and Apache Heron projects. Before Splunk, he worked at Streamlio (acquired by Splunk), Citadel, and Yahoo on distributed systems and stream processing. Jerry has been working in the area of distributed systems and stream processing since his days in grad school at the University of Illinois, Urbana-Champaign.

Swags and Food are sponsored by Databricks.

Photo of Data Riders group
Data Riders
See more events
Hacker Dojo
855 Maude Ave · Mountain View, CA