Skip to content

Arbitrary Stateful Stream Processing in PySpark

Photo of Arivoli Tirouvingadame
Hosted By
Arivoli T.
Arbitrary Stateful Stream Processing in PySpark

Details

ABSTRACT:
PySpark is a popular language choice for users to implement Spark applications. The ability for users to be able to implement custom stateful processing is needed for users to realize, for example, custom sessions windowing, cached lookups, exponentially weighted moving average, etc.. Because of such needs, we have added support for arbitrary stateful processing in PySpark so that it can be on par with the Scala and Java API. In this talk, we will go over at a high level how PySpark works and how we implemented arbitrary stateful processing for PySpark and the challenges we encountered and choices made to integrate with python native features such as pandas. We will also talk about some of the interesting use cases that can be implemented with arbitrary stateful processing API in Spark.

SPEAKER BIO(S):

Speaker #1: Dr. Karthik Ramasamy, Head of Streaming, Databricks
LinkedIn: https://www.linkedin.com/in/kramasamy/

Karthik Ramasamy is the Head of Streaming at Databricks. Before joining Databricks, he was a Senior Director of Engineering, managing the Pulsar team at Splunk. Before Splunk, he was the co-founder and CEO of Streamlio that focused on building next-generation event processing infrastructure using Apache Pulsar and led the acquisition of Streamlio by Splunk. Before Streamlio, he was the engineering manager and technical lead for real-time infrastructure at Twitter where he co-created Twitter Heron, which was open sourced and used by several companies. He has two decades of experience working with companies such as Teradata, Greenplum and Juniper in their rapid growth stages building parallel databases, big data infrastructure and networking. He co-founded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, which was acquired by Twitter.

Karthik has a Ph.D. in computer science from the University of Wisconsin, Madison, with a focus on big data and databases. During his college tenure, several of the research projects he participated in were later spun off as a company acquired by Teradata. Karthik is the author of several publications, patents and a popular book, Network Routing: Algorithms, Protocols and Architectures.

Speaker #2: Jerry Peng, Staff Software Engineer, Databricks
LinkedIn: https://www.linkedin.com/in/boyang-jerry-peng/

Boyang Jerry Peng is currently a Staff Engineer at Databricks extensively working Apache Spark Structured Streaming. Before joining Databricks, he was a Principal Software Engineer at Splunk working on streaming and messaging projects especially with Apache Pulsar. Jerry is a committer and PMC member of Apache Pulsar, Apache Storm, and Apache Heron projects. Before Splunk, he worked at Streamlio (acquired by Splunk), Citadel, and Yahoo on distributed systems and stream processing. Jerry has been working in the area of distributed systems and stream processing since his days in grad school at the University of Illinois, Urbana-Champaign.

Speaker #3: Hyukjin Kwon, Tech Lead Software Engineer, Databricks
LinkedIn: https://www.linkedin.com/in/hyukjin-kwon-25045412b/

Hyukjin is a tech-lead software engineer in the PySpark team at Databricks, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, etc.. He is also one of the top contributors in both Apache Spark and Koalas (Pandas API on Spark), and the maintainer of Py4J. Hyukjin holds an MS from University College London.

Speaker #4: Jungtaek Lim, Senior Software Engineer, Databricks
LinkedIn: https://www.linkedin.com/in/heartsavior/

Jungtaek is a software engineer at Databricks working on Spark Structured Streaming. He is a committer of Apache Spark and has been maintaining the Structured Streaming component for more than 3 years. Before contributing to Apache Spark, he focused on contributing to Apache Storm where he is a one of PMC members. Jungtaek holds a BS in Computer Science from Kookmin University.

SWAGS and FOOD would be provided by Databricks.

Photo of Data Riders group
Data Riders
See more events
Hacker Dojo
855 Maude Ave · Mountain View, CA