Skip to content

[In-Person] 2 Talks About Workflow Orchestation and Distributed Compute

Photo of Kevin Kho
Hosted By
Kevin K. and 2 others
[In-Person] 2 Talks About Workflow Orchestation and Distributed Compute

Details

We are back to in-person events at the Melrose Center in the Orlando Public Library (Downtown Branch)!!! The plan is to drive together to Lazy Moon and get Pizza/Beer. Prefect's Pizza Patrol program will give a gift card for the pizza. They give pizza to any open-source Meetup whatever the topic is. Feedback is welcome in the comments section about the chosen venue (Lazy Moon) and suggestions are welcome.

We will have 2 mini-talks about scaling and productionizing data pipelines. Both don't require any prior knowledge beyond basic Python and Pandas.

Talk 1:
Next Generation Workflow Orchestration with Prefect

Workflow orchestration has traditionally been closely coupled to the concept of Directed Acyclic Graphs (DAGs). Building data pipelines involved registering a static graph containing all the tasks and their respective dependencies. During workflow execution, this graph would be traversed and executed. The orchestration engine would then be responsible for determining which tasks to trigger based on the success and failure of upstream tasks.

This system was sufficient for standard batch processing-oriented data engineering pipelines but proved to be constraining for some emerging common use cases. Data professionals would have to compromise their vision to get their workflow to fit in a DAG. For example,

  1. How do I re-run a part of my workflow based on a downstream condition?
  2. How do I execute a long-running workflow?
  3. How do I dynamically add tasks to the DAG during runtime?

This has led to the development of Prefect Orion (Prefect 2.0), a DAG-less workflow orchestration system that emphasizes runtime flexbility and an enhanced developer experience. By removing the DAG constraint, Orion offers an interface to workflow orchestration that feels more Pythonic than ever. Developers only need to wrap as little code as they want to get observability into a specific task of the workflows.

Talk 2:
Comparing the Different Ways to Scale Python and Pandas Code

Fugue is an open-source unified interface for Pandas, Spark, and Dask that aims to let data practitioners define their compute workflows in a scale-agnostic manner. By decoupling logic and execution, users can code in a language that they are familiar with (Python, Pandas or SQL), and then choose an execution engine to run it on (Pandas, Spark or Dask). In this talk, we cover the `transform()` function, which lets a user execute a single function in a distributed setting. This simple interface can be incrementally adopted and allows data practitioners to be productive with distributed computing very quickly.

About the Speaker:

Kevin Kho is an Open Source Community Engineer at Prefect, an open-source workflow orchestration management system. Previously, he was a data scientist at Paylocity, where he worked on adding machine learning features to their Human Capital Management (HCM) Suite. Outside of work, he is a contributor for Fugue, an abstraction layer for Pandas, Spark, and Dask. He also organizes the Orlando Machine Learning and Data Science Meetup.

https://www.linkedin.com/in/kvnkho/

COVID-19 safety measures

Event will be indoors
The event host is instituting the above safety measures for this event. Meetup is not responsible for ensuring, and will not independently verify, that these precautions are followed.
Photo of Orlando Machine Learning and Data Science group
Orlando Machine Learning and Data Science
See more events
Orlando Public Library
101 E Central Blvd · Orlando, FL