Bigdata Orchestration using Airflow, Big data Infrastructure and Data Science


Details
Guest Speaker, Joshua Robinson, PureStorage
Joshua Robinson (https://www.linkedin.com/in/joshuarobinson80/) will be doing a talk around Datascience.
This talk will overview Pure Storage’s streaming big data analytics pipeline, which uses open source technologies like Spark and Kafka to process over 30 billion events per day and provide real-time feedback in under five seconds. This pipeline is supported by Pure Storage’s FlashBlade as a shared storage solution, which enables a streaming use case as well as on-demand batch analytics.
This pipeline illustrates the use case for big data analytics technologies, the lessons learned from this project, and the underlying elastic infrastructure that provides flexible scaling, agility, and simplicity across multiple application clusters.
Joshua is a founding engineer of FlashBlade from Purestorage.
He has spearheaded the big-data strategy for FlashBlade.
Flashblade is an enterprise hardware backed storage product that provide extremely fast IOPs (read/writes).
Guest Speaker, TBD, Aon Centre for Innovation and Analytics (ACIA)
ACIA will speak about how how they leaverage cloud based infrastructure to support their growing data science needs.
More details will be released tomorrow!
Paul Foran, Organizer of Meetup
I will talk about how I use Apache Airflow (a python based data-pipeline/scheduling system) to interact with big-data systems in the cloud (like AWS: S3, EMR, Spark and Redshift).
Apache airflow can be used to schedule pretty much anything! (from scheduling jobs to train models right through to ingesting data.
I will go through the various elements within airflow, like stabilizing the environment, building dynamic DAGs, interacting with custom or generic restful APIs (such as a metadata API system) to aid on-board new ingestion systems in an ETL pipeline
Beer and pizza will be provided by ACIA!

Bigdata Orchestration using Airflow, Big data Infrastructure and Data Science