From a little Spark may burst a flame π₯


Details
Apache Spark is often mentioned in all kinds of Reactive architectures examples as a very trusted data processing engine, and I've been wanting to do a meetup about it in a long time. Spark is a de-facto standard for large data processing... or is it? Competition is fierce πΏ, and innovation is crucial in order to keep the throne. Let's gather to discover how a couple of companies make use of its most advanced capabilities. The title is a quote from Dante Alighieri (https://www.poetryfoundation.org/poets/dante-alighieri). This meetup is made possible ππ» by XITE (https://xite.nl).
1730: doors open
1745: pizza and drinks.
1815: intro
1830: talk #1: Structured Streaming at XITE
1910: short break
1915: talk#2: Affordable automatic deployment of Spark and HDFS with Kubernetes and Gitlab CI/CD
2000: wrap up and additional drinks until clojure
Talk #1: Structured Streaming at XITE
The latest Apache Spark Structured Streaming API promises to help building streaming applications easily avoiding complex streaming semantics. This talk is going to give an overview of what Structured Streaming is all about, cover the most important aspects of the framework and close up with the actual real life use cases of how Structured Streaming is used at XITE.
Speaker's bio: Natalia Grybovska is a Scala Developer and Data Scientist/Engineer beginner at XITE International in Amsterdam. She is interested in Big Data and Machine Learning as well as streaming tools and frameworks like Apache Spark, Kafka and so on.
Talk #2: Affordable automatic deployment of Spark and HDFS with Kubernetes and Gitlab CI/CD
Running an application on Spark with external dependencies, such as R and python packages, requires the installation of these dependencies on all the workers. To automate this tedious process, a continuous deployment workflow has been developed using Gitlab CI/CD. This workflow consists of building the HDFS and Spark docker images with the required Python & R dependencies for workers and master and then deploying the images on a Kubernetes cluster. For this, we used an affordable cluster made of mini PCs. Additionally, we will demonstrate that this cluster is fully operational. The Spark cluster is accessible using Spark UI, Zeppelin and R Studio. Moreover, HDFS is fully integrated together with Kubernetes.
Speaker's bio: Angel Sevilla Camins is data scientist at Anchormen (https://anchormen.nl) with strong affinity for big data technologies (Spark and Hadoop) and automating deployment (Kubernetes, Docker and Gitlab CI/CD)

From a little Spark may burst a flame π₯