Spark on Azure + Spark Streaming


Details
Agenda
• Spark On Azure (Nathan Bijnens & Wesley Backelant)
• Spark Streaming (Gerard Maas)
Spark On Azure – Bringing the power of Big Data to the Cloud
Microsoft has a rich history of embracing Big Data technologies in the Azure platform. We offer templates in our IAAS platform for HortonWorks, Cloudera and MapR among other Big Data Services like Cassandra. We also offer HDInsight, our 100% Apache Hadoop-based service in the cloud. By using HDInsight you are up and running in minutes with your own Big Data cluster, these clusters can come in different types such as Hadoop (HortonWorks), HBase, Storm and more recently Spark. Join us for an introduction to the world of Big Data and Spark on Azure. We are going to use Spark notebooks (Jupyter and Zeppelin), which are available on Azure HDInsight to demonstrate the ideal ad-hoc data analytics environment, right from within your browser. Once the data is processed we will integrate Power BI on Apache Spark in an interactive way, to build a nice dashboard and visualize our insights.
Spark Streaming
Apache Spark is a distributed computing framework that enables scalable, high-throughput, and fault-tolerant processing of data. Spark Streaming delivers the power of Spark to process streams of data in near real-time.
After a quick introduction, in this talk we are going to discuss the Spark Streaming "micro-batch" model that enables the re-use of Spark as a data processing engine for in-flight data.
In particular, we will place emphasis on:
• the different stream consumption approaches
• the performance characteristics of each, and
• zoom into the new Kafka "direct" receiver for improved reliability.
Though several examples, we will explore the Spark Streaming API and see how streaming jobs can be combined with other Spark libraries to create data products that extract value from data in real-time.

Spark on Azure + Spark Streaming