Spark at Microsoft Extravaganza


Details
We have three awesome sessions for our next installment of Spark at Microsoft extravaganza!
Agenda
• 5:30 pm Doors Open
• 5:30 to 6:00 pm Check-in, Food+Drinks, Networking
• 6:00 to 8:00 Three Sessions (30 to 40 minutes each)
• 8:00 to 8:30 Networking
The sessions are
Temporal Operators For Spark Streaming And Its Application For Office365 Service Monitoring
While building intelligent monitoring and alerting system for Office365 service quality and user experience on top of Spark Streaming, the requirement is to use event application time for the majority of our monitoring logic -mostly aggregates and temporal joins over different type of events windows for repeatability and cross signal correlation. The native Spark Streaming only supports wall-clock windowing operators, which is insufficient for most of our scenarios. Therefore Office365 team and Azure Streaming Analytics team have been working together to create a set of temporal operators (e.g. reorder, aggregate, temporal joins all by event application time) on top of Spark Streaming to fulfill our complex monitoring logic at scale. Azure Streaming Analytics team have been working for years for advanced streaming programming models and implementations while Office365 team has strong need to scale its monitoring/alerting infrastructure for service quality and user experience by leveraging open source stack (Kafka/Spark/Cassandra). During Spark Summit 2016, we presented the core concepts and streaming programming model of the temporal operators, in this talk, we will go one level deeper to analyze two different approaches of processing out of order events, reorder than process, vs. handle out of order events in the operators. We will enumerate the problems of in-memory state size and amount of computation performed, as well as the dry shard problem when using high water mark to move timeline forward. The more detailed analysis of these problems will be covered in a future talk.
Speaker: Zhong Chen, Microsoft
Spark in YARN-managed multi-tenant clusters
Spark’s YARN support allows scheduling Spark workloads on Hadoop alongside a variety of other data-processing frameworks. We will deep dive on how Spark works on yarn and why we opted on yarn as preferred cluster manager. We will give our insight on how we achieved multi-tenancy, maximizing cluster resource utilization, and while ensuring minimum resources for each application using Spark dynamic executor and Yarn schedulers on Spark HDI clusters.
Speaker: Pravin Mittal, Rajesh Iyer, Microsoft
Five Lessons Learned In Building Streaming Applications At Microsoft Bing Scale
Hundreds of millions search queries hit Bing.com every day. To enable teams in Bing to monitor and analyze user engagement, act upon revenue opportunities in markets around the world, Shared Data Team must collect logs and signals associated every single search query, process and enrich the data in near real-time. Apache Spark Streaming is the solution that empowers us to fulfill the mission. In this presentation, we will walk through top 5 lessons we learned in building and running large scale streaming applications successfully in production.
Speaker: Renyi Xiong, Microsoft

Spark at Microsoft Extravaganza