IoT with Spark Streaming & Hadoop with Kudu


Details
Please join us for two tech talks for our first Austin Meetup.
Agenda:
• 6-630 Networking/food & drinks
• 630-710 Presentation #1: IoT with Spark Streaming: Practical Lessons from Real-World Use Cases
• 715-8 Presentation #2: Resolving Transactional Access/Analytic Performance Trade-offs in Hadoop with Kudu
Presentation #1 - IoT with Spark Streaming: Practical Lessons from Real-World Use Cases : Over the past year, Spark Streaming has emerged as the leading platform to implement IoT and similar real-time use cases. There are successful implementations across a diverse spectrum of industries: consumer internet and mobile, to healthcare to traditional manufacturing.
We will start with a brief introduction to Spark Streaming’s micro-batch architecture for real-time stream processing. However, the primary focus of the talk will be on end-to-end architectures and use cases. We will give a walkthrough, and live demo, of an example use case that includes processing and alerting on-time series data (such as sensor data); all the way from ingestion of the time series data streams with Kafka, processing in Spark Streaming to identify egregious conditions, and sending alerts via Kafka events.
Speaker Bio:
Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for real-time streaming, Apache Spark, and tools for data ingestion into the Hadoop platform. Before joining Cloudera, he worked as an engineer at LinkedIn, where he applied machine learning techniques to improve the relevance and personalization of LinkedIn’s Feed. Anand has extensive experience in leveraging big data platforms to deliver products that delight customers. He has a master’s in computer science from Stanford and a bachelor’s from the University of Arizona.
Presentation #2 - Resolving Transactional Access/Analytic Performance Trade-offs in Hadoop with Kudu
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu (http://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/), the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Speaker Bio:
David Alves is a software engineer at Cloudera, working on Kudu for the past 2.5 years, and a PhD student at UT Austin. He is a committer at the Apache Foundation and in the past has contributed to several open source projects, such as Apache Cassandra and Apache Drill.

IoT with Spark Streaming & Hadoop with Kudu