Real-time Big Data Analytics with Spark and Solr


Details
Join us for the inaugural Austin Lucene/Solr Meetup! Solr Committer, Timothy Potter, will give the below presentation around using Solr & Spark for real-time big data analytics. Food & drinks will be provided. Hope to see you there!
6:00-6:30pm: Networking, food & Drinks
6:30-7:30pm: Presentation & Questions
7:30-8:00pm: Wrap-up
Abstract: Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Mr. Potter covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation across large result sets in Spark. All of the concepts presented in this talk are implemented in an open source project donated and supported by Lucidworks, see: https://github.com/LucidWorks/spark-solr
After covering basic use cases of indexing and query execution, Tim digs a little deeper to show how to include Solr result sets into interactive analysis sessions using Spark SQL. Lastly, Tim will touch on using MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA).
When discussing big data, especially search on big data, it’s important to establish performance metrics. For instance, how many docs per second can be indexed from Spark to Solr using this framework? Or, how many rows can be processed per second when reading data from Solr into Spark? Tim concludes his presentation by showing read/write performance metrics achieved using a 10-node Spark / SolrCloud cluster running on YARN in EC2.
Attendees will come away with a solid understanding of common use cases, access to open source code, and performance metrics to help them develop their own large-scale search and discovery solution with Spark and Solr.
Speaker Bio: Timothy Potter is a senior member of the engineering team at Lucidworks and a committer on the Apache Solr project. Tim focuses on scalability and hardening the distributed features in SolrCloud. Previously, Tim was an architect on the Big Data team at Dachis Group, where he worked on large-scale machine learning, text mining, and social network analysis problems using Hadoop, Cassandra, and Storm. Tim is the co-author of Solr In Action, a comprehensive guide to using Solr 4. He lives with his two Shiba Inus in the mountains outside Denver, CO.

Real-time Big Data Analytics with Spark and Solr