Apache Solr powers search and navigation for many of the world's largest websites. Solr is widely admired for its rock-solid full-text search and its ability to scale up to massive workflows. But it has moved beyond its roots as just a full-text search engine. Today, people use Solr for aggregating data, powering dashboards, geo-location, even building knowledge graphs!
In fact, Solr is so powerful, it's the standard engine for big data search on major data analytics platforms including Hadoop and Cassandra. Critical data is being accessed through Solr's rich query interface and, now, big data engineers are including Solr as one more data store in the analytics processing chain.
But, as we expand the data pipeline to include diverse data stores, we need consistent ways of working across different data access patterns and representations.
Enter Apache Spark. Apache Spark has seen a meteoric rise as the tool for big data processing. Spark makes distributed computing as simple as running a SQL query. Well, almost!
Spark's core abstraction, the Resiliant Distributed Dataset (RDD), is capable of representing pretty much any data store, including Solr. So, let's see how we can integrate Apache Solr into our data processing pipeline using Apache Spark.
We'll talk about the implications and opportunities of treating Solr as just another RDD. And, we'll review some use cases for running Spark jobs over Solr, ranging from "comparing data sets" to "calculating Flesch–Kincaid reading levels in text".
To top it off, we'll demonstrate how to use your existing SQL skills with SparkSQL, declarative programming over distributed datasets in real time! Finally, we'll tie it all together with a new Apache project that marries the best of iPython Notebook (the favorite tool of data scientists) and the best of distributed computing (Apache Spark and SparkSQL). Apache Zeppelin is the interactive computational environment for data analytics. Just like iPython Notebook, Zeppelin supports collaboration, data exploration and discovery, and rich graphs and visualizations. But its deep integration with Spark means Apache Zeppelin is the "interactive analytics notebook" for Big Data.
About Eric Pugh
Fascinated by the “craft” of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past 10 years. He is a member of the Apache Software Foundation and lately has been mulling over how we move our expectations for search from “10 blue links” to “remarkably right”. In biotech, financial services and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software.
As a speaker, he has advocated the advantages of Agile practices in software development. Eric became involved in Solr in 2007 when he submitted SOLR-284, the patch for Parsing Rich Documents such as PDF and MS Office, which subsequently became the single most popular patch as measured by community votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the Free/Open Source Model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. Eric co-authored "Solr Enterprise Search Server", the first book on Solr.
Check out his blog at http://www.opensourceconnections.com/blog