Skip to content

Solr, Spark and Zeppelin: The Analytics Toolkit for Distributed Big Data

Photo of Bob Larrick
Hosted By
Bob L. and Peter F.
Solr, Spark and Zeppelin: The Analytics Toolkit for Distributed Big Data

Details

Solr, Spark and Zeppelin: The Analytics Toolkit for Distributed Big Data

Apache Solr (http://lucene.apache.org/solr/) powers search and navigation for many of the world's largest websites. Solr is widely admired for its rock-solid full-text search and its ability to scale up to massive workflows. But Solr has moved beyond its roots as just a full-text search engine. Today, people use Solr for aggregating data, powering dashboards, geo-location, even building knowledge graphs! In fact, Solr is so powerful, it's the standard engine for big data search on major data analytics platforms including Hadoop (https://hadoop.apache.org/) and Cassandra (http://cassandra.apache.org/). Critical data is being accessed through Solr's rich query interface and, now, big data engineers are including Solr as one more data store in the analytics processing chain. But, as we expand the data pipeline to include diverse data stores, we need consistent ways of working across different data access patterns and representations.

Enter Apache Spark (http://spark.apache.org/). Apache Spark has seen a meteoric rise as the tool for big data processing. Spark makes distributed computing as simple as running a SQL query. Well, almost!
Spark's core abstraction, the Resiliant Distributed Dataset (https://en.wikipedia.org/wiki/Resilient_distributed_dataset) (RDD), is capable of representing pretty much any data store, including Solr. So, let's see how we can integrate Apache Solr into our data processing pipeline using Apache Spark.

We'll talk about the implications and opportunities of treating Solr as just another RDD. And, we'll review some use cases for running Spark jobs over Solr, ranging from "comparing data sets" to "calculating Flesch–Kincaid reading levels in text". To top it off, we'll demonstrate how to use your existing SQL skills with SparkSQL (http://spark.apache.org/sql/) -- declarative programming over distributed datasets in real time!

Finally, we'll tie it all together with a new Apache project that marries the best of iPython Notebook (http://ipython.org/notebook.html) (the favorite tool of data scientists) and the best of distributed computing (Apache Spark and SparkSQL). Apache Zeppelin (https://zeppelin.incubator.apache.org/) is the interactive computational environment for data analytics. Just like iPython Notebook, Zeppelin supports collaboration, data exploration and discovery, and rich graphs and visualizations. But its deep integration with Spark means Apache Zeppelin is the "interactive analytics notebook" for Big Data.

About Eric

Fascinated by the “craft” of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past 10 years. He is a member of the Apache Software Foundation and lately has been mulling over how we move our expectations for search from “10 blue links” to “remarkably right”. In biotech, financial services and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. As a speaker he has advocated the advantages of Agile practices in software development. Eric became involved in Solr in 2007 when he submitted SOLR-284, the patch for Parsing Rich Documents such as PDF and MS Office, which subsequently became the single most popular patch as measured by community votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the Free/Open Source Model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. Eric co-authored "Solr Enterprise Search Server", the first book on Solr. He blogs at http://www.opensourceconnections.com/blog/.

About Data Hackers

RVA Data Hackers is a community of programmers who meet regularly to develop skills and learn about the tools and techniques of Big Data. We discuss how to find, organize, understand and serve data sets large and small. We'll cover anything related to 'big data' -- machine learning, artificial intelligence and architectures to scale Big Data for the Internet. If you're a programmer interested in machine learning algorithms and managing big data, this group is for you. Topics vary from basic concepts to demonstrations of real-world implementations and everything in between. Our mission is to foster a local community of experienced, practicing experts. We're here to have fun, share and learn about an exciting field of computer science.

Our Sponsors:

RVAData Hackers is sponsored by UpJump (http://upjump.com), CapTech Consulting (http://captechconsulting.com), Richmond Analytics (http://www.richmondanalytics.com) and 804RVA (http://www.804rva.com).

Photo of RVA Data Hackers group
RVA Data Hackers
See more events
WORK Studios
1657 West Broad Street · Richmond, VA