In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Presenter: Anastasios Zouzias, twitter: @zouzias, email: Anastasios Zouzias
Senior Data Scientists @ Sqooba