Skip to content

Data Science in Practice: Building a Graph-Based Search Engine

Photo of Alex Leeds
Hosted By
Alex L.
Data Science in Practice: Building a Graph-Based Search Engine

Details

Mark Ibrahim will walk us through his work on a side-project called Knowledge Search.

Knowledge Search (knowledgesearch.us (http://knowledgesearch.us/)) is a graph-based search engine powered by Wikipedia. The engine builds a weighted graph using the first link and page views to conceptually link events, objects, people, and places. The project is inspired by research done in collaboration with Peter S. Dodds and Christoper M. Danforth.

A search for “pencil” categorizes the term as a “writing implement,” identifies types of pencils such as “HB” or “charcoal,” and discovers other writing implements such as “pens.” The engine processes all 11 million articles in the English dump of Wikipedia (50-110GB, updated monthly) using Spark, Neo4j (a graph database), and Elasticsearch. The search engine allows for misspelled and multi-word queries then renders a d3 visualization of the subgraph.

Mark Ibrahim is a Fellow at Insight Data Engineering. Previously, he built fixed income risk models for UBS and developed applications to tag Twitter and Facebook posts as a freelancer for Condé Nast. He enjoys using math and technology to explore human behavior over a good cup of joe.

Directions:
ThoughtWorks in on the 15th Floor of 99 Madison Ave!

Photo of AI Professionals - NYC AI Hackers group
AI Professionals - NYC AI Hackers
See more events
ThoughtWorks
99 Madison Ave · New York, NY