Data Science in Practice: Building a Graph-Based Search Engine

NYC Data Wranglers
NYC Data Wranglers
Public group
Location image of event venue


Mark Ibrahim will walk us through his work on a side-project called Knowledge Search.

Knowledge Search ( ( is a graph-based search engine powered by Wikipedia. The engine builds a weighted graph using the first link and page views to conceptually link events, objects, people, and places. The project is inspired by research done in collaboration with Peter S. Dodds and Christoper M. Danforth.

A search for “pencil” categorizes the term as a “writing implement,” identifies types of pencils such as “HB” or “charcoal,” and discovers other writing implements such as “pens.” The engine processes all 11 million articles in the English dump of Wikipedia [masked]GB, updated monthly) using Spark, Neo4j (a graph database), and Elasticsearch. The search engine allows for misspelled and multi-word queries then renders a d3 visualization of the subgraph.

Mark Ibrahim is a Fellow at Insight Data Engineering. Previously, he built fixed income risk models for UBS and developed applications to tag Twitter and Facebook posts as a freelancer for Condé Nast. He enjoys using math and technology to explore human behavior over a good cup of joe.

ThoughtWorks in on the 15th Floor of 99 Madison Ave!