Search Query Categorization at Scale
Hosted by Analytics.Club NYC Search & Discovery
Details
Classification of short text into a predefined hierarchy of categories is a challenge. The need to categorize short texts arises in multiple domains: keywords and queries in online advertising, improvement of search engine results, analysis of tweets or messages in social networks, etc. We leverage community-moderated, freely-available data sets (Wikipedia, DBPedia, Freebase) and open-source tools (Hadoop, Solr) to build a flexible and extensible classification model.
Magnetic is an online advertising company specializing in search retargeting and applying data science to online search behavior. We create custom real-time audience segments based on what users have searched for across the web. Targeting individual keywords found in user search history is a great way to build an audience. But the need to create manually selected keywords might present operational challenge. The ability to classify queries and keywords helps to create larger audiences with less effort and better accuracy. Among the other use cases for keyword classification in online advertising are reporting on size of inventory available by category, and campaign performance optimization.
We will share our experiences building a real-world data science system that scales to production data volumes of more than 20 million keyword classifications per hour. And will touch on some aspect of knowledge discovery such as language detection, n-gram extraction, and entity recognition.
about the speaker: Alex Dorman, CTO at Magnetic.
Alex has used Hadoop technologies since 2007. Before joining Magnetic, Alex built big data platforms and teams at Proclivity Media and ContextWeb/PulsePoint.
