Anomaly Detection and Natural Language Processing using Hadoop


Details
We are excited to have Ofer and Casey from Hortonworks, and Jiwon (http://hortonworks.com) from Stanford give presentations on Anomaly Detection and Natural Language Processing in our pre-Hadoop Summit (http://2015.hadoopsummit.org/san-jose/) Meetup.
Talk 1: Using PageRank for Fraud Detection in Healthcare Data
Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.
Ofer Mendelevitch (http://www.linkedin.com/in/ofermend), Hortonworks
Ofer Mendelevitch is Director of data sciences at Hortonworks (http://hortonworks.com), where he is responsible for professional services involving data science with Hadoop. Prior to joining Hortonworks, Ofer served as Entrepreneur in Residence at XSeed Capital where he developed an investment strategy around big data. Before XSeed, Ofer served as VP of Engineering at Nor1, and before that he was Director of engineering at Yahoo! where he led multiple engineering and data science teams responsible for R&D of large scale computational advertising projects including CTR prediction (with Hadoop), a new front-end ad-serving system and sales tools.
Jiwon Seo, Stanford University
Jiwon Seo is a PhD candidate in computer science at Stanford, working with professor Monica Lam. His research interest includes distributed systems, large-scale graph analysis, and query languages. With his advisor he designed and implemented SociaLite, a query language and distributed system for graph analysis. His work on SociaLite is presented in various academic conferences including ICDE, VLDB, SIGMOD, as well as industry conferences including Hadoop Summit and Python Conference.
Talk 2: Using Natural Language Processing on Non-Textual Data with MLLib
Word2Vec (https://code.google.com/p/word2vec/) is an interesting unsupervised way to construct vector representations of words to act as features for downstream algorithms or as a basis for similarity searches. We look at using the Spark implementation of Word2Vec shipped in MLLib to help us organize and make sense of some non-textual data by treating discrete clinical events (I.e. Diagnoses, drugs prescribed, etc.) in a medical dataset as non-textual "words”.
Casey Stella (http://www.linkedin.com/pub/casey-stella/1/9a1/84b), Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks (http://hortonworks.com). In the past, I`ve worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or anything vaguely mathematical.
Schedule
5:00 – 6:00pm : Reception at the San Jose Ballroom at the San Jose Marriott (next to the Convention Center). No summit registration required.
6:30 – 7:30pm : Talks
7:30 – 8:00pm : Social
Sponsors
Many thanks to Hortonworks for making this meetup possible. Thanks to Hakka Labs for Videography support.

Anomaly Detection and Natural Language Processing using Hadoop