pySpark, Ipython Notebook and SparkSQL as a environment for data science


Details
Abstract: Data Science on Hadoop can be a daunting journey as you generally are spanning multiple tools and different interfaces. Furthermore, while there are people out there doing data science, worked examples are few and far between.
As part of the Social Security Act, the Center for Medicare and Medicaid Services has begun to publish data detailing the relationship between physicians and medical institutions. This data has been analyzed cursorily in the press, but an in-depth outlier and benford's law analysis hasn't been attempted (to my knowledge).
Casey will present a demo using Spark and Hive to do the above analysis without leaving IPython notebook.
Speaker: Casey Stella is a Principal Architect at HortonWorks and focus' on issues around data science and especially natural language processing at scale. He has domain knowledge in medical/clinical informatics and oil/gas data analysis and signal processing at scale.
Food: Pizza and drinks, first come first serve, starting at 6:30PM.
Map: http://bit.ly/RCtaTI

pySpark, Ipython Notebook and SparkSQL as a environment for data science