As new data sets become available through municipal Open Data initiatives, how can these be leveraged to reveal insights and build services for communities? This talk by Paco Nathan shows an open source project based on the City of Palo Alto "Open Data Platform", demonstrating how to work with public GIS data available there for parks, roads, trees, etc.
Starting from the raw data, we review simple techniques for discovery and modeling, then use Cascalog, Hadoop, and R to structure the GIS export into data products and accompanying visualizations. The end result creates a data service for a mobile app: "Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo." Extensions to the app incorporate other data sources to provide insights for the community: for example, monitoring invasive vs. endangered species, or the proximity of toxin-producing species near day-care centers.
Cascalog is an open source project from Twitter authored by Nathan Marz, Sam Ritchie, et al., which integrates the Cascading open source API into the Clojure language. Contributions, sample apps, and case studies have been published by a number of organizations including Climate Corp, REDD Metrics, YieldBot, Nokia Maps, Factual, Harvard School of Public Health, etc. This talk includes code and data, but also explores the process of approaching Open Data from the perspective of developing a data product -- from start to finish. Cascalog functions are typically only a few lines long, so the code involved is brief and simple to grasp.
The talk also explores simple practices for test-driven development with large-scale data workflows based on unique features in Cascalog. We also touch on CS theory, going all the way back to the original "relational model" paper by Edgar Codd to discuss some of the unique properties of Cascalog. These aspects are useful for a wide range of data-driven apps.
This example app project began as a seminar at CMU West, showing students examples of how to work with the Palo Alto Open Data initiative, plus how to leverage open source tools for Big Data. The intended audience needs some exposure to programming, but the focus is mostly on process: understanding how to approach large-scale data. This project is also used as a case study in the O'Reilly book "Enterprise Data Workflows with Cascading".
GitHub repo for the open source project (code + wiki): https://github.com/Cascading/CoPA
Paco Nathan is the Director of Data Science at Concurrent in SF and a committer on the Cascading http://cascading.org/ open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics -- with 25+ years in the tech industry overall. For the past 10+ years Paco has led innovative Data Science teams, building large-scale apps. He is the author of the O'Reilly book "Enterprise Data Workflows with Cascading". http://liber118.com/pxn/ @pacoid
Parking and Directions:
We're in the Gateway West building, next to the Westfield Century City mall on the east side. It's the only building with big red "Westfield" signs on top.
The best place for attendees to park is at the Westfield Century City Mall. Please note that they no longer offer free parking for the first 3 hours. Parking rates are as follows:
0 to 3 hours: $1.00 per hour
3 to 5 hours: $1.00 every 15 minutes
5 hours or more: $24.00
Daily maximum: $24.00
About the Venue:
Factual (http://www.factual.com) is a location platform that enables personalized and contextually relevant mobile experiences by enriching mobile location signals with definitive global data. Factual’s real-time data stack builds and maintains data on a global scale, with Factual's core Global Places data covering over 65 million local businesses and points of interest in 50 countries. Factual’s platform also informs location with contextual demographic and commercial data, and offers cleaning and mapping services for business listings and points of interest.