Big Data Crunchathon


Details
"I hear Clojure is good for processing big data. How does that work?"
This conversation during our last fearless adventure has spawned the topic of our next meetup. In one corner we have large data sets from the National Renewable Energy Laboratory (NREL) http://www.nrel.gov/. In the the other corner we have parallel functions, multi-version concurrency control, lazy evaluation and the reducers framework; all the goodness that Clojure brings to the table. We will start this meetup by quickly reviewing Clojure's concurrency features. Then we will split into groups and see who can come up with the best solutions. After group work we will come back together and take the best of the best based on producing good answers, speed of execution, clarity of code and use of parallelization. We will have an eight core Macbook Pro with Java7, the latest Leiningen and latest Clojure installed. Bring your laptops and come ready to code. And bring your A-game. It's on.
Thanks to Scott Crowder from NREL for providing the data sets. I'll either make the full datasets or samples available as soon as I get them. We would appreciate if folks would volunteer to go over:
Multi-version Concurrency Control (MVCC) - Atoms, Refs, Agents
Parallel functions - pmap and friends
Lazy evaluation, paging through large data sets
The Reducers framework
Thanks to Cody and Tek Systems for your continued support, we appreciate the pizza. Beer to share is welcome by all. Below are some resources to help you get up to speed before the meeting.
The answers are a little weak but there is a really important point here, you must choose the right boundaries of parallelization:
http://www.michaelharrison.ws/weblog/?p=387
All of the Clojure books out there have great sections on the MVCC concurrency primitives.
Extra credit:


Big Data Crunchathon