Past Meetup

Data Exploration in Spark

This Meetup is past

150 people went

Location visible to members


Many thanks to Radius Intelligence for volunteering to host our next gathering. The doors will open at 6:30, presentations will start at 7. See you there!

Main Talk : "Data Exploration in Spark"

Working with distributed frameworks can be hard to break into; there are a set of patterns that your code must fit into in order to run. Bridging the gap between the patterns you're used to writing in, and the patterns of distributed frameworks can seem like an insurmountable effort. Fortunately the gap is now much smaller with Apache Spark. We'll go through some examples of exploring and working with some arbitrary datasets we found on the internet. Typical exploration patterns that you might try locally can work very similarly with Spark. Once you've specified your logic with Spark, you can be confident that it will scale to large datasets. Attendees are encouraged to follow along on their own machines, example code and setup instructions will be provided.

Speaker: Nimbus Goehausen is a data engineer at Radius. During his 4 years at Radius he has worked on projects ranging from an unstructured text geotagger to constructing a Hadoop pipeline to de-duplicate business records. Currently he spends most of his time trying to achieve novel and scalable solutions to data problems using Apache Spark. Prior to working at Radius, Nimbus was a research assistant working on autonomous helicopter control, and flying cyborg beetle control at UC Berkeley. You can find him on twitter @nimbusgo.

Lightning Talk #1: "Exploring the Small Business Economy using Spark"

Together, we'll blaze through some interesting insights related to the small business economy of the United States using the small business index and the power of Spark!

Speaker: Shaun Swanson is currently applying machine learning and complexity economics theory to's database of 30 million small businesses in the United States in order to ultimately piece together insights and models that drive economic success across the nation. He has previously studied the complexities of city redevelopment and entrepreneurship - as a community organizer in the emerging tech industry of Las Vegas, and as the founder of several failed tech startups. ;) In his free time, Shaun is researching potential generalizations of classical Boltzmann-Gibbs Thermostatistics. He hopes to discover a version that fits better to modern observations of complex system behavior.

Lightning Talk #2: "Union Find in MapReduce: Finding Islands in Disjoint Graph"

Working with graphs is a common way to deal with many of today's big data problems. Often graphs can be used to discover hidden relationships between many disparate pieces of data. We'll cover basic techniques to finding related groups within a graph, as well as discuss how these naive approaches break down as the size of these graphs increase. Finally, we'll explore how to scale these techniques to billion node graphs.

Adrian Druzgalski is the Chief Technology Officer and Co-Founder of Radius Intelligence.