Skip to content

Data Science at scale and graph analysis with Dean Wampler and Paco Nathan

Photo of Friso van Vollenhoven
Hosted By
Friso van V.
Data Science at scale and graph analysis with Dean Wampler and Paco Nathan

Details

For the next meetup, we welcome speakers Paco Nathan and Dean Wampler, who are in town to represent Apache Spark at the Scala Days conference. Luckily they have found some time in their schedules to pay our group a visit.

We thank the kind people at Booking.com (https://workingatbooking.com) for hosting us this evening and providing us with food and drinks.

Agenda:

• 18.30: Arrive, eat, drink.

• 19.00: Introduction from your humble organisers and Booking.com

• 19.10: Talks:

Data Science at Scale with Spark, by Dean Wampler, Big Data Architect at Typesafe

Apache Spark has been blessed as the replacement for MapReduce in Hadoop environments. It also runs in other deployment modes. Spark provides better performance, better developer productivity, and it supports a wider range of application scenarios than MapReduce, including event stream processing, ad hoc queries, graphs, and iterative algorithms. Graphs are a natural way to represent many data sets, such as social media networks, and iterative algorithms are important for Machine Learning, such as model training with gradient descent.

This talks discusses Spark from a Data Science perspective, it's strengths and weaknesses, the Scala, Java, Python, and R APIs it offers for common analytics problems, what's missing, and what's planned. We'll look at support for ad hoc queries over large data sets, machine learning algorithms, graph processing, the programmer experience, and the pragmatic concerns of running applications.

GraphX: Graph analytics for insights about developer communities, by Paco Nathan, O'Reilly author of the upcoming Spark book

Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Space to surface insights about open source developer communities, based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, being used to help refine developer certification exams, etc. As an example, we will examine analysis of the Spark developer community itself.

• 21.00: Eat + drink some more

• ??.??: Everybody out!

Photo of Data Council Amsterdam - NL Data Engineering & Science group
Data Council Amsterdam - NL Data Engineering & Science
See more events