Potentially the Most Significant Open Source Project of the Next Decade


Details
IBM recently declared that Apache Spark is "Potentially the Most Significant Open Source Project of the Next Decade". Therefore it would be rude not to invite them to Manchester to explain why. One of IBM's European Big Data team is coming to tell us what all the hype is about. We also have Doctor Matthew Rowe from Lancaster University explaining his use of Spark with social data research and we welcome back Christopher Batey from DataStax to dive into Spark Streaming with Kafka and Cassandra.
If excellent content is not enough Rental Cars have kindly asked us to use their new, shiny central Manchester Offices to hold the event - there is plenty of on street, free parking only few mins walk away.
Agenda
6.30pm Networking over a beer & pizza
6.55pm Welcome
7.00pm Use machine learning when buying a car
By: Willem Hendriks - IBM EMEA
How to use Spark for machine learning plus the latest IBM announcements around Spark and Hadoop and the Open Data Platform
7.40pm Social Computing Research with Apache Spark
By: Dr Matthew Rowe, Lancaster University
In this talk I will explain how we use Apache Spark, and also elements of the Hadoop stack, for performing social computing research within my team of researchers. I will focus on two example areas: characterising UK ISP filters, and the diffusion of innovation in language across social media. For the former, I will explain how Spark to compute pseudo-classifiers in a parallelised manner, and in doing so allow the accuracy of web filters to be gauged. While for the latter, I will explain how we use Spark to mine signals of language innovation (neologisms, word blends) across social media and track their diffusion over time. In both cases, we are able to use data to answer social questions; something that we could not have done 5 years ago.
8.00pm Real time stream processing with batch analytics
By: Christopher Batey - Software Engineer at Datastax
Combing near real time stream processing with batch analytics is the goal for so many companies. Why? Well rather than getting new insights the next day after a nightly batch job you can start to get them with in seconds with stream processing. The result? Fast results that are up to date but also take into account vast amounts of historical data. Typically this is two technology stacks, e.g Storm for stream processing and Hadoop for batch analytics. This talk will show you how to do it all with the same stack: Spark running on Cassandra.
This talk will cover:
• As we've already had a session of Apache Cassandra in April there will only be a very short introduction to how Cassandra distributes data - consistent hashing + replication
• Hooking up Spark stream processing to do on the fly aggregates when ingesting data into Cassandra
• Running Spark batch jobs
• An in depth look at how we can build Spark partitions from data that is on a single Cassandra host
8.45 - 9.15pm Networking
As you can see it's a full agenda so please try and arrive in good time.
Biographies
Christopher Batey (@chbatey) is a Software Engineer by trade and is currently employed by DataStax a Technical Evangelist for Apache Cassandra, previously he was Senior Software Engineer at BSkyB where he spent his time designing and developing their next generation platform that backs Sky Go, Now TV etc. He is a keen blogger, tweeter and open source advocate.
Dr. Matthew Rowe is a lecturer in social computing at Lancaster University, where he is also director of the M.Sc. in Data Science. Matthew has published over 60 peer-reviews publications to date at a range of international conferences and workshops, and has run the Making Sense of Microposts workshop series at the World Wide Web conference for the past 5 years. His research focuses on examining how people engage with one another on social media, and more generally the Web, and for which he explores the use of data mining, machine learning, and user modelling techniques.
Web: http://www.lancaster.ac.uk/staff/rowem/
Twitter: @mrowebot
Willem Hendriks studied mathematics at the TU Delft technical University in the Netherlands. Spent some years in a small process improvement company, teaching statistics and analyzing data for various projects. A passion for programming python, peanut butter, and pizza.

Potentially the Most Significant Open Source Project of the Next Decade