November 2014 Hadoop Meetup


Details
Dear HUG UK members,
I am pleased to announce our November Meetup. This Meetup is being sponsored by Cloudera who are providing the venue and refreshments and flying two speakers over to the UK before the Strata + Hadoop World event in Barcelona. Thanks!
If you want to do a short five minute talk about your real world Hadoop setup then please get in touch with the team through meetups@huguk.org
Note: The time has been changed to make it half an hour earlier - in line with most past meetups.
Alex McLintock
co-organiser of HUGUK
Time: Monday November 17th - Doors open 6:30pm,
Refreshments from 6:30pm to 7:00pm.
Presentations from 7:00pm to 9:00pm.
Location: Offices of Ketchum, 35-41 Folgate St, London, E1 6BX.
AGENDA
Session 1: Architectural considerations for Hadoop applications
Speaker: Mark Grover, Software Engineer, Cloudera
Speaker: Ted Malaska, Sr Solution Architect, Cloudera
Description:
In this talk we’ll walk through an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. We’ll use this example to illustrate important topics such as:
• Modeling data in Hadoop and selecting optimal storage formats for data stored in Hadoop
• Moving data between Hadoop and external data management systems such as relational databases
• Moving event-based data such as logs and machine generated data into Hadoop
• Accessing and processing data in Hadoop
• Orchestrating and scheduling workflows on Hadoop
Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This talk will be valuable for developers, architects, or project leads who are already knowledgeable about Hadoop, and are now looking for more insight into how it can be leveraged to implement real-world applications.
Session 2: Clickstream & Social Media Analysis with Apache Spark
Speaker: Michael Cutler, CTO of TUMRA
Description:
One of the most common use-cases for ‘Big Data’ tools is analysing Clickstream and Social Media data. The conventional way to achieve this involved using Hadoop Map/Reduce jobs to ‘extract, transform, and load’ (ETL) the raw data into a useful form which Analysts can use answer questions e.g. Hive tables. These ETL jobs were generally slow batch processes (introducing latency) and required Java developers to maintain and make any changes to the data schema.
Enter Apache Spark, a data processing engine designed for both batch and streaming workloads. The entire process of transforming the raw input data through to asking questions using a SQL-like interface can be achieved in one step, and it’s up to 100x faster than Map/Reduce & Hive. Did I forget to mention that it also has an interactive shell, machine learning and graph analysis baked in?
This talk will introduce Apache Spark and some of the use-cases we’ve successfully deployed with it, and will then dive into real example clickstream & social media analysis tasks illustrating how they can be achieved simply and quickly using Spark.
Bios
Mark Grover, Software Engineer, Cloudera
Mark Grover is a committer on Apache Bigtop, a committer and PMC member on Apache Sentry (incubating) and a contributor to Apache Hadoop, Apache Spark, Apache Hive, Apache Sqoop and Apache Flume. He is currently co-authoring O’Reilly’s Hadoop Application Architectures title and is a section author of O’Reilly’s book on Apache Hive – Programming Hive. He has written a few guest blog posts and spoken at many conferences about technologies in the hadoop ecosystem.
Ted Malaska, Sr Solution Architect, Cloudera
Ted has worked on close to 60 Clusters over 2-3 dozen clients with over 100’s of use cases. He has 18 years of professional experience working for start-ups, the US government, a number of the worlds largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the US’s largest non-profit financial regulator. He has architecture experience across topic such as Hadoop, Web 2.0, Mobile, SOA (ESB, BPM), and Big Data. Ted is a regular committer to Flume, Avro, Pig and YARN.
Michael Cutler, CTO of TUMRA
Michael pioneered the use of Big Data at BSkyB since 2008 and was a guest speaker at Hadoop World 2011 (New York) where he presented one of the first talks on machine learning at scale. Co-founding TUMRA in 2012 he has applied Apache Spark, Cassandra and machine learning algorithms to build real-time personalisation and recommendation solutions delivering increases of £1,000,000+ in revenue for leading ecommerce businesses.

November 2014 Hadoop Meetup