Big Data Mining and Graph Processing


Details
Jimmy is passing through town and wanted to share some of his experiences at Twitter - so we are having a last minute July meetup! We will also have Carlos back to give a brief overview of Apache Giraph (https://giraph.apache.org/).
We are very pleased to have NICTA (http://nicta.com.au/) sponsor the event: beer, pizza and venue!
Note venue at NICTA office in Redfern (right near the train station). Hope you can all make it at short notice.
Scaling Big Data Mining Infrastructure: The Twitter Experience
Jimmy Lin (http://www.umiacs.umd.edu/~jimmylin/) - University of Maryland, USA and Twitter
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users,and variety of use cases. In this talk, I'll discuss the evolution of Twitter's infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, I'll discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing".
Apache Giraph Essentials
Carlos Piva - Cloudera
Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyse the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

Big Data Mining and Graph Processing