Detailed agenda and summaries to follow. General agenda:
- 6:00 - 6:30 - Socialize over food and beer(s)
- 6:30 - 7:00 - HCatalog Overview
- 7:00 - 7:30 - Rhadoop, Hadoop for R
- 7:30 - 8:00 - Storm: distributed and fault-tolerant realtime computation
HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
Presenter: Sushanth Sowmyan, Hortonworks
Rhadoop, Hadoop for R
RHadoop is an open source project aiming to combine two rising star in the analytics firmament: R and Hadoop. With more than 2M users, R is arguably the dominant language to express complex statistical computations. Hadoop needs no introduction at HUG. With RHadoop we are trying to combine the expressiveness of R with the scalability of Hadoop and to pave the way for the statistical community to tackle big data with the tools they are familiar with. At this time RHadoop is a collection of three packages that interface with HDFS, HBase and mapreduce, respectively. For mapreduce, the package is called rmr and we tried to give it a simple, high level interface that's true to the mapreduce model and integrated with the rest of the language. We will cover the API and provide some examples.
Presenter: Antonio Piccolboni, Revolution Analytics
Storm: distributed and fault-tolerant realtime computation
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world.
Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.
Presenter: Nathan Marz, Twitter
Yahoo Campus Map:
Location on Wikimapia: