Bay Area Hadoop User Group (HUG) May Meetup


Details
Hello Hadoopers
Agenda is available for the May 19th meeting
-
6:00 - 6:15 - Socializing and Beers (Gates open at 5:45)
-
6:15 - 6:30 - What's new with Pig? Alan Gates, Yahoo!
-
6:30 - 7:00 - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
-
7:00 - 7:30 - Extraordinarily rapid and robust data analysis with Cascalog, Nathan Marz, BackType
-
7:30 - 7:45 - Apache Hadoop Release Plans for 0.21.0, Tom White, Cloudera
-
QnA , Open Discussion, and a small surprise
Session details are available below.
Looking forward to seeing you there!
Register today for Hadoop Summit 2010 (https://hadoopsummit2010.eventbrite.com/) June 29th, Hyatt, Santa Clara, CA
Dekel
HBase and Pig: The Hadoop ecosystem at Twitter
Twitter makes extensive use of Hadoop, HBase, and Pig to power its analytics infrastructure. In this talk, we will describe our data flow pipeline, go over the new Pig-HBase integration, and introduce Elephant Bird, the recently open-sourced collection of libraries we use for working with Protocol Buffers, Hadoop, HBase, Pig, and Thrift.
Dmitriy Ryaboy is an engineer at Twitter and a Pig committer; he previously worked at Lawrence Berkeley National Laboratory and at Ask.com. Dmitriy holds a bachelor's degree in Computer Science from UC Berkeley and a master's in Very Large Information Systems (it's a real thing) from Carnegie Mellon University. You can follow him on Twitter, where he goes by @squarecog.
Extraordinarily rapid and robust data analysis with Cascalog, Nathan, BackType
Cascalog is an interactive query language for Hadoop with a focus on simplicity, expressiveness, and flexibility intended to be used by Analysts and Developers alike.
Cascalog eschews the SQL syntax for a simpler and more expressive syntax based on Datalog. With this added expressiveness, Cascalog can query existing data stores "out of the box" with no required data "importing" or "under the hood" configuration necessary. Because Cascalog sits on top of Clojure, a powerful JVM based language and interactive shell, adding new operations to a query is as simple as defining a new function.
In this presentation, Nathan will introduce Cascalog and how it's used at BackType. Nathan will show how the Datalog syntax provides more robustness and flexibility than SQL based languages. Finally, Nathan will demonstrate how the Cascalog, Clojure, and Cascading stack can be leveraged by advanced users who wish to build more complex queries and libraries in Java and Clojure for data processing, data mining, and machine learning.
Nathan is the Lead Engineer at BackType where he is building technology for real-time search and analytics of online social media. He has been using Hadoop extensively since 2008, using Hadoop both for data warehousing and as the basis for scalable, data-intensive applications. Nathan makes use of technologies like Cascading and Clojure heavily in order to simplify the devlopment of complex applications on top of Hadoop. Nathan writes a blog at https://nathanmarz.com... (https://nathanmarz.com/blog)
Apache Hadoop Release Plans for 0.21.0
Tom will give a short update on the progress of the release, and explain the work that has been done on compatibility with 0.20.

Sponsors
Bay Area Hadoop User Group (HUG) May Meetup