Skip to content

Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez

Photo of Ed Kohlwey
Hosted By
Ed K. and 2 others
Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez

Details

Exploring Enron Email Dataset with Kiji and Hive

Lee Sheng, WibiData

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.

Bio:

Lee is an engineer at WibiData who works on building tools for building Big Data Applications. He holds a BS in Computer Science from Carnegie Mellon University. Previous stints include developing systems for making strategic buying decisions at Amazon.com as well as distributed simulation frameworks for the Department of Defense.

Apache YARN & Apache Tez

Tom McCuch Technical Director, Hortonworks

Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN - the new Hadoop compute framework. YARN - Yet Another Resource Negotiator - is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez - covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.

Bio:

Tom McCuch drives the field architecture and engineering for Hortonworks in the Northeast region. Tom has over twenty five years of experience in software engineering. At Hortonworks, Tom helps guide enterprise customers through their adoption of Apache Hadoop. He has deep experience across the Financial Services, Insurance, Life Sciences, Retail, and Telecommunications industries. Before coming to Hortonworks, Tom has served in many different roles across Enterprise Architecture, Product Engineering, Professional Services, and Sales Engineering of mission-critical solutions based on Java and open source software.

Schedule

6:00-7:00 - Networking

7:00-7:15 - Announcements

7:15-8:00 - Lee Sheng on Kiji

8:00-8:15 - Break

8:15-9:00 - Yarn and Tez

Photo of Hadoop-DC group
Hadoop-DC
See more events
Neustar (Room: Neuview)
21575 Ridgetop Circle · Sterling, VA