Advanced Hadoop Architectures and Unstructured Data Mining


Details
Hadoop.Next, HDFS Federation and High Availability
Owen O'Malley
Cofounder and Senior Architect, Hortonworks
Join Hortonworks cofounder and Apache Hadoop Committer Owen O'Malley, as he outlines Hadoop.Next and the approach and current status for the HDFS improvements. Apache Hadoop is the de-facto Big Data platform for data storage and processing. The current stable, production release of Hadoop is Hadoop 1.0. The Apache Hadoop community is actively working on Hadoop 0.23 which is the next major version of Hadoop with several notable improvements including HDFS Federation, High Availability and NextGen MapReduce. The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo, Facebook and other enterprises. However, the NameNode does not have automatic failover. A hot failover solution called HA NameNode is under active development (HDFS-1623) and making excellent progress.
Owen contributed patches to Hadoop before it became an independent Apache project. He was the first committer added and still remains one of the most active contributors to Apache Hadoop. He was also the founding chair of the Apache Hadoop Project Management Committee. Prior to co-founding Hortonworks, Owen worked on Yahoo! Search’s WebMap project, which built and performed heuristic analyses over a graph of the known web. Once ported to Apache Hadoop, it became the single largest known Hadoop application. He has a PhD in Software Engineering from the University of California, Irvine. Owen may be followed on Twitter: @owen_omalley.
New Architectural Possibilities for Hadoop
Ted Dunning
Chief Application Architect, MapR
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. An important cluster of these assumptions are essentially work-arounds for the limitations of HDFS. By augmenting HDFS-compatible access with access to files across multiple clusters using standard protocols like NFS, MapR makes many of these work-arounds unnecessary. I will describe the underlying architecture that MapR uses to enable these advances and show how this can simplify systems or, in some cases, make certain classes of programs run orders of magnitude faster.
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom.
Making Sense of the Data Chaos
Adam Gugliciello
Solution Engineer, Datameer
Until recently, data analysis by companies and government agencies has typically been based on structured datasets. This session will demonstrate how new insights can be gained from large amounts of text data, such as company documents, emails, and twitter data that could traditionally not be mined or analyzed. Through specific use cases and interesting examples, we will demonstrate how to take very large unstructured text documents and easily extract useful business insight from them. This talk will discuss: uncovering and retrieving new insights from volumes of data, gleaning value from unstructured and unused sources, and enhanced customer intelligence
Adam Gugliciello is a 15-year veteran in Software Engineering and Systems Architecture and specializes in highly available, parallel systems. Most recently he has developed grid computing solutions to enable deep analyses and intelligence gathering on huge software systems for technical debt and functional mapping. Adam is a Solution Engineer at Datameer and helps bring Financial and Telco applications expertise to the utilization of the Datameer business intelligence suite.
Agenda
6:00-6:30 pm - Networking
6:30-7:00 pm - First presentation
7:05-7:35 pm - Second presentation
7:40-8:10 pm - Third presentation

Advanced Hadoop Architectures and Unstructured Data Mining