41st Bay Area Hadoop User Group (HUG) Monthly Meetup


Details
Agenda
6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - In-memory data grid & MapReduce Engine for real-time analytics 7:00 - 7:30 - Machine learning for cyber security using Hadoop 7:30 - 8:00 - Computing Capacity Calculator(C3) for Hadoop
Session I (6:30 - 7:00 PM) - In-memory data grid & MapReduce Engine for real-time analytics
Hadoop MapReduce offers a powerful data-parallel programming model that traditionally has been used to analyze large, static data sets with a multi-tenant, batch scheduling implementation that provides results in minutes to hours. Many real-time applications in financial services, e-commerce, logistics, and other areas can benefit from MapReduce’s parallel speedup but must analyze fast-changing data within milliseconds to seconds.
This talk will describe how ScaleOut Software implemented a MapReduce engine within a distributed, in-memory data grid to meet the needs of real-time analytics. It will describe the key architectural decisions and tradeoffs that were made to accelerate execution time, and it will compare this approach to other real-time Hadoop implementations, such as Spark. The performance benefits will be illustrated using a financial services application.
Speaker:William L. Bain, Founder & CEO, ScaleOut Software
Bio:
Dr. William L. Bain is the founder of ScaleOut Software. Bill has a Ph.D. (1978) in electrical engineering/parallel computing from Rice University, and he has worked at Bell Labs research, Intel, and Microsoft. Bill founded and ran three start-up companies prior to joining Microsoft. In the most recent company (Valence Research), he developed a distributed Web load-balancing software solution that was acquired by Microsoft and is now called Network Load Balancing within the Windows Server operating system. Dr. Bain holds several patents in computer architecture and distributed computing. As a member of the screening committee for the Seattle-based Alliance of Angels, Dr. Bain is actively involved in entrepreneurship and the angel community.
Session II (7:00 - 7:30 PM) - Machine learning for cyber security using Hadoop
As the Internet grows more complex and dynamic, information security organizations face enormous volumes of data, users connected at tremendous speeds, and ever-changing dynamics with new devices and applications entering the network all the time. The only way for organizations to stay ahead is to perform analysis on every piece of data that flows across the network. And, they must understand that data in context of everything else that is happening in the world.
That is why the future of cybersecurity requires a new approach: drawing from the richly layered semantic web to enable machine-to-machine analysis and automated machine learning to bring deep new meaning to network activity and behavior.
This presentation will provide an overview of Narus’s integrated approach to cybersecurity. We will describe unique technical challenges posed by the huge data volumes and the need to support different levels of usage and deployment scenarios. We will also talk about open source and commercial technologies including Hadoop, leveraged by Narus to address these challenges
Speaker: Padmanabh Dabke, VP of Analytics and Visualization, Narus Inc.
Bio:
Padmanabh Dabke is a serial innovator and entrepreneur with over 22 years of experience in defense, financial and utility sectors. He currently works at Narus, a Boeing company, as the VP of Analytics and Visualization. He is responsible for building Narus’s next generation cybersecurity analytics solutions for both government and Fortune 1000 enterprises leveraging Hadoop. Prior to joining Narus, Padmanabh was leading the team at Social Lair where he built a SaaS platform offering an array of enterprise social solutions. His other accomplishments include Spigit's Social Innovation platform, MoneylineConnect family of financial products, and Stringbeans: an open source portal. Padmanabh has a B.S from IIT, Mumbai; MS from Clemson University; and PhD from Stanford University.
Session III (7:30 - 8:00 PM) – Computing Capacity Calculator (C3) for Hadoop
A self-service hosted tool build on Hadoop Vaidya, to estimate the compute capacity in terms of number of nodes, given users SLA. Tool analyzes the job histories for the test runs and takes into account various factors to predict the capacity required for onboarding a project taking into account various speedup factors between test and production cluster nodes, partial data set vs. full data set, high memory jobs requiring multiple slots or containers per map/reduce tasks and task re-execution scenarios. Tool is also extended to compute the capacity requirements for the Pig scripts and work is in progress to predict capacity for complex Oozie workflows built around Pig scripts. This tool has been used to onboard at least 100+ projects in Yahoo! and has served more than 2300 requests for Grid project users.
Speaker:Viraj Bhat, Principal Engineer, Yahoo!, Inc
Bio:
Viraj is a Principal Engineer at Yahoo!, Inc. where he is building, porting and parallelizing big data applications on Yahoo! Grids based on Hadoop. He built Hadoop Vaidya, a performance diagnostic tool for Hadoop jobs. He is an Apache contributor on Pig, HCatalog & Hive. He received the Yahoo! award in 2008 for evangelizing Grid technologies, profiling and optimizing Hadoop applications and the 2012 excellence award for “Gridifying” the Genome project. Viraj Bhat graduated with a Ph.D. degree from the Rutgers University and has been involved in several research projects and publications at PPPL, LBNL and ORNL
Yahoo Campus Map:
Detail map (http://photos4.meetupstatic.com/photos/event/2/8/e/d/600_21370477.jpeg)
Location on Wikimapia:
http://www.wikimapia.org/#lat=37.4181633&lon=-122.0250607&z=18&l=0&m=b&search=yahoo

Sponsors
41st Bay Area Hadoop User Group (HUG) Monthly Meetup