Skip to content

51st Bay Area Hadoop User Group (HUG) Monthly Meetup

Photo of Yahoo! HUG Organizer
Hosted By
Yahoo! HUG O.
51st Bay Area Hadoop User Group (HUG) Monthly Meetup

Details

Agenda:

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 7:00 - Apache Apex (incubating): Stream Processing Architecture and Applications

7:00 - 7:30 - Apache Kudu (incubating): New Apache Hadoop Storage for Fast Analytics on Fast Data

7:30 - 8:00 - Running Spark Clusters in Containers with Docker

Sessions:

Session 1 (6:30 - 7:00 PM) - Apache Apex (incubating): Stream Processing Architecture and Applications

Presentation on Apache Apex, the enterprise-grade big data analytics platform and how it is used in production use cases. In this talk you will learn about:

• Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc

• Application development model, unified approach for real-time and batch use cases

• Tools for ease of use, ease of operability and ease of management

• How customers use Apache Apex in production

Speaker:

Pramod Immaneni is Apache Apex (incubating) PPMC member, committer and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Prior to that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.

Session 2 (7:00 - 7:30 PM) - Apache Kudu (incubating): New Apache Hadoop Storage for Fast Analytics on Fast Data

Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.

This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

Speaker:

David Alves. Software engineer at Cloudera working on the Kudu team, and a PhD student at UT Austin. David is a committer at the Apache Software Foundation and has contributed to several open source projects, including Apache Cassandra and Apache Drill.

Session 3 (7:30 - 8:00 PM) - Running Spark Clusters in Containers with Docker

This session will examine the many options the data scientist has for running Spark clusters in public and private clouds. We will discuss various environments employing AWS, Mesos, containers, docker, and BlueData EPIC technologies and the benefits and challenges of each.

Speaker:

Tom Phelan, Co-founder and Chief Architect - BlueData Inc. Tom has spent the last 25 years as a senior architect, developer, and team lead in the computer software industry in Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability group. Most recently, Tom led one of the key projects – vFlash, focusing on integration of server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform. Tom received his Computer Science degree from the University of California, Berkeley.

Photo of Bay Area Hadoop Meetup group
Bay Area Hadoop Meetup
See more events