SF Spark+SF Hadoop Joint Meetup at Strata: Apache Spot and Kudu


Details
Strata + Hadoop World | March 13–16, 2017 | San Jose, CA "One of the most valuable events to advance my career."
Strata + Hadoop World is a rich learning experience at the intersection of data science and business. Thousands of innovators, leaders, and practitioners gather to develop new skills, share best practices, and discover how tools and technologies are evolving to meet new challenges. Find out how big data, machine learning, and analytics are changing not only business, but society itself at Strata + Hadoop World. Save 20% on most passes with discount code PCSFSPARK. Check out the program and register by midnight PT on January 20 to save even more.
http://www.oreilly.com/pub/cpc/41876
The meetup will be held on the venue but is free to the public, with food and drinks provided by the gracious conference hosts.
Talk #1 A Community Approach to Fighting Cyber Threats
Apache Spot is a community-driven cybersecurity project, built from the ground up, to bring advancedanalytics to all IT Telemetry data on an open, scalable platform. Spot expedites threat detection, investigation,and remediation via machine learning and consolidates all enterprise security data into a comprehensive ITtelemetry hub based on open data models. Spot’s scalability and machine learning capabilities support anecosystem of ML-based applications that can run simultaneously on a single, shared, enriched data set toprovide organizations with maximum analytic flexibility. Spot harnesses a diverse community of expertisefrom Centrify, Cloudera, Cybraics, Endgame, Intel, Jask, RICH IT, Semantix, Streamsets, and Webroot.
Mark Grover is a committer on Apache Bigtop and a committer and PMC member on Apache Sentry (incubating). He has contributed code to Apache Hadoop, Apache Hive, Apache Spark, Apache Pig, Apache Sqoop and Apache Flume. He is a co-author of O'Reilly's Hadoop Application Architectures title and has authored a chapter in O'Reilly's Programming Hive title. He is a software engineer at Cloudera working on Spark.
Talk #2 Up and Running with Apache Kudu
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu (http://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/), the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
David Alves is a software engineer at Cloudera and a PhD student at UT Austin. He is a committer at the Apache Foundation and in the past has contributed to several open source projects, such as Apache Cassandra and Apache Drill.

SF Spark+SF Hadoop Joint Meetup at Strata: Apache Spot and Kudu