Skip to content

Spark in Production and Spark HBase

Photo of Future of Data
Hosted By
Future of D. and 2 others
Spark in Production and Spark HBase

Details

This event will be held at Hortonworks SF Office, a 8 min walk from the Hilton Union Square where Spark Summit 2016 is taking place.

We have 3 speakers from Hortonworks and Bloomberg, all to talk about Spark!

Running Spark in Production

As more and more Spark projects are moving into production, getting the most out of Spark in production environment is becoming more critical. In a production deployment there are many concerns: 1. What is the best way to get performance out of Spark: configuring right task parallelism, choosing best number, size & core per executors requires deeper understanding of Spark
2. How to secure a Spark deployment? What is the right way to integration Spark with Kerberos authentication & Authorization.
3. How do I run Spark on YARN in the best way? How do I share resources across all YARN workloads efficiently? The session will summarize the experience gained from many customer engagements and address common pitfalls and provide concrete ways to make your production deployment of Spark a success.

Vinay Shukla, Hortonworks
Vinay Shukla is the Director of Product Management for Spark & Data Science at Hortonworks. Vinay is a veteran of enterprise software. Previously, Vinay has worked as Product Manager, Developer, and Security Architect. When not in front of a computer, Vinay enjoys being on a Yoga mat or on a hiking trail.

Spark and Online Analytics

Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg's Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However,Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.

Shubham Chopra, Bloomberg

Shubham is a software engineer at Bloomberg. He has been a long time user of Hadoop and Spark. He previously contributed to Apache Pig and helped develop the 'illustrate' function. He is currently working on improving Spark's reliability for online analytics.

Bringing HBase Data Efficiently into Spark with DataFrame Support

HBase has become the defacto standard in Hadoop for online access to data. And Spark is now the most popular choice for processing the data. In this session, we walk through the current offering of the HBase-Spark module in HBase, focusing on the HBase as an external DataSource of Spark. It leverages the current Spark catalyst engine, and support complex SQL queries, e.g., Join, Aggregation, etc, within the DataFrame abstraction. We first introduce the generic rules in the implementation of external data sources within the Spark Catalyst engine, and the design choice of HBase-Spark connector. Specifically, we introduce its internal architecture and how the data locality, predicate pushdown, partition pruning, bulkGet is achieved, and further composite key, primitive data type, avro and customized data type support in the current offering. Last but not least, we want to receives the feedback and try to gather the input/requests from community to prioritize our future work.

Zhan Zhang, HortonworksZhan Zhang is a member of technical staff at Hortonworks, where he works on Apache Spark and Hadoop Ecosystem. He received his BS/MS degree from Fudan University of China and Ph.D in Computer & Information Science & Engineering from University of Florida. His research interests distributed system and large scale machine learning platform, with results published in top journals/conferences, such as MobiCom, INFOCOM, etc.

Photo of Future of Data: San Francisco group
Future of Data: San Francisco
See more events