Skip to content

24. Exploring Complex Types for Analytic Workloads

24. Exploring Complex Types for Analytic Workloads

Details

Agenda

• 17.45: drink, socialize

• 18.00: first talk: Exploring Complex Types for Analytic Workloads

Speaker: Marcel Kornacker is a tech lead at Cloudera, and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. Marcel has a PhD in databases from UC Berkeley.

Abstract: Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data. However, over the past few years, they have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.

In this talk, attendees will learn how using nested data structures increases analytic productivity. The well-known TPC-H schema will serve as an example to demonstrate how to simplify analytic workloads with nested schemas. Attendees will also learn best practices for converting flat relational schemas into nested schemas and explore examples of data science-style analysis utilizing Apache Impala’s (incubating) support for complex types in SQL.

• 18.45: eat, drink, socialize (more)

• 19.00: second talk: Application of Locality Sensitive Hashing at Spotify

Speaker: Boxun Zhang is a data scientist at Spotify. His work has been mainly focusing on the measurement and modeling of user's retention behavior. Before joining Spotify, he obtained his PhD from TU Delft, the Netherlands, where he studied the operations and user behavior in BitTorrent.

Abstract: In this talk, I will first introduce locality-sensitive hashing (LSH), a widely-used technique for nearest neighbor search, hierarchical clustering, audio fingerprint, and etc. Then I will explain how LSH is used in Spotify's recommender systems and our implementation. In addition, I will briefly talk about another popular implementation of LSH that works well with Euclidean distance.

• 19.45: drink, socialize (even more)

Follow SHUG on twitter (https://twitter.com/shug_meetup)!

Photo of Stockholm Hadoop User Group group
Stockholm Hadoop User Group
See more events
Spotify Office
Birger Jarlsgatan 61 (11tr) · Stockholm