1. Talk: Parallel processing for natural texts with Apache Spark`
Processing of massive amounts of texts written by humans is a non trivial task due to computational complexity of underlying algorithms.
We present our first insights from using Spark for solving this task using different approaches for parallelization.
Since many observations cannot be reproduced across the boundaries of linguistic units we have to employ basic NLP techniques to extract necessary features from texts.
Finally we show how it can help to build a text classification system.
Andrei Beliankou works as a Data Engineer at Comsysto Reply GmbH. He mainly focuses on data pipelines and automation for massive sensor data processing.
2. Talk: Skewed data - the silent killer of parallelism in Spark
We often tune our Spark/Hadoop environment as well as our Spark code for performance and forget sometimes the (changing) structure of data we are processing. But imbalance in datasets (like skewed join keys) can lead to massive performance issues. In this talk I would like to show, how to diagnose such issues, as well as walk through some solution strategies.
Dieter Kling is Data Engineer at Comsysto Reply for the last 3 years working on such big data topics as building Data Lakes and implementing Spark data pipelines.