20th Swiss Big Data User Group Meeting


Details
Agenda
18:00 Welcome & Intro
18:10
Title: Data Lake in general and on Azure
Abstract:
Microsoft’s Azure Data Lake PaaS offering suffers a tiny little bit of its name, that creates a bridge to the method / concept with the same name.
Talking about Data Lakes, people refer to a collection of methods, functions and concepts. All together they make up the analytics, data models, storage and ingestion- and processing effort of a company to provide not only insight but also advanced analytics and predictive capabilities. To keep up in the modern, digitally disrupted business world, time is a key factor as well as the ease of use and the capability to integrate with other tools, storage types, formats and technologies.
This session will give you and overview over a modern analytical approach in comparison to the monolithic way of how Business Intelligence and Data Warehouse projects used to be.
Bio:
Patrik Borosch, Technical Solution Professional Data Platform, Microsoft Schweiz
18:55
Title: (Big) Data Science on Hadoop
Abstract:
We are entering the golden age of machine learning, and it’s all about the data. As the quantity of data grows and the costs of compute and storage continue to drop, the opportunity to solve the world’s biggest problems has never been greater. Open Source tools are emerging with advanced machine learning to build self-driving cars, provide better care to newborns in the hospital , stop financial crimes and combat cyber threats. But this is clearly just the beginning. Data scientists strive to implement their work beyond simple research but bridging the gaps between the language of the data scientist and the speak of distributed systems proves to be increasingly difficult. Factor in a fast evolving ecosystem of tools and libraries, many being delivered weekly, and you have a recipe for distraction.
This talk shows how to accelerate analytics projects from exploration to production using Python (incl. pySpark), R (incl. sparklyr) or Scala to process data and build analytical models using the storage capacity and processing power of Hadoop.
Bio:
Guido Oswald, Sales Engineer at Cloudera
19:35
Title: LuceneRDD for (Geospatial) Search and Entity Linkage
Abstract:
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Bio:
Anastasios Zouzias earned a PhD in Computer Science from University of Toronto. His research is focused on randomized algorithms for solving linear algebra problems. Currently, Anastasios is working as a Senior Data Scientist at Swiss Re in Zurich. Previous to that, he has worked for Swisscom and IBM Research on various big data and machine learning projects.

20th Swiss Big Data User Group Meeting