14 January the first Data Science Northeast Netherlands meetup will take place!
We want to bring together starting and experienced developers and researchers in the area of data science in an informal setting.
Talks will be in English. Please sign up so we can estimate the amount of catering.
18:00 - 19:00 Pizzas, drinks, networking
19:00 - 19:30 Matching product data using Elasticsearch (Dolf Trieschnigg)
20:00 - 20:30 Managing uncertainty in data: the key to effective management of data quality problems (Maurice van Keulen)
20:30 - 21:30 More drinks & networking
Matching product data using Elasticsearch (Dolf Trieschnigg, Mydatafactory)
Product data is everywhere, ranging from size and colour information about products on e-commerce websites to specifications of spare parts in enterprise databases. Finding the desired product in such a database is difficult because of the mismatch between product descriptions. Product metadata might be described or spelled differently, the same description might have multiple meanings, or vital information might be missing. In this talk I will discuss the challenges of product search and how we deal with these issues at Mydatafactory. I will talk about how we use and adapt Elasticsearch, an open source search system, to deal with some of these problems in the context of industrial product data.
Dolf Trieschnigg received a MSc in computer science and a PhD in information retrieval from the University of Twente. In his PhD project he investigated various techniques for dealing with semantic mismatch in searching biomedical literature. The last five years he worked as a postdoctoral researcher on various information retrieval and text mining topics, ranging from federated web search and social media analysis, to language identification and keyword extraction. Since April 2015 he is working as a data scientist at Mydatafactory in Meppel where he is responsible for the matching and extraction algorithms used in the Mydatafactory data cleansing application.
Managing uncertainty in data: the key to effective management of data quality problems (Maurice van Keulen, University of Twente)
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.
Maurice van Keulen studied computer science at the University of Twente and received his Msc. in 1992. His work as a research assistant in the ESPRIT project IMPRESS gradually shifted towards PhD research, which resulted in november 1997 in a PhD with a thesis entitled “Formal operation definition in object-oriented databases”. From august 1997 until april 1999, he worked with the company Ordina Utopics Front Office B.V. as an information architect. His subsequent assistant professorship at the University of Twente was interrupted for 9 months in 2002/2003 for a sabattical with the DBIS-group at the University of Konstanz, Germany. Per 1 January 2010, he was promoted to associate professor.
His research centers around data quality and data interoperability. Humans naturally interpret, absorb, combine, and communicate information of possibly poor data quality and with a varying structure and presentation. Humans doubt and distrust data, they handle opinions and subjective data, they maintain information on the likelihood of correctness of certain pieces of data, where information came from (data provenance), and the fact that they may be ignorant about certain things. The scientific challenge is to sufficiently mimic these human capabilities to achieve near or full autonomous data interoperability. His vision is that the key lies in explicitly dealing with the inherent semantic uncertainty throughout the entire process. He envisions a two-phase process where (1) with minimal effort one strives for a “good enough” initial solution that can be meaningfully used, and (2) one gradually and continuously improves data quality during use. His goal is to develop generic and scalable data management technology with the aforementioned capabilities to support the two-phase data interoperability process.