let´s get together again in May! We scheduled another "sparky" meetup for you. :-)
In this talk, Frank will discuss the topic of data quality problems when developing data science applications.
Looking forward to seeing you soon.
Have a nice day!
In real world scenarios, data comes from different sources, may be transformed by complex ETL processes, and is owned by different stakeholders. Before the data can be used for modeling, it has to be cleaned and preprocessed. Often times, data scientists build these steps based on technical and domain specific assumptions about the data. You will see how explicitly specifying these assumptions and monitoring the actual situation from the first data delivery on enables an efficient transition from a prototype to a product. Frank will use Apache Spark, Drunken Data Quality (DDQ) (https://github.com/FRosner/drunken-data-quality), Apache Zeppelin and the ELK stack to give a practical example of this approach.
Frank Rosner is working as a Data Scientist in the Global Data and Analytics Competence Center of Allianz SE. As a data nerd and open source developer he is contributing and committing to open source projects like Apache Spark, Apache Mahout, Spark Notebook and Apache Zeppelin (incubating). His research interests are in the field of probabilistic topic models and integration of data science and data architecture. If there is a problem but no tool to solve the it, Frank does not hesitate to build one.