Data is MESSY. About 80% of Data Science goes into preparing, cleaning, wrangling and munging data. Curating the right data sources is only one part of the operation. NAs and handling missing elements has a big impact on algorithm accuracy. Factor Expansion, Feature engineering and derived fields occupy a good portion of the data science life cycle. In this talk we review tools, techniques and common methods to do Data Munging on small and big data. sed, awk, Python, R, data.table, plyr are only some methods. word2vec, deep features and other new ML methods are transforming this space.
Speaker: Matt Dowle is the main author of the data.table package in R. He has worked for some of the world’s largest financial organizations: Lehman Brothers, Salomon Brothers, Citigroup, Concordia Advisors and Winton Capital. He is particularly pleased that data.table is also used outside Finance, for example Genomics where large and ordered datasets are also researched.