Aug 22, 2012 · 6:15 PM
This location is shown only to members
** See the Data Scientist Seminar Series flyer which describes all 6 seminars / meet-ups in more detail.
Cleansing the Data
Data Scientists must be concerned with data quality. They must continually strive to ensure the data is accurate, has appropriate levels of integrity, is complete, is valid per organizational thresholds, uniform and adheres to density levels as offered by data providers. Addressing these challenges poses some serious trade-offs, though, such as: how much cost is involved? How do these solutions affect security, ownership and distribution? What if some data is inadvertently lost? What tools are available to facilitate data cleansing? These challenges will be addressed in today’s seminar.
Tom Morris is an independent software engineering and product management consultant with strengths in big data, modeling, open source, and intellectual property issues. He is a contributor to multiple open source projects including the Google Refine data cleaning power tool.
• Data Auditing
• Workflow spec’ing
• Workflow execution
• Post-processing and verifying correctness
• Data Validation