addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscrossdots-three-verticaleditemptyheartexporteye-with-lineeyefacebookfolderfullheartglobegmailgooglegroupshelp-with-circleimageimagesinstagramFill 1linklocation-pinm-swarmSearchmailmessagesminusmoremuplabelShape 3 + Rectangle 1ShapeoutlookpersonJoin Group on CardStartprice-ribbonShapeShapeShapeShapeImported LayersImported LayersImported Layersshieldstartickettrashtriangle-downtriangle-uptwitteruserwarningyahoo

Seminar 3 - Cleansing the Data

  • Aug 22, 2012 · 6:15 PM
  • This location is shown only to members

** See the Data Scientist Seminar Series flyer which describes all 6 seminars / meet-ups in more detail.

Cleansing the Data

Data Scientists must be concerned with data quality. They must continually strive to ensure the data is accurate, has appropriate levels of integrity, is complete, is valid per organizational thresholds, uniform and adheres to density levels as offered by data providers. Addressing these challenges poses some serious trade-offs, though, such as: how much cost is involved? How do these solutions affect security, ownership and distribution? What if some data is inadvertently lost? What tools are available to facilitate data cleansing? These challenges will be addressed in today’s seminar.


Tom Morris is an independent software engineering and product management consultant with strengths in big data, modeling, open source, and intellectual property issues. He is a contributor to multiple open source projects including the Google Refine data cleaning power tool.


Cleansing Process

• Data Auditing

• Workflow spec’ing

• Workflow execution

• Post-processing and verifying correctness

Methods Used:

• De-duping

• ETL’ing

• Data Validation

• Tools

Join or login to comment.

  • John V.

    Very cool that Google Refine can link to Freebase.

    August 23, 2012

  • A former member
    A former member

    I really liked this meetup. The content was interesting. Google Refine is a good tool, I would say it could replace several of the smaller tools I use to cleanup data. Tom Morris was informative and engaging.

    August 23, 2012

  • A former member
    A former member

    It was difficult to see what was being done on screen from the back, so the seminar was not very useful. It would have been a great presentation otherwise.

    August 23, 2012

  • Juan O.

    It was a good example of Google tool to work with data, but I think a bit extensive for the scope of the meeting.

    August 22, 2012

  • Roberta G.

    The topic was interesting but interactive tools aren't very useful in the highly regulated health research that I do. I need a record of everything in case I get audited.

    August 22, 2012

  • Joshua N.

    The Google Refine tool presented at the seminar loos like a great tool for data cleanup and quick analysis.

    August 22, 2012

  • Jim T.

    Great introduction to the capabilities of Google Refine and it use as a tool.

    August 22, 2012

Our Sponsors

  • Vertica

    Food and Drinks

  • Oracle

    Speaker & Food. Thanks Oracle!

  • Tableau Software

    Thank you so much for the food and drinks, Tableau!

  • O'Reilly

    Thank you so much @AudraMontenegro for all the great O'Reilly books!

  • EMC

    Thank you so much EMC for offering us food and drinks!

  • MIT Sloan Data Analytics Club

    Venue at MIT

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy