[Talk]Using Dataset Versioning to Address Datascience Challenges + Workshop
Details
Hi Everyone,
We are going to have a talk by a guest speaker followed by a workshop on the Kaggle problem.
Note : Please bring your laptops for workshop!!
Talk: Using Dataset Versioning to Address Some Practical Datascience Challenges
Outcome - Beginners will learn what data science in practice looks like, and everybody will learn about some best practices and a tool that could be useful.
Summary: Data science in practice is less structured and more complicated than we realize. Data tends to be messy and process tends to iterative, laborious, and error prone.
In this context, serious decision makers are asking harder questions about robustness of the process:
(a) Lineage/Auditability: Where did the numbers come from?
(b) Reproducibility/Replicability: Is this an accident? Does it hold now?
(c) Efficiency/Automation: Can you do it faster, cheaper, better?
In this talk, speaker discuss the state of current data science process, some failure points, and why we need to improve upon it to be able to address these new asks.
Introduction to an open source tool, dgit - git wrapper to manage dataset versions, discuss why dgit was developed, and how we can give more structure to the data science process using dgit. dgit is still work in progress. Collaboration by interested people is welcome.
Tool: https://github.com/pingali/dgit
Speaker: Dr. Venkata Pingali is Founder of Scribble Data, a data science automation company. He was former VP, Analytics at FourthLion technologies and led analytics work for large political campaigns and business customers of FourthLion. Previous to that he was Founder and CEO of an energy analytics company, eLuminos. He has a BTech from IIT Mumbai and PhD from University of Southern California, Los Angeles in systems
Workshop: Classify handwritten digits using the famous MNIST data[Kaggle Problem]
Please work on this problem before coming to the meetup to have discussions or bring your laptops along to follow the workshop.
https://www.kaggle.com/c/digit-recognizer
I will set up a Github repository so you can upload your solutions to discuss during workshop session.
Agenda:
11:00 - 11:10 Networking
11:10 - 11:40 Talk
11:40 - 12:40 Workshop
12:40 - 01:00 Networking
