Data analysis in Python: a survey of pandas and scikits-learn

Python has been a great platform for the munging and analysis of numeric data (numpy, scipy) as well as textual data (NLTK). However, for general data analyses, there was something to be desired. The advent of pandas has enabled a much better and easier importing and handling of data, specially heterogenous data types, as well as made the munging of data faster and more efficient. 

Data modeling in Python is still not as comprehensive as in other software ecosystems, but great strides have been made towards a very good ecosystem. I'll survey two of the most useful: scikit-learn and statsmodels

AUTHOR INFO:

Abhijit Dasgupta is a data scientist and biostatistician in the DC metro area. He is a high-level consultant for NIH with several years experience in bioinformatics as well as over 40 peer reviewed articles. He also works with multiple local companies on their data science needs. He organizes Statistical Programming DC, a meetup dedicated to statistical programming issues in R, Python and other platforms, and sits on the board of Data Community DC, a local non-profit dedicated to creating and enhancing links between data-oriented individuals, groups and companies in the greater DC area.

(This talk was postponed from the October DCPython meeting.)


Join or login to comment.

  • Tony O.

    For those that found this interesting and want to learn more, Abhijit is teaching a couple Python Data Analysis workshops on Feb. 22nd. You can get more details and sign up at http://bit.ly/19ePwzw

    1 · February 10, 2014

  • Curtis N.

    Excellent meetup, lots of good information for someone new Python.

    1 · November 11, 2013

  • Abhijit

    Someone at beer after the meetup wondered if pypy might speed up things in pandas. I just looked into it, and numpy, which is the basis for pandas and the pydata ecosystem, is not fully ported to pypy yet. So the short answer I could find is, no, pypy won't be of much help. Please, someone, I'd be so happy if my research is wrong.

    1 · November 10, 2013

    • Peter W.

      Pypy is building a light "clone" of NumPy purely in Python, but this will never be compatible with Pandas, because Pandas uses Cython and the NumPy C-API, which is beyond the scope of what Pypy is trying to do with it. Pandas itself is already quite fast for most things, and Pypy is not really designed to speed up the kinds of things that are slow in Pandas.

      1 · November 11, 2013

    • Chad R.

      You really did an outstanding job with your presentation yesterday. Looking forward to doing more with pandas now.

      2 · November 6, 2013

    • Alex P.

      Linda, also try here http://stackoverflow....­

      1 · November 6, 2013

  • Chris G.

    Great introduction to Python data analysis tools...great use of IPython as presentation + demonstration tool.

    3 · November 6, 2013

    • Jeffrey C. J.

      Sorry to interrupt but IPython *like* *like* *like*!!! I gotta get me some of that for this lecture on turning reStructuredText into screenplays lecture I keep thinking about.

      November 6, 2013

  • Abhijit

    Many people were asking about the IPython Notebook. The presentations by Fernando Perez and colleagues at http://pyvideo.org/video/1652/ipython-in-depth-high-productivity-interactive-a-0 are probably worth looking at. I looked at them last night and realized I could've set things up better.

    1 · November 6, 2013

    • Peter W.

      Anyone who wants to run the notebook, without needing to install anything, can just view it at Wakari.io:
      https://www.wakari.io/...­

      (You will need to create a free account, and then you can run IPython Notebook on a free Amazon EC2 instance.)

      1 · November 6, 2013

    • Abhijit

      Peter, Wakari says "A WebSocket connection to could not be established." How do I fix that?

      November 6, 2013

  • Bob K.

    I thought the presentation was superb. I'm definitely going to give pandas/IPython a tryout. I love the meeting place and Maddy's for afterward. All-in-all, a really fun and enlightening evening. Thanks!

    1 · November 6, 2013

  • Jackie K.

    Can you post slides?

    2 · November 5, 2013

    • Iva

      Any webinar software should do http://en.wikipedia.o...­

      1 · November 5, 2013

    • Abhijit

      I'll put my talk in a Gist and forward the link to the group

      1 · November 6, 2013

  • Jackie K.

    :-( Ugh. I wanted to attend this, but I don't think I am going to make it to the location in time. Didn't realize it was so far north. :-(

    November 5, 2013

    • Alex C.

      "So far north"??? We generally always meet in Dupont Circle if we can help it :-)

      1 · November 5, 2013

  • Alex C.

    At kramer's bar for about 5 min

    1 · November 5, 2013

  • Heidi M

    I am looking to become familiar with Python, since my (large) employer is using it for several applications. Would this meeting be way too advanced for me?
    My usual tools are SAS and R (less descriptive names?) but I can just about spell statistics.

    1 · November 5, 2013

    • eddie w.

      Without hearing the presentation, I think it should be fine. I'd guess it'd be more of an introduction to the software, rather than a exploration of the finer points of what it does.

      1 · November 5, 2013

  • Jennifer A S.

    Also, is there wi-fi available? I have to multi-task, unfortunately....

    1 · November 5, 2013

  • Andy N.

    Something came up. Won't be able to make this event. Next time.

    November 5, 2013

  • Peter W.

    Abhijit, as long as you are talking about modeling, you might also want to mention Patsy, which is what (I believe) powers some of the linear modeling expressions in statsmodels. https://github.com/pydata/patsy

    And for statistical plotting, the new ggplot-for-Python package, as well as Seaborn:
    https://github.com/yhat/ggplot/
    https://github.com/mwaskom/seaborn

    2 · November 4, 2013

    • Abhijit

      Peter, great suggestions. My enemy here will be time. I'll actually show a bit of the modeling semantics from patsy.

      1 · November 5, 2013

  • Jennifer A S.

    are there snacks? Or do I need to factor in foraging between work and meetup? :D

    1 · November 4, 2013

    • Robert D.

      Typically food isn't offered and I don't see it mentioned in the description. I'll be eating ahead of time just in case. Chipotle is right across the street.

      1 · November 4, 2013

  • A former member
    A former member

    I actually might not be able to make it due to this being the day of the gubernatorial elections in Virginia, but I thought the group might be interested in this from Sunlight labs:

    https://github.com/sunlightlabs/census?source=c

    1 · November 4, 2013

    • Alex P.

      nice link

      1 · November 4, 2013

    • Peter W.

      That's a great link, Starred!

      1 · November 4, 2013

  • Davis S.

    I think ill be able to make it! This is my first time, hopefully ill be able to find my way via metro. Hope to see you all there!

    1 · November 4, 2013

  • Matthew M.

    Would love to be there guys but this conflicts with DevOpsDC's Ansible meetup the same night.

    1 · November 1, 2013

Our Sponsors

People in this
Meetup are also in:

Imagine having a community behind you

Get started Learn more
Henry

I decided to start Reno Motorcycle Riders Group because I wanted to be part of a group of people who enjoyed my passion... I was excited and nervous. Our group has grown by leaps and bounds. I never thought it would be this big.

Henry, started Reno Motorcycle Riders

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy