Skip to content

Python Data Analysis I Workshop

Photo of Abhijit
Hosted By
Abhijit
Python Data Analysis I Workshop

Details

Overview:

Data Community DC (http://datacommunitydc.org/) and District Data Labs (http://www.districtdatalabs.com/) are excited to be offering two Python Data Analysis workshops to kick off 2014.

Python is probably the most popular general purpose scripting language in use today. It comes with "batteries included" and includes an ecosystem of over 38,000 packages.

Several user-contributed packages have been developed over the years to provide scientific computing capabilities in line with Matlab. These include Numpy, Scipy, Sympy, Matplotlib. They are wonderful packages. However, until the advent of pandas, it was not possible to easily ingest, transform and clean data like you could with domain specific languages like R and SAS.

Today, the scientific stack of pandas, Numpy, Scipy and Matplotlib, along with IPython's ability to "glue" different languages and allow parallel computing form a sound platform for using Python as a primary data analytics tool. Python's capabilities are quickly moving forward to include functionality comparable to specialized data processing languages. It is also attractive for integrating data analytic processing into existing Python-based web frameworks like Django and Flask, as well as other Python-based software development. According to a recent KDNuggets poll (http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-science.html), Python is the second most commonly used computer language for data analysis.

This workshop will provide an introduction to working with Python in a data analysis context. You will learn how to use Python and it's packages to read data from different data sources, how to munge and summarize data, and how to visualize data.

The price per attendee for this workshop is $150.

What to Bring:

We will use the Python distribution Anaconda (https://store.continuum.io/cshop/anaconda/) provided by Continuum Analytics. Anaconda is free to use, works on Windows, Mac OSX and Linux, and includes the data analysis stack. Anaconda installation does not require administrative privileges, so it has the lowest barrier to use among the available scientific Python distributions that include pandas, Numpy, Scipy and matplotlib. There are also several other packages for data analysis, visualization and scientific computing included as part of this distribution. Installation instructions can be found here (http://docs.continuum.io/anaconda/install.html), and all the packages we will use in this workshop are installed by default.

It is expected that you will come with a laptop with Anaconda already installed. We will provide you a link to the Github site where all code for the workshop will be available, as well as the workshop presentation, which will be provided as a IPython notebook (if you don't know what this is, you will find out at the workshop).

Outline:

Python primer

  • "Hello World!"
  • Python as a calculator
  • Object types
  • List comprehensions
  • Basic data manipulation

Python tools to use

  • IPython
  • Pandas
  • Numpy

Importing data

  • Text files (tab-delimited, comma-delimited)
  • SQL databases
  • Web pages
  • Idiosyncratic data

Storing data in Python

  • Pandas (Series and DataFrame)
  • Numpy arrays

Cleaning data

  • Missing data
  • Data summaries
  • Data imputation

Data manipulation and munging

  • Merging datasets
  • Subsetting data
  • Grouping and summarizing
  • Split-apply-combine
  • Pivot tables

Basic graphics

  • Histograms and bar plots
  • 2D plots
  • Visualizing bivariate patterns
  • Boxplots

Instructor:
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting.

He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine learning divide. He is always is on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data. He is a member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly R Users DC).

Other Info:

District Data Labs (http://www.districtdatalabs.com/) is comprised of several Data Community DC members focused on providing data science educational offerings to help others in our community enhance and expand their existing technical and analytical skills.

For those that are driving, the best parking option we have found in the area is the garage behind the SunTrust building on the Southeast corner of Glebe Rd. and Fairfax Dr.

Photo of Data Community DC (DC2) group
Data Community DC (DC2)
See more events
Metro Offices - Ballston Office Center
4601 N Fairfax Drive · Arlington, VA