Analyze US Government Survey Data with R

Statistical Programming DC
Statistical Programming DC
Public group

Location visible to members


Please join us for our January meetup where Anthony Damico will talk about analyzing US Gov't survey data with R.

The United States Government spends over a billion dollars annually to collect and distribute information about its population. While much of this can be downloaded at no cost, data researchers have historically either relied on inflexible online data query tools (like AmericanFact Finder) or needed to purchase expensive, proprietary statistical software - like SAS, SUDAAN, or Stata - in order to correctly account for complex sample survey designs. A new website - - hosts easy-to-use, obsessively documented syntax to analyze government survey data with free, open-source software (the R language and MonetDB) using reproducible techniques.

Since its inception in mid-2012, a new data set has been added every few weeks. The repository currently includes: Area Resource File, American Community Survey, Basic Standalone Medicare Claims Public Use Files, Behavioral Risk Factor Surveillance System, Consumer Expenditure Survey, Current Population Survey, General Social Survey, Medical Expenditure Panel Survey, National Health and Nutrition Examination Survey, National Health Interview Survey, National Study on Drug Use and Health

Each data set posted follows a basic rubric with these core components:

1) Download Automation - no-changes-necessary programs to download every microdata file from every survey year as an R data file onto your local disk.

2) Publication Replication - match published numbers exactly to show that R produces the same results as other statistical languages.

3) Analysis Examples - fully-commented, easy-to-modify examples of how to load, clean, configure, and analyze the most current data sets available.

This presentation will outline why the R language is well-positioned to become the "lingua statistica" of survey methodologists, how the R survey and sqlsurvey packages work, and how to get started using one of the government survey data sets available in the repository. There will also be a brief introduction to the column-oriented database MonetDB and a new method of communication with the R language. In speed tests on a regular desktop computer, MonetDB was able to analyze the 67 million-record Medicare Claims Public Use Files in about one hundred twenty seconds. For large data sets, MonetDB has been integrated into all survey analysis commands with minimal hassle for the user.

Anthony Damico is a Statistical Analyst at the Henry J. Kaiser Family Foundation, where he conducts data analysis for Marketplace, Medicare, and Medicaid health care policy reports. He has published in peer-reviewed policy and methods journals using the R, SAS, Stata, and SUDAAN statistical programming languages. Prior to joining the Kaiser Family Foundation, Anthony worked as a survey researcher at the Center for the Study of Services in Washington D.C. Anthony holds a Bachelor’s degree in Mathematics from Oberlin College and a Masters in Health Policy from Johns Hopkins University.

Tentative Agenda:

6:30 - 7:00: Networking and food/drinks

7:00 - 7:10: Introduction and announcements

7:10 - 7:25: Warm-up act: VOLUNTEERS NEEDED

7:25 - 8:30: Anthony Damico discussing Government Survey Data

8:45ish: Off to beeR.