DMV-RSE meetup #1: Data pipelines


Details
Come join us for the first in-person DMV Research Software Engineer meetup in Washington DC! This is a great opportunity learn about a new RSE topic, to connect with fellow RSE professionals in the area, and make friends!
For our first meeting, we'll be talking about data pipelines. We'll be hearing from Jennifer Melot (Georgetown's Center for Security and Emerging Technology) and Kristijan Armeni (Johns Hopkins University) who will share with us how they are building and thinking about pipelines in their work, be it via Apache Airflow and Beam for data retrieval and augmentation to study security implications of emerging technologies or via the Python scientific stack and high-performance computing clusters (HPC) when running analyses in neuroscience experiments.
See schedule and program information below.
DMV-RSE is a regional group of the US-RSE association.
Schedule
- 5.30-6 PM: Check in, hang out, introductions, networking
- 6-7 PM: Talks, Q&A, informal discussion
- 7PM onwards: Happy hour at a nearby bar (TBD, for those interested)
Abstracts and speaker info
Jennifer Melot. Scaling text ETL and model deployment for data-driven analysis: what problems do we have and how do pipelines solve them (and what problems do we still have?)
Abstract: The Center for Security and Emerging Technology (CSET) is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe how CSET builds data pipelines using Apache Airflow, Apache Beam, and the Google Cloud Platform to automate data retrieval and augmentation, model deployment, and manually curated data integration. We will also discuss some challenges and lessons learned from developing these pipelines in a research environment.
Bio: Jennifer Melot is the Technical Lead for CSET's Emerging Technology Observatory initiative, managing a small engineering team and building data pipelines and web applications.
Kristijan Armeni. Reproducible pipelines in scientific computing: What problems do I have and would pipelines solve them?
Abstract: In this talk, I will outline an analysis workflow in a (fairly typical) neuroimaging experiment. It starts with ingesting the raw data, involves various post-processing stages (with some interactive manual curation), and ends with final visualizations. The workflow relies on the use of high-performance computing (HPC) clusters and more or less standard tools from the Python scientific stack. I will evaluate the elements of this workflow against the idea of a Reproducible Analysis Pipeline and see what elements are currently lacking (e.g. data and pipeline versioning, experiment tracking, packaging/containerization) and why. Finally, I will conclude by reviewing recent initiatives that in my opinion are paving the way towards reproducible pipelines in neuroimaging and scientific computing broadly.
Bio: Kristijan Armeni is an Assistant Research Scientist at the Johns Hopkins University, Department of Psychological and Brain Sciences, working at the intersection of computational cognitive neuroscience, AI, and language

DMV-RSE meetup #1: Data pipelines