Data Engineering with Airflow, R and Postgres at Education Analytics


Education Analytics (EA) partners with the CORE Districts—a consortium of eight school districts in California that serve more than 1 million students attending around 1,500 schools—to provide actionable metrics to district partners and stakeholders. To deliver timely data, our team at EA has built a data pipeline that uses the Python package Apache Airflow, the statistical programming language R, and PostgreSQL databases. We use Airflow to schedule runs of the system and to determine which new data to process, we use R to process data and calculate metrics, and we use PostgreSQL to store data in a custom longitudinal research data warehouse. This data feeds a custom, user-centered dashboard as well as other analytics and reports oriented around continuous improvement for the CORE districts. This data pipeline has become an integral part of the work that the CORE districts do in their improvement communities.

Some of the challenges we faced in building this system include (1) passing information between Python and R for logging, conditional execution, and error handling; (2) automating the processing of complex statistical methods like causal estimates of school effects on student outcomes and long term predictive models; and (3) designing robust quality control processes for automated systems. In this discussion, we share some lessons learned about the solutions we have arrived upon and preview some challenges we continue to work on solving.

I would like to thank American Family for the food and Cloudera for an after meetup round of drinks.