Data Engineering Sustainability!


Details
How can we make our data systems and pipelines sustainable?
We're celebrating Earth Day by looking at data processing sustainability to talk about how we can reduce emissions and save money for our data engineering work!
----
Data Engineers DC is a professional group that meets monthly to discuss topics including all things related to Data Engineering such as open data, data gathering, data munging, and the creation, storage and maintenance of datasets. We combine presentations with hands-on workshops, always seeking to make our data munging lives easier.
---
Location: Excella Consulting - 2300 Wilson Blvd, #600, Arlington, VA, 22209
Come up to the 6th floor! Excella is across the street from the Courthouse metro, and there is a paid street parking and a paid garage on N Adam's st.
Agenda:
5:30-6:30pm: Food & Networking
6:30-6:45pm: Introductions
6:45-8:00pm: Presentations & Discussion
Talk: Mapping and locating slums in Africa for sustainable development - Akhil Bharadwaj Mateti
This presentation is going to be about a geospatial data science project that applies data engineering and machine learning techniques to identify urban deprivation. Focusing on African cities like Lagos, Nigeria and Nairobi, Kenya, the project supports the IDEAMAPS initiative, which aims to standardize the mapping of slum areas using open geospatial data.
This project talks about data engineering practices and building data pipelines tailored to the geospatial domain. It integrates diverse datasets including Sentinel-2 satellite images, covariate features, and contextual features such as population density, building quality, and climate risk indicators. The data pipeline is designed using Python scripts to automate extraction, resampling, and merging of raster and vector data and open source tools like Geopandas, Rasterio, and Geowombat were employed to handle geospatial file formats and automate preprocessing tasks.
Machine learning plays a key role in classifying slum areas using supervised learning techniques to model slum versus non-slum areas based on the integrated feature set. The modeling process included handling imbalanced data and evaluating performance with precision, recall, and confusion matrices.
The final outputs were visualized using QGIS, a powerful open-source GIS platform. The integration of predictions with spatial data allowed for effective mapping of classified slum areas in Lagos, demonstrating the project’s real-world application in urban planning and development. These visual outputs help stakeholders better understand patterns of deprivation and the effectiveness of model predictions. Overall, the project exemplifies a robust intersection of geospatial data engineering, machine learning, and spatial visualization for social impact.
Talk: Maximizing Concurrency while Minimizing Cost - Mike Pankiewicz
The presentation explores how Apache Airflow can be leveraged in a serverless architecture to efficiently orchestrate data workflows at scale. By maximizing task concurrency and minimizing idle resource usage, teams can significantly reduce operational costs while maintaining performance. It highlights best practices, architecture patterns, and real-world use cases for integrating Airflow with services like AWS Lambda and Google Cloud Functions. The session is designed for data professionals looking to optimize their orchestration strategy in a cloud-native environment.
---
Data Engineers DC is a program of DC2. Learn more at www.dc2.org

Data Engineering Sustainability!