Building a Production Data Lake in the Cloud using Apache Spark


Details
Summary (Starts 6:30PM)
Apache Spark is a distributed processing engine that enterprises can use for large-scale data migration. Key concepts will be discussed from a recent production-level ETL effort (extract, transform, load), which used Java Spark connectors to convert millions of rows of legacy mainframe data to Mongo collections and Cassandra tables. Additional topics that will be covered include using Spring in a Spark project for dependency injection, using the Spark REST API for job submission and job monitoring, and thoughts on using Spark to build a data lake in S3 using Parquet.
Bio
Bryan Der has taken a path less traveled into software engineering. His undergraduate studies were biochemistry and molecular biology at the University of Richmond and from there he went on to get a PhD in biochemistry, biophysics and computational protein design at UNC Chapel Hill. Post Doctoral work at MIT focused on automated design of genetically encoded boolean logic circuits. Last year, Bryan joined Notch and has turned his focus to data engineering, data science, and casual chess.
Happy Hour
Socialize with the Richmond Spark community afterwards at Station 2, which is a three-block walk from Notch! (This is a not a sponsored happy hour)
Parking Information
https://secure.meetupstatic.com/photos/event/6/4/8/600_464881608.jpeg
Street parking is the best option for this location. However, if you are not comfortable with street parking, you may park at the Farm Fresh and walk a few blocks to Notch. Afterwards, we will walk to Station 2 for Happy Hour.
Interactive Map (https://drive.google.com/open?id=15jFYzFYtHzcarbDnyen4I_Sx_qU&usp=sharing)

Building a Production Data Lake in the Cloud using Apache Spark