Aug '18-Using dplyr for ETL, because SQL is so last decade (Derek Slone-Zhen)


Details
We look forward to seeing you at our August meetup
Arrive from 5:45pm for a 6pm talk.
Using dplyr for your ETL processes, because SQL is sooo last decade
TALK OUTLINE:
SQL is the go-to language for databases and Enterprise Data Warehouses, including columnar and MPP databases such as Amazon Redshift, and even finding traction as a top layer API on Hadoop via Impala, HiveQL and others.
It's also incredibly verbose, with far to much repatition, and nothing done "for you".
Seriously, I never want to hand-write SQL again. In the past, I've used both Python and Ruby to write the SQL on my behalf.
Tonight we will look at how to replace a traditional ETL style data transformation that would traditionally be coded in SQL, with R and dplyr, that will take the drudgery and repatition out of your SQL coding.
We also include some data quality checks and data profiling, essential before any "serious" data work, and that will often be used to help determine the physical data layout that will be used.
We will use RStudio and RNotebooks to host the code and make the processes repeatable, and will use Postgres and Redshift as our databases.
BIO:
Derek Slone-Zhen is a seasoned software engineer, data engineer, and data scientist, with a predelection for the optimisation of code performance, data volumes, and code size.
He is also passionate about the human experiance, human perception, and how people assimilate information from visuals and respond to their environment.
He strives hard to deliver less, believing that "more" more often delivers complexity and obfuscation rather than any real value.
He has worked for IBM, Microsoft, Quantium, and Data Republic, and currently works for PwC in their analytics practice in Sydney.

Aug '18-Using dplyr for ETL, because SQL is so last decade (Derek Slone-Zhen)