Feb'19 - You don't need Spark for medium data (Zhuo Jia Dai)


Details
We look forward to seeing you at our first 2019 meetup, on 13th February:
Arrive from 5:45 pm for a 6 pm talk.
"You don't need Spark for medium data"
TALK OUTLINE:
Medium data is an important segment of datasets sandwiched between small data (datasets that can be manipulated in R or Python/Pandas) and big data (datasets the require distributing data over many computers to be effective e.g. Hadoop/Spark). This segment is important because it is difficult to analyse without proper tools but is also the predominant form of data in many industries including banking. The canonical tools for dealing with medium data include Dask, JuliaDB.jl, SAS, and Spark; and there aren't any good options in R. In this talk, we will present the disk.frame (https://github.com/xiaodaigh/disk.frame) R package which is a new medium data manipulation framework that is simple, fast, and (hopefully) intuitive to use. We will showcase how to summarise 1.8 billion data points on a laptop within minutes using disk.frame.
BIO:
ZJ has more than 10 years of experience in credit risk modelling/analytics/data science and has recently become an independent consultant. He has a maths background, and runs the Sydney Competitive programming meetup and Julia (Julialang) meetup.

Feb'19 - You don't need Spark for medium data (Zhuo Jia Dai)