Skip to content

Feb'19 - You don't need Spark for medium data (Zhuo Jia Dai)

Photo of Eugene Dubossarsky
Hosted By
Eugene D. and 2 others
Feb'19 - You don't need Spark for medium data  (Zhuo Jia Dai)

Details

We look forward to seeing you at our first 2019 meetup, on 13th February:

Arrive from 5:45 pm for a 6 pm talk.

"You don't need Spark for medium data"

TALK OUTLINE:

Medium data is an important segment of datasets sandwiched between small data (datasets that can be manipulated in R or Python/Pandas) and big data (datasets the require distributing data over many computers to be effective e.g. Hadoop/Spark). This segment is important because it is difficult to analyse without proper tools but is also the predominant form of data in many industries including banking. The canonical tools for dealing with medium data include Dask, JuliaDB.jl, SAS, and Spark; and there aren't any good options in R. In this talk, we will present the disk.frame (https://github.com/xiaodaigh/disk.frame) R package which is a new medium data manipulation framework that is simple, fast, and (hopefully) intuitive to use. We will showcase how to summarise 1.8 billion data points on a laptop within minutes using disk.frame.

BIO:

ZJ has more than 10 years of experience in credit risk modelling/analytics/data science and has recently become an independent consultant. He has a maths background, and runs the Sydney Competitive programming meetup and Julia (Julialang) meetup.

Photo of Sydney Users of  R Forum (SURF) group
Sydney Users of R Forum (SURF)
See more events
SMSA
280 Pitt St · Sydney