Bit Manipulation Hacks in the Wild & Fast clustering & visualization of big data
Details
Hello all,
I would like to welcome you all to the next meetup of KW Intersections on November 13th at 7 pm. This will be our 50th meetup. Incredible how time flies. We will have two times. The first talk, titled "Bit Manipulation Hacks in the Wild; Time Series and Granger Causality" will be given by Avishalom (Vish) Shalitm Head of Data Science at Kik. The second talk titled "Fast clustering and visualization of big data" will be given by Daniel Ashlock, Professor of Mathematics at the University of Guelph.
I am looking forward to seeing you all on November 13th.
&&&&&&
Title: Bit Manipulation Hacks in the Wild; Time Series and Granger Causality. (Featuring SQL and math)
Speaker: Avishalom (Vish) Shalit
Head of Data Science
Kik
Bits are back! A favourite interview topic from a decade ago turns out to have been important all along. When your terabytes of data are stored in the cloud, you could do push a lot of the data crunching right to the query retrieving your data with MATH(!!!). In this session we will go over some representation and manipulation methods and their uses in time series analyses. We will discuss real world applications, calculating the Granger causality between various time series very efficiently; several orders of magnitude better. We explore fast alternatives to windowing aggregate functions. Cooler than 0x5F3759DF.
&&&&&&
Title: Fast clustering and visualization of big data.
Speaker: Daniel Ashlock
Professor of Mathematics,
University of Guelph
The talk presents and off-line/on-line technique for clustering data sets that scales well to big data sets and which can work transparently with high-dimensional data. The technique is based on point packing, a technique that arises from the theory of error correcting codes. A set of data of the the sort being clustered are selected so that they are well spaced out in the data space; this selection process, point-packing in the data space, is the off-line part of the process. These points are called the cluster centers. Clustering is then performed by binning data by which of the selected points they are nearest to. The resulting clustering is of moderate quality but requires linear time to generate, making it very fast; this is the on-line portion of the process.
Clustering of this sort can then be used to generate a simple visualization of the data. Clusters become nodes in a network with the links in the network generated between small numbers of nearest neighbors among the cluster centers. The cluster centers are placed in the plane via non-linear projection or any of a variety of other algorithms. The network then forms a 2D picture of the data. These techniques are based on sophisticated mathematics but are accessible to anyone familiar with simple coding. This technique not only visualizes the data but highlights density anomalies in the data.
