This is a FREE session hosted at Strata London Conference. Please note the address of venue and name of the room.
Talks start at 18.45
(Talk to be announced) Sr. Data Scientist @Pivotal
"Particles mining: Turning a mountain of data into a molehill" by Ellie Dobson, Application Engineer at Mathworks
The LHC experiment, based at CERN in Geneva, produces about a petabyte of data per second, but the front end of the Higgs discovery analysis was performed on a dataset that could fit on most laptops. In this talk I shall walk through strategies employed by particle physics experiments to search for complex patterns in an initial dataset too big to fit in any data warehouse. The first barrier of defense consists of a hardware-based trigger, backed up by a farm of software triggers running in real time. After the trigger reduces the initial input rate, the recorded data is replicated to a world wide computing cluster, where automatic software processing is performed to aggregate the binary data into fewer elements. The next stage is to run algorithms, trained on Monte Carlo simulation, to mine the data for promising looking signatures. Only after these steps are statistical analyses performed, which qualify whether the signatures seen can indeed be attributed to a new physics signal. Ellie spent most of her early life planning to be a musician but did a rather unexpected u-turn at the age of 18 and after a brief foray into teaching ended up reading physics at Oxford University. She was awarded her PhD in 2009 in particle physics, and spent many happy days hunting particles in the ATLAS detector as part of the LHC project. After embarking on a Marie Curie fellowship in affiliation with University College London, she recently took the plunge into the private sector and now works as an Application Engineer at MathWorks, specialising in parallel computing and data science.
"Topological Data Analysis: visual presentation of multidimensional data sets" by Edward Kibardin, Head of Data and Analytics @Base79.
Topology data analysis (TDA) is an unsupervised approach which may revolutionise the way data can be mined and eventually drive the new generation of analytical tools. The idea behind TDA is an attempt to "measure" shape of data and find compressed combinatorial representation of the shape. In ordinary topology, the combinatorial representations serve the purpose of providing the compressed representation of high dimensional data sets which retains information about the geometric relationships between data points. TDA can also can be used as a very powerful clustering technique. Edward will present the comparison between TDA and other dimension reduction algorithms like PCA, LLE, Isomap, MDS, and Spectral Embedding.
Break, Community Update and Data Science books giveaway.
“Item Similarity Revisited” by Mark Levy, Sr. Data Scientist @Mendeley
The announcement of the Netflix Prize back in 2006 fired the starting pistol on a race to develop methods to predict preference based on collaborative filtering of ratings, a race which is still in progress, at least in academic circles. Netflix themselves commented on their tech blog in 2012 that predicted ratings form only a single, relatively uninfluential input feature to the model which they actually use to generate recommendations. Meanwhile several other industry players, particularly those whose datasets contain only implicit feedback and not ratings, are known still to use simple item similarity methods as the basis of their recommender systems.
Item similarity methods offer fast computation at recommendation time, and natural explanations of why particular items are being recommended, but they have not been a focus of academic research, except as benchmarks which can apparently be easily beaten by more complex algorithms, perhaps because item similarity tends to give high quality recommendations only when carefully tuned for a particular dataset. An interesting paper from 2012 bucked the trend by introducing Sparse Linear Methods (SLIM), and showing that they easily outperformed more complex preference prediction models for top-N recommendation, but at rather a high computational cost compared to traditional item similarity methods when applied to large datasets.
In this talk Mark will present experimental results which suggest that a simple relaxation of the problem constraints solved by SLIM can lead to an item similarity method which outperforms model-based algorithms but at reasonable computational cost. I put this in the context of some reflections on the reality of running large-scale industrial recommender systems based on experience at Last.fm and Mendeley, and also introduce a new open source Python software package implementing our version of SLIM and some other useful methods for working with implicit feedback data.
Special thanks to O’Reilly Strata Conference London for hosting our meetup and supporting our community.
Thanks to MongoDB, Cloudera, Pivotal for supporting our community