April 23, 2014 · 6:30 PM
Here is the schedule for our 3rd meetup. Looking forward to it! And many thanks to Google for hosting us this time.
• Scalable Probabilistic Entity-Topic Linking
Massimiliano Ciaramita, Research Scientist at Google Zürich
Entity linking involves labeling phrases in text with their referent entity Id, e.g., from Wikipedia or Freebase. This task is challenging due to the number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate this problem in terms of probabilistic inference within a topic model (LDA), where each topic is associated with an entity Id. To scale we propose an efficient Gibbs sampling scheme. This conceptually simple approach achieves state of the art performance on a popular benchmark and can be easily extended to a distributed learning framework.
• Introduction to Speech Recognition
Paul R. Dixon, Research Scientist at Yandex
Speech recognition is a difficult problem that requires knowledge from areas including signal processing, machine learning, algorithms and linguistics. The aim of this presentation is to give an accessible introduction to the algorithms and techniques used for creating state-of-the-art spoken language systems. Recently, a large number of high quality open source toolkits for speech recognition have become available. These will be of interest to the larger community because many of the algorithms and tools that are used for speech recognition can be applied to other machine learning and language processing tasks.
• Randomized Linear Regression: A brief Overview and Recent Results
Brian McWilliams, Postdoc at ETH Zürich
Linear regression is an important and ubiquitous tool in machine learning, statistics and data analysis. However, standard solvers for ordinary least squares scale poorly to large datasets. Recently, randomized algorithms based on subsampling the dataset have been proposed which recover good approximate solutions much faster than standard LAPACK routines. I will give a brief overview of the ideas behind these techniques.
At the same time, the assumptions underlying linear regression are known to be unrealistic. We introduce a new statistical model which assumes that some of the datapoints we observe are corrupted. We propose a random subsampling algorithm which is able to identify the corrupted datapoints so that they are sampled with low probability. We show an application of our algorithm to the problem of predicting flight delay time.
- Important: Please update your RSVP if you change your plans! Places are limited to 111 this time.
- Please get in touch if you would like to present something in one of the future meetups, or if you have ideas about topics and speakers and locations! And we're still urgently searching for companies willing to sponsor a small apéro at the future events from Mai on :-)