What we're about

We're a small group that meets once or twice a month to review recent research in machine learning. We started in Northern Virginia, but we've moved permanently online. Each meeting has a specific topic a research papers that we read prior (as much as we can) so that we can all contribute to the conversation. We welcome new active participants, and we really appreciate the history of reading we've developed as a group. This group is aimed at people who have enough background in machine learning to be able to read and (mostly) understand research in the field, or who are serious about building that understanding. Everyone who is interested in learning more about machine learning is welcome. We love lively discussions!

Upcoming events (3)

Multi-scale Feature Learning Dynamics: Insights For Double Descent

Multi-scale Feature Learning Dynamics: Insights For Double Descent
https://arxiv.org/pdf/2112.03215.pdf

Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie

Mila

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the highdimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of “double descent” of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

Perceptrons are All You Need

Online event

Pay Attention to MLPs
https://arxiv.org/abs/2105.08050

From "The Batch" https://read.deeplearning.ai/the-batch/perceptrons-are-all-you-need/

The paper that introduced the transformer famously declared, “Attention is all you need.” To the contrary, new work shows you may not need transformer-style attention at all.

What’s new: Hanxiao Liu and colleagues at Google Brain developed the gated multi-layer perceptron (gMLP), a simple architecture that performed some language and vision tasks as well as transformers.

Key insight: A transformer processes input sequences using both a vanilla neural network, often called a multi-layer perceptron, and a self-attention mechanism. The vanilla neural network works on relationships between each element within the vector representation of a given token — say, a word in text or pixel in an image — while self-attention learns the relationships between each token in a sequence. However, the vanilla neural network also can do this job if the sequence length is fixed. The authors reassigned attention’s role to the vanilla neural network by fixing the sequence length and adding a gating unit to filter out the least important parts of the sequence.

Why it matters: This model, along with other recent work from Google Brain, bolsters the idea that alternatives based on old-school architectures can approach or exceed the performance of newfangled techniques like self-attention.

We’re thinking: When someone invents a model that does away with attention, we pay attention!

AIBC: Solving Problems by Searching

Online event

Join us as we work through Russell and Norvig's newest edition over the coming months. We'll start with Chapter 3 and organize ourselves on our Slack channel -- hoping lots of you will be willing to present an example or two to our friendly, virtual group.

More details to come. Message me if you'd like to join our Slack group.

Past events (39)

Masked Autoencoders Are Scalable Vision Learners

Online event

Photos (20)