What we're about
Upcoming events (3)
Multi-scale Feature Learning Dynamics: Insights For Double Descent
Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie
A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the highdimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of “double descent” of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.
Pay Attention to MLPs
From "The Batch" https://read.deeplearning.ai/the-batch/perceptrons-are-all-you-need/
The paper that introduced the transformer famously declared, “Attention is all you need.” To the contrary, new work shows you may not need transformer-style attention at all.
What’s new: Hanxiao Liu and colleagues at Google Brain developed the gated multi-layer perceptron (gMLP), a simple architecture that performed some language and vision tasks as well as transformers.
Key insight: A transformer processes input sequences using both a vanilla neural network, often called a multi-layer perceptron, and a self-attention mechanism. The vanilla neural network works on relationships between each element within the vector representation of a given token — say, a word in text or pixel in an image — while self-attention learns the relationships between each token in a sequence. However, the vanilla neural network also can do this job if the sequence length is fixed. The authors reassigned attention’s role to the vanilla neural network by fixing the sequence length and adding a gating unit to filter out the least important parts of the sequence.
Why it matters: This model, along with other recent work from Google Brain, bolsters the idea that alternatives based on old-school architectures can approach or exceed the performance of newfangled techniques like self-attention.
We’re thinking: When someone invents a model that does away with attention, we pay attention!
Join us as we work through Russell and Norvig's newest edition over the coming months. We'll start with Chapter 3 and organize ourselves on our Slack channel -- hoping lots of you will be willing to present an example or two to our friendly, virtual group.
More details to come. Message me if you'd like to join our Slack group.