What we're about

We know that scaling reinforcement learning (RL) by stitching it together with deep nets works. This brings excitement to all keen on understanding and building autonomous agents solving hard real world problems.

But, we have only seen early results brought about by excellent engineering innovations, barring a few fundamental revisits to RL, e.g. distributional perspective on RL. The latter has shown even greater promise, thankfully.

We need more revisits like these. We need to take a step back and examine the building blocks of RL, deep nets, training regimens, and the nature of dynamic systems, to see how and when these complement each other. This may lead to novel learning schemes and substrates.

In so doing, we will find ways to make RL work orders of magnitude better than it does today. We want RL to be data efficient, to generalise, and to go wild.

This meetup is a Grad school style reading group for those interested in the aforementioned exercise. One volunteer leads the discussion of some recent work, while the others chip in to take the work apart.

The discussion is at a deep technical level. We like rich intuitive explanations, math, and code. We also like applications on occasion, but are more keen on generality and basic methods and findings. As a thought exercise and when relevant, we entertain applications of the discussed work. Beginners are welcome, specially students and hobbyists who want to play a meaningful role in understanding and advancing the state-of-the-art.

Do join our Slack workspace, where we keep the discussion going and plan our future meetings:

https://join.slack.com/t/rlrg/shared_invite/enQtNDkyNjM1NTE3NjMyLWVhZWUyNzUxZjdmYjEzOTlmMDBjMWFmOWI3YTNkNWZhM2ZmYmE5ZmJjMWQ3ODdlNWI5ZTkwYTMyZjcyYzkyZmM

Upcoming events (3)

RUDDER: Return Decomposition for Delayed Rewards

Needs a location

Assigning credit to action sequences when rewards are delayed is a fundamental challenge addressed by RL algorithms. However, very long delays can become prohibitive for any learning to happen at all using standard algorithms. This problem can be addressed by smarter exploration strategies, bootstrapping on credits/rewards extracted from demonstrations, decomposing the problem by exploiting bottlenecks, etc. We will have Rasmus Munter lead us through an alternative approach to this problem, presented in the below mentioned paper. The approach taken calls upon ideas like attention and interpretability, which are being explored by the deep learning community, to help decompose the delayed reward (in this case return) predictions made by an LSTM model, into contributions along the state-action sequence experienced by the agent. Rasmus is currently studying AI and applied Math at NTNU. Paper details: https://arxiv.org/abs/1806.07857 "RUDDER: Return Decomposition for Delayed Rewards", by Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Sepp Hochreiter Abstract: "We propose a novel reinforcement learning approach for finite Markov decision processes (MDPs) with delayed rewards. In this work, biases of temporal difference (TD) estimates are proved to be corrected only exponentially slowly in the number of delay steps. Furthermore, variances of Monte Carlo (MC) estimates are proved to increase the variance of other estimates, the number of which can exponentially grow in the number of delay steps. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track Q-values so that the future expected reward is always zero. We experimentally confirm our theoretical results on bias and variance of TD and MC estimates. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture in only a fraction of the learning time. RUDDER considerably improves the state-of-the-art on the delayed reward Atari game Bowling in much less learning time. Source code is available at https://github.com/ml-jku/baselines-rudder and demonstration videos at https://goo.gl/EQerZV."

Distributional reinforcement learning with deep neural nets

A risk taking agent may want to bet on actions with a chance of a high but uncertain return, as opposed to a low but more certain return, and vice-versa for a risk averse agent. Modelling the aleatoric uncertainty in action values sounds very reasonable indeed, to enable such behaviour. In other words, if we could model the full distribution of returns in stochastic environments, an agent could behave in accordance with its preferences towards risk. Recent work in this direction models these distributions with deep neural nets. In addition to building risk-aware agents, the work argues for the value of learning the return distributions in and of itself, i.e. even when risk sensitivity is not a concern. Algorithms resulting from taking the distributional approach have shown orders of magnitude better sample efficiency compared to a host of other deep RL methods that do not take distributions into account. We will have Olav Nymoen lead us through a few of these works (see below) as they evolved over the past year. Olav is currently finishing up his studies in AI at NTNU, whilst also working as a data scientist at Finn.no. He was one of the early employees at Comoyo/Telenor Digital where he built browser based operating systems for mobile phones and low powered IoT networks. "A Distributional Perspective on Reinforcement Learning" Marc G. Bellemare, Will Dabney, Rémi Munos https://arxiv.org/abs/1707.06887 Related talk: https://vimeo.com/235922311 "Distributional Reinforcement Learning with Quantile Regression" Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos https://arxiv.org/abs/1710.10044 Related talk: https://www.youtube.com/watch?v=LzIWBb2FhZU&list=PLe5rNUydzV9Q01vWCP9BV7NhJG3j7mz62&index=14&t=0s "Implicit Quantile Networks for Distributional Reinforcement Learning" Will Dabney, Georg Ostrovski, David Silver, Rémi Munos https://arxiv.org/abs/1806.06923 4th talk in this video: https://www.facebook.com/icml.imls/videos/429611280886726/?t=2499

Learning to search with MCTSnets

Needs a location

A key direction in deep reinforcement learning (and in deep learning more generally) is that of empowering the function approximator to express more of the algorithmic complexity which would otherwise be hand-crafted (see http://www.deeplearningindaba.com/uploads/1/0/2/6/102657286/principles_of_deep_rl.pdf). We can thus let an approximator learn the rules that enable learning and/or planning to solve tasks, in addition to solving those tasks, thereby automating the algorithm design process. We will have Truls Edvard Stokke lead us through the following paper from Deepmind, presented at ICML 2018, which exemplifies the above idea within the domain of game tree search, automating to some degree how knowledge about the game is efficiently acquired. Truls is currently studying AI, applied Math, and Physics, at NTNU. Paper title: "Learning to search with MCTSnets", Arthur Guez, Theophane Weber, et. al. Abstract: "Planning problems are among the most important and well-studied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and back-up those evaluations to the root of a search tree. Among these algorithms, Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimised to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to back-up those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well-known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines." Paper can be found here: http://proceedings.mlr.press/v80/guez18a.html Talk from ICML 2018 (23:20min onwards) here: https://www.facebook.com/icml.imls/videos/429607650887089/?t=1402

Past events (2)

Revisiting the Uncertainty Bellman Equation

The Study, Mesh

Photos (6)