Xlera8

Going beyond average for reinforcement learning

Consider the commuter who toils backwards and forwards each day on a train. Most mornings, her train runs on time and she reaches her first meeting relaxed and ready. But she knows that once in awhile the unexpected happens: a mechanical problem, a signal failure, or even just a particularly rainy day. Invariably these hiccups disrupt her pattern, leaving her late and flustered.

Randomness is something we encounter everyday and has a profound effect on how we experience the world. The same is true in reinforcement learning (RL) applications, systems that learn by trial and error and are motivated by rewards. Typically, an RL algorithm predicts the average reward it receives from multiple attempts at a task, and uses this prediction to decide how to act. But random perturbations in the environment can alter its behaviour by changing the exact amount of reward the system receives.

In a new paper, we show it is possible to model not only the average but also the full variation of this reward, what we call the value distribution. This results in RL systems that are more accurate and faster to train than previous models, and more importantly opens up the possibility of rethinking the whole of reinforcement learning.

Returning to the example of our commuter, let’s consider a journey composed of three segments of 5 minutes each, except that once a week the train breaks down, adding another 15 minutes to the trip. A simple calculation shows that the average commute time is (3 x 5) + 15 / 5 = 18 minutes.

Source: https://deepmind.com/blog/article/going-beyond-average-reinforcement-learning

Chat with us

Hi there! How can I help you?