I was rather surprised by a prediction in a research report I recently read:
“By 2021, 15% of customer experience applications will be continuously hyper-personalized by combining a variety of data and newer reinforcement learning algorithms”
What caught my interest wasn’t the fact that I had no real idea what is meant by “continuously hyper-personalized” but that apparently in just over 18 months’ time, around 1 in 6 customer experience applications will use reinforcement learning algorithms. With my curiosity piqued, I decided to read further, discovering that reinforcement learning will allow vendors to “personalize recommendations, offers, and even products tailored to an individual’s specific needs.”
However, my curiosity gave way to confusion, for surely recommendation engines already work rather well using established machine learning methods? So I wondered, what is reinforcement learning, what types of problems is it good at answering, and will it really enjoy mainstream adoption or is it just hype?
Types of machine learning
Supervised and unsupervised machine learning methods have been around for a while, so most people are probably already somewhat familiar with them. In summary, supervised learning methods “teach” a model by providing example inputs along with their associated outputs (called a labelled training dataset), so that when new input arrives the model can accurately predict the expected output. On the other hand, with unsupervised learning methods, the objective is not to determine desired outputs but to uncover or “learn” patterns and structures otherwise hidden in the data.
Reinforcement learning is a very different beast entirely. Fundamentally it is a framework for teaching a model (commonly referred to as an agent) to perform a task or to react to its environment through a process of trial-and-error decision making, typically teaching the model to prefer “positive decisions” as opposed to “negative decisions”. The objective is to find a series of decisions that maximises reward and minimises penalties.
The textbook example of reinforcement learning is that of training a robot that has been placed in the centre of a trap filled maze, and which has to navigate its way safely to the exit by avoiding the traps on its journey. Very simplistically, each time the maze is played, the robot is rewarded when it makes a correct decision (say by finding a quick and safe route to the exit) and is penalised when it makes a wrong decision (eg when it gets too close to a trap or it gets lost). The robot is trained by solving the maze thousands and thousands of times, learning from its previous mistakes and previous rewards.
The successes and difficulties of reinforcement learning
A much-lauded success story of reinforcement learning is Google’s AlphaGo, which beat the world’s best Go player (Lee Sedol) 4 games to 1 in 2016. The significance of this achievement cannot be understated – Go is a highly complex game with an estimated 10170 possible board positions.
Certainly very impressive, but other than playing games and escaping mazes, reinforcement learning has not found widespread adoption or real-world success. So, why not?
Well, with every machine learning approach, there are of course pros and cons. For instance, with supervised learning it is often difficult to provide a labelled training dataset, whereas for unsupervised learning methods, the usefulness of the results are difficult to confirm since no pre-defined outputs exist!
In the case of reinforcement learning, as well as facing a number of problems similar in nature to those of supervised and unsupervised methods, reinforcement learning has its own unique and highly complex challenges, including difficult training/design set-up and problems related to the balance of exploration vs. exploitation, as well as the so-called long term credit assignment problem. Indeed, even for relatively simple problems, reinforcement learning requires a huge amount of training, taking anywhere from hours to days or even weeks to train. As a case in point, apparently it took 6 weeks to train AlphaGo.
While reinforcement learning may ultimately have promise, it is important not to overstate its current achievements nor its current applicability. For instance, while there has been much focus on the role reinforcement learning had in the development of AlphaGo, what is less well known is AlphaGo’s training started with Monte Carlo methods and deep neural networks, during which time it learnt from 30 million moves from expert human players. Reinforcement learning was only applied after this extensive initial training. Further, after the reinforcement learning phase, moves from those games were then fed into a second neural network. In other words, reinforcement learning only played a part (albeit important part) in the success of AlphaGo – it was not the entire solution.
Data scientists have a tendency to apply new methods to every problem they encounter, simply because they are fascinated by it and often without stopping to think whether the new method should be applied to their particular problem. It is actually a form of bias known as “Maslow’s hammer”.
In the case of reinforcement learning, there are several blogs that explain how it could be applied to recommendation engines. However, such blogs tend not to explain why reinforcement learning should be used for this task (as opposed to tried and tested machine learning methods) nor do such blogs discuss the challenges of productionising such a solution in a real world system.
Only time will tell whether reinforcement learning becomes as mainstream as some predict, or whether it is best suited only to niche problems such as game solving and robotics. However, does that then mean it doesn’t warrant investment, research or learning about in the meantime? Absolutely not! When asked why he wanted to scale Mount Everest, George Mallory famously replied “Because it’s there” – it provided a focus, a challenge, and a reason and sometimes that is all that is needed. With RL, we have a challenge and reward, and so I for one will continue to learn about this fascinating approach, but I won’t be applying it to problems I encounter, at least not until its challenges have been overcome.
Calum Chalmers is a senior AI consultant in the Insights & Data practice, with over 20 years’ analytical and machine learning experience in the financial and energy sectors. He studied mathematics and data science at Warwick University, focusing on streaming algorithms for big data.