[ad_1]

## Applied Reinforcement Learning

## The Machine Learning Architecture Behind AlphaGO Explained

In 2016 AlphaGo beat the world’s best player at the game of go. Then, it seemed impossible; now, it is remembered as a key milestone in the history of machine learning.

## Game of Go and Machine Learning

Games, whether they are board games or video-games, are the perfect platform to test, evaluate and improve machine learning models. Games often have a very clear scoring system, and therefore present a clear and effective way to quantify and measure progress.

In the past, there have been key milestones in the history of technology marked by other board games. When DeepBlue beat the world chess champion Garry Kasparov in 1997, this was seen as an incredible achievement in the world of computing. At the time, many thought it would take many more decades, and some thought it would never happen.

This model worked thanks to human input (by other chess grandmasters) and raw computing power. It was able to look many moves ahead, beating humans. Some criticise DeepBlue’s approach, like Michio Kaku, in his 2008 book “Physics of the Impossible”, by saying that this event was a victory of “raw computer power, but the experiment taught us nothing about intelligence”. He adds, “I used to think chess required *thought*”. He attributed the victory of machine over human to the fact that DeepBlue could calculate moves positions way in the future, without *creativity *or *intelligence*.

The game of Go is much more complex than chess. At any given position in chess, the expected number of legal moves is around 35, in GO, it is 250. In fact, there are more possible configurations in a GO board than there are atoms in the universe. Computing moves in the future is not a strategy that would work, and therefore to solve the game of go, real *thought*, *creativity *and *intelligence *is required.

This is why defeating the world’s best humans at the game of Go was such an important milestone of the world of Machine Learning, and in this article, I want to go through how Google’s DeepMind was able to master this game and beat some of the best players in history.

## The Rules of GO

Before I get into the ML, a brief description of the game of Go. The game of Go is quite simple. Each player places a piece on the board, one at a time. The goal is to surround empty space with your pieces. You win points by surrounding empty space. If you surround your opponent’s pieces, these are captured (see the faded black pieces on the bottom left of the image above). At the end of the game, the player that controls the most empty space wins. In the final position above, white controls more empty space, so white wins.

I tried playing Go online, and did terribly. The good news is you don’t need to understand Go or be any good at it to understand the machine learning behind AlphaGo.

## Reinforcement Learning, the Basics

“Even if you took all the computers in the world, and ran them for a million years, it wouldn’t be enough to compute all possible variations {in the game of GO}” Demis Hassabis, co-founder and CEO of Deep Mind.

Reinforcement learning is the category of models in machine learning that learn by playing. These models learn a bit like a human would, they play the game over a number of iterations, improving as they win or loose. I have another article explaining how reinforcement learning works in much more detail. You can check it out here.

For a quick recap, the core of reinforcement learning can be understood through the following definitions:

**The agent **is the algorithm that is playing the game.

**The environment **is the platform on which the game is playing, in the case of GO it’s the board.

**A state** is the current position of all the pieces in the environment.

**An action **is the move that the agent may take at a given state on the environment.

**The value **indicates the likelihood of winning a game given a state or an action/state pair.

**The policy **is the method with which the agent chooses the next action, based on the predicted values of the next states (do you always go for what you think is the best action according to the value function? should you explore from time to time to learn something new?).

The goal of an RL algorithm is to learn the optimal value function, which will allow it to determine at any given state what is the action that will result in the highest likelihood of it winning the game.

## Mastering the game of Go with Deep Neural Networks and Tree Search

Paper by David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis.

Published on Nature, 2016.

Here I’ll go through AlphaGo’s architecture. Whilst reading this section, feel free to refer back to the definitions above.

AlphaGo consists of 2 types of neural networks, *policy* networks and a *value* networks.

## Policy Networks

**BLUF: **The goal of the policy network is to determine the probabilities of the next action given a state. The training of the policy network is broken down into 2 steps. First, a policy network is trained through supervised learning ( the SL Policy Network) on professional game data. Then, the weights of this model are used as the initialisation for the policy network trained using reinforcement learning (the RL Policy Network).

**Supervised Learning (SL) Policy Network: **The input of the policy network is the *state*. The state, as discussed above, represents the positions of the pieces. The board is 19×19. To improve the input, they expand it using 48 feature maps, making the input 19x19x48 shaped. These feature maps help the algorithm capture the position of the board, this may be by highlighting the empty spaces around pieces, the conglomerations of pieces etc. This expansion helps the algorithm learn.

The SL policy network is a supervised learning classification model (so far, no reinforcement learning). It is trained using professional player games. The policy network is then given the task to predict the next most likely *action* after a given state *p(a|s)*. The policy network is a Convolutional Neural Network (CNN), and it outputs a series of probabilities for what the next action should be (a 19×19 output).

The weights *(σ)* of the policy network *(p) *can be updated proportionally to the term shown above. This is basically cross-entropy loss in a supervised learning classification setting. Here we are updating the weights to maximise the likelihood of a human move *(a)* given the state *(s)*.

**The rollout policy** is a much smaller network, trained in the same way. It is a linear layer, and it is trained on less data. It is much faster to compute future positions, and therefore it becomes useful later on when performing a tree-search.

**RL Policy Network: **So far the policy networks have been supervised learning models, and all they do is imitate professional players using their training data. These models still haven’t learnt to play the game on its own. To do so, the SL policy network is improved through reinforcement learning.

The RL policy network is initialised with the weights of the SL policy network. It is then trained through self play, where you take a pool of these models, and they play against each-other. At the end of a game, the probability that the model plays the wining model’s actions should increase, and the opposite is true for the loosing model’s actions. By having the model play itself, we are now teaching the model to win games, rather than to just imitate grand masters.

The update function of the weights is similar to that of the SL policy network. This time however, we wait until the end of the game to perform the update. *z *is the reward based on the outcome of the game. If the agent won, then *z* is equal to +1, and if it lost it is equal to -1. This way we make the model prefer moves that it played when it won, and is less likely to play moves that it played when it lost a game.

## The Value Network

**BLUF: **The value network predicts the probability of an agent wining a game given the state. With a perfect value network, you could determine the best move in each position, by choosing the action that results in the state with highest value. In reality, the authors used a Tree Search method to determine the next best action, still using this approximate value function.

To train the value network, 30 million games are generated by self-play (having the SL policy networks play themselves). Then, the value network is trained as a regression on random states from each of those games, to determine whether the position is wining or loosing. This is once again a supervised learning task, although this time a regression task.

The update of the gradients of the value function is shown above. The scaling term on the right hand side is a function of the outcome *z* and the output of the value function *v(s)*. If when given a state the value function correctly predicts that the agent will win, the scaling term remains positive. However, if the value function predicts that the agent will loose, but ends up winning, the scaling term is negative and we update the gradients in the opposite direction. The magnitude of this scaling term increases with how wrong the value function’s prediction was about the likelihood of wining the game. This is basically L2 loss in a supervised regression task.

## Tree Search

In order to determine the next best move, a tree search is performed.

The tree search begins at a given state *s*. An action is selected based on *Q *and* u(P).*

*Q* is the *action value function*, which determines the value of an action rather than a state.

*u(s|a) *is the output of a tree search algorithm, where the output is proportional to the prior (numerator in the equation above, the prior is taken from the SL policy network). *u(s|a) *is also inversely proportional to the number of visits to leaf in the tree given an action *(N(s, a))*. At the beginning, the algorithm to explore new moves, and trust less the *Q* value function (taking the *u* part of the argmax, since *N(s, a)* will be small), but after the search algorithm converges on the same leaf many times, it begins relying more on *Q, *allowing it to explore deeper into promising branches that have been visited often.

To calculate *Q *(the action value function), they first evaluate each leaf using *V*, which is a weighted average between the value function *(v) *of the resulting state,* *and the rollout outcome *(z)*. The rollout outcome is the outcome of the game (win or loose) when applying the rollout policy network and playing out the game from that leaf state. Because the rollout policy network is small, it is quick to evaluate, and therefore it can be used as part of the tree-search to play out the whole game many times for each leaf, and consider those outcome when picking the final action. Finally, to get *Q* from the leaf evaluation* V*, you simply divide V by the number of visits of that edge.

## Results and Further Research

The AlphaGo models were in reality a series of models, each new iteration improving on the previous. Two papers were published, the first in 2016, “Mastering the game of Go with Deep Neural Networks and Tree Search” (the one I went through), and a second one in 2017, “Mastering the Game of GO Without Human Knowledge”. This second paper presents AlphaGo Zero, which uses no prior knowledge (no professional player training data) and simply relies on the RL policy network, the Value Network, and the tree search. The model is solely based on reinforcement learning, and it generalised to play other board games such as chess, where it was able to beat the best engine in chess, stockfish.

AlphaGo Lee was the model that famously defeated Lee Sedol 4 to 1 in 2016, one of the strongest players in the history of Go. AlphaGo Master then went on to defeat a series of world chess champions scoring 60 to 0. Finally, AlphaGo Zero, a better iteration that used no professional training data (see the Elo ratings above).

## Conclusion

In this article, I described the architecture of AlphaGo, the machine learning model that defeated the some of the top Go players of all time. I first went through some of the basics of reinforcement learning, and then I broke down the architecture of the model.

Now you may be wondering, why put so much effort into solving a board game? After all, this isn’t really helping anybody. Research is about furthering our knowledge as a whole, and games like Go allow us to quantify progress very easily. The progress that we manage to make on these board games can then be applied to solve greater challenges. DeepMind works on models to save energy, identity disease and accelerate science across the globe. This kind of research is critical as it furthers our knowledge and understanding of AI, and in the future it’s likely to act as the catalyst for many life changing technologies.

## Support me

Hopefully, this helped you, if you enjoyed it you can **follow me!**

You can also become a **medium member**** **using my referral link, and get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership

## Other articles you might enjoy

## References

[1] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis, “Mastering the game of Go with deep neural networks and tree search”, Nature, 2016. Available: https://www.nature.com/articles/nature16961

[2] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel & Demis Hassabis, “Mastering the game of Go without human knowledge”, Nature, 2017. Available: https://www.nature.com/articles/nature24270

[ad_2]

Source link