Mastering the game of Go with deep neural networks and tree search

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.


All games of perfect information have an optimal value function, v*(s), which determines the outcome of the game, from every board position or state s, under perfect play by all players. These games may be solved by recursively computing the optimal value function in a search tree containing approximately bd possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80)1 and especially Go (b ≈ 250, d ≈ 150)1, exhaustive search is infeasible2,3, but the effective search space can be reduced by two general principles. First, the depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s) ≈ v*(s) that predicts the outcome from state s. This approach has led to superhuman performance in chess4, checkers5 and othello6, but it was believed to be intractable in Go due to the complexity of the game7. Second, the breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a prob-ability distribution over possible moves a in position s. For example, Monte Carlo rollouts8 search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving superhuman performance in backgammon8 and Scrabble9, and weak amateur level play in Go10.


Monte Carlo tree search (MCTS)11,12 uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simu-lations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function12. The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves13. These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play1315. However, prior work has been limited to shallow policies1315 or value functions16 based on a linear combination of input features.


Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification17, face recognition18, and playing Atari games19. They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image20. We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

最近,深度卷积神经网络在视觉领域有了前所未有的表现:比如,图像分类、面部识别以及玩雅达利游戏。这些神经网络使用着多个层级的神经元,每一个都与其他神经元进行连接,这样便可以将一个图片来增加抽象程度并显示在电脑上。我们在围棋上应用了一个类似的结构。我们使用了一个19*19的平板,使用了传统的层级来表示这些位置。我们使用这些神经网络来减少搜索树的深度和广度: 通过价值网络来判断位置的优劣,比通过策略网络进行推演。

We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network pσ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work13,15, we also train a fast policy pπ that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network pρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network vθ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.


Supervised learning of policy networks
For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning13,2124. The SL policy network pσ(a |  s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final soft-max layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state (see Extended Data Table 2). The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s



We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The net-work predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi-tion and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission24 (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.




