Mastering the game of Go with deep neural networks and tree search

myfather103 · 2016-11-03 10:00:01 · 2318 次点击 · 预计阅读时间 10 分钟 · 大约8小时之前开始浏览

这是一个创建于 2016-11-03 10:00:01 的文章，其中的信息可能已经有所发展或是发生改变。

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

摘要
围棋被认为是一种特别难以被人工智能所掌握的传统博弈游戏，基于其对棋盘中棋子的推演和摆放位置预测的复杂程度。基于此，我们是用了最新的方法来让电脑下围棋，使用价值网络去预测下棋的位置，并用策略网络去选择可能的步骤。这些深度神经网络通过一个虚拟的监督学习机制来学习人类大师的对弈棋盘，并通过自我博弈加强学习。除去最近的研究，通过蒙特卡洛树搜索方法可以模拟上千个自我博弈的随机棋局。我们同是引入一个新的搜索算法来综合蒙特卡洛树和价值神经网络以及策略神经网络。通过使用这个搜索算法，我们的AlphaGo程序对战其他围棋程序能获得99.8%的胜率，并以5：0成功战胜了欧洲冠军樊辉。这是第一次电脑围棋程序在整个围棋棋局中战胜一位人类专业对手。这颠覆了以前人们认为需要很多年以后电脑才可以和人下围棋的想法。

All games of perfect information have an optimal value function, v*(s), which determines the outcome of the game, from every board position or state s, under perfect play by all players. These games may be solved by recursively computing the optimal value function in a search tree containing approximately bd possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80) $^1$ and especially Go (b ≈ 250, d ≈ 150) $^1$ , exhaustive search is infeasible $^2$ , $^3$ , but the effective search space can be reduced by two general principles. First, the depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s) ≈ v*(s) that predicts the outcome from state s. This approach has led to superhuman performance in chess $^4$ , checkers $^5$ and othello $^6$ , but it was believed to be intractable in Go due to the complexity of the game $^7$ . Second, the breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a prob-ability distribution over possible moves a in position s. For example, Monte Carlo rollouts $^8$ search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving superhuman performance in backgammon $^8$ and Scrabble $^9$ , and weak amateur level play in Go $^{10}$ .

很多游戏在充足的数据中有一个非常好的优化值函数， $v^*(s)$ 。这个函数决定了游戏的输出，它来自于棋盘的每一个状态和或者是状态s，这在于每一个玩家的最好状态下。这些游戏可以通过电脑建立一个搜索树来解决，搜索树包含一个有 $b^d$ 的最佳值函数。其中b是游戏的宽度（每个位置有效次数的可能性）而d则是它的深度（游戏的长度）。大多数的游戏，比如说象棋（b=35,d=80）但是围棋（b=250，d=150），无穷尽的搜索却是相当耗时的，但是有效地搜索空间可以在两种方式下得到简化。第一，通过预测位置可以简化搜索的深度：缩短搜索树在状态s上的搜索，并当给出的状态s计算出来的值小于一个给定的值v(s)= $v^*(s)$ (通过状态s计算出来的值)时，则不再继续检索接下来的子树。这样的方法已经被证明在象棋、拼写检查、奥赛罗棋等游戏上打败了人类，但是在在围棋上，由于围棋的复杂程度以至于这个方法是不可取的。第二，搜索树的宽度可以通过p(a|s)这种简单的策略进行简化，这是一种概率用来计算从s移动a的机会。例如，蒙特·卡洛首次展示搜索用以搜索最大的深度，而不用展开他的分支，通过一系列长序列的行为来给所有的玩家提供一个策略p。通过这些检索通常可以获得一个有效的位置预测，在西洋双陆棋上和射击游戏获得到比人类还要好的表现，而在围棋上则达到了业余选手水平的高度。

Monte Carlo tree search (MCTS) $^{11,12}$ uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simu-lations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function $^{12}$ . The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves $^{13}$ . These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play $^{13–15}$ . However, prior work has been limited to shallow policies $^{13–15}$ or value functions $^{16}$ based on a linear combination of input features.

蒙特卡罗树搜索树（MCTS）使用了蒙特卡罗首先是来预测搜索树的每个状态值。当更多的模拟状态被计算出来以后，搜索树变得越来越大，而其相关的值则会变得越来越准确。通过搜索有高价值的子节点，策略在搜索的过程中同样在不断进步。逐渐地，这个策略函数聚集在一个比较优化的局面之中，并且将聚集的值和最优函数进行评估。目前最好的围棋程序是基于蒙特卡洛搜索树，通过预测人类专家的下子情况来训练和提高。这些策略可以用来缩短较高可能性步骤的搜索范围，并能在棋局中给出实力步骤。这已经在和业余棋手的较量中取得了好的成绩。然而，之前的工作主要限制在一些表面的策略或者是赋值函数，而且也仅仅局限于一些线性组合的输入特性。

Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification $^{17}$ , face recognition $^{18}$ , and playing Atari games $^{19}$ . They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image $^{20}$ . We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

最近，深度卷积神经网络在视觉领域有了前所未有的表现：比如，图像分类、面部识别以及玩雅达利游戏。这些神经网络使用着多个层级的神经元，每一个都与其他神经元进行连接，这样便可以将一个图片来增加抽象程度并显示在电脑上。我们在围棋上应用了一个类似的结构。我们使用了一个19*19的平板，使用了传统的层级来表示这些位置。我们使用这些神经网络来减少搜索树的深度和广度：通过价值网络来判断位置的优劣，比通过策略网络进行推演。

We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p $_{σ}$ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work $^{13,15}$ , we also train a fast policy p $_{π}$ that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network pρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v $_{θ}$ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.

这个程序的机器学习训练阶段是通过几个管道连接的神经网络所组成的。我们直接通过人类大师的步骤来训练监督策略神经网络p $_{σ}$ 。这也就提供了具有高质量收缩和快速有效的学习更新。类似与之前的工作，我们同样通过训练快速策略网络p $_{π}$ 来在对局中演示步骤。接着我们训练了一个增强型的（RL）策略网络p $_{ρ}$ 用以增强策略网络，方法是通过自我博奕来发现最终游戏的最优输出结果。这可以最终让游戏变的胜算更大一些，而不是依靠最大化预测。最终，我们训练一个价值网络v $_{θ}$ 来预测自我博弈的时候胜算几率。我么的程序AlphaGo通过蒙特卡洛搜索树来组合了策略网络和价值网络。

Supervised learning of policy networks
For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning $^{13,21–24}$ . The SL policy network p $_{σ}$ (a | s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final soft-max layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state (see Extended Data Table 2). The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s

策略网络的受监督学习
在训练管道的第一部分，我们第一步通过监督学习来预测专家在围棋中的步子。监督的策略网络p $_{σ}$ (a|s)在传统的神经网络层数σ和非线性之间作出调整。一个Softmax回归输出层将得到所有可能的有效步骤a。而输入s到策略网络则是一个简单呈现棋盘状态的方法。策略网络被随机地通过状态和行为输入对(s,a)来训练，使用了随机径向函数来不断逼近人类在这种状态s下可能下的棋子的位置a：

Δ (σ) = \partial log p σ ( a | s ) \partial σ (1)

$\varDelta(\sigma) = \frac{\partial \text{log}p_{\sigma}(a|s) }{\partial \sigma} \tag{1}$

We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The net-work predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi-tion and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission $^{24}$ (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.

有疑问加站长微信联系（非本文作者）

本文来自：CSDN博客

感谢作者：myfather103

查看原文：Mastering the game of Go with deep neural networks and tree search

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

2318 次点击

加入收藏微博

收入我的专栏

上一篇：Go Error 的使用

下一篇：golang 文件服务器

神经网络

函数

搜索算法

复杂程度

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

关注我

扫码关注领全套学习资料
加入 QQ 群：
- 192706294（已满）
- 731990104（已满）
- 798786647（已满）
- 729884609（已满）
- 977810755（已满）
- 815126783（已满）
- 812540095（已满）
- 1006366459（已满）
- 692541889
加入微信群：liuxiaoyan-s，备注入群
也欢迎加入知识星球 Go粉丝们（免费）

Mastering the game of Go with deep neural networks and tree search

用户登录

今日阅读排行

一周阅读排行

关注我

Mastering the game of Go with deep neural networks and tree search

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏