A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
人工智能的一个长期目标便是算法可以在挑战性领域中学习,纯粹的,并有着超过人类表现的能力。最近,AlphaGo成为了第一个能够打败世界围棋冠军的程序。AlphaGo的树搜索方法分析位置并通过深度搜索树来选择下一步。这些神经网络通过监督学习和人类围棋步骤来学习训练,通过加强学习来自我博弈。这里我们介绍了一种只基于增强学习方法的算法,不需要人类的数据,指导或者除了规则之外的其他专业知识。AlphaGo成为了自己的老师:一个神经网络用于预测AlphaGo自身的步骤选择并成为了AlphaGo对局中的赢家。这种神经网络提升了树搜索的强度,通过高质量的步骤选择和在下一次迭代的强大的自我博弈能力。从最空白的时刻开始,我们的新程序AlphaGo Zero达到了超过人类的性能,和前任已经打败了冠军的AlphaGo相比,是100:0的成绩。
Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts. However, expert data sets are often expensive, unreliable or simply unavailable. Even when reliable data sets are available, they may impose a ceiling on the performance of systems trained in this manner. By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games, such as Atari and 3D virtual environments. However, the most challenging domains in terms of human intellect—such as the game of Go, widely viewed as a grand challenge for artificial intelligence—require a precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved humanlevel performance in these domains.
许多在神经网络方面的研究都是使用了监督学习系统,这个模型是用来复制人类专家的经验。然而,专业数据都通常相当昂贵、不可靠并且都不能提供。就算当可靠的数据时可以使用的,通过这种方法在性能上仍然有瓶颈。相反,增强学习系统则是通过自身经验来训练,重要的是让他们超过人类能力,在人类专家不擅长的领域有所突破。最近,这个目标开始快速靠近,方法是增强学习训练深度神经网络。这些系统在电脑游戏中比人类要优秀,比如说阿塔里和3D视觉环境。然而,更多的挑战集中于人类智力研究——比如围棋,普遍认为是一个对人工智能的巨大挑战——需要一个精确的、熟练的和巨大搜索空间的。完全通用技术在这些领域在此之前无法达到人类水平的能力。
AlphaGo was the first program to achieve superhuman performance in Go. The published version, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policygradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte Carlo tree search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to highprobability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of international titles, in March 2016.
AlphaGo是第一个在围棋领域中性能超越了人类的程序。之前的公开版本,我们称之为AlphaGo Fan,是在2015年十月打败了欧洲冠军樊辉。AlphaGo Fan使用了两个深度神经网络:一个策略神经网络,用来输出可能的步骤;第二个是价值网络,用于判断位置。策略网络首先是通过监督学习网络来训练,用于准确预测人类专家的步骤,并通过策略径向增强学习来逐步修正。价值网络用于预测通过策略网络下的游戏的概率。一旦测试结束,这些网络便会用蒙特卡罗搜索树结合在一起,用来进行游戏中落子的预测。使用策略网络来筛选出高可能性的步骤,并用价值网络(用快速走子结合在一起的蒙特卡罗搜索树)来评估树中的位置。接下来的版本,我们称之为AlphaGo Lee,使用了类似的方法,并在2016年3月份打败了18次获得世界冠军的选手李世石。
Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects. First and foremost, it is trained solely by selfplay reinforcement learning, starting from random play, without any supervision or use of human data. Second, it uses only the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.
我们的程序AlphaGo Zero与以往的AlphaGo Fan和AlphaGo Lee在许多地方上有不同。第一也是最重要的一点,他只被自我博弈的增强型学习训练,被随机训练,并没有使用任何的人类数据。第二,他只用了板上黑棋和白棋做为输入内容。最后,他使用了一个更简单的树搜索,只用了这个简单的神经网络来评估位置和范例步骤,而不是用任何蒙特卡洛树走子来实现。为了达到这些目的,我们使用了一个新的增强学习算法,这个算法在训练循环中使用了非协同的前置搜索,以期达到快速的进步和准确稳定的学习。在搜索算法、训练过程和网络架构的技术不同在方法部分中有具体描述。
Reinforcement learning in AlphaGo Zero
AlphaGo Zero的增强学习
Our new method uses a deep neural network
我们的新方法是用了一个有参数θ的深度神经网络
The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network
AlphaGo Zero的神经网络是通过一个虚拟机的增强学习算法的自我博弈游戏中训练的。在每一个位置s上,一个蒙特卡洛搜索便会执行,这个搜索树通过函数
The MCTS uses the neural network
蒙特卡罗搜索树使用了神经网络
MCTS may be viewed as a selfplay algorithm that, given neural network parameters
蒙特卡罗搜索树被认为是一种自我博弈的算法,给定好神经网络变量
The neural network is trained by a selfplay reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights
神经网络通过自我博弈、使用了蒙特卡洛搜索树来下每一步走子的增强学习算法。第一步,神经网络被随机权重
and
where c is a parameter controlling the level of L2 weight regularization (to prevent overfitting).
c是一个控制L2权重规范化的等级的变量,用于防止过度拟合
Empirical analysis of AlphaGo Zero training
AlphaGo Zero训练的宏观分析
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately three days.
我们使用了增强学习管道来训练我们的程序AlphaGo Zero。训练从一开始就是完全的随机行为,三天时间内没有任何人为干预。
Over the course of training, 4.9 million games of selfplay were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move. Parameters were updated from 700,000 minibatches of 2,048 positions. The neural network contained 20 residual blocks (see Methods for further details).
在训练知识,4900万次自我博弈被生成,每次蒙特卡洛搜索树可以使用1600次模拟,每次移动都用了大概0.4s来分析。变量从2048个位置中70万次迷你批次中更新。神经网络包括了20个多余块(查看方法来获得更多信息)。
Figure 3a shows the performance of AlphaGo Zero during selfplay reinforcement learning, as a function of training time, on an Elo scale. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting that have been suggested in previous literature. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2 h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs), whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1 and Supplementary Information).
数据3a表示了AlphaGo Zero在自我博弈的增强学习算法中的性能,做为训练时间的函数,在一个Elo跨度下。学习在训练的时候进展顺利,并没有遇到之前研究中所提及的震荡和遗忘。出乎意料的是,AlphaGo Zero在36小时候便超过了AlphaGo Lee的性能。相比之下,AlphaGo Lee则需要训练几个月。72小时候,我们在评估AlphaGo Zero时,让他和打败过李世石的AlphaGo Lee在首尔2小时控制时间和比赛环境下进行比赛。Alpha Zero使用了一个单一的机器,只有4个传感处理器,而AlphaGo则是使用了许多分布式的机器而且用了48个传感处理器。AlphaGo Zero和AlphaGo Lee的比赛以100:0结束。
To assess the merits of selfplay reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server dataset; this achieved state-of-the-art prediction accuracy compared to previous work (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the selflearned player performed much better overall, defeating the humantrained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.
为了能够评价自我博弈的增强学习算法的优劣,与从人类大师中学习的算法进行比较,我们训练了第二个神经网络(使用了相同的架构)用以预测在KGS Server数据集上的专家步骤;这样获得到的最领先的预测准确性来比较之前的工作(查看额外数据表格1和2来分别查看现在和之前的工作)。监督学习在一开始获得了非常好的性能,并对人类的步骤有非常好的预测效果。很明显的是,尽管监督学习获得了更高的步骤预测能力,自我学习的玩家则获得了更高深的技巧,在第一次开始学习起的24小时内便打败了通过人类数据学习的玩家。这说明AlphaGo Zero可以学习到完全与人类不同的技能。
To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Fig. 4). Four neural networks were created, using either separate policy and value networks, as were used in AlphaGo Lee, or combined policy and value networks, as used in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimize the same loss function (equation (1)), using a fixed dataset of selfplay games generated by AlphaGo Zero after 72 h of selfplay training. Using a residual network was more accurate, achieved lower error and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.
为了将结构和算法的贡献分清楚,我们分析了AlphaGo Zero目前的神经网络架构和AlphaGo Lee所使用的神经网络架构。四种神经网络被生成,使用了单独分开的策略网络和价值网络,就如在AlphaGo Lee使用的,或者就如在AlphaGo Zero里面使用的综合性策略和价值网络;或者使用了AlphaGo Lee中的卷积网络或者是AlphaGo Zero中的剩余网络中的一种。每一种网络都训练起来,使用的是相同的损失函数。(方程1),使用的是72小时候AlphaGo Zero生成的自我对奕生成的数据集。使用残差网络是因为能够更精确,在AlphaGo 60次对弈后会有更少的错误和性能的提升。将策略网络和价值网络混合在一起会稍微降低准确性,但是也同时降低了值的错误并在剩下的600次对弈中提高了游戏的性能。这一部分是因为提高了计算能力,但更重要的是双向物体规范化了网络到了一个能使用多个案例的通用方式。
Knowledge learned by AlphaGo Zero
从AlphaGo Zero学到的知识
AlphaGo Zero discovered a remarkable level of Go knowledge during its selfplay training process. This included not only fundamental elements of human Go knowledge, but also nonstandard strategies beyond the scope of traditional Go knowledge.
AlphaGo Zero在自我博弈的时候发现了围棋知识的新境界。这不仅包括了人类围棋的基础知识,也包括了传统知识领域未曾达到的地步。
Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Fig. 5a and Extended Data Fig. 2); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Fig. 5b and Extended Data Fig. 3). Figure 5c shows several fast selfplay games played at different stages of training (see Supplementary Information). Tournament length games played at regular intervals throughout training are shown in Extended Data Fig. 4 and in the Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts, including fuseki (opening), tesuji (tactics), lifeanddeath, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.
数据5展示了当专业的定式(位于边角的序列上)被发现时候的时间轴(数据5a和扩展数据2);最终AlphaGo Zero使用了新的定式。数据5c表示出了在不同训练阶段几次快速自我博弈游戏的进行情况(见补充信息)。在训练过程中一般游戏长度中都有一些间隔,这些间隔都显示在了额外数据4和补充信息中。AlphaGo Zero在从新手到老手的过程当中迅速成长并理解围棋的几个阶段,包括开局、战术、存活和死亡、复盘、官子、捕捉棋子、初始、成型、影响和占领,都在第一步阶段的时候迅速掌握。令人意外的是,梯子(抓住了整个棋盘的序列)——在人类学习围棋中比较早被人类掌握的围棋知识点——却在AlphaGo Zero训练比较晚的时候才掌握到。
Final performance of AlphaGo Zero
AlphaGo Zero的最终性能
We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days.
我们接着将我们的增强学习管道引入第二个AlphaGo Zero示例当中,并用了更大一点神经网络还持续了更长的时间来训练。从完全陌生的行为再次开始训练并持续了大约有40天。
Over the course of training, 29 million games of selfplay were generated. Parameters were updated from 3.1 million minibatches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Fig. 6a. Games played at regular intervals throughout training are shown in Extended Data Fig. 5 and in the Supplementary Information.
在训练的时候,2900万次自我博弈的棋盘被生成。变量从2048个位置的310万个小位置中更新。神经网络包括了40个残差块。学习曲线在数据6a。训练中游戏在正常间隔中进行,这部分在额外数据5和补充信息中展示。
We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGo Master—a program based on the algorithm and architecture presented in this paper but using human data and features (see Methods)—which defeated the strongest human professional players 60–0 in online games in January 2017. In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.
我们让训练好的AlphaGo Zero在公司内部与AlphaGo Fan、AlphaGo Lee和之前研究的围棋程序进行比赛。我们同样使用了AlphaGo Master——程序使用了与论文相同的算法和结构的程序,但是训练数据则是人类数据和特点(在方法中查看)——这个AlphaGo Master在2017年1月与人类在网络对决,结果是60:0。我们的预测当中,所有的程序都可以在五秒内决定下一步的落子;AlphaGo Zero和AlphaGo Master都是使用了一个4颗TPU的单机;AlphaGo Fan和AlphaGo Lee分别使用了176颗GPU和48颗TPU的分布式机器。我们同样使用了一个只有神经网络的AlphaGo Zero;这个机器会选择最大可能性的落子。
Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.
数据6b显示在一部棋的规模下每一种程序的性能。原始神经网络在不使用任何前瞻性的算法,达到了3055积分,AlphaGo Zero获得了5185积分,相比之下AlphaGo Master获得了4858分,AlphaGo Lee获得了3739分,AlphaGo Fan获得了3144分。
Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100-game match with 2h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information).
最终,我们对AlphaGo Zero和AlphaGo Master的2小时内100场比赛对决的控制下,AlphaGo Zero赢得了89场,输了11场。
Conclusion
结论
Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin.
我们的结论十分完全的说明了一个完全的增强学习的实现是非常灵活的,在许多有挑战性的领域也是可以使用的:他可以有机会达到超过人类的高度,不需要人类的例子或指导,只需要基本规则而不需要额外知识。更重要的是,一个纯粹的增强学习的实现只需要几小时便可以训练完成,并能达到类似的性能,相比使用人类算法要好得多。使用了这个方法,AlphaGo Zero打败了之前最强版本的AlphaGo。之前的版本使用了人类下棋的棋盘,并用了许多资源来学习。
Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.
人类在过去几千年的历史来了数万盘棋,已经总结了大量的技巧、模式和书籍。然而在短短的几天时间里,AlphaGo Zero从开局开始就能重新发现大多数围棋知识,并能从这些虚拟的技巧中发现新东西,并让传统焕发出新的光芒。
扩展点:Batch Normalization和Rectifier Nonlinearity.
有疑问加站长微信联系(非本文作者)