Multi-armed Bandit Problem与增强学习的联系

选自《Reinforcement Learning: An Introduction》, version 2, 2016, Chapter2

https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

引言中是这样引出Chapter2的：

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. The exploration-exploitaion dilemma has been intensively studied by mathematicians for many decades (see chapter 2). For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in their purest forms.

增强学习的挑战之一是如何处理exploration与exploitation之间的折中，这是其他类学习问题所没有的。为了获得很多奖励、收益，增强学习的agent更倾向于选择那些在过去尝试过且收益很大的行为。但是为了发现这样的行为，它必须尝试之前没有选择过的。也就是说，对于agent，一方面它要尽可能的利用它已经知道的知识来获得收益，另一方面，它必须积极进行探索使得未来能够做出更好的选择。矛盾在于过分的追求exploration或exploitation都会导致任务的失败。所以agent应该一方面积极尝试多种多样的行为，另一方面应该尽量选择那些目前看来最好的。在随机试验中，每个行为必定被多次尝试以获得对于期望收益最为可靠的估计。exploration-exploitation矛盾已经被数学家广泛研究了几十年（见第2章）。至少现在我们可以简单的理解为平衡exploration与exploitation的问题并没有出现在有监督与无监督的学习问题中。

chapter2是这样引出的：

The most import feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trail-and-error search for good behavior. Purely evaluative feedback indicates how good the action taken is, but not whether it is the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification. In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken. There are also interesting intermediate cases in which evaluation and instruction blend together.

增强学习有别于其他类的学习方式，它使用训练数据不但能够给出正确的行为指令，而且能够评价该行为（采用该行为的奖励、收益）。由此产生了通过显式搜索有利行为的主动的探索需求。单纯的评价式反馈指明了若采取某一行为，则产生的收益是多少，而不是仅仅判断这个行为是最好的活最差的。从另一个角度来讲，单纯的指示型反馈仅指明应该采取的正确行为，与实际采取的行为无关。这种反馈是有监督学习的基础。这两种反馈是完全不同的：评价式反馈完全依赖于已经采取的行为，而指示型反馈独立于实际采取的行为。也有一些处于两者之间的例子。

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. This nonassociative setting is the one in which most prior work involving evaluative feedback has been done, and it avoids much of the complexity of the full reinforcement learning problem. Studying this case will enable us to see most clearly how evaluative feedback differs from, and yet can be combined with instructive feedback.

本章研究增强学习在简化场景下的评价方面，所谓简化场景也就是说不涉及多个学习场景。这种非关联场景已有许多相关工作涉及到评价式反馈，但是比完全的增强学习问题要简单。学习这些例子有助于我们理解评价式反馈，以及与之相结合的指示型反馈。

The particular nonassociative, evaluative feedback problem that we explore is a simple version of the k-armed bandit problem. We can use this problem to introduce a number of basic learning methods which we extend in later chapters to apply to the full reinforcement learning problem. At the end of this chapter, we take a step closer to the full reinforcement learning problem by discussing what happens when the bandit problem becomes associative, that is, when actions are taken in more than one situation.

我们将要探索的这种特殊的非关联的评价式反馈问题是k-armed bandit problem的简化版本。我们用这个问题引出后续章节中要介绍的完全增强学习的基本方法。本章的最后，对bandit问题进行扩展，使得action发生在多个场景中，得到了关联型版本。

总结：

Multi-armed bandit problem（又称k-armed bandit problem）并非完全的reinforcement learning，而只是其简化版本。所以该书将bandit问题作为引子，引出reinforcement learning的问题。reinforcement learning中的一些概念都是其中的一些概念扩展而来的。

个人收藏笔记记录

开通VIP