首页 技术 正文
技术 2022年11月20日
0 收藏 440 点赞 3,029 浏览 12218 个字

强化学习读书笔记 – 06~07 – 时序差分学习(Temporal-Difference Learning)

学习笔记:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

数学符号看不懂的,先看看这里:

时序差分学习简话

时序差分学习结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。

时序差分这个词不好理解。改为当时差分学习比较形象一些 – 表示通过当前的差分数据来学习。

蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。
时序差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时序差分学习。
本章只考虑单步的时序差分学习。多步的时序差分学习在下一章讲解。

数学表示
根据我们已经知道的知识:如果可以计算出策略价值(\(\pi\)状态价值\(v_{\pi}(s)\),或者行动价值\(q_{\pi(s, a)}\)),就可以优化策略。
在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值\(G_t\)来计算状态的价值。其公式:
Formula MonteCarlo
\[
V(S_t) \gets V(S_t) + \alpha \delta_t \\
\delta_t = [G_t – V(S_t)] \\
where \\
\delta_t \text{ – Monte Carlo error} \\
\alpha \text{ – learning step size}
\]

时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):
Formula TD(0)
\[
V(S_t) \gets V(S_t) + \alpha \delta_t \\
\delta_t = [R_{t+1} + \gamma\ V(S_{t+1} – V(S_t)] \\
where \\
\delta_t \text{ – TD error} \\
\alpha \text{ – learning step size} \\
\gamma \text{ – reward discount rate}
\]

注:书上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。

时序差分学习方法

本章介绍的是时序差分学习的单步学习方法。多步学习方法在下一章介绍。

  • 策略状态价值\(v_{\pi}\)的时序差分学习方法(单步\多步)
  • 策略行动价值\(q_{\pi}\)的on-policy时序差分学习方法: Sarsa(单步\多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: Q-learning(单步)
  • Double Q-learning(单步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(带importance sampling): Sarsa(多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm(多步)
  • 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: \(Q(\sigma)\)(多步)

策略状态价值\(v_{\pi}\)的时序差分学习方法

单步时序差分学习方法TD(0)

  • 流程图

Reinforcement Learning - TD0Reinforcement Learning - TD0sSs_1S's->s_1  A  RvV(S)s->vv_1V(S')s_1->v_1v->v_1R

  • 算法描述

    Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}^+\)
    Repeat (for each episode):
      Initialize \(\mathcal{S}\)
      Repeat (for each step of episode):
       \(A \gets\) action given by \(\pi\) for \(S\)
       Take action \(A\), observe \(R, S'\)
       \(V(S) \gets V(S) + \alpha [R + \gamma V(S') – V(S)]\)
       \(S \gets S'\)
      Until S is terminal

多步时序差分学习方法

  • 流程图

Reinforcement Learning - TD nReinforcement Learning - TD nsSs_1…s->s_1  A0  RkvV(S)s->vs_2Sns_1->s_2  An-1  Rnv_1V(…)s_1->v_1v_2V(Sn)s_2->v_2v->v_1Rkv_1->v_2Rn

  • 算法描述

    Input: the policy \(\pi\) to be evaluated
    Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}\)
    Parameters: step size \(\alpha \in (0, 1]\), a positive integer \(n\)
    All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

    Repeat (for each episode):
      Initialize and store \(S_0 \ne terminal\)
      \(T \gets \infty\)
      For \(t = 0,1,2,\cdots\):
       If \(t < T\), then:
        Take an action according to \(\pi(\dot | S_t)\)
        Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
        If \(S_{t+1}\) is terminal, then \(T \gets t+1\)
       \(\tau \gets t – n + 1 \ \) (\(\tau\) is the time whose state’s estimate is being updated)
       If \(\tau \ge 0\):
        \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
        if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n}V(S_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
        \(V(S_{\tau}) \gets V(S_{\tau}) + \alpha [G – V(S_{\tau})]\)
      Until \(\tau = T – 1\)

这里要理解\(V(S_0)\)是由\(V(S_0), V(S_1), \dots, V(S_n)\)计算所得;\(V(S_1)\)是由\(V(S_1), V(S_1), \dots, V(S_{n+1})\)。

策略行动价值\(q_{\pi}\)的on-policy时序差分学习方法: Sarsa

单步时序差分学习方法

  • 流程图

Reinforcement Learning - TD SarsaReinforcement Learning - TD SarsasSs_1S's->s_1  A  RqQ(S, A)s->qq_1Q(S', A')s_1->q_1q->q_1R

  • 算法描述

    Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(termnal-state, \dot) = 0\)
    Repeat (for each episode):
      Initialize \(\mathcal{S}\)
      Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
      Repeat (for each step of episode):
       Take action \(A\), observe \(R, S'\)
       Choose \(A'\) from \(S'\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
       \(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma Q(S', A') – Q(S, A)]\)
       \(S \gets S'; A \gets A';\)
      Until S is terminal

多步时序差分学习方法

  • 流程图

Reinforcement Learning - TD SarsaReinforcement Learning - TD SarsasSs_1…s->s_1  A  RkqQ(S, A)s->qs_2Sns_1->s_2An-1Rnq_1Q(…)s_1->q_1q_2Q(Sn, An)s_2->q_2q->q_1Rkq_1->q_2Rn

  • 算法描述

    Initialize \(Q(s, a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
    Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size \(\alpha \in (0, 1]\),
      small \(\epsilon > 0\)
      a positive integer \(n\)
    All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

    Repeat (for each episode):
      Initialize and store \(S_0 \ne terminal\)
      Select and store an action \(A_0 ~ \pi(\dot | S_0)\)
      \(T \gets \infty\)
      For \(t = 0,1,2,\cdots\):
       If \(t < T\), then:
        Take an action \(A_t\)
        Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
        If \(S_{t+1}\) is terminal, then:
         \(T \gets t+1\)
        Else:
         Select and store an action \(A_{t+1} ~ \pi(\dot | S_{t+1})\)
       \(\tau \gets t – n + 1 \ \) (\(\tau\) is the time whose state’s estimate is being updated)
       If \(\tau \ge 0\):
        \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
        if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
        \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G – Q(S_{\tau}, A_{\tau})]\)
        If {\pi} is being learned, then ensure that \(\pi(\dot | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
      Until \(\tau = T – 1\)

策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: Q-learning

Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。
\[
Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \underset{a}{max} \ Q(S_{t+1}, a) – Q(S_t, A_t)]
\]

单步时序差分学习方法

  • 算法描述

    Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(termnal-state, \dot) = 0\)
    Repeat (for each episode):
      Initialize \(\mathcal{S}\)
      Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
      Repeat (for each step of episode):
       Take action \(A\), observe \(R, S'\)
       \(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S‘, a) – Q(S, A)]\)
       \(S \gets S';\)
      Until S is terminal

  • Q-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
    具体说明,请看书上的Example 6.7。**
    使用Double Q-learning可以消除这个问题。

Double Q-learning

单步时序差分学习方法

Initialize $Q_1(s, a) and \(Q_2(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily
Initialize $Q_1(termnal-state, \dot) = \(Q_2(termnal-state, \dot) = 0\)
Repeat (for each episode):
  Initialize \(\mathcal{S}\)
  Repeat (for each step of episode):
   Choose \(A\) from \(S\) using policy derived from \(Q_1\) and \(Q_2\) (e.g. \(\epsilon-greedy\))
   Take action \(A\), observe \(R, S'\)
   With 0.5 probability:
    \(Q_1(S, A) \gets Q_1(S, A) + \alpha [R + \gamma Q_2(S', \underset{a}{argmax} \ Q_1(S', a)) – Q_1(S, A)]\)
   Else:
    \(Q_2(S, A) \gets Q_2(S, A) + \alpha [R + \gamma Q_1(S', \underset{a}{argmax} \ Q_2(S', a)) – Q_2(S, A)]\)
   \(S \gets S';\)
  Until S is terminal

策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(by importance sampling): Sarsa

考虑到重要样本,把\(\rho\)带入到Sarsa算法中,形成一个off-policy的方法。
\(\rho\) – 重要样本比率(importance sampling ratio)
\[
\rho \gets \Pi_{i = \tau + 1}^{min(\tau + n – 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})
\]

多步时序差分学习方法

  • 算法描述

    Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
    Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
    Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size \(\alpha \in (0, 1]\),
      small \(\epsilon > 0\)
      a positive integer \(n\)
    All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

    Repeat (for each episode):
      Initialize and store \(S_0 \ne terminal\)
      Select and store an action \(A_0 ~ \mu(\dot | S_0)\)
      \(T \gets \infty\)
      For \(t = 0,1,2,\cdots\):
       If \(t < T\), then:
        Take an action \(A_t\)
        Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
        If \(S_{t+1}\) is terminal, then:
         \(T \gets t+1\)
        Else:
         Select and store an action \(A_{t+1} ~ \pi(\dot | S_{t+1})\)
       \(\tau \gets t – n + 1 \ \) (\(\tau\) is the time whose state’s estimate is being updated)
       If \(\tau \ge 0\):
        \(\rho \gets \Pi_{i = \tau + 1}^{min(\tau + n – 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})\)
        \(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
        if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
        \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G – Q(S_{\tau}, A_{\tau})]\)
        If {\pi} is being learned, then ensure that \(\pi(\dot | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
      Until \(\tau = T – 1\)

Expected Sarsa

  • 流程图

Reinforcement Learning - TD Expected SarsaReinforcement Learning - TD Expected SarsasSs_1…s->s_1  A  RkqQ(S, A)s->qs_2Sns_1->s_2An-1Rnq_1Q(…)s_1->q_1q_2sum(pi(a|Sn) * Q(Sn, a))s_2->q_2q->q_1Rkq_1->q_2Rn

  • 算法描述
    略。

策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。
求行动价值的期望值意味着对所有可能的行动\(a\)都评估一次。

多步时序差分学习方法

  • 流程图

Reinforcement Learning - TD Tree BackupReinforcement Learning - TD Tree BackupsSs_1…s->s_1  A  RkqQ(S, A)s->qs_2Sns_1->s_2An-1Rnq_1sum(pi(a|…) * Q(…, a))s_1->q_1q_2sum(pi(a|Sn) * Q(Sn, a))s_2->q_2q->q_1Rkq_1->q_2Rn

  • 算法描述

    Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
    Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size \(\alpha \in (0, 1]\),
      small \(\epsilon > 0\)
      a positive integer \(n\)
    All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

    Repeat (for each episode):
      Initialize and store \(S_0 \ne terminal\)
      Select and store an action \(A_0 ~ \pi(\dot | S_0)\)
      \(Q_0 \gets Q(S_0, A_0)\)
      \(T \gets \infty\)
      For \(t = 0,1,2,\cdots\):
       If \(t < T\), then:
        Take an action \(A_t\)
        Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
        If \(S_{t+1}\) is terminal, then:
         \(T \gets t+1\)
         \(\delta_t \gets R – Q_t\)
        Else:
         \(\delta_t \gets R + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) – Q_t\)
         Select arbitrarily and store an action as \(A_{t+1}\)
         \(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
         \(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
       \(\tau \gets t – n + 1 \ \) (\(\tau\) is the time whose state’s estimate is being updated)
       If \(\tau \ge 0\):
        \(E \gets 1\)
        \(G \gets Q_{\tau}\)
        For \(k=\tau, \dots, min(\tau + n – 1, T – 1):\)
         \(G \gets\ G + E \delta_k\)
         \(E \gets\ \gamma E \pi_{k+1}\)
        \(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G – Q(S_{\tau}, A_{\tau})]\)
        If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot)\)
      Until \(\tau = T – 1\)

策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: \(Q(\sigma)\)

\(Q(\sigma)\)结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当\(\sigma = 1\)时,使用了重要样本的Sarsa算法。
当\(\sigma = 0\)时,使用了Tree Backup的行动期望值算法。

多步时序差分学习方法

  • 流程图

Reinforcement Learning - TD Q(sigma)Reinforcement Learning - TD Q(sigma)sSs_1…s->s_1  A  R.qQ(S, A)s->qs_2…s_1->s_2A.R.q_1Q(…)s_1->q_1sigma = 1s_3…s_2->s_3A.R.q_2sum(pi(a|…) * Q(…,a))s_2->q_2sigma = 0s_4Sns_3->s_4An-1Rnq_3Q(…)s_3->q_3sigma = 1q_4sum(pi(a|Sn) * Q(Sn,a))s_4->q_4sigma = 0q->q_1R.q_1->q_2R.q_2->q_3R.q_3->q_4Rn

  • 算法描述

    Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
    Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
    Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size \(\alpha \in (0, 1]\),
      small \(\epsilon > 0\)
      a positive integer \(n\)
    All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)

    Repeat (for each episode):
      Initialize and store \(S_0 \ne terminal\)
      Select and store an action \(A_0 ~ \mu(\dot | S_0)\)
      \(Q_0 \gets Q(S_0, A_0)\)
      \(T \gets \infty\)
      For \(t = 0,1,2,\cdots\):
       If \(t < T\), then:
        Take an action \(A_t\)
        Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
        If \(S_{t+1}\) is terminal, then:
         \(T \gets t+1\)
         \(\delta_t \gets R – Q_t\)
        Else:
         Select and store an action as \(A_{t+1} ~ \mu(\dot|S_{t+1})\)
         Select and store \(\sigma_{t+1})\)
         \(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
         \(\delta_t \gets R + \gamma \sigma_{t+1} Q_{t+1} + \gamma (1 – \sigma_{t+1})\sum_a \pi(a|S_{t+1})Q(S_{t+1},a) – Q_t\)
         \(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
         \(\rho_{t+1} \gets \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}\)
       \(\tau \gets t – n + 1 \ \) (\(\tau\) is the time whose state’s estimate is being updated)
       If \(\tau \ge 0\):
        \(\rho \gets 1\)
        \(E \gets 1\)
        \(G \gets Q_{\tau}\)
        For \(k=\tau, \dots, min(\tau + n – 1, T – 1):\)
         \(G \gets\ G + E \delta_k\)
         \(E \gets\ \gamma E [(1 – \sigma_{k+1})\pi_{k+1} + \sigma_{k+1}]\)
         \(\rho \gets\ \rho(1 – \sigma_{k} + \sigma_{k}\tau_{k})\)
        $Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \(\rho [G – Q(S_{\tau}, A_{\tau})]\)
        If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot)\)
      Until \(\tau = T – 1\)

总结

时序差分学习方法的限制:学习步数内,可获得奖赏信息。
比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。

参照

相关推荐
python开发_常用的python模块及安装方法
adodb:我们领导推荐的数据库连接组件bsddb3:BerkeleyDB的连接组件Cheetah-1.0:我比较喜欢这个版本的cheeta…
日期:2022-11-24 点赞:878 阅读:9,104
Educational Codeforces Round 11 C. Hard Process 二分
C. Hard Process题目连接:http://www.codeforces.com/contest/660/problem/CDes…
日期:2022-11-24 点赞:807 阅读:5,581
下载Ubuntn 17.04 内核源代码
zengkefu@server1:/usr/src$ uname -aLinux server1 4.10.0-19-generic #21…
日期:2022-11-24 点赞:569 阅读:6,428
可用Active Desktop Calendar V7.86 注册码序列号
可用Active Desktop Calendar V7.86 注册码序列号Name: www.greendown.cn Code: &nb…
日期:2022-11-24 点赞:733 阅读:6,200
Android调用系统相机、自定义相机、处理大图片
Android调用系统相机和自定义相机实例本博文主要是介绍了android上使用相机进行拍照并显示的两种方式,并且由于涉及到要把拍到的照片显…
日期:2022-11-24 点赞:512 阅读:7,835
Struts的使用
一、Struts2的获取  Struts的官方网站为:http://struts.apache.org/  下载完Struts2的jar包,…
日期:2022-11-24 点赞:671 阅读:4,918