
Intro to Policy-Based RL

VS. Value-Based RL

Why Policy-Based?

  1. Conti. Action

  2. Stochastic Policy

    While we have to choose one specific action that maximizes Q function

                                    $arg\\underset{a_t}max\\ Q(s_t, a_t)$

    we can randomly select a sample from PDF

                                                $P_w(a_t | s_t)$

    <aside> 💡 Explanation with an partially observable ex. (From lecture by David Silver)


    스크린샷 2022-07-21 오후 5.39.34.png

    Observation $s_t$ is defined as $s_t = [Up, Left, Right, Bottom]^T$. (1 if there is wall, 0 if not)

    Action $a_t = [Up, Left, Right, Bottom]^T$

    Note that this is a partially observable environment. $s_1 = s_3 = [1, 0, 0, 1]^T$

    Value-based learning (Q-learning)

    Let’s say it’s found to be $Q(s_1, Left) > Q(s_1, Right)$ after some learning.

    Then if the agent is placed in $s_1$, then it will move left according to the Q function, then at $s_0$, it will go right to avoid “Die”, then it will move left again…. Endless loop

    Then if we train to $Q(s_1, Left) < Q(s_1, Right)$, then the same loop will happen at $s_3$.

    Policy-based learning

    $P(Left | s_1) = P(Right | s_3) = 1/2$

  3. Better convergence properties

  4. Effective in high-dimensional or conti. action space