Contents
VS. Value-Based RL
Conti. Action
Stochastic Policy
While we have to choose one specific action that maximizes Q function
$arg\\underset{a_t}max\\ Q(s_t, a_t)$
we can randomly select a sample from PDF
$P_w(a_t | s_t)$
<aside> 💡 Explanation with an partially observable ex. (From lecture by David Silver)
</aside>
Observation $s_t$ is defined as $s_t = [Up, Left, Right, Bottom]^T$. (1 if there is wall, 0 if not)
Action $a_t = [Up, Left, Right, Bottom]^T$
Note that this is a partially observable environment. $s_1 = s_3 = [1, 0, 0, 1]^T$
Value-based learning (Q-learning)
Let’s say it’s found to be $Q(s_1, Left) > Q(s_1, Right)$ after some learning.
Then if the agent is placed in $s_1$, then it will move left according to the Q function, then at $s_0$, it will go right to avoid “Die”, then it will move left again…. Endless loop
Then if we train to $Q(s_1, Left) < Q(s_1, Right)$, then the same loop will happen at $s_3$.
Policy-based learning
$P(Left | s_1) = P(Right | s_3) = 1/2$
Better convergence properties
Effective in high-dimensional or conti. action space