Policy-Based RL | Notion

Contents

Intro to Policy-Based RL

VS. Value-Based RL

Instead of solving for the optimized sum of rewards given a state and an action, in policy-based RL, we learn directly the policy function (Probability Distribution Function) that maps state to action

Why Policy-Based?

Conti. Action
- not easy for value-based RL
Stochastic Policy

While we have to choose one specific action that maximizes Q function
```
                                $arg\\underset{a_t}max\\ Q(s_t, a_t)$
```
we can randomly select a sample from PDF
```
                                            $P_w(a_t | s_t)$
```
<aside> 💡 Explanation with an partially observable ex. (From lecture by David Silver)

</aside>

Observation $s_t$ is defined as $s_t = [Up, Left, Right, Bottom]^T$. (1 if there is wall, 0 if not)

Action $a_t = [Up, Left, Right, Bottom]^T$

Note that this is a partially observable environment. $s_1 = s_3 = [1, 0, 0, 1]^T$

Value-based learning (Q-learning)

Let’s say it’s found to be $Q(s_1, Left) > Q(s_1, Right)$ after some learning.

Then if the agent is placed in $s_1$, then it will move left according to the Q function, then at $s_0$, it will go right to avoid “Die”, then it will move left again…. Endless loop

Then if we train to $Q(s_1, Left) < Q(s_1, Right)$, then the same loop will happen at $s_3$.

Policy-based learning

$P(Left | s_1) = P(Right | s_3) = 1/2$
Better convergence properties
Effective in high-dimensional or conti. action space