Cos'è l'operatore Bellman nell'apprendimento del rinforzo?

In matematica, l' operatore parola può fare riferimento a diversi concetti distinti ma correlati. Un operatore può essere definito come una funzione tra due spazi vettoriali, può essere definito come una funzione in cui dominio e codice sono uguali oppure può essere definito come una funzione da funzioni (che sono vettori) ad altre funzioni (per esempio, l' operatore differenziale ), ovvero una funzione di ordine elevato (se si ha familiarità con la programmazione funzionale).

Cos'è l' operatore Bellman nell'apprendimento di rinforzo (RL)? Perché ne abbiamo bisogno? In che modo l'operatore Bellman è correlato alle equazioni di Bellman in RL?

reinforcement-learning terminology math

— nbro
fonte

Alcuni articoli relativi a questo argomento sono Metodi basati su funzioni per la programmazione dinamica su larga scala (di John N. Tsitsiklis e Benjamin Van Roy, 1996), Un'analisi dell'apprendimento delle differenze temporali con approssimazione delle funzioni (di John N. Tsitsiklis e Benjamin Van Roy, 1997) e Least-Squares Policy Iteration (di Michail G. Lagoudakis e Ronald Parr, 2003).

— nbro

Alcuni altri documenti correlati che ho trovato sono Processi decisionali di Markov generalizzati: algoritmi di programmazione dinamica e di apprendimento di rinforzo (di Csaba Szepesvári e Michael L. Littman, 1997) e

ϵ

$\epsilon$ -MDPs: Learning in Varying Environments (by István Szita, Bálint Takács, András Lörincz, 2002).

— nbro

The notation I'll be using is from two different lectures by David Silver and is also informed by these slides.

The expected Bellman equation is

\begin{matrix} (1) & v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'})) \end{matrix}

$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$

If we let

\begin{matrix} (2) & P_{s s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s s^{'}}^{a} \end{matrix}

$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$ and

\begin{matrix} (3) & R_{s}^{π} = \sum_{a \in A} π (a | s) R_{s}^{a} \end{matrix}

$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$ allora possiamo riscrivere

(1)

$(1)$ come

\begin{matrix} (4) & v_{π} (s) = R_{s}^{π} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{π} v_{π} (s^{'}) \end{matrix}

$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$

Questo può essere scritto in forma di matrice

\begin{matrix} (5) & [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] = [\begin{matrix} R_{1}^{π} \\ ⋮ \\ R_{n}^{π} \end{matrix}] + γ [\begin{matrix} P_{11}^{π} & \dots & P_{1 n}^{π} \\ ⋮ & ⋱ & ⋮ \\ P_{n 1}^{π} & \dots & P_{n n}^{π} \end{matrix}] [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] \end{matrix}

$\left. \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} \cal{R}_1^\pi \\ \vdots \\ \cal{R}_n^\pi \end{bmatrix} +\gamma \begin{bmatrix} \cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\ \vdots & \ddots & \vdots\\ \cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} \right. \tag 5$

Or, more compactly,

\begin{matrix} (6) & v_{π} = R^{π} + γ P^{π} v_{π} \end{matrix}

$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$

Notice that both sides of $(6)$ are $n$ -dimensional vectors. Here $n=|\cal{S}|$ is the size of the state space. We can then define an operator $\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$ as

\begin{matrix} (7) & T^{π} (v) = R^{π} + γ P^{π} v \end{matrix}

$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$

for any $v\in \mathbb{R}^n$ . This is the expected Bellman operator.

Similarly, you can rewrite the Bellman optimality equation

\begin{matrix} (8) & v_{*} (s) = max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{*} (s^{'})) \end{matrix}

$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$

as the Bellman optimality operator

\begin{matrix} (9) & T^{*} (v) = max_{a \in A} (R^{a} + γ P^{a} v) \end{matrix}

$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values, $\mathbb{R}^n$ .

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g. policy iteration, value iteration) converge to a unique fixed point. This usefulness comes in the form of a body of existing work in operator theory, which allows us to make use of special properties of the Bellman operators.

Specifically, the fact that the Bellman operators are contractions gives the useful results that, for any policy $\pi$ and any initial vector $v$ ,

\begin{matrix} (10) & lim_{k \to \infty} (T^{π})^{k} v = v_{π} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$

\begin{matrix} (11) & lim_{k \to \infty} (T^{*})^{k} v = v_{*} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$

where $v_\pi$ is the value of policy $\pi$ and $v_*$ is the value of an optimal policy $\pi^*$ . The proof is due to the contraction mapping theorem.

— Philip Raeisghasem
fonte