The notation I'll be using is from two different lectures by David Silver and is also informed by these slides.
The expected Bellman equation is
vπ(s)=∑a∈Aπ(a|s)(Ras+γ∑s′∈SPass′vπ(s′))(1)
If we let
Pπss′=∑a∈Aπ(a|s)Pass′(2)
and
Rπs=∑a∈Aπ(a|s)Ras(3)
allora possiamo riscrivere (1) come
vπ(s)=Rπs+γ∑s′∈SPπss′vπ(s′)(4)
Questo può essere scritto in forma di matrice
⎡⎣⎢⎢vπ(1)⋮vπ(n)⎤⎦⎥⎥=⎡⎣⎢⎢Rπ1⋮Rπn⎤⎦⎥⎥+γ⎡⎣⎢⎢Pπ11⋮Pπn1…⋱…Pπ1n⋮Pπnn⎤⎦⎥⎥⎡⎣⎢⎢vπ(1)⋮vπ(n)⎤⎦⎥⎥(5)
Or, more compactly,
vπ=Rπ+γPπvπ(6)
Notice that both sides of (6) are n-dimensional vectors. Here n=|S| is the size of the state space. We can then define an operator Tπ:Rn→Rn as
Tπ(v)=Rπ+γPπv(7)
for any v∈Rn. This is the expected Bellman operator.
Similarly, you can rewrite the Bellman optimality equation
v∗(s)=maxa∈A(Ras+γ∑s′∈SPass′v∗(s′))(8)
as the Bellman optimality operator
T∗(v)=maxa∈A(Ra+γPav)(9)
The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values, Rn.
Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g. policy iteration, value iteration) converge to a unique fixed point. This usefulness comes in the form of a body of existing work in operator theory, which allows us to make use of special properties of the Bellman operators.
Specifically, the fact that the Bellman operators are contractions gives the useful results that, for any policy π and any initial vector v,
limk→∞(Tπ)kv=vπ(10)
limk→∞(T∗)kv=v∗(11)
where vπ is the value of policy π and v∗ is the value of an optimal policy π∗. The proof is due to the contraction mapping theorem.