Esempio dettagliato di differenziazione automatica in modalità inversa

Non sono sicuro che questa domanda appartenga qui, ma è strettamente correlata ai metodi di gradiente nell'ottimizzazione, che qui sembra essere in argomento. Ad ogni modo, sentiti libero di migrare se ritieni che un'altra comunità abbia una migliore esperienza in materia.

In breve, sto cercando un esempio dettagliato della differenziazione automatica in modalità inversa . Non c'è molta letteratura sull'argomento e l'implementazione esistente (come quella in TensorFlow ) è difficile da capire senza conoscere la teoria alla base. Quindi sarei molto grato se qualcuno potesse mostrare in dettaglio cosa passiamo , come lo elaboriamo e cosa prendiamo dal grafico computazionale.

Un paio di domande con cui ho maggiori difficoltà:

semi : perché ne abbiamo bisogno?
regole di differenziazione inversa : so come differenziare in avanti, ma come possiamo tornare indietro? Ad esempio nell'esempio di questa sezione , come facciamo a sapere che $\bar{w_2}=\bar{w_3}w_1$ ?
lavoriamo solo con simboli o passiamo attraverso valori reali ? Ad esempio , nello stesso esempio , sono simboli e valori $w_i$ e $\bar{w_i}$ ?

— ffriend
fonte

"Apprendimento automatico pratico con Scikit-Learn & TensorFlow" L'Appendice D fornisce una spiegazione molto valida a mio avviso. Lo consiglio.

— Agustin Barrachina,

Diciamo che abbiamo espressione $z = x_1x_2 + \sin(x_1)$ e vogliamo trovare derivate $\frac{dz}{dx_1}$ e $\frac{dz}{dx_2}$ . La modalità inversa AD suddivide questa attività in 2 parti, vale a dire, i passaggi avanti e indietro.

Forward pass

Innanzitutto, scomporremo la nostra espressione complessa in un insieme di espressioni primitive, cioè espressioni costituite al massimo da una singola chiamata di funzione. Nota che ho anche rinominato le variabili di input e output per coerenza, anche se non è necessario:

w_{1} = X_{1}

$w_1 = x_1$

w_{2} = X_{2}

$w_2 = x_2$

w_{3} = w_{1} w_{2}

$w_3 = w_1w_2$

w_{4} = peccato (w_{1})

$w_4 = \sin(w_1)$

w_{5} = w_{3} + w_{4}

$w_5 = w_3 + w_4$

z = w_{5}

$z = w_5$

Il vantaggio di questa rappresentazione è che sono già note regole di differenziazione per ciascuna espressione separata. Ad esempio, sappiamo che derivato del $\sin$ è $\cos$ , e quindi $\frac{dw_4}{dw_1} = \cos(w_1)$ . Useremo questo fatto nel passaggio inverso di seguito.

In sostanza, il forward pass consiste nel valutare ciascuna di queste espressioni e nel salvare i risultati. Supponiamo che i nostri input siano: $x_1 = 2$ e $x_2 = 3$ . Poi abbiamo:

w_{1} = x_{1} = 2

$w_1 = x_1 = 2$

w_{2} = x_{2} = 3

$w_2 = x_2 = 3$

w_{3} = w_{1} w_{2} = 6

$w_3 = w_1w_2 = 6$

w_{4} = \sin (w_{1}) = 0.9

$w_4 = \sin(w_1) ~= 0.9$

w_{5} = w_{3} + w_{4} = 6.9

$w_5 = w_3 + w_4 = 6.9$

z = w_{5} = 6.9

$z = w_5 = 6.9$

Passaggio inverso

Questo è dove la magia inizia e inizia con la regola della catena . Nella sua forma base, regola della catena afferma che se si dispone di variabili $t(u(v))$ , che dipende da $u$ , che, a sua volta, dipende dalla $v$ , allora:

\frac{d t}{d v} = \frac{d t}{d u} \frac{d u}{d v}

$\frac{dt}{dv} = \frac{dt}{du}\frac{du}{dv}$

o, se $t$ dipende $v$ tramite più percorsi / variabili $u_i$ , ad esempio:

u_{1} = f (v)

$u_1 = f(v)$

u_{2} = g (v)

$u_2 = g(v)$

t = h (u_{1}, u_{2})

$t = h(u_1, u_2)$

quindi (vedi la prova qui ):

\frac{d t}{d v} = \sum_{i} \frac{d t}{d u_{i}} \frac{d u_{i}}{d v}

$\frac{dt}{dv} = \sum_i \frac{dt}{du_i}\frac{du_i}{dv}$

In termini di grafico espressione, se abbiamo un finale nodo $z$ e nodi di ingresso $w_i$ , e percorso da $z$ a $w_i$ passano attraverso nodi intermedi $w_p$ (cioè $z = g(w_p)$ dove $w_p = f(w_i)$ ), possiamo trovare la derivata $\frac{dz}{dw_i}$ come

\frac{d z}{d w_{i}} = \sum_{p \in p a r e n t s (i)} \frac{d z}{d w_{p}} \frac{d w_{p}}{d w_{i}}

$\frac{dz}{dw_i} = \sum_{p \in parents(i)} \frac{dz}{dw_p} \frac{dw_p}{dw_i}$

In altre parole, per calcolare la derivata della variabile di output $z$ qualsiasi variabile intermedia o di input $w_i$ , abbiamo solo bisogno di conoscere le derivate dei suoi genitori e la formula per calcolare la derivata dell'espressione primitiva $w_p = f(w_i)$ .

Il passaggio inverso inizia alla fine (cioè $\frac{dz}{dz}$ ) e si propaga all'indietro verso tutte le dipendenze. Qui abbiamo (espressione per "seme"):

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$

Ciò può essere letto come "il cambiamento in $z$ provoca esattamente lo stesso cambiamento in $z$ ", il che è abbastanza ovvio.

Quindi sappiamo che $z = w_5$ e così:

\frac{d z}{d w_{5}} = 1

$\frac{dz}{dw_5} = 1$

$w_5$ dipende linearmente da $w_3$ e $w_4$ , quindi $\frac{dw_5}{dw_3} = 1$ e $\frac{dw_5}{dw_4} = 1$

\frac{d z}{d w_{3}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{3}} = 1 \times 1 = 1

$\frac{dz}{dw_3} = \frac{dz}{dw_5} \frac{dw_5}{dw_3} = 1 \times 1 = 1$

\frac{d z}{d w_{4}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{4}} = 1 \times 1 = 1

$\frac{dz}{dw_4} = \frac{dz}{dw_5} \frac{dw_5}{dw_4} = 1 \times 1 = 1$

$w_3 = w_1w_2$ $\frac{dw_3}{dw_2} = w_1$ . Thus:

\frac{d z}{d w_{2}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{2}} = 1 \times w_{1} = w_{1}

$\frac{dz}{dw_2} = \frac{dz}{dw_3} \frac{dw_3}{dw_2} = 1 \times w_1 = w_1$

Which, as we already know from forward pass, is:

\frac{d z}{d w_{2}} = w_{1} = 2

$\frac{dz}{dw_2} = w_1 = 2$

Finally, $w_1$ contributes to $z$ via $w_3$ and $w_4$ . Once again, from the rules of partial derivatives we know that $\frac{dw_3}{dw_1} = w_2$ and $\frac{dw_4}{dw_1} = \cos(w_1)$ . Thus:

\frac{d z}{d w_{1}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{1}} + \frac{d z}{d w_{4}} \frac{d w_{4}}{d w_{1}} = w_{2} + \cos (w_{1})

$\frac{dz}{dw_1} = \frac{dz}{dw_3} \frac{dw_3}{dw_1} + \frac{dz}{dw_4} \frac{dw_4}{dw_1} = w_2 + \cos(w_1)$

And again, given known inputs, we can calculate it:

\frac{d z}{d w_{1}} = w_{2} + \cos (w_{1}) = 3 + \cos (2) = 2.58

$\frac{dz}{dw_1} = w_2 + \cos(w_1) = 3 + \cos(2) ~= 2.58$

Since $w_1$ and $w_2$ are just aliases for $x_1$ and $x_2$ , we get our answer:

\frac{d z}{d x_{1}} = 2.58

$\frac{dz}{dx_1} = 2.58$

\frac{d z}{d x_{2}} = 2

$\frac{dz}{dx_2} = 2$

And that's it!

This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
Each component in output array is an independent function of 1 or more components of input array(s). E.g. if $y = f(x)$ and both $x$ and $y$ are vectors, $y_i$ never depends on $y_j$ , but only on subset of $x_k$ . In particular, this means that finding derivative $\frac{dy_i}{dx_j}$ boils down to tracking how $y_i$ depends on $x_j$ .

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression $\frac{dz}{dw_1} = w_2 + \cos(w_1) = x_2 + \cos(x_1)$ and calculate this derivative for whatever inputs we want.

— ffriend
fonte

Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)

— MadHatter

Also it won't hurt clarifying why we need seeds.

— MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression -

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$ , not something you can choose or generate.

— ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?

— MadHatter

As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.

— ffriend