Deep Neural Network - Backpropogation con ReLU

Sto incontrando qualche difficoltà nel ricavare la propagazione indietro con ReLU e ho fatto un po 'di lavoro, ma non sono sicuro di essere sulla strada giusta.

Funzione di costo: dove è il valore reale e è un valore previsto. Supponi anche che > 0 sempre. $\frac{1}{2}(y-\hat y)^2$ $y$ $\hat y$ $x$

1 livello ReLU, dove il peso al 1 ° livello è $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2 strati ReLU, in cui i pesi al 1 ° strato sono e il 2 ° strato è E volevo aggiornare il 1 ° strato $w_2$ $w_1$ $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

Da $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

ReLU a 3 strati, dove i pesi al 1 ° strato sono , 2 ° strato e 3 ° strato $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Da $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

Poiché la regola della catena dura solo con 2 derivati, rispetto a un sigmoide, che potrebbe essere lungo quanto $n$ numero di strati.

Supponiamo di voler aggiornare tutti i pesi a 3 livelli, dove $w_1$ è il 3 ° livello, $w_2$ è il 2 ° livello, $w_1$ è il 3 ° strato

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Se questa derivazione è corretta, in che modo impedisce la scomparsa? Rispetto al sigmoide, dove nell'equazione abbiamo moltiplicato per 0,25, mentre ReLU non ha moltiplicazioni di valore costante. Se ci sono migliaia di strati, ci sarebbe molta moltiplicazione a causa dei pesi, quindi questo non causerebbe il gradiente di sparizione o esplosione?

neural-network backpropagation

— user1157751
fonte

@NeilSlater Grazie per la tua risposta! Puoi elaborare, non sono sicuro di cosa volevi dire?

— user1157751,

Ah, penso di sapere cosa intendevi. Bene, il motivo per cui ho sollevato questa domanda è che sono sicuro che la derivazione sia corretta? Ho cercato in giro e non ho trovato un esempio di ReLU completamente derivato da zero?

— user1157751,

Definizioni operative della funzione ReLU e dei suoi derivati:

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

La derivata è la funzione di step unitario . Ciò ignora un problema in $x=0$ , in cui il gradiente non è definito in modo rigoroso, ma ciò non rappresenta un problema pratico per le reti neurali. Con la formula sopra, la derivata su 0 è 1, ma potresti ugualmente trattarla come 0, o 0,5 senza alcun impatto reale sulle prestazioni della rete neurale.

Rete semplificata

Con queste definizioni, diamo un'occhiata alle tue reti di esempio.

Stai eseguendo la regressione con la funzione di costo $C = \frac{1}{2}(y-\hat{y})^2$ . Hai definito $R$ come uscita del neurone artificiale, ma non hai definito un valore di input. Lo aggiungerò per completezza: chiamalo $z$ , aggiungi un po 'di indicizzazione per layer, e preferiscominuscole per i vettori e maiuscole per le matrici, quindi $r^{(1)}$ output del primo layer, $z^{(1)}$ per i suoi input e $W^{(0)}$ per il peso che collega il neurone al suo input $x$ (in una rete più grande, che potrebbe connettersi a unpiù profondo $r$ valore invece). Ho anche regolato il numero di indice per la matrice di peso - perché sarà più chiaro per la rete più grande. NB Per ora sto ignorando di avere più del neurone in ogni strato.

Osservando il tuo semplice 1 strato, 1 rete di neuroni, le equazioni feed-forward sono:

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

La derivata della funzione di costo con una stima di esempio è:

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

Utilizzo della regola chain per la propagazione posteriore al valore di pre-trasformazione ( $z$ ):

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

Questo $\frac{\partial C}{\partial z^{(1)}}$ è una fase intermedia e una parte critica dei passaggi di backprop che collegano insieme. Le derivazioni spesso saltano questa parte perché combinazioni intelligenti di funzione di costo e livello di output ne semplificano la realizzazione. Qui non lo è.

Per ottenere il gradiente rispetto al peso $W^{(0)}$ , si tratta di un'altra iterazione della regola della catena:

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

. . . perché $z^{(1)} = W^{(0)}x$ quindi $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

Questa è la soluzione completa per la tua rete più semplice.

Tuttavia, in una rete a più livelli, devi anche portare la stessa logica fino al livello successivo. Inoltre, in genere hai più di un neurone in uno strato.

Rete ReLU più generale

$(k)$ $i$ $(k+1)$ $j$

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

$r^{output}_j$ is still $r^{output}_j - y_j$ . However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
fonte

Was a chain rule performed on

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ?

— user1157751

@user1157751: No,

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ because

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ . The cost function C is simple enough that you can take its derivative immediately. The only thing I haven't shown there is the expansion of the square - would you like me to add it?

— Neil Slater

But

C

$C$ is

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ , don't we need to perform chain rule so that we can perform the derivative on

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ , where

U = y - \hat{y}

$U = y - \hat y$ . Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (

— user1157751

If you can make things simpler by expanding. Then please do expand the square.

— user1157751

@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.

— Neil Slater