Perché indietro propagarsi nel tempo in un RNN?

In una rete neurale ricorrente, di solito si inoltra la propagazione attraverso diversi passaggi temporali, "si srotolano" la rete e quindi si propagano indietro attraverso la sequenza di input.

Perché non dovresti semplicemente aggiornare i pesi dopo ogni singolo passaggio della sequenza? (l'equivalente dell'uso di una lunghezza di troncamento di 1, quindi non c'è nulla da srotolare) Questo elimina completamente il problema del gradiente di fuga, semplifica notevolmente l'algoritmo, probabilmente ridurrebbe le possibilità di rimanere bloccati nei minimi locali e, soprattutto, sembra funzionare bene . Ho formato un modello in questo modo per generare testo e i risultati sembravano paragonabili ai risultati che ho visto dai modelli addestrati BPTT. Sono confuso solo su questo perché ogni tutorial su RNN che ho visto dice di usare BPTT, quasi come se fosse necessario per un corretto apprendimento, il che non è il caso.

Aggiornamento: ho aggiunto una risposta

— Frobot
fonte

Una direzione interessante per intraprendere questa ricerca sarebbe quella di confrontare i risultati che hai raggiunto sul tuo problema con i benchmark pubblicati in letteratura sui problemi RNN standard. Sarebbe un articolo davvero interessante.

— Sycorax dice di ripristinare Monica

Il tuo "Aggiornamento: ho aggiunto una risposta" ha sostituito la modifica precedente con la descrizione dell'architettura e un'illustrazione. È apposta?

— ameba dice di reintegrare Monica il

Sì, l'ho tolto perché in realtà non sembrava rilevante per la vera domanda e occupava molto spazio, ma posso aggiungerlo di nuovo se aiuta

— Frobot

Beh, le persone sembrano avere enormi problemi con la comprensione della tua architettura, quindi immagino che qualsiasi spiegazione aggiuntiva sia utile. Puoi aggiungerlo alla tua risposta anziché alla tua domanda, se preferisci.

— ameba dice di reintegrare Monica il

Risposte:

Modifica: ho fatto un grosso errore nel confrontare i due metodi e devo cambiare la mia risposta. Si scopre il modo in cui lo stavo facendo, appena tornato a propagarsi sull'attuale fase temporale, in realtà inizia a imparare più velocemente. I rapidi aggiornamenti apprendono molto rapidamente gli schemi più elementari. Ma su un set di dati più ampio e con tempi di allenamento più lunghi, BPTT risulta di fatto al top. Stavo testando un piccolo campione per poche epoche e ho pensato che chiunque avesse iniziato a vincere la gara sarebbe stato il vincitore. Ma questo mi ha portato a una scoperta interessante. Se inizi l'addestramento indietro propagandoti solo un singolo passaggio temporale, poi passa a BPTT e aumenti lentamente la distanza di propagazione, otterrai una convergenza più rapida.

— Frobot
fonte

Grazie per l'aggiornamento. Nella fonte di quest'ultima immagine dice questo sull'impostazione uno a uno : "Modalità di elaborazione Vanilla senza RNN, dall'input di dimensioni fisse all'output di dimensioni fisse (ad es. Classificazione delle immagini)". Quindi è quello che stavamo dicendo. Se è come hai descritto, non ha stato e non è un RNN. "forward propagating through a single input before back back propagating" - lo definirei ANN. Ma questi non avrebbero funzionato altrettanto bene con il testo, quindi qualcosa non funziona e non ho idea di cosa perché non ho il codice

— ragulpr

Non ho letto quella parte e tu hai ragione. Il modello che sto usando è in realtà il "molti a molti" all'estrema destra. supponevo che nella sezione "uno a uno" ce ne fossero davvero tanti tutti collegati e il disegno lo aveva lasciato fuori. ma questa è in realtà una delle opzioni all'estrema destra che non ho notato (è strano averla in un blog sugli RNN, quindi ho pensato che fossero tutti ricorrenti).

— Modificherò

Immaginavo che fosse così, ecco perché ho insistito per vedere la tua funzione di perdita. Se si tratta di molti a molti la vostra perdita è simile a

ed è identicamente un RNN e sei propagazione / inputing l'intera sequenza, ma poi basta troncando BPTT si cioe' d calcolare la parte rossa nel mio post ma non ricorrere ulteriormente.

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$error=\sum_t(y_t-\hat{y}_t)^2$

— ragulpr,

La mia funzione di perdita non si somma nel tempo. Prendo un input, ottengo un output, quindi calcolo una perdita e aggiorno i pesi, quindi passo a t + 1, quindi non c'è nulla da sommare. Aggiungerò la funzione di perdita esatta al post originale

— Frobot,

Pubblica il tuo codice e non sto più indovinando, è sciocco.

— ragulpr,

Un RNN è una rete neurale profonda (DNN) in cui ogni livello può ricevere un nuovo input ma avere gli stessi parametri. BPT è una parola di fantasia per Back Propagation su tale rete che è essa stessa una parola di fantasia per Discesa a Gradiente.

Dire che il RNN uscite in ogni passo e $\hat{y}_t$

e r r o r_{t} = (y_{t} - {\hat{y}}_{t})^{2}

$\begin{equation} error_t=(y_t-\hat{y}_t)^2 \end{equation}$

Per apprendere i pesi abbiamo bisogno di gradienti affinché la funzione risponda alla domanda "quanto un cambiamento nel parametro ha effetto sulla funzione di perdita?" e spostare i parametri nella direzione indicata da:

\nabla e r r o r_{t} = - 2 (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \nabla error_t=-2(y_t-\hat{y}_t)\nabla \hat{y}_t \end{equation}$

Cioè abbiamo un DNN in cui riceviamo feedback su quanto sia buona la previsione su ogni livello. Dal momento che una modifica del parametro cambierà ogni livello nel DNN (timestep) e ogni livello contribuirà agli output imminenti che devono essere tenuti in considerazione.

Prendi una semplice rete di un neurone-uno strato per vederlo semi-esplicitamente:

\begin{aligned} {\hat{y}}_{t + 1} = & f (a + b x_{t} + c {\hat{y}}_{t}) \\ \frac{\partial}{\partial a} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot c \cdot \frac{\partial}{\partial a} {\hat{y}}_{t} \\ \frac{\partial}{\partial b} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot (x_{t} + c \cdot \frac{\partial}{\partial b} {\hat{y}}_{t}) \\ \frac{\partial}{\partial c} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ({\hat{y}}_{t} + c \cdot \frac{\partial}{\partial c} {\hat{y}}_{t}) \\ ⟺ \\ \nabla {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ([\begin{matrix} 0 \\ x_{t} \\ {\hat{y}}_{t} \end{matrix}] + c \nabla {\hat{y}}_{t}) \end{aligned}

$\begin{align*} \hat{y}_{t+1} =& f(a+bx_t+c\hat{y}_t)\\ \frac{\partial}{\partial a}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot c\cdot \frac{\partial}{\partial a}\hat{y}_{t} \\ \frac{\partial}{\partial b}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (x_t+c\cdot\frac{\partial}{\partial b}\hat{y}_{t})\\ \frac{\partial}{\partial c}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (\hat{y}_t+c\cdot\frac{\partial}{\partial c}\hat{y}_{t})\\ \iff\\ \nabla \hat{y}_{t+1} =& f'(a+bx_t+c\hat{y}_t)\cdot \left(\begin{bmatrix}0\\x_t\\\hat{y}_t \end{bmatrix} + c \mathbin{\color{red}{\nabla \hat{y}_{t}}} \right) \end{align*}$

$\delta$

[\begin{matrix} \tilde{a} \\ \tilde{b} \\ \tilde{c} \end{matrix}] \leftarrow [\begin{matrix} a \\ b \\ c \end{matrix}] + δ (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \begin{bmatrix}\tilde{a}\\\tilde{b}\\\tilde{c}\end{bmatrix} \leftarrow \begin{bmatrix}a\\b\\c\end{bmatrix} + \delta (y_{t}-\hat{y}_{t})\nabla \hat{y}_t \end{equation}$

What we see is that in order to calculate $\nabla \hat{y}_{t+1}$ you need to calculate i.e roll out $\nabla \hat{y}_{t}$ . What you propose is to ~~simply disregard the red part~~ calculate the red part for $t$ but not recurse further. I assume that your loss is something like

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$\begin{equation} error=\sum_t(y_t-\hat{y}_t)^2 \end{equation}$

Maybe each step will then contribute a crude direction which is enough in aggregation? This could explain your results but I'd be really interested in hearing more about your method/loss function! Also would be interested in a comparison with a two timestep windowed ANN.

edit4: After reading comments it seems like your architecture is not an RNN.

RNN: Stateful - carry forward hidden state $h_t$ indefinitely This is your model but the training is different.

~~Your model: Stateless - hidden state rebuilt in each step~~ edit2 : added more refs to DNNs edit3 : fixed gradstep and some notation edit5 : Fixed the interpretation of your model after your answer/clarification.

— ragulpr
fonte

thank you for your answer. I think you may have misunderstood what I am doing though. In the forward propagation I only do one step, so that in the back propagation it is also only one step. I don't forward propagate across multiple inputs in the training sequence. I see what you mean about a crude direction that is enough in aggregation to allow learning, but I have checked my gradients with numerically calculated gradients and they match for 10+ decimal places. The back prop works fine. I am using cross entropy loss.

— Frobot

I am working on taking my same model and retraining it with BPTT as we speak to have a clear comparison. I have also trained a model using this "one step" algorithm to predict whether a stock price will rise or fall for the next day, which is getting decent accuracy, so I will have two different models to compare BPTT vs single step back prop.

— Frobot

If you only forward propagate one step, isn't this a two layered ANN with feature input of last step to the first layer, feature input to the current step at the second layer but has same weights/parameters for both layers? I'd expect similar results or better with an ANN that takes input

{\hat{y}}_{t + 1} = f (x_{t}, x_{t - 1})

$\hat{y}_{t+1}=f(x_t,x_{t-1})$ i.e that uses a fixed time-window of size 2. If it only carries forward one step, can it learn long term dependencies?

— ragulpr

I'm using a sliding window of size 1, but the results are vastly different than making a sliding window of size 2 ANN with inputs (xt,xt−1). I can purposely let it overfit when learning a huge body of text and it can reproduce the entire text with 0 errors, which requires knowing long term dependencies that would be impossible if you only had (xt,xt−1) as input. the only question I have left is if using BPTT would allow the dependencies to become longer, but it honestly doesn't look like it would.

— Frobot

Look at my updated post. Your architecture is not an RNN, it's stateless so long term-dependencies not explicitly baked into the features can't be learned. Previous predictions does not influence future predictions. You can see this as if

\frac{\partial}{\partial {\hat{y}}_{t - 2}} {\hat{y}}_{t} = 0

$\frac{\partial}{\partial \hat{y}_{t-2}}\hat{y}_t =0$ for your architecture. BPTT is in theory identical to BP but performed on an RNN-architecture so you can't but I see what you mean, and the answer is no. Would be really interesting to see experiments on stateful RNN but only onestep BPTT though ^^

— ragulpr

"Unfolding through time" is simply an application of the chain rule,

\frac{d F (g (x), h (x), m (x))}{d x} = \frac{\partial F}{\partial g} \frac{d g}{d x} + \frac{\partial F}{\partial h} \frac{d h}{d x} + \frac{\partial F}{\partial m} \frac{d m}{d x}

$\frac{dF(g(x), h(x), m(x))}{dx} = \frac{\partial F}{\partial g}\frac{dg}{dx} + \frac{\partial F}{\partial h}\frac{dh}{dx} + \frac{\partial F}{\partial m}\frac{dm}{dx}$

The output of an RNN at time step $t$ , $H_t$ is a function of the parameters $\theta$ , the input $x_t$ and the previous state, $H_{t-1}$ (note that instead $H_t$ may be transformed again at time step $t$ to obtain the output, that is not important here). Remember the goal of gradient descent: given some error function $L$ , let's look at our error for the current example (or examples), and then let's adjust $\theta$ in such a way, that given the same example again, our error would be reduced.

How exactly did $\theta$ contribute to our current error? We took a weighted sum with our current input, $x_t$ , so we'll need to backpropagate through the input to find $\nabla_\theta a(x_t, \theta)$ , to work out how to adjust $\theta$ . But our error was also the result of some contribution from $H_{t-1}$ , which was also a function of $\theta$ , right? So we need to find out $\nabla_\theta H_{t-1}$ , which was a function of $x_{t-1}$ , $\theta$ and $H_{t-2}$ . But $H_{t-2}$ was also a function a function of $\theta$ . And so on.

— Matthew Hampsey
fonte

I understand why you back propagate through time in a traditional RNN. I'm trying to find out why a traditional RNN uses multiple inputs at once for training, when using just one at a time is much simpler and also works

— Frobot

The only sense in which you can feed in multiple inputs at once into an RNN is feeding in multiple training examples, as part of a batch. The batch size is arbitrary, and convergence is guaranteed for any size, but higher batch sizes may lead to more accurate gradient estimations and faster convergence.

— Matthew Hampsey

That's not what I meant by "multiple inputs at once". I didn't word it very well. I meant you usually forward propagate through several inputs in the training sequence, then back propagate back through them all, then update the weights. So the question is, why propagate through a whole sequence when doing just one input at a time is much easier and still works

— Frobot

I think some clarification here is required. When you say "inputs", are you referring to multiple training examples, or are you referring to multiple time steps within a single training example?

— Matthew Hampsey

I will post an answer to this question by the end of today. I finished making a BPTT version, just have to train and compare. After that if you still want to see some code let me know what you want to see and I guess I could still post it

— Frobot