Come viene derivata la funzione di costo dalla regressione logistica

29

Sto facendo il corso di Machine Learning Stanford su Coursera.

Nel capitolo sulla regressione logistica, la funzione di costo è questa:

Quindi, è derivato qui:

Ho provato a ottenere la derivata della funzione di costo ma ho ottenuto qualcosa di completamente diverso.

Come si ottiene il derivato?

Quali sono i passaggi intermedi?

— Ottaviano
fonte

+1, controlla la risposta di @ AdamO nella mia domanda qui. stats.stackexchange.com/questions/229014/…

— Haitao Du

"Completamente diverso" non è davvero sufficiente per rispondere alla tua domanda, oltre a dirti ciò che già conosci (il gradiente corretto). Sarebbe molto più utile se ci fornissi i risultati dei tuoi calcoli, quindi possiamo aiutarti a puntellare dove hai commesso l'errore.

— Matthew Drury,

@MatthewDrury Scusa, Matt, avevo organizzato la risposta prima che arrivasse il tuo commento. Ottaviano, hai seguito tutti i passaggi? Modificherò per dargli un valore aggiunto in seguito ...

— Antoni Parellada,

2

quando dici "derivato" intendi "differenziato" o "derivato"?

— Glen_b

41

Adattato dalle note del corso, che non vedo disponibili (inclusa questa derivazione) al di fuori delle note fornite dagli studenti all'interno della pagina del corso di apprendimento automatico Coursera di Andrew Ng .

Nel seguito, l'apice $(i)$ indica misure individuali o "esempi" di addestramento.

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)}$

The derivative of the sigmoid function is

$\Tiny\begin{align}\frac{d}{dx}\sigma(x)&=\frac{d}{dx}\left(\frac{1}{1+e^{-x}}\right)\\[2ex] &=\frac{-(1+e^{-x})'}{(1+e^{-x})^2}\\[2ex] &=\frac{e^{-x}}{(1+e^{-x})^2}\\[2ex] &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)\\[2ex] &=\left(\frac{1}{1+e^{-x}}\right)\,\left(\frac{1+e^{-x}}{1+e^{-x}}-\frac{1}{1+e^{-x}}\right)\\[2ex] &=\sigma(x)\,\left(\frac{1+e^{-x}}{1+e^{-x}}-\sigma(x)\right)\\[2ex] &=\sigma(x)\,(1-\sigma(x)) \end{align}$

— Antoni Parellada
fonte

1

+1 for all the efforts!, may be using matrix notation could be easier?

— Haitao Du

can I say in linear regression, objective is

‖ A x - b ‖^{2}

$\|Ax-b\|^2$ and derivative is

2 A^{T} e

$2A^Te$ , where

e = A x - b

$e=Ax-b$ , in logistic regression, it is similar, the derivative is

A^{T} e

$A^Te$ where

e = p - b

$e=p-b$ , and

p = sigmoid (A x)

$p=\text{sigmoid}~(Ax)$ ?

— Haitao Du

2

that is why I appreciate your effort. you spend time to us OP's language!!

— Haitao Du

1

My understanding is that there are convexity issues that make the squared error minimization undesirable for non-linear activation functions. In matrix notation, it would be

\frac{\partial J (θ)}{\partial θ} = \frac{1}{m} X^{⊤} (σ (X θ) - y)

$\frac{\partial J(\theta)}{\partial \theta}=\frac{1}{m}X^\top\left( \sigma(X\theta)-\mathbf y\right)$ .

— Antoni Parellada

1

@MohammedNoureldin I just took the partial derivative in the numerators on the prior line, applying the chain rule.

— Antoni Parellada

8

To avoid impression of excessive complexity of the matter, let us just see the structure of solution.

With simplification and some abuse of notation, let $G(\theta)$ be a term in sum of $J(\theta)$ , and $h = 1/(1+e^{-z})$ is a function of $z(\theta)= x \theta$ :

G = y \cdot \log (h) + (1 - y) \cdot \log (1 - h)

$G = y \cdot \log(h)+(1-y)\cdot \log(1-h)$

We may use chain rule: $\frac{d G}{d \theta}=\frac{d G}{d h}\frac{d h}{d z}\frac{d z}{d \theta}$ and solve it one by one ( $x$ and $y$ are constants).

\frac{d G}{\partial h} = \frac{y}{h} - \frac{1 - y}{1 - h} = \frac{y - h}{h (1 - h)}

$\frac{d G}{\partial h} = \frac{y} {h} - \frac{1-y}{1-h} = \frac{y - h}{h(1-h)}$ For sigmoid

\frac{d h}{d z} = h (1 - h)

$\frac{d h}{d z} = h (1-h)$ holds, which is just a denominator of the previous statement.

Finally, $\frac{d z}{d \theta} = x$ .

Combining results all together gives sought-for expression:

\frac{d G}{d θ} = (y - h) x

$\frac{d G}{d \theta} = (y-h)x$ Hope that helps.

— garej
fonte

0

The credit for this answer goes to Antoni Parellada from the comments, which I think deserves a more prominent place on this page (as it helped me out when many other answers did not). Also, this is not a full derivation but more of a clear statement of $\frac{\partial J(\theta)}{\partial \theta}$ . (For full derivation, see the other answers).

\frac{\partial J (θ)}{\partial θ} = \frac{1}{m} \cdot X^{T} (σ (X θ) - y)

$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \cdot X^T\big(\sigma(X\theta)-y\big)$

where

\begin{aligned} X \in R^{m \times n} & = Training example matrix \\ σ (z) & = \frac{1}{1 + e^{- z}} = sigmoid function = logistic function \\ θ \in R^{n} & = weight row vector \\ y & = class/category/label corresponding to rows in X \end{aligned}

$\begin{equation} \begin{aligned} X \in \mathbb{R}^{m\times n} &= \text{Training example matrix} \\ \sigma(z) &= \frac{1}{1+e^{-z}} = \text{sigmoid function} = \text{logistic function} \\ \theta \in \mathbb{R}^{n} &= \text{weight row vector} \\ y &= \text{class/category/label corresponding to rows in X} \end{aligned} \end{equation}$

Also, a Python implementation for those wanting to calculate the gradient of $J$ with respect to $\theta$ .

import numpy
def sig(z):
return 1/(1+np.e**-(z))


def compute_grad(X, y, w):
    """
    Compute gradient of cross entropy function with sigmoidal probabilities

    Args: 
        X (numpy.ndarray): examples. Individuals in rows, features in columns
        y (numpy.ndarray): labels. Vector corresponding to rows in X
        w (numpy.ndarray): weight vector

    Returns: 
        numpy.ndarray 

    """
    m = X.shape[0]
    Z = w.dot(X.T)
    A = sig(Z)
    return  (-1/ m) * (X.T * (A - y)).sum(axis=1)

— CiaranWelsh
fonte

0

For those of us who are not so strong at calculus, but would like to play around with adjusting the cost function and need to find a way to calculate derivatives... a short cut to re-learning calculus is this online tool to automatically provide the derivation, with step by step explanations of the rule.

https://www.derivative-calculator.net

— Yaoshiang
fonte