Come eseguire la regressione lineare a tratti con più nodi sconosciuti?

14

Esistono pacchetti per eseguire una regressione lineare a tratti, in grado di rilevare automaticamente i nodi multipli? Grazie. Quando uso il pacchetto strucchange. Non sono riuscito a rilevare i punti di cambiamento. Non ho idea di come rilevi i punti di cambiamento. Dalle trame, ho visto che ci sono diversi punti che voglio che potrebbero aiutarmi a individuarli. Qualcuno potrebbe fare un esempio qui?

regression change-point

— Honglang Wang
fonte

1

Questa sembra essere la stessa domanda di stats.stackexchange.com/questions/5700/… . Se differisce in modo sostanziale, fatecelo sapere modificando la domanda per riflettere le differenze; in caso contrario, lo chiuderemo come duplicato.

— whuber

1

Ho modificato la domanda.

— Honglang Wang,

1

Penso che tu possa farlo come un problema di ottimizzazione non lineare. Basta scrivere l'equazione della funzione da adattare, con i coefficienti e le posizioni dei nodi come parametri.

— mark999,

1

Penso che il segmentedpacchetto sia quello che stai cercando.

— AlefSin,

1

Ho avuto un problema identico, risolto con il segmentedpacchetto di R : stackoverflow.com/a/18715116/857416

— un diverso ben

8

Sarebbe MARS sia applicabile? R ha il pacchetto earthche lo implementa.

— Wayne
fonte

8

In generale, è un po 'strano voler adattarsi a qualcosa di lineare come pezzo. Tuttavia, se lo desideri davvero, l'algoritmo MARS è il più diretto. Costruirà una funzione un nodo alla volta; e poi di solito elimina il numero di nodi per combattere alberi decisionali troppo adatti. Puoi accedere all'algotitmo MARS in R tramite eartho mda. In generale, è compatibile con GCV che non è così lontano dall'altro criterio informativo (AIC, BIC ecc.)

MARS non ti darà davvero una vestibilità "ottimale" poiché i nodi vengono fatti crescere uno alla volta. Sarebbe davvero piuttosto difficile inserire un numero veramente "ottimale" di nodi poiché le possibili permutazioni dei posizionamenti dei nodi esploderebbero rapidamente.

In generale, questo è il motivo per cui le persone si rivolgono alla levigatura delle spline. La maggior parte delle spline leviganti sono cubiche, così puoi ingannare un occhio umano senza perdere le discontinuità. Sarebbe comunque possibile eseguire una spline di livellamento lineare. Il grande vantaggio di levigare le spline è il loro unico parametro da ottimizzare. Ciò ti consente di raggiungere rapidamente una soluzione veramente "ottimale" senza dover cercare tra le goccioline di permutazioni. Tuttavia, se vuoi davvero cercare punti di flesso e hai abbastanza dati per farlo, allora qualcosa come MARS sarebbe probabilmente la soluzione migliore.

Ecco un esempio di codice per spline di livellamento lineare penalizzate in R:

require(mgcv);data(iris);
gam.test <- gam(Sepal.Length ~ s(Petal.Width,k=6,bs='ps',m=0),data=iris)
summary(gam.test);plot(gam.test);

Tuttavia, i nodi effettivi scelti non sarebbero necessariamente correlati ad alcun vero punto di flesso.

— Shea Parkes
fonte

3

L'ho programmato da zero una volta qualche anno fa e ho un file Matlab per eseguire una regressione lineare a livello di pezzo sul mio computer. Circa 1 - 4 punti di interruzione sono computazionalmente possibili per circa 20 punti di misura o giù di lì. 5 o 7 punti di interruzione iniziano ad essere davvero troppo.

Il puro approccio matematico secondo me è quello di provare tutte le possibili combinazioni come suggerito dall'utente mbq nella domanda collegata al commento sotto la tua domanda.

Poiché le linee adattate sono tutte consecutive e adiacenti (senza sovrapposizioni), la combinatoria seguirà il triangolo di Pascal. Se ci fossero sovrapposizioni tra i punti dati utilizzati dai segmenti di linea, credo che la combinatoria seguirà invece numeri di Stirling del secondo tipo.

La migliore soluzione nella mia mente è quella di scegliere la combinazione di linee adattate che ha la deviazione standard più bassa dei valori di correlazione R ^ 2 delle linee adattate. Proverò a spiegare con un esempio. Tieni presente però che chiedere quanti punti di rottura si dovrebbero trovare nei dati, è simile alla domanda "Quanto dura la costa della Gran Bretagna?" come in uno dei documenti di Benoit Mandelbrots (un matematico) sui frattali. E c'è un compromesso tra il numero di punti di interruzione e la profondità di regressione.

Ora all'esempio.

$y$ $x$ $x$ $y$

\begin{array}{cccccc} x & y & R^{2} l i n e 1 & R^{2} l i n e 2 & s u m o f R^{2} v a l u e s & s t a n d a r d d e v i a t i o n o f R^{2} \\ 1 & 1 & 1, 000 & 0, 0400 & 1, 0400 & 0, 6788 \\ 2 & 2 & 1, 000 & 0, 0118 & 1, 0118 & 0, 6987 \\ 3 & 3 & 1, 000 & 0, 0004 & 1, 0004 & 0, 7067 \\ 4 & 4 & 1, 000 & 0, 0031 & 1, 0031 & 0, 7048 \\ 5 & 5 & 1, 000 & 0, 0135 & 1, 0135 & 0, 6974 \\ 6 & 6 & 1, 000 & 0, 0238 & 1, 0238 & 0, 6902 \\ 7 & 7 & 1, 000 & 0, 0277 & 1, 0277 & 0, 6874 \\ 8 & 8 & 1, 000 & 0, 0222 & 1, 0222 & 0, 6913 \\ 9 & 9 & 1, 000 & 0, 0093 & 1, 0093 & 0, 7004 \\ 10 & 10 & 1, 000 & - 1, 978 & 1, 000 & 0, 7071 \\ 11 & 9 & 0, 9709 & 0, 0271 & 0, 9980 & 0, 6673 \\ 12 & 8 & 0, 8951 & 0, 1139 & 1, 0090 & 0, 5523 \\ 13 & 7 & 0, 7734 & 0, 2558 & 1, 0292 & 0, 3659 \\ 14 & 6 & 0, 6134 & 0, 4321 & 1, 0455 & 0, 1281 \\ 15 & 5 & 0, 4321 & 0, 6134 & 1, 0455 & 0, 1282 \\ 16 & 4 & 0, 2558 & 0, 7733 & 1, 0291 & 0, 3659 \\ 17 & 3 & 0, 1139 & 0, 8951 & 1, 0090 & 0, 5523 \\ 18 & 2 & 0, 0272 & 0, 9708 & 0, 9980 & 0, 6672 \\ 19 & 1 & 0 & 1, 000 & 1, 000 & 0, 7071 \\ 20 & 2 & 0, 0094 & 1, 000 & 1, 0094 & 0, 7004 \\ 21 & 3 & 0, 0222 & 1, 000 & 1, 0222 & 0, 6914 \\ 22 & 4 & 0, 0278 & 1, 000 & 1, 0278 & 0, 6874 \\ 23 & 5 & 0, 0239 & 1, 000 & 1, 0239 & 0, 6902 \\ 24 & 6 & 0, 0136 & 1, 000 & 1, 0136 & 0, 6974 \\ 25 & 7 & 0, 0032 & 1, 000 & 1, 0032 & 0, 7048 \\ 26 & 8 & 0, 0004 & 1, 000 & 1, 0004 & 0, 7068 \\ 27 & 9 & 0, 0118 & 1, 000 & 1, 0118 & 0, 6987 \\ 28 & 10 & 0, 04 & 1, 000 & 1, 04 & 0, 6788 \end{array}

$\begin{array}{|c|c|c|c|c|c|} \hline &x &y &R^2 line 1 &R^2 line 2 &sum of R^2 values &standard deviation of R^2 \\ \hline &1 &1 &1,000 &0,0400 &1,0400 &0,6788 \\ \hline &2 &2 &1,000 &0,0118 &1,0118 &0,6987 \\ \hline &3 &3 &1,000 &0,0004 &1,0004 &0,7067 \\ \hline &4 &4 &1,000 &0,0031 &1,0031 &0,7048 \\ \hline &5 &5 &1,000 &0,0135 &1,0135 &0,6974 \\ \hline &6 &6 &1,000 &0,0238 &1,0238 &0,6902 \\ \hline &7 &7 &1,000 &0,0277 &1,0277 &0,6874 \\ \hline &8 &8 &1,000 &0,0222 &1,0222 &0,6913 \\ \hline &9 &9 &1,000 &0,0093 &1,0093 &0,7004 \\ \hline &10 &10 &1,000 &-1,978 &1,000 &0,7071 \\ \hline &11 &9 &0,9709 &0,0271 &0,9980 &0,6673 \\ \hline &12 &8 &0,8951 &0,1139 &1,0090 &0,5523 \\ \hline &13 &7 &0,7734 &0,2558 &1,0292 &0,3659 \\ \hline &14 &6 &0,6134 &0,4321 &1,0455 &0,1281 \\ \hline &15 &5 &0,4321 &0,6134 &1,0455 &0,1282 \\ \hline &16 &4 &0,2558 &0,7733 &1,0291 &0,3659 \\ \hline &17 &3 &0,1139 &0,8951 &1,0090 &0,5523 \\ \hline &18 &2 &0,0272 &0,9708 &0,9980 &0,6672 \\ \hline &19 &1 &0 &1,000 &1,000 &0,7071 \\ \hline &20 &2 &0,0094 &1,000 &1,0094 &0,7004 \\ \hline &21 &3 &0,0222 &1,000 &1,0222 &0,6914 \\ \hline &22 &4 &0,0278 &1,000 &1,0278 &0,6874 \\ \hline &23 &5 &0,0239 &1,000 &1,0239 &0,6902 \\ \hline &24 &6 &0,0136 &1,000 &1,0136 &0,6974 \\ \hline &25 &7 &0,0032 &1,000 &1,0032 &0,7048 \\ \hline &26 &8 &0,0004 &1,000 &1,0004 &0,7068 \\ \hline &27 &9 &0,0118 &1,000 &1,0118 &0,6987 \\ \hline &28 &10 &0,04 &1,000 &1,04 &0,6788 \\ \hline \end{array}$

These y values have the graph:

idealized data

Which clearly has two break points. For the sake of argument we will calculate the R^2 correlation values (with the Excel cell formulas (European dot-comma style)):

=INDEX(LINEST(B1:$B$1;A1:$A$1;TRUE;TRUE);3;1)
=INDEX(LINEST(B1:$B$28;A1:$A$28;TRUE;TRUE);3;1)

for all possible non-overlapping combinations of two fitted lines. All the possible pairs of R^2 values have the graph:

R^2 values

The question is which pair of R^2 values should we choose, and how do we generalize to multiple break points as asked in the title? One choice is to pick the combination for which the sum of the R-square correlation is the highest. Plotting this we get the upper blue curve below:

sum of R squared and standard deviation of R squared

The blue curve, the sum of the R-squared values, is the highest in the middle. This is more clearly visible from the table with the value $1,0455$ as the highest value. However it is my opinion that the minimum of the red curve is more accurate. That is, the minimum of the standard deviation of the R^2 values of the fitted regression lines should be the best choice.

Piece wise linear regression - Matlab - multiple break points

— Mats Granvik
fonte

1

There is a pretty nice algorithm described in Tomé and Miranda (1984).

The proposed methodology uses a least-squares approach to compute the best continuous set of straight lines that fit a given time series, subject to a number of constraints on the minimum distance between breakpoints and on the minimum trend change at each breakpoint.

The code and a GUI are available in both Fortran and IDL from their website: http://www.dfisica.ubi.pt/~artome/linearstep.html

— arkaia
fonte

0

... first of all you must to do it by iterations, and under some informative criterion, like AIC AICc BIC Cp; because you can get an "ideal" fit, if number of knots K = number od data points N, ok. ... first put K = 0; estimate L = K + 1 regressions, calculate AICc, for instance; then assume minimal number of data points at a separate segment, say L = 3 or L = 4, ok ... put K = 1; start from L-th data as the first knot, calculate SS or MLE, ... and step by step the next data point as a knot, SS or MLE, up to the last knot at the N - L data; choose the arrangement with the best fit (SS or MLE) calculate AICc ... ... put K = 2; ... use all previous regressions (that is their SS or MLE), but step by step divide a single segment into all possible parts ... choose the arrangement with the best fit (SS or MLE) calculate AICc ... if the last AICc occurs greater then the previous one: stop the iterations ! This is an optimal solution under AICc criterion, ok

— Maciek
fonte

AIC, BIC can't be used because they penalised for extra parameters, which is clearly not the case here.

— HelloWorld

0

I once came across a program called Joinpoint. On their website they say it fits a joinpoint model where "several different lines are connected together at the 'joinpoints'". And further: "The user supplies the minimum and maximum number of joinpoints. The program starts with the minimum number of joinpoint (e.g. 0 joinpoints, which is a straight line) and tests whether more joinpoints are statistically significant and must be added to the model (up to that maximum number)."

The NCI uses it for trend modelling of cancer rates, maybe it fits your needs as well.

— psj
fonte

0

In order to fit to data a piecewise function :

where $a_1 , a_2 , p_1 , q_1, p_2 , q_2 , p_3 , q_3$ are unknown parameters to be approximately computed, there is a very simple method (not iterative, no initial guess, easy to code in any math computer language). The theory given page 29 in paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf and from page 30 :

For example, with the exact data provided by Mats Granvik the result is :

Without scattered data, this example is not very signifiant. Other examples with scattered data are shown in the referenced paper.

— JJacquelin
fonte

0

You can use the mcp package if you know the number of change points to infer. It gives you great modeling flexibility and a lot of information about the change points and regression parameters, but at the cost of speed.

The mcp website contains many applied examples, e.g.,

library(mcp)

# Define the model
model = list(
  response ~ 1,  # plateau (int_1)
  ~ 0 + time,    # joined slope (time_2) at cp_1
  ~ 1 + time     # disjoined slope (int_3, time_3) at cp_2
)

# Fit it. The `ex_demo` dataset is included in mcp
fit = mcp(model, data = ex_demo)

Then you can visualize:

plot(fit)

Or summarise:

summary(fit)

Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
  1: response ~ 1
  2: response ~ 1 ~ 0 + time
  3: response ~ 1 ~ 1 + time

Population-level parameters:
    name match  sim  mean lower  upper Rhat n.eff
    cp_1    OK 30.0 30.27 23.19 38.760    1   384
    cp_2    OK 70.0 69.78 69.27 70.238    1  5792
   int_1    OK 10.0 10.26  8.82 11.768    1  1480
   int_3    OK  0.0  0.44 -2.49  3.428    1   810
 sigma_1    OK  4.0  4.01  3.43  4.591    1  3852
  time_2    OK  0.5  0.53  0.40  0.662    1   437
  time_3    OK -0.2 -0.22 -0.38 -0.035    1   834

Disclaimer: I am the developer of mcp.

— Jonas Lindeløv
fonte

The use of "detect" in the question indicates the number--and even the existence--of changepoints are not known beforehand.

— whuber