Algoritmo per il monitoraggio dinamico dei quantili

24

Voglio stimare il quantile di alcuni dati. I dati sono così enormi che non possono essere inseriti nella memoria. E i dati non sono statici, i nuovi dati continuano ad arrivare. Qualcuno conosce qualche algoritmo per monitorare i quantili dei dati osservati finora con memoria e calcolo molto limitati? Trovo utile l' algoritmo P2 , ma non funziona molto bene per i miei dati, che sono distribuiti in modo estremamente pesante.

algorithms quantiles

— sinoTrinity
fonte

Per alcune idee (nel contesto della stima delle mediane) vedere il thread su stats.stackexchange.com/q/346/919 .

— whuber

3

Questa domanda è Crosspost su math.SE.

— cardinale

16

L'algoritmo P2 è una bella scoperta. Funziona facendo diverse stime del quantile, aggiornandole periodicamente e usando l' interpolazione quadratica (non lineare, non cubica) per stimare il quantile. Gli autori sostengono che l'interpolazione quadratica funziona meglio nelle code rispetto all'interpolazione lineare e che i cubi diventerebbero troppo esigenti e difficili.

Non si afferma esattamente come questo approccio fallisca per i dati "dalla coda pesante", ma è facile da indovinare: le stime dei quantili estremi per le distribuzioni dalla coda pesante saranno instabili fino a quando non verrà raccolta una grande quantità di dati. Ma questo sarà un problema (in misura minore) anche se dovessi archiviare tutti i dati, quindi non aspettarti miracoli!

Ad ogni modo, perché non impostare marcatori ausiliari - chiamiamoli e entro i quali sei altamente sicuro che il quantile mentirà e memorizzeremo tutti i dati compresi tra e ? Quando il buffer si riempie dovrai aggiornare questi marker, mantenendo sempre . Un semplice algoritmo per fare ciò può essere ideato da una combinazione di (a) la stima P2 corrente del conteggio quantile e (b) memorizzato del numero di dati inferiore a e il numero di dati maggiore di $x_0$ $x_6$ $x_0$ $x_6$ $x_0 \le x_6$ $x_0$ $x_6$ . In questo modo è possibile, con elevata certezza, stimare il quantile così come se si avesse sempre l'intero set di dati sempre disponibile, ma è necessario solo un buffer relativamente piccolo.

In particolare, sto proponendo una struttura di dati per mantenere informazioni parziali su una sequenza di valori di dati . Qui, è un elenco collegato $(k, \mathbf{y}, n)$ $n$ $x_1, x_2, \ldots, x_n$ $\mathbf{y}$

y = (x_{[k + 1]}^{(n)} \leq x_{[k + 2]}^{(n)} \leq \dots \leq x_{[k + m]}^{(n)}) .

$\mathbf{y} = (x^{(n)}_{[k+1]} \le x^{(n)}_{[k+2]} \le \cdots \le x^{(n)}_{[k+m]}).$

In questa notazione denota la più piccolo dei valori letto finora. è una costante, la dimensione del buffer . $x^{(n)}_{[i]}$ $i^\text{th}$ $n$ $x$ $m$ $\mathbf{y}$

L'algoritmo inizia riempiendo con i primi valori di dati rilevati e posizionandoli in ordine ordinato, dal più piccolo al più grande. Sia essere il quantile da stimare; ad es. = 0,99. Dopo aver letto ci sono tre possibili azioni: $\mathbf{y}$ $m$ $q$ $q$ $x_{n+1}$

Se , incrementa . $x_{n+1} \lt x^{(n)}_{[k+1]}$ $k$
Se , non fare nulla. $x_{n+1} \gt x^{(n)}_{[k+m]}$
Altrimenti, inserisci in . $x_{n+1}$ $\mathbf{y}$

In ogni caso, incremento . $n$

La procedura di inserimento inserisce in in ordine ordinato e quindi elimina uno dei valori estremi in : $x_{n+1}$ $\mathbf{y}$ $\mathbf{y}$

Se , quindi rimuovere da e incrementare ; $k + m/2 \lt n q$ $x^{(n)}_{[k+1]}$ $\mathbf{y}$ $k$
Altrimenti, rimuovere da . $x^{(n)}_{[k+m]}$ $\mathbf{y}$

Se è sufficientemente grande, questa procedura raggrupperà il vero quantile della distribuzione con alta probabilità. In qualsiasi fase può essere stimato nel solito modo in termini di e , che probabilmente risiederà in . (Credo che debba solo ridimensionare come la radice quadrata della massima quantità di dati ( $m$ $n$ $x^{(n)}_{[\lfloor{q n}\rfloor]}$ $x^{(n)}_{[\lceil{q n}\rceil]}$ $\mathbf{y}$ $m$ $N$ ), ma non ho effettuato un'analisi rigorosa per dimostrarlo. In ogni caso, l'algoritmo rileverà se è riuscito (confrontando e con ). $k/n$ $(k+m)/n$ $q$

Test con un massimo di 100.000 valori, usando $m = 2\sqrt{N}$ and $q=.5$ (the most difficult case) indicates this algorithm has a 99.5% success rate in obtaining the correct value of $x^{(n)}_{[\lfloor{q n}\rfloor]}$ . For a stream of $N=10^{12}$ values, that would require a buffer of only two million (but three or four million would be a better choice). Using a sorted doubly linked list for the buffer requires $O(\log(\sqrt{N}))$ = $O(\log(N))$ effort while identifying and deleting the max or min are $O(1)$ operations. The relatively expensive insertion typically needs to be done only $O(\sqrt{N})$ times. Thus the computational costs of this algorithm are $O(N + \sqrt{N} \log(N)) = O(N)$ in time and $O(\sqrt{N})$ in storage.

— whuber
fonte

This is an extended work of P2 algorithm. [link] sim.sagepub.com/content/49/4/159.abstract. The storage is still too much for my application, which runs on small sensors with a total of 10K RAM. I can consume a few hundred bytes at most for quantile estimation only.

— sinoTrinity

@whuber Actually I implement the extended P2 and test it with generated samples from various distributions such as uniform and exponential, where it works great. But when I apply it against data from my application, whose distribution is unknown, sometimes it fails to converge and yields relative error (abs(estimation - actual) / actual) up to 300%.

— sinoTrinity

2

@sino The quality of the algorithm compared to using all the data shouldn't depend on the heaviness of the tails. A fairer way to measure error is this: let

F

$F$ be the empirical cdf. For an estimate

\hat{q}

$\hat{q}$ of the

q

$q$ percentile, what is the difference between

F (\hat{q})

$F(\hat{q})$ and

F (q)

$F(q)$ ? If it is on the order of

1 / n

$1/n$ you're doing awfully well. In other words, just what percentile is the P2 algorithm returning for your data?

— whuber

You are right. I just measured the F(qˆ) and F(q) for the case I mentioned with relative error up to 300%. For q of 0.7, qˆ is almost 0.7, resulting in negligible error. However, for q of 0.9, qˆ seems to be around 0.95. I guess that's why I have huge error of up to 300%. Any idea why it's 0.95, not 0.9? BTW, can I post figure here and how can I post mathematical formula as you did?

— sinoTrinity

2

@whuber Sono abbastanza sicuro che la mia implementazione sia conforme alla P2 estesa. 0,9 va ancora a 0,95 o anche più grande quando stima contemporaneamente i quantili 0,8, 0,85, 0,9, 0,95. Tuttavia, 0,9 si avvicina molto a 0,9 se vengono tracciati contemporaneamente i quantili 0,8, 0,85, 0,9, 0,95 e 1,0 .

— sinoTrinity

5

$O(\sqrt N)$ storage or it doesn't work out for some other reason, here is an idea for a different generalization of P2. It's not as detailed as what whuber suggests - more like a research idea instead of as a solution.

Instead of tracking the quantiles at $0$ , $p/2$ , $p$ , $(1+p)/2$ , and $1$ , as the original P2 algorithm suggests, you could simply keep track of more quantiles (but still a constant number). It looks like the algorithm allows for that in a very straightforward manner; all you need to do is compute the correct "bucket" for incoming points, and the right way to update the quantiles (quadratically using adjacent numbers).

Say you keep track of $25$ points. You could try tracking the quantile at $0$ , $p/12$ , $\dotsc$ , $p \cdot 11/12$ , $p$ , $p + (1-p)/12$ , $\dotsc$ , $p + 11\cdot(1-p)/12$ , $1$ (picking the points equidistantly in between $0$ and $p$ , and between $p$ and $1$ ), or even using $22$ Chebyshev nodes of the form $p/2 \cdot (1 + \cos \frac{(2 i - 1)\pi}{22})$ and $p + (1 - p)/2 \cdot (1 + \cos \frac{(2i-1)\pi}{22})$ . If $p$ is close to $0$ or $1$ , you could try putting fewer points on the side where there is less probability mass and more on the other side.

If you decide to pursue this, I (and possibly others on this site) would be interested in knowing if it works...

— Erik P.
fonte

+1 I think this is a great idea given the OP's constraints. All one can hope for is an approximation, so the trick is to pick bins that have a high likelihood of being narrow and containing the desired quantile.

— whuber

3

Press et al., Numerical Recipes 8.5.2 "Single-pass estimation of arbitrary quantiles" p. 435, give a c++ class IQAgent which updates a piecewise-linear approximate cdf.

— denis
fonte

books.google.com/… for a version that doesn't require Flash.

— ZachB

2

This can be adapted from algorithms that determine the median of a dataset online. For more information, see this stackoverflow post - /programming/1387497/find-median-value-from-a-growing-set

— benhamner
fonte

The computational resources required of the algorithm you link to are unnecessarily large and do not meet the requirements of this question.

— whuber

2

I'd look at quantile regression. You can use it to determine a parametric estimate of whichever quantiles you want to look at. It make no assumption regarding normality, so it handles heteroskedasticity pretty well and can be used one a rolling window basis. It's basically an L1-Norm penalized regression, so it's not too numerically intensive and there's a pretty full featured R, SAS, and SPSS packages plus a few matlab implementations out there. Here's the main and the R package wikis for more info.

Edited:

Check out the math stack exchange crosslink: Someone sited a couple of papers that essentially lay out the very simple idea of just using a rolling window of order statistics to estimate quantiles. Literally all you have to do is sort the values from smallest to largest, select which quantile you want, and select the highest value within that quantile. You can obviously give more weight to the most recent observations if you believe they are more representative of actual current conditions. This will probably give rough estimates, but it's fairly simple to do and you don't have to go through the motions of quantitative heavy lifting. Just a thought.

— Marc
fonte

1

It is possible to estimate (and track) quantiles on an on-line basis (the same applies to the parameters of a quantile regression). In essence, this boils down to stochastic gradient descent on the check-loss function which defines quantile-regression (quantiles being represented by a model containing only an intercept), e.g. updating the unknown parameters as and when observations arrive.

See the Bell Labs paper "Incremental Quantile Estimation for Massive Tracking" ( ftp://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/kdd/p516-chen.pdf)

— Ludo
fonte

0

Another important algorithm is M. Greenwald and S. Khanna 2004 - Space-efficient online computation of quantile summaries.

— Quartz
fonte