Corso di crash nella stima media robusta


15

Ho un sacco (circa 1000) di stime e si suppone che siano tutte stime di elasticità a lungo termine. Poco più della metà di questi viene stimata usando il metodo A e il resto usando un metodo B. Da qualche parte leggo qualcosa del tipo "Penso che il metodo B stima qualcosa di molto diverso dal metodo A, perché le stime sono molto più alte (50-60%) ". La mia conoscenza di statistiche affidabili non ha praticamente nulla, quindi ho calcolato solo le medie dei campioni e le mediane di entrambi i campioni ... e ho subito notato la differenza. Il metodo A è molto concentrato, la differenza tra mediana e media è molto piccola, ma il campione del metodo B è variato in modo selvaggio.

Ho concluso che gli errori anomali e gli errori di misurazione hanno distorto il campione del metodo B, quindi ho gettato via circa 50 valori (circa il 15%) che erano molto incoerenti con la teoria ... e improvvisamente i mezzi di entrambi i campioni (incluso il loro CI) erano molto simili . Trama anche la densità.

(Nella ricerca dell'eliminazione dei valori anomali, ho esaminato l'intervallo del campione A e rimosso tutti i punti campione in B che sono caduti al di fuori di esso.) Vorrei che mi dicessi dove potrei scoprire alcune basi di una solida stima dei mezzi che potrebbero mi permetta di giudicare questa situazione in modo più rigoroso. E per avere dei riferimenti. Non ho bisogno di una comprensione molto approfondita delle varie tecniche, piuttosto di leggere un sondaggio completo sulla metodologia della stima robusta.

I t-tested for significance of mean difference after removing the outliers and the p-value is 0.0559 (t around 1.9), for the full samples the t stat was around 4.5. But that is not really the point, the means can be a bit different, but they should not differ by 50-60% as stated above. And I don't think they do.


3
What's your intended analysis using this data? The practice of removing outliers is of dubious statistical credibility: you can "make data" to give significance or lack of significance at any level by doing that. Are populations A and B which received measurements using methods A and B truly homogenous populations or is it possible that your methods have just given you different populations?
AdamO

There will be no further calculations or analysis to be done with the data. Both of the methods mentioned are consistent, according to recent research, so the populations should be homogenous; but the data is not of great quality and it is clear some of the values in B are there by mistake (the method is error prone), they make absolutely no economic sense. I know the removal is dubious, that is why I am looking for something more rigorous and credible.
Ondrej

Risposte:


18

Are you looking for the theory, or something practical?

If you are looking for books, here are some that I found helpful:

  • F.R. Hampel, E.M. Ronchetti, P.J.Rousseeuw, W.A. Stahel, Robust Statistics: The Approach Based on In fluence Functions, John Wiley & Sons, 1986.

  • P.J. Huber, Robust Statistics, John Wiley & Sons, 1981.

  • P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, 1987.

  • R.G. Staudte, S.J. Sheather, Robust Estimation and Testing, John Wiley & Sons, 1990.

If you are looking for practical methods, here are few robust methods of estimating the mean ("estimators of location" is I guess the more principled term):

  • The median is simple, well-known, and pretty powerful. It has excellent robustness to outliers. The "price" of robustness is about 25%.

  • The 5%-trimmed average is another possible method. Here you throw away the 5% highest and 5% lowest values, and then take the mean (average) of the result. This is less robust to outliers: as long as no more than 5% of your data points are corrupted, it is good, but if more than 5% are corrupted, it suddenly turns awful (it doesn't degrade gracefully). The "price" of robustness is less than the median, though I don't know what it is exactly.

  • The Hodges-Lehmann estimator computes the median of the set {(xi+xj)/2:1ijn} (a set containing n(n+1)/2 values), where x1,,xn are the observations. This has very good robustness: it can handle corruption of up to about 29% of the data points without totally falling apart. And the "price" of robustness is low: about 5%. It is a plausible alternative to the median.

  • The interquartile mean is another estimator that is sometimes used. It computes the average of the first and third quartiles, and thus is simple to compute. It has very good robustness: it can tolerate corruption of up to 25% of the data points. However, the "price" of robustness is non-trivial: about 25%. As a result, this seems inferior to the median.

  • There are many other measures that have been proposed, but the ones above seem reasonable.

In short, I would suggest the median or possibly the Hodges-Lehmann estimator.

P.S. Oh, I should explain what I mean by the "price" of robustness. A robust estimator is designed to still work decently well even if some of your data points have been corrupted or are otherwise outliers. But what if you use a robust estimator on a data set that has no outliers and no corruption? Ideally, we'd like the robust estimator to be as efficient at making use of the data as possible. Here we can measure the efficiency by the standard error (intuitively, the typical amount of error in the estimate produced by the estimator). It is known that if your observations come from a Gaussian distribution (iid), and if you know you won't need robustness, then the mean is optimal: it has the smallest possible estimation error. The "price" of robustness, above, is how much the standard error increases if we apply a particular robust estimator to this situation. A price of robustness of 25% for the median means that the size of the typical estimation error with the median will be about 25% larger than the size of the typical estimation error with the mean. Obviously, the lower the "price" is, the better.


I often see the H-L estimator defined as the median of the n(n+1)/2 values (xi+xj)/2 for 1ijn. I.e., the diagonal is included. To my knowledge, that's also how it is defined in, e.g., R's wilcox.test(..., conf.int=TRUE) function. Do you have sources for the definition where the diagonal is left out?
caracal

+1, this is really excellent. I have one nitpick, however: I would not use the phrase "error term" in your last paragraph, as it is often used to mean something else; I would use 'standard error of the sampling distribution', or just 'standard error', instead.
gung - Reinstate Monica

A very well structured and concise answer, thank you! An overview is what I needed, I will read through the paper suggested by Henrik and should be covered. For long summer night entertainment, I will be sure to check out the books suggested by you and jbowman.
Ondrej

@caracal, you are correct. My characterization of the H-L estimator was incorrect. Thanks for the correction. I've updated my answer accordingly.
D.W.

Thanks, @gung! I've edited the answer to use 'standard error' as you suggest.
D.W.

7

If you like something short and easy to digest, then have a look at the following paper from the psychological literature:

Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63(7), 591–601. doi:10.1037/0003-066X.63.7.591

They mainly rely on the books by Rand R Wilcox (which are admittedly also not too mathematical):

Wilcox, R. R. (2001). Fundamentals of modern statistical methods : substantially improving power and accuracy. New York; Berlin: Springer.
Wilcox, R. R. (2003). Applying contemporary statistical techniques. Amsterdam; Boston: Academic Press.
Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing. Academic Press.


5

One book that combines theory with practice pretty well is Robust Statistical Methods with R, by Jurečková and Picek. I also like Robust Statistics, by Maronna et al. Both of these may have more math than you'd care for, however. For a more applied tutorial focused on R, this BelVenTutorial pdf may help.


Ah, prof. Jurečková — a teacher at our university, what are the odds. I will check both of the books. Though I was looking for a more... brief document (since this problem is very marginal for me), it does not hurt to delve into it a little deeper. Thanks!
Ondrej

1
It's a small world! Well, at least I corrected the spelling by copying from your comment...
jbowman
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.