Controlling the Effect of Multiple Testing in Big Data

Denkowska, Sabina

doi:10.15611/me.2014.10.01.

Artykuł - szczegóły

Czasopismo

Mathematical Economics

2014 | nr 10(17) | 5--16

Tytuł artykułu

Controlling the Effect of Multiple Testing in Big Data

Autorzy

Sabina Denkowska

Treść / Zawartość

Pełne teksty:

http://dx.doi.org/10.15611/me.2014.10.01 [zdalny]

Warianty tytułu

Języki publikacji

Abstrakty

Big Data poses a new challenge to statistical data analysis. An enormous growth of available data and their multidimensionality challenge the usefulness of classical methods of analysis. One of the most important stages in Big Data analysis is the verification of hypotheses and conclusions. With the growth of the number of hypotheses, each of which is tested at  significance level, the risk of erroneous rejections of true null hypotheses increases. Big Data analysts often deal with sets consisting of thousands, or even hundreds of thousands of inferences. FWER-controlling procedures recommended by Tukey [1953], are effective only for small families of inferences. In cases of numerous families of inferences in Big Data analyses it is better to control FDR, that is the expected value of the fraction of erroneous rejections out of all rejections. The paper presents marginal procedures of multiple testing which allow for controlling FDR as well as their interesting alternative, that is the joint procedure of multiple testing MTP based on resampling [Dudoit, van der Laan 2008]. A wide range of applications, the possibility of choosing the Type I error rate and easily accessible software (MTP procedure is implemented in R multtest package) are their obvious advantages. Unfortunately, the results of the analysis of the MTP procedure obtained by Werft and Benner [2009] revealed problems with controlling FDR in the case of numerous sets of hypotheses and small samples. The paper presents a simulation experiment conducted to investigate potential restrictions of MTP procedure in case of large numbers of inferences and large sample sizes, which is typical of Big Data analyses. The experiment revealed that, regardless of the sample size, problems with controlling FDR occur when multiple testing procedures based on minima of unadjusted p-values ( ) are applied. Moreover, the experiment indicated the serious instability of the results of the MTP procedure (dependent on the number of bootstrap samplings) if multiple testing procedures based on minima of unadjusted p-values ( ) are used. The experiment described in the paper and the results obtained by Werft, Benner [2009] and Denkowska [2013] indicate the need for further research on MTP procedure.(original abstract)

Słowa kluczowe

Big Data Controlling Bootstrap

Big Data Controlling Metody samowsporne

Czasopismo

Mathematical Economics

Rocznik

2014

Numer

nr 10(17)

Strony

5--16

Opis fizyczny

Twórcy

autor

Sabina Denkowska

Uniwersytet Ekonomiczny w Krakowie

Bibliografia

Benjamini Y. (2001). False Discovery Rate in Large Multiplicity Problem. www.math.tau.ac.il/~ybenja/Temple.ppt (6.12.2014).
Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Ser. B. 57 (1). Pp. 289-300.
Benjamini Y., Hochberg Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Behav. Educ. Statist. Vol. 25. Pp. 60-83.
Benjamini Y., Krieger A.M., Yekutieli D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika. Vol. 93. Pp. 491-507.
Benjamini Y., Yekutieli D. (2001). The Control of the False Discovery rate in multiple testing under dependency. Annals of Statistics 29. Pp. 1165-1188.
Denkowska S. (2013). Non classical procedures of multiple testing (Nieklasyczne procedury testowań wielokrotnych). Przegląd Statystyczny. Z. 4. Pp. 461-476.
Dudoit S., Gilbert H.N., van der Laan M. (2008). Resampling-based empirical Bayes multiple testing procedures for controlling generalized tail probability and expected value error rates: focus on the false discovery rate and simulation study. www.ncbi.nlm.nih.gov/pubmed/18932138.
Dudoit S., van der Laan M. (2008). Multiple Testing Procedures with Applications to Genomics. Springer Series in Statistics. Efron B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press.
Hochberg Y., Tamhane A.C. (1987). Multiple Comparison Procedures. John Wiley & Sons. New York.
Tukey J.W. (1953). The problem of multiple comparisons. In: H.I. Braun. (1994).
The Collected Works of John W. Tukey. Vol. VIII: Multiple Comparisons: 1948-1983. Chapman & Hall. New York. Pp. 1-300.
Westfall P. H., Young S.S. (1993). Resampling Based Multiple Testing. Wiley. New York.
Werft W., Benner A. (2009). www.iscb2009.info/RSystem/Soubory/Prez%20 Monday/S10.4%20Werft.pdf
Yekutieli D. (2008a). Comments on: Control of the false discovery rate under dependence using the bootstrap and subsampling. Test 17 (3). Pp. 458-460.
Yekutieli D. (2008b). False discovery rate control for non-positively regression dependent test statistics. Journal of Statistical Planning and Inference 138 (2). Pp. 405-415.

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.15611/me.2014.10.01.

Identyfikator YADDA

bwmeta1.element.ekon-element-000171393303

Komentarze

Musisz być zalogowany aby pisać komentarze.

Mathematical Economics

Controlling the Effect of Multiple Testing in Big Data

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane