Warianty tytułu
Języki publikacji
Abstrakty
We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented. (original abstract)
Czasopismo
Rocznik
Tom
Numer
Strony
144--158
Opis fizyczny
Twórcy
autor
- University of Maryland, USA
Bibliografia
- BERK, R. H. JONES, D. H., (1978). Relatively optimal combinations of test statistics. Scand. J. Statist., 5(3), pp. 158-162.
- BICKEL, P. J. FREEDMAN, D. A., (1981). Some asymptotic theory for the bootstrap. Ann. Statist., 9(6), pp,1196-1217.
- BICKEL, P. J. KRIEGER, A. M., (1989). Confidence bands for a distribution function using the bootstrap. J. Amer. Statist. Assoc., 84(405), pp. 95-100.
- BRESLOW, N. E. CHATTERJEE, N., (1999). Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(4), pp. 457-468.
- BRESLOW, N. E., LUMLEY, T., BALLANTYNE, C., CHAMBLESS, L., KULICH, M., (2009). Using the whole cohort in the analysis of case-cohort data. American J. Epidemiol., 169, pp. 1398-1405.
- BRETH, M., (1978). Bayesian confidence bands for a distribution function. Ann. Statist., 6(3), pp. 649-657.
- BRICK, J. M., DIPKO, S., PRESSER, S., TUCKER, C., YUAN, Y., (2006). Nonresponse bias in a dual frame sample of cell and landline numbers. The Public Opinion Quarterly, 70(5), pp. 780-793.
- CERVANTES, I., JONES, M., ROJAS, L., BRICK, J., KURATA, J., GRANT, D., (2006). A review of the sample design for the california health interview survey. In Proceedings of the Social Statistics Section, American Statistical Association, pp. 3023-3030.
- CHATTERJEE, N., CHEN, Y.-H., MAAS, P., CARROLL, R. J., (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Amer. Statist. Assoc., 111(513), pp. 107-117.
- CHENG, R. C. H. ILES, T. C., (1983). Confidence bands for cumulative distribution functions of continuous random variables. Technometrics, 25(1), pp.77-86.
- COX, D. R., (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B, 34, pp. 187-220.
- D'ANGIO, G. J., BRESLOW, N., BECKWITH, J. B., EVANS, A., BAUM, H., DELORIMIER, A., FERNBACH, D., HRABOVSKY, E., JONES, B., KELALIS, P., (1989). Treatment of Wilms' tumor. Results of the Third National Wilms' Tumor Study. Cancer, 64(2), pp. 349-360.
- DVORETZKY, A., KIEFER, J., WOLFOWITZ, J., (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist., 27, pp. 642-669.
- FREY, J., (2008). Optimal distribution-free confidence bands for a distribution function. J. Statist. Plann. Inference, 138(10), pp. 3086-3098.
- GINÉ, E. NICKL, R., (2016). Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics, [40]. Cambridge University Press, New York.
- HARTLEY, H. O., (1962). Multiple frame surveys. In Proceedings of the Social Statistics Section, American Statistical Association, pp. 203-206.
- HARTLEY, H. O., (1974). Multiple frame methodology and selected applications. Sankhy¯a Ser. C, 36, pp. 99-118.
- HU, S. S., BALLUZ, L., BATTAGLIA, M. P., FRANKEL, M. R., (2011). Improving public health surveillance using a dual-frame survey of landline and cell phone numbers. American Journal of Epidemiology, 173(6), pp. 703-711.
- KANOFSKY, P. SRINIVASAN, R., (1972). An approach to the construction of parametric confidence bands on cumulative distribution functions. Biometrika, 59, pp. 623-631.
- KEIDING, N. LOUIS, T. A., (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), pp. 319-376.
- KOLMOGOROV, A. N., (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari, 4, pp. 83-91.
- MASSART, P., (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab., 18(3), pp. 1269-1283.
- METCALF, P. SCOTT, A., (2009). Using multiple frames in health surveys. Statistics in Medicine, 28(10), pp. 1512-1523.
- OWEN, A. B., (1995). Nonparametric likelihood confidence bands for a distribution function. J. Amer. Statist. Assoc., 90(430), pp. 516-521.
- SAEGUSA, T., (2019). Large sample theory for merged data from multiple sources. Ann. Statist., 47(3), pp. 1585-1615.
- SAEGUSA, T. WELLNER, J. A., (2013). Weighted likelihood estimation under twophase sampling. Ann. Statist., 41(1), pp. 269-295.
- SCHAFER, R. E. ANGUS, J. E., (1979). Estimation of weibull quantiles with minimum error in the distribution function. Technometrics, 21(3), pp. 367-370.
- SMIRNOV, N. V., (1944). Approximate laws of distribution of random variables from empirical data. Uspehi Matem. Nauk, 10, pp. 179-206.
- TSIRELSON, V. S., (1975). The density of the distribution of the maximum of a Gaussian process. Theory of Probability and its Applications, 20, pp. 847-865.
- WANG, J., CHENG, F., YANG, L., (2013). Smooth simultaneous confidence bands for cumulative distribution functions. J. Nonparametr. Stat., 25(2), pp. 395-407.
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.ekon-element-000171624032