Journal of Statistics Applications & Probability


In the era of data revolution, availability and presence of data is a huge wealth that has to be utilized. Instead of making new surveys, benefit can be made from data that already exists. As, enormous amounts of data become available, it is becoming essential to undertake research that involves integrating data from multiple sources in order to make the best use out of it. Statistical Data Integration (SDI) is the statistical tool for considering this issue. SDI can be used to integrate data files that have common units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage (RL) and Statistical Matching (SM). SM techniques typically aim to achieve a complete data file from different sources which do not contain the same units. There are a number of traditional matching techniques mentioned in the literature. Among these techniques, there are various approaches for continuous data, but not as many methods for categorical data. This paper proposes a Statistical Matching technique for categorical data based on latent class models within a Bayesian framework. Dirichlet Process Mixture of Product of Multinomial distributions model is used in Statistical Matching throughout this paper which is a fully Bayesian estimation method for latent class models. Performance of the proposed latent class model used for Statistical Matching is evaluated using an empirical comparison with several existing matching procedures based on simulation studies.

Digital Object Identifier (DOI)