•  
  •  
 

Future Computing and Informatics Journal

Future Computing and Informatics Journal

Abstract

Data clustering is one of the fundamental techniques in scientific analysis and data mining, which describes a dataset according to similarities among its objects. Partition based clustering algorithms are the most popular and widely used clustering technique. In this information era, due to the digitization of every field, the huge volume of data is available to data analysts. The quick growth of such datasets makes decade old computing platforms, programming paradigms, and clustering algorithms become inadequate to obtain knowledge from these datasets. To cluster such large datasets, Hadoop distributed platform, MapReduce programming paradigm and modified clustering algorithms are being used to shrink the computational time by distributing clustering job across multiple computing nodes. This paper provides a comprehensive review of Hadoop and MapReduce and their components. This paper aims to survey recent research works on partition based clustering algorithms which use MapReduce as their programming paradigm. In many recent works, the traditional partition based clustering algorithms like K-means, Kprototypes, K-medoids, K-modes and Fuzzy C-means are modified for MapReduce paradigm in order to obtain different clustering objectives on different datasets for reducing the computational time. The contribution of this paper is (1) to provide an overview of clustering challenges in real world large dataset clustering and the role of MapReduce programming paradigm and its supporting platforms in dealing the challenges for several tasks in different datasets and (2) to review recent works in partition based clustering using MapReduce paradigm for different clustering objectives for different datasets employing different strategies.

Share

COinS