Future Computing and Informatics Journal

Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions

Tanvir Habib Sardar, Computer Science and Engineering, P.A. College of Engineering, Mangalore, IndiaFollow
Zahid Ansari, Computer Science and Engineering, P.A. College of Engineering, Mangalore, IndiaFollow

Abstract

Data clustering is one of the fundamental techniques in scientific analysis and data mining, which describes a dataset according to similarities among its objects. Partition based clustering algorithms are the most popular and widely used clustering technique. In this information era, due to the digitization of every field, the huge volume of data is available to data analysts. The quick growth of such datasets makes decade old computing platforms, programming paradigms, and clustering algorithms become inadequate to obtain knowledge from these datasets. To cluster such large datasets, Hadoop distributed platform, MapReduce programming paradigm and modified clustering algorithms are being used to shrink the computational time by distributing clustering job across multiple computing nodes. This paper provides a comprehensive review of Hadoop and MapReduce and their components. This paper aims to survey recent research works on partition based clustering algorithms which use MapReduce as their programming paradigm. In many recent works, the traditional partition based clustering algorithms like K-means, Kprototypes, K-medoids, K-modes and Fuzzy C-means are modified for MapReduce paradigm in order to obtain different clustering objectives on different datasets for reducing the computational time. The contribution of this paper is (1) to provide an overview of clustering challenges in real world large dataset clustering and the role of MapReduce programming paradigm and its supporting platforms in dealing the challenges for several tasks in different datasets and (2) to review recent works in partition based clustering using MapReduce paradigm for different clustering objectives for different datasets employing different strategies.

Recommended Citation

Sardar, Tanvir Habib and Ansari, Zahid (2018) "Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions," Future Computing and Informatics Journal: Vol. 3: Iss. 2, Article 10.
Available at: https://digitalcommons.aaru.edu.jo/fcij/vol3/iss2/10

Download

Included in

Computer Engineering Commons

COinS

Future Computing and Informatics Journal

Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions

Abstract

Recommended Citation

Included in

Special Issues:

Search

Future Computing and Informatics Journal

Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions

Authors

Abstract

Recommended Citation

Included in

Share

Special Issues:

Search