One of the significant data mining techniques is clustering. Due to expansion and digitalization of each field, large datasets are being generated rapidly. Such large dataset clustering is a challenge for traditional sequential clustering algorithms due to huge processing time. Distributed parallel architectures and algorithms are thus helpful to achieve performance and scalability requirement of clustering large datasets. In this study, we design and experiment a parallel k-means algorithm using MapReduce programming model and compared the result with sequential k-means for clustering varying size of document dataset. The result demonstrates that proposed k-means obtains higher performance and outperformed sequential k-means while clustering documents.
Sardar, Tanvir Habib and Ansari, Zahid
"An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm,"
Future Computing and Informatics Journal: Vol. 3:
2, Article 7.
Available at: https://digitalcommons.aaru.edu.jo/fcij/vol3/iss2/7