•  
  •  
 
Information Sciences Letters

Information Sciences Letters

Abstract

Text classification (TC) is a crucial subject. The number of digital files available on the internet is enormous. The goal of TC is to categorize texts into a series of predetermined groups. The number of studies conducted on the English database is significantly higher than the number of studies conducted on the Arabic database. Therefore, this research analyzes the performance of automatic TC of the Arabic language using Machine Learning (ML) approaches. Further, Single-label Arabic News Articles Datasets (SANAD) are introduced, which contain three different datasets, namely Akhbarona, Khaleej, and Arabiya. Initially, the collected texts are pre-processed in which tokenization and stemming occur. In this research, three kinds of stemming are employed, namely light stemming, Khoja stemming, and no- stemming, to evaluate the effect of the pre-processing technique on Arabic TC performance. Moreover, feature extraction and feature weighting are performed; in feature weighting, the term weighting process is completed by the term frequency- inverse document frequency (tf-idf) method. In addition, this research selects C4.5, Support Vector Machine (SVM), and Naïve Bayes (NB) as a classification algorithm. The results indicated that the SVM and NB methods had attained higher accuracy than the C4.5 method. NB achieved the maximum accuracy with a performance of 99.9%.

Share

COinS