Journal of Statistics Applications & Probability


Text Mining is the technique of obtaining high characteristic information from text. In recent years, applications of text mining are broadly used in the fields of multimedia, biomedical, patent analysis, anti-spam filtering of emails, linguistic profiling and opinion mining etc. To extract useful patterns from text, various tasks such as text preprocessing, feature extraction, pattern discovery and evaluation are performed on it. The proposed work has been developed as an efficient and effective classification algorithm for textual data base. This algorithm helps to evaluate the descriptive type answers collected from the learners and also eliminate the discrepancy in manual evaluation. The implemented framework preprocesses the documents in two steps. Initially, the documents have been pruned and stemmed to moderate the size of the documents. Also, some of the feature extraction methods have been analyzed and implemented for feature extraction. The existing feature extraction method Term-Frequency-Inverse Document Frequency (TF-IDF) assigns weight to the term, based on the occurrence. But the modified TF-IDF (M-TF-IDF) assigns weight to the term based on the occurrence and importance of the terms in the document. This weighting scheme is used to increase the accuracy of the classification algorithm. But this method does not consider semantic similarity of the term. Hence Latent Semantic Analysis (LSA) method is discussed to select the terms based on the semantic similarity. The combination of M-TF-IDF and LSA has assigned weight to the terms based on the importance and semantic similarity between the terms. The Support Vector Machine (SVM) algorithm classifies the text document which depends on the kernel functions and cost parameter. The proposed work has introduced cosine similarity function as decision making function. The implemented framework Cosine-SVM (CSVM) classifies the new test data in three steps. First, the cosine similarity value has been calculated between each group support vectors and the new test data. Then, the average is calculated between them and the similarity value has been checked. If the new test data has the highest similarity with any one group of support vectors, then the label of that group has been assigned to the test data. The present work effectively and efficiently classifies the bench mark data set and hence it has also been used to evaluate the descriptive type answer written by the learners. This method has a number of benefits like increased reliability of results, reduced time and effort, reduced burden on the faculty and efficient use of resources.

Digital Object Identifier (DOI)