Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation

TAKCI, HİDAYET; Nusrat, Fatema

doi:10.34028/iajit/20/1/4

Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation

TAKCI H., Nusrat F.

International Arab Journal of Information Technology, cilt.20, sa.1, ss.29-37, 2023 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 20 Sayı: 1
Basım Tarihi: 2023
Doi Numarası: 10.34028/iajit/20/1/4
Dergi Adı: International Arab Journal of Information Technology
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Arab World Research Source, Computer & Applied Sciences
Sayfa Sayıları: ss.29-37
Anahtar Kelimeler: Internet security, prediction methods, feature selection, data conversion, spam detection
Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

© 2023, Zarka Private University. All rights reserved.The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C-Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.