SMS spam detection via a systematic machine learning pipeline: Feature engineering, hyperparameter optimizing, scaling, and ensemble learning

Şenol, Tuğçe; TAKCI, HİDAYET; ŞEKER, ABDULKADİR

doi:10.1016/j.jer.2026.05.020

SMS spam detection via a systematic machine learning pipeline: Feature engineering, hyperparameter optimizing, scaling, and ensemble learning

Şenol T., TAKCI H., ŞEKER A.

Journal of Engineering Research (Kuwait), 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.jer.2026.05.020
Dergi Adı: Journal of Engineering Research (Kuwait)
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Directory of Open Access Journals
Anahtar Kelimeler: Ensemble learning, Feature engineering, Hyperparameter optimization, Machine learning, Short message service spam detection, Text preprocessing
Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

SMS spam detection remains a critical challenge as spammers continuously adapt their tactics to evade automated filters. This study proposes a systematic machine learning pipeline for SMS spam detection and conducts a comprehensive comparative analysis across multiple feature extraction methods, feature selection strategies, scaling approaches, and ensemble learning algorithms within a single reproducible framework. Experiments are performed on the UCI SMS Spam Collection v.1 dataset using eight classifiers (LR, DT, KNN, SVM, NC, GB, XGB, RF) combined with Bag-of-Words, TF-IDF, and Word2Vec feature extraction methods. Five feature selection methods (Chi-square, Mutual Information, LASSO, Random Forest Importance, and Recursive Feature Elimination), eight scaling strategies, and class imbalance handling via SMOTE and random undersampling are systematically evaluated. A standardized preprocessing pipeline incorporating lowercasing, URL normalization, digit normalization, and special character removal is applied prior to feature extraction. Results demonstrate that BoW consistently outperforms TF-IDF and Word2Vec across most classifiers. The best-performing configuration, BoW combined with QuantileTransformer scaling and Logistic Regression, achieves an accuracy of 0.989, precision of 0.995, recall of 0.920, F1-score of 0.956, and ROC-AUC of 0.990, with only 18 false negatives and 1 false positive on a test set of 1672 messages. Feature selection improves over baseline performance, with LASSO combined with SVM or XGB yielding the highest F1-scores on BoW features. Class imbalance experiments reveal that neither SMOTE nor random undersampling improves upon the unbalanced baseline, confirming that BoW features inherently tolerate the natural class distribution. XGBoost achieves the best ensemble results with the most favorable accuracy-efficiency trade-off. False negative analysis identifies personal/chat-mimicking spam and humor-format spam as the primary challenge categories for token-frequency methods. Training completes in under 2 s on a standard CPU without GPU dependency, confirming that competitive classification performance against deep learning approaches is attainable at substantially lower computational cost.