Journal of Engineering Research (Kuwait), 2026 (SCI-Expanded, Scopus)
SMS spam detection remains a critical challenge as spammers continuously adapt their tactics to evade automated filters. This study proposes a systematic machine learning pipeline for SMS spam detection and conducts a comprehensive comparative analysis across multiple feature extraction methods, feature selection strategies, scaling approaches, and ensemble learning algorithms within a single reproducible framework. Experiments are performed on the UCI SMS Spam Collection v.1 dataset using eight classifiers (LR, DT, KNN, SVM, NC, GB, XGB, RF) combined with Bag-of-Words, TF-IDF, and Word2Vec feature extraction methods. Five feature selection methods (Chi-square, Mutual Information, LASSO, Random Forest Importance, and Recursive Feature Elimination), eight scaling strategies, and class imbalance handling via SMOTE and random undersampling are systematically evaluated. A standardized preprocessing pipeline incorporating lowercasing, URL normalization, digit normalization, and special character removal is applied prior to feature extraction. Results demonstrate that BoW consistently outperforms TF-IDF and Word2Vec across most classifiers. The best-performing configuration, BoW combined with QuantileTransformer scaling and Logistic Regression, achieves an accuracy of 0.989, precision of 0.995, recall of 0.920, F1-score of 0.956, and ROC-AUC of 0.990, with only 18 false negatives and 1 false positive on a test set of 1672 messages. Feature selection improves over baseline performance, with LASSO combined with SVM or XGB yielding the highest F1-scores on BoW features. Class imbalance experiments reveal that neither SMOTE nor random undersampling improves upon the unbalanced baseline, confirming that BoW features inherently tolerate the natural class distribution. XGBoost achieves the best ensemble results with the most favorable accuracy-efficiency trade-off. False negative analysis identifies personal/chat-mimicking spam and humor-format spam as the primary challenge categories for token-frequency methods. Training completes in under 2 s on a standard CPU without GPU dependency, confirming that competitive classification performance against deep learning approaches is attainable at substantially lower computational cost.