SCIENCE OF COMPUTER PROGRAMMING, cilt.252, sa.103466, ss.1-23, 2026 (SCI-Expanded, Scopus)
Assessing software quality without access to source code is a challenging task, as traditional metrics and testing approaches typically rely on internal code analysis. User feedback, however, provides a valuable alternative data source by reflecting real-world quality issues and user perceptions. Leveraging such feedback for quality classification introduces several technical challenges: the data are unstructured, class distributions are imbalanced, and labeled examples are scarce. In this study, these challenges are addressed using a dataset of user reviews from mobile health (mHealth) applications, where each review is labeled with one of the eight software quality characteristics defined in ISO/IEC 25,010:2011. The comments were represented using a range of text vectorization techniques, including Bag-of-Words, TF-IDF, Word2Vec, FastText, BERT, and RoBERTa, and classified with various machine learning algorithms such as SVM, KNN, Random Forest, Naive Bayes, Decision Tree, SGD, XGBoost, and AdaBoost. To address class imbalance, the Adaptive Synthetic Sampling (ADASYN) method, a variant of the Synthetic Minority Oversampling Technique (SMOTE), was applied, and hyperparameter optimisation was performed to improve model performance. The empirical results revealed substantial performance differences across model combinations. In particular, the TF-IDF (700) + RFC + SMOTE model achieved the best performance with an average F1-score of 85% across the eight quality classes. Data balancing notably improved recall scores for minority classes. While some models achieved high training performance, their test performance decreased significantly, indicating a tendency toward overfitting. A Wilcoxon Signed-Rank Test confirmed that the improvement resulting from data balancing was practically meaningful. Furthermore, experiments conducted with 10 different random data splits yielded a low standard deviation (0.78), demonstrating the stability of the proposed model. Overall, this study presents a comprehensive comparative and pipeline analysis grounded in methodological enhancements, specifically focusing on data balancing, hyperparameter tuning, and statistical evaluation, rather than proposing an entirely new method. However, since the analysis was conducted using a single mHealth dataset, the generalizability of the findings remains limited. Despite this limitation, the results provide valuable insights into how user feedback can be effectively leveraged for automated software quality classification.