A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

Kekül, HAKAN; ERGEN, BURHAN; ARSLAN, HALİL

doi:10.5815/ijcnis.2022.04.03

A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

Kekül H., ERGEN B., ARSLAN H.

International Journal of Computer Network and Information Security, cilt.14, sa.4, ss.27-42, 2022 (Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 14 Sayı: 4
Basım Tarihi: 2022
Doi Numarası: 10.5815/ijcnis.2022.04.03
Dergi Adı: International Journal of Computer Network and Information Security
Derginin Tarandığı İndeksler: Scopus
Sayfa Sayıları: ss.27-42
Anahtar Kelimeler: Information security, Multiclass Classification, Software Security, Software Vulnerability, Text Analysis
Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

The analysis and grading of software vulnerabilities is an important process that is done manually by experts today. For this reason, there are time delays, human errors, and excessive costs involved with the process. The final result of these software vulnerability reports created by experts is the calculation of a severity score and a severity rating. The severity rating is the first and foremost value of the software’s vulnerability. The vulnerabilities that can be exploited are only 20% of the total vulnerabilities. The vast majority of exploitations take place within the first two weeks. It is therefore imperative to determine the severity rating without time delays. Our proposed model uses statistical methods and deep learning-based word embedding methods from natural language processing techniques, and machine learning algorithms that perform multi-class classification. Bag of Words, Term Frequency Inverse Document Frequency and Ngram methods, which are statistical methods, were used for feature extraction. Word2Vec, Doc2Vec and Fasttext algorithms are included in the study for deep learning based Word embedding. In the classification stage, Naive Bayes, Decision Tree, K-Nearest Neighbors, Multi-Layer Perceptron, and Random Forest algorithms that can make multi-class classification were preferred. With this aspect, our model proposes a hybrid method. The database used is open to the public and is the most reliable data set in the field. The results obtained in our study are quite promising. By helping experts in this field, procedures will speed up. In addition, our study is one of the first studies containing the latest version of the data size and scoring systems it covers.