Using Large Language Model to Label Data in Determining Software Quality Classes: ChatGPT

KEKÜL H., POLATGİL M.

2024 Innovations in Intelligent Systems and Applications Conference, ASYU 2024, Ankara, Türkiye, 16 - 18 Ekim 2024, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/asyu62119.2024.10757027
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: ChatGPT, data labeling, Fleiss' Kappa, ISO/IEC 25010
Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

In this research, ChatGPT's ability to label software quality classes was examined. Approximately 50 thousand Duolingo user comments, randomly selected from the Google Play Store, were labeled by ChatGPT according to ISO software quality classes, and these labels were compared with the labels made by two experts in the field of software engineering. The results were analyzed using the Fleiss' Kappa statistical method. As a result of the analysis, it was determined that ChatGPT's labeling was 69% compatible with the labeling of human experts. According to Fleiss' Kappa method, this rate is quite high in terms of compatibility of three different people. This finding demonstrates that ChatGPT can be used effectively with human experts in data labeling processes. In particular, the labels made in the "Usability" and "Functional Suitability" classes reveal that users attach importance to these features. Although ChatGPT has disadvantages such as hallucination and bias, it is possible to increase the accuracy of these models by using hybrid approaches. The research shows that large language models can be an important auxiliary tool in data labeling processes in the field of software engineering and that more accurate and reliable results can be obtained by working with human experts.