The Use of Artificial Intelligence in Medical Education: A Comparative Analysis of Theoretical Exam Performance between ENT Residents and ChatGPT-4o

Doğan Karataş, Tuba; Aksoy, AHMET; Bora, ADEM; Doğan, MANSUR

doi:10.14744/lhhs.2024.94858

The Use of Artificial Intelligence in Medical Education: A Comparative Analysis of Theoretical Exam Performance between ENT Residents and ChatGPT-4o

Doğan Karataş T., Aksoy A., Bora A., Doğan M.

LOKMAN HEKIM HEALTH SCIENCES, cilt.6, sa.2, ss.196-202, 2026 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 6 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.14744/lhhs.2024.94858
Dergi Adı: LOKMAN HEKIM HEALTH SCIENCES
Derginin Tarandığı İndeksler: Central & Eastern European Academic Source (CEEAS), CINAHL, Directory of Open Access Journals, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.196-202
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

Abstract

Introduction: This study assesses the theoretical examination performance of otorhinolaryngology residents and compares their results with those of ChatGPT-4o, an artificial intelligence (AI) language model.

Methods: A 100-item multiple-choice theoretical examination was administered in February 2025 to 17 otolaryngology residents enrolled in an otorhinolaryngology specialty training program. The Department of Otorhinolaryngology at a tertiary care university hospital administered the examination as part of its annual assessment program. The same questions were subsequently presented to ChatGPT-4o, a large language model developed by OpenAI, and its responses were systematically recorded. The numbers of correct answers provided by the residents and ChatGPT 4o were then compared. Each question was assigned a difficulty index based on participant performance and was thematically categorized to enable detailed item-level and domain-specific analyses.

Results: Seventeen otolaryngology residents completed the theoretical examination. The mean examination score among residents was 55.8 out of 100, whereas ChatGPT-4o achieved a score of 64. However, the difference was not statistically significant (p=0.077). Topic-based analysis revealed that ChatGPT-4o performed better on knowledge based neurotology questions but performed worse on clinically contextual items requiring surgical decision-making. A positive, statistically significant correlation was observed between the duration of residency training and examination performance (r=0.66, p=0.004).

Discussion and Conclusion: ChatGPT-4o demonstrated a performance level comparable to that of human participants in theoretical medical examinations. AI-based educational platforms may serve as supportive tools in the training of medical residents and students.

Keywords: Artificial intelligence; Clinical competence; ChatGPT-4o; Medical education; Otorhinolaryngology