Content and Readability Analysis of ChatGPT and Gemini’s Responses to FAQs on Patellofemoral Instability
1Department of Orthopedics and Traumatology, Cine State Hospital, Aydin, Türkiye
2Department of Orthopedics and Traumatology, University of Health Sciences, Antalya Training and Research Hospital, Antalya, Türkiye
3Department of Orthopedics and Traumatology, Siverek State Hospital, Urfa, Türkiye
4Department of Orthopedics and Traumatology, Medikum Private Hospital, Antalya, Türkiye
Sports Traumatol Arthrosc - DOI: 10.14744/start.2025.96548

Abstract

Objective: This study aimed to evaluate the quality and readability of responses generated by ChatGPT and Gemini to frequently asked questions about patellofemoral instability (PFI). In the context of increasing reliance on AI chatbots for medical information, it is imperative to evaluate their accuracy, completeness, and accessibility to determine their potential role in patient education.
Materials and Methods: A cross-sectional observational study was conducted using 20 frequently asked patient questions about PFI, selected based on Google search trends and patient education resources. These questions were submitted to ChatGPT (version 4o) and Gemini (version 2.1), and the responses were analyzed for content quality and readability. Content quality was assessed by three independent orthopedic specialists using a structured scoring framework. Each response was rated on a five-point Likert scale, ranging from very poor to excellent. This framework focused on relevance, accuracy, clarity, completeness, evidence-based support, and consistency. The readability of the responses was assessed using several linguistic indices, including the Flesch-Kincaid Grade Level, the Flesch Reading Ease Score, the Gunning Fog Index, the Coleman-Liau Index, the Automated Readability Index (ARI), and the Simple Measure of Gobbledygook (SMOG) Index. Both the content and readability of the responses were compared statistically.
Results: ChatGPT had higher accuracy (4.70±0.21 vs. 4.58±0.40, p=0.071) and evidence-based support (4.51±0.45 vs. 4.35±0.67, p=0.045) scores than Gemini, although these differences were not always statistically significant. In contrast, Gemini produced significantly clearer responses (4.95±0.12 vs. 4.75±0.14, p=0.001) and had a significantly higher Flesch Reading Ease Score (30.21±9.43 vs. 19.39±10.09, p=0.001), indicating that its responses were easier to read. Both models generated text at a college reading level, suggesting limited accessibility for the general patient population.
Conclusion: ChatGPT and Gemini have provided reliable yet complex answers to patient questions about patellofemoral instability. In particular, ChatGPT has been shown to excel in accuracy and evidence-based support, while Gemini has been observed to produce more readable content. However, both models require further refinement regarding readability and transparency to improve their suitability for patient education. Future research should explore the integration of AI chatbots into clinical workflows to ensure safe and effective information dissemination for diverse patient populations.