Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination

This study compares the performance of GPT-3.5, GPT-4, and GPT-4o on the 2020 and 2021 Chinese NMLE, focusing on the enhanced accuracy and reliability of GPT-4o in answering medical questions. The results indicate that GPT-4o achieved significantly higher accuracy in the 2020 and 2021 NMLE than GPT-3.5 and GPT-4. In subgroup analyses by question type and unit, GPT-4o consistently demonstrated superior performance. While GPT-4 showed competitive results in specific modules, it did not surpass GPT-4o, and both models significantly outperformed GPT-3.5.

China, a developing country with over 1.4 billion people, reported approximately 9.56 billion medical visits in 2023, with total healthcare expenditures reaching 8.4 trillion RMB. Despite having around 4.78 million licensed and assistant physicians, physician distribution and availability disparities persist. The NMLE, organized and standardized by the National Medical Examination Center, is a pivotal national licensing exam for medical practitioners in China, serving as a critical gateway to medical licensure. Therefore, the NMLE was selected as the benchmark for evaluating the application of ChatGPT in medical education, providing insights into its future role in healthcare development.

ChatGPT models, developed through deep learning techniques and extensive datasets, excel in efficient information retrieval and language organization. According to OpenAI, GPT-3.5, released in November 2022, contains approximately 175 billion parameters and is trained on data up to September 2021. GPT-4, released in 2023, expanded to one trillion parameters, while GPT-4o, launched on May 14, 2024, has 1.2 trillion parameters and is trained on data up to December 2023. To minimize the impact of out-of-database information on GPT-3.5’s performance, we selected the 2020 and 2021 NMLE tests for evaluation.

In the 2020 and 2021 exams, GPT-4o demonstrated the highest overall accuracy, followed by GPT-4, with GPT-3.5 performing the lowest. GPT-4 and GPT-4o achieved accuracy rates exceeding 60%, surpassing the NMLE passing threshold. According to official NMLE data, the national pass rate typically ranges between 18% and 22%. Hence, the performance of GPT-4 and GPT-4o exceeds that of most candidates, showcasing their remarkable potential in medical problem-solving and analysis.

While GPT-3.5 has passed the United States Medical Licensing Examination (USMLE)^16,17, it has not succeeded in the NMLE^20,21,22. Wang et al. reported GPT-3.5 accuracy rates of 47% and 45.8% for the 2020 and 2021 NMLE, attributing the discrepancy to differences in medical policies and epidemiological data between China and the United States²⁰. Our findings corroborate these results, with GPT-3.5 achieving accuracy rates of 50.5% (2020) and 50.8% (2021), reflecting slight improvements potentially linked to operational details or the model’s inherent randomness. This also suggests that deep learning and model training enhance performance over time.

Zong et al. tested GPT-3.5 on the 2017–2021 NMLE, pharmacist, and nursing exams in China, revealing that GPT-3.5 failed to meet the passing standard for all tests. The authors attributed this to ChatGPT’s English-centric training, highlighting challenges in understanding medical policies outside English-speaking regions. However, GPT-3.5 consistently scored above 50%, underscoring AI’s potential in medical education²². Fang et al. attempted to improve results by translating NMLE questions into English before inputting them into ChatGPT, but the outcomes showed no significant enhancement²⁷. Another study found a 5% improvement in accuracy when the NMLE was translated into professional English²⁸. This discrepancy may stem from translation quality, but semantic and cultural differences remain key challenges for AI in non-English medical exams.

The advent of GPT-4 has significantly improved the model’s ability to comprehend non-English languages. Takagi et al. found that GPT-4 outperformed GPT-3.5 in the Japanese Medical Licensing Examination, successfully passing the test²⁹. Similarly, GPT-4 met the passing threshold for the Chinese Master’s Degree Entrance Examination in Clinical Medicine, achieving accuracy rates of 73.67%²⁷ and 81.25%²⁸ in the NMLE, aligning with our findings. A separate meta-analysis examined the performance of LLMs across dental licensing examinations in diverse linguistic and geographical contexts, revealing that GPT-4 holds potential in dental education and diagnosis, albeit with accuracy levels still falling below the threshold required for clinical applications³⁰. Our study demonstrates that GPT-4o significantly improved performance, achieving overall accuracy rates of 84.1% (2020) and 88.2% (2021). Ebel et al. found that GPT-4o passed the European Board of Interventional Radiology (EBIR) mock written exam, a qualification often associated with expert-level knowledge in interventional radiology³¹. Moreover, GPT-4o can generate exam questions at varying difficulty levels, offering valuable training and assessment tools for radiology residents and medical students³¹.

These findings suggest that LLMs could, in the future, be integrated into medical education within academic institutions and professional training for clinicians. Nevertheless, the responses generated by AI are not invariably accurate. First, GPT-4o may provide seemingly rational analyses even for incorrect answers, potentially misleading users. Second, we observed instances where GPT-4o selected the correct answer but provided flawed reasoning. For example, in an A2-type cardiology question, GPT-4o accurately identified the optimal therapeutic agent but exhibited errors in classifying the type of arrhythmia. Such discrepancies may stem from limitations in the underlying database and variations in medical theories, cultural contexts, and legal regulations across different countries. Consequently, when employing LLMs to address medical questions, it is imperative to critically evaluate the validity of their responses and avoid over-reliance on their outputs.

Table of Contents

Comparative analysis of question types

GPT-4o consistently outperformed GPT-4 across different question types, while GPT-4 generally exceeded GPT-3.5 in most cases. However, in case analysis questions (A3/A4) for the 2020 exam and standard matching questions (B1) for the 2021 exam, the accuracy difference between GPT-4 and GPT − 3.5 was not statistically significant.

A3 and A4 questions, collectively called case analysis questions, assess the ability to analyze clinical scenarios comprehensively. A3-type questions involve analyzing scenarios based on a single patient, with 2–3 related questions requiring independent judgment. A4-type questions are more complex, providing multi-level information as patient conditions unfold, necessitating deeper case analysis. These questions demand contextual understanding and diagnostic reasoning, posing significant challenges to ChatGPT’s ability to interpret and process information. B1-type questions feature an innovative format, where five options are used for at least two questions, testing the candidate’s ability to make the best selection. Wang et al. highlighted GPT-3.5’s subpar performance in case analysis questions²⁰, while Li et al. found that multiple-choice questions appeared to be a weak point for both GPT-3.5 and GPT-4, with the lowest scores compared to other question types³². Takagi et al. observed that GPT-3.5’s accuracy for difficult questions was only 33.3%, whereas GPT-4’s increased by 40%, surpassing examinees’ accuracy by 17%²⁹.

Our study corroborates these findings, indicating that GPT-4 and GPT-3.5 struggle with complex question types. GPT-3.5 performed the worst on B1-type questions, with 33.8% and 40% accuracy rates in 2020 and 2021, respectively. GPT-4 achieved a 58.8% accuracy rate for B1-type questions in 2021, the only instance where its accuracy fell below 60% across all question types and units. Additionally, GPT-4 showed poor performance in 2020 A3/A4-type questions, with no significant difference compared to GPT − 3.5.

These findings highlight the challenges B1-type and A3/A4-type questions pose for ChatGPT’s processing and analytical capabilities. However, GPT-4o demonstrated superior performance across all question types, achieving over 80% accuracy, with case analysis and standard matching questions exceeding 85%. However, as the written component of the NMLE does not include image recognition questions, the ability of GPT-4o to interpret Chinese-language electrocardiograms and imaging-related questions remains to be further validated.

Comparative analysis by unit

GPT-4o consistently outperformed GPT-4 across units, particularly in 2021, where all units except Unit 1 and Unit 3 exhibited higher accuracy. Both GPT-4o and GPT-4 showed significantly higher accuracy than GPT-3.5.

Unit 1 primarily covers foundational subjects, focusing on A1-type questions emphasizing memorization.
Unit 2 assesses cardiovascular, urological, and musculoskeletal systems, with diagnostic, auxiliary examination, and treatment-related content contributing over 20 points each.
Unit 3 focuses on the digestive and respiratory systems.
Unit 4 involves the female reproductive, pediatric, and neurological/psychiatric systems.

We found that GPT-4o achieved its highest accuracy rate of 94.7% in Unit 3 of the 2021 examination, excelling in digestive and respiratory systems questions. However, its performance in Unit 3 of the 2020 examination was relatively modest (82%). In Unit 2 of the 2020 examination (covering cardiovascular, urinary, and musculoskeletal systems), GPT-4o achieved the second-highest accuracy rate of 89.3%, while it performed least effectively in Unit 1 (basic sciences), with an accuracy rate of 78%. The performance of GPT-4 in basic sciences and the digestive and respiratory systems was comparable to that of GPT-4o (P ≥ 0.001). In Unit 4 of the 2020 examination (covering female reproduction, pediatrics, and neuropsychiatry), GPT-4 achieved its highest accuracy rate of 76.7%, which was lower than the lowest accuracy rate of GPT-4o (78%), highlighting a performance disparity between the two models. The variation in accuracy rates of LLMs across different units reflects their differing capabilities across specialties, though it may also be influenced by variations in question difficulty across years. Additionally, the relatively small number of questions in each subspecialty may limit the ability to capture LLMs’ true proficiency in these domains fully. Lin et al. compared GPT-4o, Claude-3.5 Sonnet, and Gemini Advanced on Taiwan’s internal medicine exam, finding Claude-3.5 Sonnet excelled in psychiatry and nephrology. At the same time, GPT-4o achieved 97.1% accuracy in hematology and oncology, exceptionally outperforming image-based questions. Conversely, Gemini Advanced had the lowest overall accuracy but performed reasonably well in psychiatry (86.96%) and hematology/oncology (82.91%)³³. Liu et al. categorized questions from the Japanese national medical examination into 21 specialties and compared the accuracy rates of LLMs in each specialty against their overall accuracy. They found that LLMs performed significantly worse in gastroenterology, hepatology, pulmonology, and hematology than their overall performance. Their analysis inferred that this disparity might be associated with the volume of academic publications in each specialty³⁴. This methodological approach offers novel insights for future related studies. The specific capabilities of LLMs across various clinical specialties warrant further investigation by relevant professionals.

In studies focusing on image-based questions, Liu et al. observed that GPT-4o outperformed GPT-4, Gemini 1.5 Pro, and Claude 3 Opus in both image-based and non-image-based questions. However, image-based questions posed a greater challenge to LLMs, resulting in accuracy rates substantially lower than those for non-image-based questions³⁴. Another study applied GPT-4, Gemini, GPT-4 Turbo, and GPT-4o to core cardiology exams, with GPT-4o delivering the best performance on text-based and image-based questions³⁵. Fabijan et al. tested ChatGPT’s ability to evaluate scoliosis X-rays, finding that while identifying all scoliosis cases, its accuracy in determining curvature direction, type, and vertebral rotation was limited³⁶. Nakao et al. discovered that adding image information to original Japanese medical licensing exam questions resulted in decreased accuracy for GPT-4 V, indicating that GPT-4 V struggles with medical image interpretation³⁷.

These studies reveal that LLMs generally underperform in image-based questions compared to text-based questions^{33,34,35,36,37}. Future research should enhance AI’s ability to analyze and interpret medical images, mainly refining AI’s computational capacity for image-specific details.

Practical implications and limitations

In this study, we systematically evaluated the potential of ChatGPT, particularly its latest iteration, GPT-4o, in medical education and clinical practice. Our findings indicate notable feasibility in specific contexts but highlight significant limitations that necessitate careful consideration to ensure safe and effective application in the medical domain.

Advantages

First, in medical education, LLMs demonstrate considerable potential, particularly in facilitating knowledge acquisition and enhancing learning efficiency. ChatGPT can rapidly synthesize medical knowledge, saving medical students time otherwise spent consulting textbooks and literature, with its efficiency advantage being especially pronounced in foundational disciplines and broad knowledge domains. Studies have shown that LLMs can achieve high accuracy rates in certain tests; for instance, in a UK-based study, GPT-4 achieved 100% accuracy across a 20-question test³⁸. A meta-analysis encompassing 45 studies on the performance of different ChatGPT versions in medical licensing examinations reported an overall accuracy rate of 81% for GPT-4³⁹. In our study, GPT-4o exhibited high accuracy rates (nearly all above 85%) across complex question types, such as case analysis, standard matching questions, and simpler A1-type questions. This suggests that LLMs can be auxiliary tools to help students quickly grasp key concepts or address knowledge gaps, particularly in foundational knowledge and clinical simulation scenarios. In clinical practice, LLMs offer efficiency advantages in screening common diseases and managing diagnostics, rapidly analyzing complex data to provide preliminary diagnostic suggestions. In our study, GPT-4o achieved an accuracy rate of 94.7% in Unit 3 of the 2021 examination, with accuracy rates in other units all exceeding 83%, reflecting its robust performance across multiple specialties. This capability can help alleviate the workload of clinicians and enhance diagnostic efficiency, particularly in resource-limited settings.

Limitations

LLMs are a double-edged sword, and their potential risks and challenges warrant vigilance. First, ChatGPT is a machine learning system that autonomously learns from internet data and generates outputs after training on vast text datasets¹⁴. However, medical knowledge available online is not always reliable⁴⁰. These unreliable data sources may compromise output performance and accuracy, leading to potential misinformation⁴¹. In our study, GPT-4o exhibited inaccuracies and deficiencies in certain highly specialized domains yet delivered responses in an “authoritative” tone, which could foster overconfidence among users. Medicine, a rigorous discipline tied to human lives, is particularly vulnerable to the consequences of misleading information, which could directly impair students’ learning outcomes, interfere with clinicians’ judgment, and pose potential risks to patient health. Second, researchers have observed a degree of randomness in GPT-4’s responses³⁹, a phenomenon also noted in our study, where GPT-4o occasionally provided inconsistent answers to identical questions, significantly impacting user judgment. Third, over-reliance on LLMs may lead to an “answer dependency” phenomenon, stifling students’ independent thinking and critical reasoning skills. Students may become inclined to adopt ChatGPT’s suggestions directly, neglecting the need to master foundational knowledge and cultivate a spirit of inquiry. Finally, privacy protection and data security in clinical applications of LLMs are critical considerations. ChatGPT systems must rigorously safeguard patient information to prevent sensitive data breaches. Additionally, ChatGPT algorithms’ transparency and ethical implications require scrutiny to mitigate potential biases or misleading suggestions that could adversely affect patients. These considerations should form the foundation of cautious implementation in the medical application of ChatGPT, ensuring tangible and positive impacts on patients and healthcare professionals.

In summary, while LLMs demonstrate potential in medical education and clinical diagnostics, such as improving learning efficiency and aiding in the diagnosis of common diseases, their limitations—including knowledge inaccuracies, output randomness, risk of dependency, and ethical and privacy concerns—preclude their use as primary knowledge sources in medical education or standalone tools in clinical diagnostics. To maximize their benefits while mitigating risks, we recommend designing hybrid learning models in medical education that integrate LLMs, encouraging students to engage in critical reflection under AI assistance. In clinical practice, establishing secondary validation mechanisms for AI-driven decisions is essential, with healthcare professionals conducting necessary reviews to ensure clinical decisions genuinely benefit patient safety. Future research should further explore ways to enhance the accuracy and reliability of LLMs in highly specialized domains and localized contexts while refining associated ethical and regulatory frameworks to safeguard privacy and data security.

Study limitations

Our study has several limitations.

Scope of analysis

This study primarily focuses on descriptive statistical comparisons of GPT-3.5, GPT-4, and GPT-4o in the NMLE. While predictive modeling could provide further insights into factors influencing AI accuracies, such as question complexity and domain-specific difficulty, such analyses were beyond the scope of this study. Future research should consider implementing logistic regression or other predictive models to explore the relationships between question characteristics and AI performance.

Lack of image-based questions

The study exclusively tested text-based questions, omitting image-based evaluations. Future studies should incorporate more diverse question types, including imaging case analysis, to assess ChatGPT’s multimodal performance.

AI advancement

The rapid advancement of AI technology introduces additional complexity and limitations to this study. Our research relies on data and technological capabilities available before 2024. With the swift progress of LLMs in natural language processing and specialized applications, future iterations will likely surpass current models’ performance. For instance, the newly released Deepseek in 2025 was not included in this study due to differences in its training environment, which is based on a Chinese-language database. In conclusion, the results of this study may not fully reflect the true performance of future AI models in medical examinations.

Despite these limitations, this study provides valuable insights into the evolving role of AI in medical education and professional licensing exams. Future research should expand question sets, enhance cross-language accuracy, and continuously refine assessment methodologies to reflect AI’s growing role in medical education and clinical applications.

link