In the last decade, artificial intelligence (AI) technology has undergone a rapid evolution, achieving noteworthy breakthroughs in numerous fields [1, 2]. Recently, one such breakthrough that has garnered considerable attention is ChatGPT [3], an AI chatbot powered by generative pre-trained transformer (GPT) architecture, specifically GPT-3.5 with 175 billion parameters. This innovative technology is developed through human feedback reinforcement learning and trained on extensive textual data. Remarkably, ChatGPT exhibits remarkable capabilities in various tasks, including but not limited to intelligent dialogue [4], knowledge question answering [5], and text generation [6], thus showcasing unprecedented potential for further development.
In medical domain, there has been growing interest in exploration of large language models for tasks such as biomedical question answering (BioGPT [7]), and automatic dialogue generation (DialoGPT [8, 9]). Regrettably, these studies have so far demonstrated limited practical utility in clinical practice. However, ChatGPT, with its powerful language understanding and generation capabilities, showing significant potential in the fields of clinical response generation [5, 6], clinical decision support [4, 10, 11], medical education [12, 13], literature information retrieve [14], scientific writing [15,16,17,18], and beyond. Recent studies have demonstrated that ChatGPT can pass the United States Medical Licensing Exam (USMLE) [19, 20], Radiology Board-style Examination [21], UK Neurology Specialty Certificate Examination [22], and Plastic Surgery In-Service Exam [23], with results that are comparable to those of human experts. Nevertheless, Other studies have also indicated that ChatGPT failed to pass the Family Medicine Board Exam [24], and Pharmacist Qualification Examination [25]. Possible explanations for this performance difference include language and cultural differences, variations in examination content [26]. These studies highlighted the ChatGPT’s ability to comprehend the complex language used in medical contexts and its potential for use in medical education. However, current researches are limited in two aspects. Firstly, it largely focuses on the English language, and secondly, it predominantly emphasizes the physician’s examination. Additional investigation is necessary to explore the potential of ChatGPT in other non-English languages and various medical examinations, which can deliver substantial benefits for its expanded application in the medical domain.
China, with a population of over 1.4 billion, faces a significant medical burden. The provision of healthcare services involves a collaborative effort among physicians, pharmacists, and nurses who work diligently to offer the best possible care to patients. Physicians are responsible for diagnosing and treating illnesses, pharmacists ensure the appropriate medication is dispensed and administered correctly, while nurses attend to patients’ daily medical management and care service. Due to limited medical resources, medical professionals in China face immense pressure, but remain committed to providing high-quality services. The advent of ChatGPT offers a promising solution to ease this burden by delivering intelligent, efficient, and precise medical services to physicians, pharmacists, and nurses.
Medical examinations, including the Chinese National Medical Licensing Examination (NMLE), the Chinese National Pharmacist Licensing Examination (NPLE), and the Chinese National Nurse Licensing Examination (NNLE) are implemented by the government to improve professional standards, ensure medical safety and enhance healthcare services quality [27]. In NMLE, there are 4 units, each unit contains 150 questions, making a total of 600 questions. The NMLE is designed with 4 modules, including Basic Medical Sciences, Medical Humanities, Clinical Medicine, and Preventive Medicine. It is important to note that the questions within each module are randomly distributed across different units, and the number of questions focus on each module is not fixed. In NPLE, there are 4 units, each unit has 120 questions, making a total of 480 questions. The 4 units focus on 4 specific modules, namely Pharmaceutical Knowledge I, Pharmaceutical Knowledge II, Pharmaceutical Management and Regulations, and Comprehensive Pharmacy Knowledge and Skills. In NNLE, there are 2 units, each unit has 120 questions, making a total of 240 questions. Unit 1 focuses on clinical knowledge and unit 2 focuses on clinical skills. Through these medical examinations, the medical knowledge, clinical skills, and ethical standards mastered by medical staffs can significantly improve the quality of their services. This, in turn, can reduce the incidence of medical errors and accidents, and protect the fundamental right to health and safety of patients.
These medical licensing examinations aim to comprehensively evaluate candidate’s knowledge of medical science, clinical examination, disease diagnosis, surgical treatment, patient prognosis, policies, and regulations, among other areas. Successfully passing these examinations is a prerequisite for obtaining professional certification for physicians, pharmacists, and nurses. The annual number of test-takers is high, while the successful candidates remain relatively low. For NMLE, according to official website and news reports, in 2017, there were approximately 530,000 test-takers, followed by around 600,000 in 2018, around 540,000 in 2019, around 490,000 in 2020, around 530,000 in 2021, around 510,000 in 2022. For NPLE, according to the data from official website Certification Center for Licensed Pharmacist of NMPA, in 2017, the number of test-takers was 523,296, with a pass rate of 29.19%. In 2018, there were 566,613 test-takers, with 79,900 successful candidates and a pass rate of 14.10%. In 2019, there were 133,000 successful candidates, resulting in a pass rate of 18.72%. In 2020, there were 610,132 test-takers, but the number of successful candidates is not released. In 2021, there were 450,973 test-takers, with 80,840 successful candidates and a pass rate of 17.93%. In 2022, there were 495,419 test-takers, with 97,400 successful candidates and a pass rate of 19.66%. For NNLE, the total number of test-takers each year from 2012 to 2020 ranged between approximately 690,000 and 730,000, with the number of successful candidates ranging from approximately 380,000 to 420,000.
In this study, we aimed to quantitatively evaluate the performance of ChatGPT on three types of national medical examinations in China, namely NMLE, NPLE and NNLE. To enhance the reliability of our findings, we meticulously collected a substantial corpus of real-world medical question-answer data from examinations conducted from the year 2017 to 2021. We also conducted a comparative analysis of the performance of different units. For cases where incorrect responses were generated, we solicited feedback from domain experts and performed thorough assessment and error analysis. Our study yields valuable insights for researchers and developers to improve large language models’ performance in the medical domain.
link

