Medicine Jul 2, 10:00 AM EDT Updated 1h ago 0 views

From knowledge to judgment: A three-year longitudinal analysis of artificial intelligence large language model performance on the Chinese national nurse licensing examination

Article excerpt

by Xinju Zhan, Weihua Yu, Jianshu Cai, Jionghuang Chen Background The rapid advancement of Large Language Models (LLMs) presents unprecedented opportunities for healthcare education and professional credentialing. However, comprehensive longitudinal analyses of their evolving capabilities in nursing contexts remain limited.…

by Xinju Zhan, Weihua Yu, Jianshu Cai, Jionghuang Chen

Background The rapid advancement of Large Language Models (LLMs) presents unprecedented opportunities for healthcare education and professional credentialing. However, comprehensive longitudinal analyses of their evolving capabilities in nursing contexts remain limited.

Objective To conduct a three-year longitudinal performance analysis of major international and Chinese-native LLMs on the Chinese National Nurse Licensing Examination (NNLE) from July 2022 to June 2025, examining performance trajectories, comparative effectiveness, and domain-specific competencies.

Methods We curated a comprehensive corpus of 9,800 multiple-choice questions from NNLE examinations (2022, 2025) through validated educational resources. Fifteen leading LLMs were evaluated using standardized zero-shot prompting protocols, with temporal fidelity ensuring models were tested only on examinations administered after their release dates. Performance was measured as raw accuracy and benchmarked against the approximate 300-point passing threshold. Statistical analyses included trend analysis, comparative performance testing, and qualitative error categorization.

Results LLM performance demonstrated a steep upward trajectory, with top-tier models achieving accuracy rates from 47.0% in 2022 to 78.8% in 2025. Chinese-native models consistently outperformed international counterparts. The mean Chinese-native advantage decreased from 6.1 percentage points in 2023 to 3.0 percentage points in 2025, while the top-model advantage remained present but non-monotonic, measuring 4.5, 3.0, and 3.8 percentage points in 2023, 2024, and 2025, respectively. Models exhibited superior performance in the knowledge-oriented Professional Practice section (81.6% average accuracy) versus the application-oriented Practical Skills section (70.9% average accuracy). Clinical reasoning failures, particularly in nursing intervention prioritization, constituted 43% of errors among top-performing models.

Conclusion While state-of-the-art LLMs demonstrate substantial codified nursing knowledge sufficient to achieve approximate passing thresholds on professional licensing examinations, significant deficiencies in complex clinical judgment persist, defining the current boundary between artificial intelligence capabilities and human professional competence. Critically, examination performance should not be interpreted as evidence of clinical readiness or autonomous practice capability.

`j`	Next card
`k`	Previous card
`r`	Read more on focused card
`?`	Show this help