Computing Jun 4, 10:00 AM EDT Updated 2h ago 0 views

Performance and safety of a fine-tuned small language model for pediatric emergency triage: A benchmark study

Article excerpt

by Eui Jun Lee, Jae Yun Jung, Do Kyun Kim, Joong Wan Park, Young Ho Kwak Pediatric emergency triage is a safety-critical task, and recent studies have explored whether artificial intelligence, including language models, can support triage decision-making; however, evidence…

by Eui Jun Lee, Jae Yun Jung, Do Kyun Kim, Joong Wan Park, Young Ho Kwak

Pediatric emergency triage is a safety-critical task, and recent studies have explored whether artificial intelligence, including language models, can support triage decision-making; however, evidence on fine-tuned open-weight language models remains limited. We conducted a retrospective benchmark study using de-identified triage records from a tertiary pediatric emergency department in Korea collected from January 2020 to April 2025. After exclusions, 74,170 encounters were included. Each encounter was reconstructed into a case-level text sequence from triage-time structured variables and nurse-authored narratives. Qwen3-8B-Base was fine-tuned with Low-Rank Adaptation and Group Relative Policy Optimization using a safety-oriented reward design and was compared with a structured-data XGBoost model on a common evaluable test subset of 14,832 encounters. The fine-tuned model achieved an accuracy of 58.60%, a macro-F1 score of 0.417, and a quadratic weighted kappa of 0.535. Within-one-level agreement was 97.13%, and strict under-triage, defined as true Korean Triage and Acuity Scale levels 1 or 2 predicted as levels 4 or 5, occurred in 0.65% of cases. The structured-data comparator showed higher overall performance, with an accuracy of 69.40%, a macro-F1 score of 0.618, and a quadratic weighted kappa of 0.651. However, the fine-tuned model showed fewer extreme errors and lower strict under-triage in selected high-acuity groups, at the cost of higher over-triage. In this real-world pediatric benchmark, the fine-tuned language model did not surpass the structured-data comparator in overall performance but showed a distinct safety-oriented error profile. These findings support its potential role as a decision-support aid for human triage review rather than an autonomous triage system. External and prospective validation will be necessary before clinical implementation.

`j`	Next card
`k`	Previous card
`r`	Read more on focused card
`?`	Show this help