Skip to Main Content (Press Enter)

Logo UNICH
  • ×
  • Home
  • Degrees
  • Courses
  • Jobs
  • People
  • Outputs
  • Organizations
  • Third Mission
  • Projects
  • Expertise & Skills

UNI-FIND
Logo UNICH

|

UNI-FIND

unich.it
  • ×
  • Home
  • Degrees
  • Courses
  • Jobs
  • People
  • Outputs
  • Organizations
  • Third Mission
  • Projects
  • Expertise & Skills
  1. Outputs

Assessing and Comparing Free Large Language Models’ Responses to a Clinical Case: Accuracy, Safety, and Reliability

Conference Paper
Publication Date:
2025
abstract:
This research examines how five free of charge Large Language Models (LLMs)-Zephyr, Mistral 7B, LLAMA 2 7B, ChatGPT 3.5 and Copilot Precise-perform when faced with a nursing clinical scenario involving a neuropsychiatric emergency. Their responses were evaluated based on established guidelines by a Delphi consensus using a 5-point Likert scale to rate safety, accuracy, reliability and the potential for improvement. The findings underscore the greatest importance of safety and accuracy metrics. LLAMA 2 7B exhibits balanced but poor performance, scoring 3 out of 5 in Safety, Accuracy, and References, and 4 out of 5 in providing Improvement suggestions. ChatGPT 3.5 demonstrates adequate performance in Safety, Accuracy, and References, each with a score of 4 out of 5, indicating its proficiency in generating accurate, reliable content and ensuring patient safety, though there is room for improvement in enhancement suggestions (3 out of 5). Copilot Precise shows a unique profile, with balanced scores of 3 out of 5 in Safety, Accuracy, and Improvements, and a perfect score of 5 out of 5 only in References, highlighting its high accuracy in generating references. Reliability was reported in terms of both reference precision criteria and consistency over time computed through automated assessment. These preliminary results underscore the importance of developing language models that focus on ensuring safety and precision, in clinical decision-making scenarios. Further studies should aim to improve the accuracy and dependability of these models by examining a range of situations and incorporating real-time feedback mechanisms from experts. This will enhance their usefulness in clinical environments.
Iris type:
4.1 Contributo in Atti di convegno
Keywords:
accuracy; clinical decision-making; large language models; reliability; safety
List of contributors:
Sblendorio, Elena; Lo Cascio, Alessio; Napolitano, Daniele; Germini, Francesco; Dentamaro, Vincenzo; Piredda, Michela; Cicolini, Giancarlo
Authors of the University:
CICOLINI GIANCARLO
Handle:
https://ricerca.unich.it/handle/11564/867177
Book title:
Lecture Notes in Computer Science
Published in:
LECTURE NOTES IN COMPUTER SCIENCE
Journal
LECTURE NOTES IN COMPUTER SCIENCE
Series
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.4.3.0