Information

Jun 10, 2024

Historical Organization and Proposal for Dataset Terminology Clarification in Deep Learning for Medicine

Historical Organization and Proposal of Dataset Terminology in Deep Learning for Medicine

We examined the historical evolution and causes of confusion in dataset terminology, focusing on the interpretation of the term "validation" used in the medical and deep learning fields, and presented solutions that both domains can share.

Paper

Data set terminology of deep learning in medicine: a historical review and recommendation
Japanese Journal of Radiology
https://doi.org/10.1007/s11604-024-01608-1

Author's Comments

Originally, at the 2023 Japanese Society of Medical Radiology annual meeting, we received significant feedback concerned with how deep learning dataset terminology, particularly the word "validation," is used very differently in medicine and engineering. Many of these specifically questioned why the terminology and meanings are so inconsistent, and academic society officials strongly recommended publishing a paper in response, which led to this writing. We ourselves have repeatedly witnessed instances where intentions were misaligned despite using the same words while navigating between the medical and AI technology worlds. This paper, responding to such voices from the field, was published as an Invited Review summarizing content organized from history and case studies in the June 2024 issue of the Japanese Journal of Radiology.

Paper Overview

This paper organized and contrasted the dataset handling methods used throughout the engineering field, where deep learning originated, with the concept of validation traditionally emphasized in the medical field. We focused particularly on how the term "validation" often refers to the final accuracy confirmation stage in medicine, while in the deep learning world, it frequently indicates an intermediate stage for parameter adjustment. Considering that misunderstandings arising from this risk affecting actual research reports and clinical application evaluations, we decided to conduct a cross-disciplinary terminology exploration organized by historical context.

Paper Details

We first reviewed the historical meaning of the term "validation" in medical literature, exploring the background that has emphasized the concept of "verification" as a final confirmation of diagnostic accuracy. In contrast, deep learning early on established a three-part division structure composed of "training," "validation," and "test" groups, and we explained how the validation group in deep learning contexts functions not as a final evaluation but as an intermediate role to prevent model overfitting. Furthermore, we introduced the distinction between internal and external data for test sets used for final evaluation and defined and discussed the significance of temporally and geographically distinct external datasets. In conclusion, we proposed that standardizing the three divisions of "training," "validation (or tuning)," and "test" in medicine as well, and clearly defining data groups in research papers, would be an important measure to smoothly connect research across both fields. We expect that such organization will reduce unintended misunderstandings between medical practitioners and AI researchers, and further enhance the reproducibility of results and the versatility of models. We feel that terminology standardization will become increasingly important as deep learning continues to be widely used in the medical field, and we hope this paper will serve as a foundation for that.