Shouhei Hanaoka’s research while affiliated with University of Tokyo Hospital and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (120)


Fig. 2 Bland-Altman plots comparing pancreatic volume measurements by artificial intelligence models with the ground truth. The plots show bias and 95% limits of agreement for each model in a vol-
Comparison of publicly available artificial intelligence models for pancreatic segmentation on T1-weighted Dixon images
  • Article
  • Full-text available

June 2025

·

3 Reads

Japanese Journal of Radiology

Yuki Sonoda

·

Shota Fujisawa

·

·

[...]

·

Purpose This study aimed to compare three publicly available deep learning models (TotalSegmentator, TotalVibeSegmentator, and PanSegNet) for automated pancreatic segmentation on magnetic resonance images and to evaluate their performance against human annotations in terms of segmentation accuracy, volumetric measurement, and intrapancreatic fat fraction (IPFF) assessment. Materials and methods Twenty upper abdominal T1-weighted magnetic resonance series acquired using the two-point Dixon method were randomly selected. Three radiologists manually segmented the pancreas, and a ground-truth mask was constructed through a majority vote per voxel. Pancreatic segmentation was also performed using the three artificial intelligence models. Performance was evaluated using the Dice similarity coefficient (DSC), 95th-percentile Hausdorff distance, average symmetric surface distance, positive predictive value, sensitivity, Bland–Altman plots, and concordance correlation coefficient (CCC) for pancreatic volume and IPFF. Results PanSegNet achieved the highest DSC (mean ± standard deviation, 0.883 ± 0.095) and showed no statistically significant difference from the human interobserver DSC (0.896 ± 0.068; p = 0.24). In contrast, TotalVibeSegmentator (0.731 ± 0.105) and TotalSegmentator (0.707 ± 0.142) had significantly lower DSC values compared with the human interobserver average ( p < 0.001). For pancreatic volume and IPFF, PanSegNet demonstrated the best agreement with the ground truth (CCC values of 0.958 and 0.993, respectively), followed by TotalSegmentator (0.834 and 0.980) and TotalVibeSegmentator (0.720 and 0.672). Conclusion PanSegNet demonstrated the highest segmentation accuracy and the best agreement with human measurements for both pancreatic volume and IPFF on T1-weighted Dixon images. This model appears to be the most suitable for large-scale studies requiring automated pancreatic segmentation and intrapancreatic fat evaluation.

Download

Fig. 2 Kaplan−Meier curves for overall survival according to body composition parameters using optimal cutoffs for the present cohort. A Skeletal muscle index (SMI), B skeletal muscle density (SMD), C
Fig. 3 Kaplan−Meier curves for cancer-specific survival according to body composition parameters using optimal cutoffs for the present cohort. A Skeletal muscle index (SMI), B skeletal muscle density
High visceral-to-subcutaneous fat area ratio is an unfavorable prognostic indicator in patients with uterine sarcoma

June 2025

·

12 Reads

Japanese Journal of Radiology

Uterine sarcoma is a rare disease whose association with body composition parameters is poorly understood. This study explored the impact of body composition parameters on overall survival with uterine sarcoma. This multicenter study included 52 patients with uterine sarcomas treated at three Japanese hospitals between 2007 and 2023. A semi-automatic segmentation program based on deep learning analyzed transaxial CT images at the L3 vertebral level, calculating body composition parameters as follows: area indices (areas divided by height squared) of skeletal muscle, visceral and subcutaneous adipose tissue (SMI, VATI, and SATI, respectively); skeletal muscle density; and the visceral-to-subcutaneous fat area ratio (VSR). The optimal cutoff values for each parameter were calculated using maximally selected rank statistics with several p value approximations. The effects of body composition parameters and clinical data on overall survival (OS) and cancer-specific survival (CSS) were analyzed. Univariate Cox proportional hazards regression analysis revealed that advanced stage (III–IV) and high VSR were unfavorable prognostic factors for both OS and CSS. Multivariate Cox proportional hazard regression analysis revealed that advanced stage (III–IV) (hazard ratios (HRs), 4.67 for OS and 4.36 for CSS, p < 0.01), and high VSR (HRs, 9.36 for OS and 8.22 for CSS, p < 0.001) were poor prognostic factors for both OS and CSS. Added values were observed when the VSR was incorporated into the OS and the CSS prediction models. Increased VSR and tumor stage are significant predictors of poor overall survival in patients with uterine sarcoma.


CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

June 2025

·

25 Reads

The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.


Sensitivity-Aware Differential Privacy for Federated Medical Imaging

April 2025

·

11 Reads

Federated learning (FL) enables collaborative model training across multiple institutions without the sharing of raw patient data, making it particularly suitable for smart healthcare applications. However, recent studies revealed that merely sharing gradients provides a false sense of security, as private information can still be inferred through gradient inversion attacks (GIAs). While differential privacy (DP) provides provable privacy guarantees, traditional DP methods apply uniform protection, leading to excessive protection for low-sensitivity data and insufficient protection for high-sensitivity data, which degrades model performance and increases privacy risks. This paper proposes a new privacy notion, sensitivity-aware differential privacy, to better balance model performance and privacy protection. Our idea is that the sensitivity of each data sample can be objectively measured using real-world attacks. To implement this new notion, we develop the corresponding defense mechanism that adjusts privacy protection levels based on the variation in the privacy leakage risks of gradient inversion attacks. Furthermore, the method extends naturally to multi-attack scenarios. Extensive experiments on real-world medical imaging datasets demonstrate that, under equivalent privacy risk, our method achieves an average performance improvement of 13.5% over state-of-the-art methods.


Using Segment Anything Model 2 for Zero-Shot 3D Segmentation of Abdominal Organs in Computed Tomography Scans to Adapt Video Tracking Capabilities for 3D Medical Imaging: Algorithm Development and Validation

April 2025

·

24 Reads

·

1 Citation

JMIR AI

Background Medical image segmentation is crucial for diagnosis and treatment planning in radiology, but it traditionally requires extensive manual effort and specialized training data. With its novel video tracking capabilities, the Segment Anything Model 2 (SAM 2) presents a potential solution for automated 3D medical image segmentation without the need for domain-specific training. However, its effectiveness in medical applications, particularly in abdominal computed tomography (CT) imaging remains unexplored. Objective The aim of this study was to evaluate the zero-shot performance of SAM 2 in 3D segmentation of abdominal organs in CT scans and to investigate the effects of prompt settings on segmentation results. Methods In this retrospective study, we used a subset of the TotalSegmentator CT dataset from eight institutions to assess SAM 2’s ability to segment eight abdominal organs. Segmentation was initiated from three different z-coordinate levels (caudal, mid, and cranial levels) of each organ. Performance was measured using the dice similarity coefficient (DSC). We also analyzed the impact of “negative prompts,” which explicitly exclude certain regions from the segmentation process, on accuracy. Results A total of 123 patients (mean age 60.7, SD 15.5 years; 63 men, 60 women) were evaluated. As a zero-shot approach, larger organs with clear boundaries demonstrated high segmentation performance, with mean DSCs as follows: liver, 0.821 (SD 0.192); right kidney, 0.862 (SD 0.212); left kidney, 0.870 (SD 0.154); and spleen, 0.891 (SD 0.131). Smaller organs showed lower performance: gallbladder, 0.531 (SD 0.291); pancreas, 0.361 (SD 0.197); and adrenal glands—right, 0.203 (SD 0.222) and left, 0.308 (SD 0.234). The initial slice for segmentation and the use of negative prompts significantly influenced the results. By removing negative prompts from the input, the DSCs significantly decreased for six organs. Conclusions SAM 2 demonstrated promising zero-shot performance in segmenting certain abdominal organs in CT scans, particularly larger organs. Performance was significantly influenced by input negative prompts and initial slice selection, highlighting the importance of optimizing these factors.




A Deep Learning Based Automated Detection of Mucus Plugs in Chest CT

March 2025

·

39 Reads

This study presents a novel two stage deep learning algorithm for automated detection of mucus plugs in CT scans of patients with respiratory diseases. Despite the clinical significance of mucus plugs in COPD and asthma where they indicate hypoxemia, reduced exercise tolerance, and poorer outcomes, they remain under evaluated in clinical practice due to labor intensive manual annotation. The developed algorithm first segments both patent and obstructed airways using a VNet-based model pre-trained on normal airway structures and fine-tuned on mucus containing scans. Subsequently, a rule-based post processing method identifies mucus plugs by evaluating cross sectional areas along airway centerlines. Validation on an in-house dataset of 33 CT scans from patients with asthma/COPD demonstrated high sensitivity (93.8%) though modest positive predictive value (18.8%). Performance on an external dataset (LIDC-IDRI) achieved 82.8% sensitivity with 23.5% PPV. While challenges remain in reducing false positives, this automated detection tool shows promise for screening applications in both clinical and research settings, potentially addressing the current gap in mucus plug evaluation within standard practice.


Figure 4: Comparison of tokenization between BERT and ModernBERT models on a complete medical report. The figure presents the original report text (left column) alongside its tokenized representations by BERT (middle column) and ModernBERT (right column). Tokens are separated by "/" symbols, with "##" in the BERT column indicating subtokens.
Figure 6: A bar chart comparing the training and inference speeds (in samples per second) of BERT Base and ModernBERT. ModernBERT demonstrates a 1.65× speedup during training and a 1.66× speedup during inference compared to BERT Base.
Figures:
ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports

March 2025

·

21 Reads

Objective: This study aims to evaluate and compare the performance of two Japanese language models-conventional Bidirectional Encoder Representations from Transformers (BERT) and the newer ModernBERT-in classifying findings from chest CT reports, with a focus on tokenization efficiency, processing time, and classification performance. Methods: We conducted a retrospective study using the CT-RATE-JPN dataset containing 22,778 training reports and 150 test reports. Both models were fine-tuned for multi-label classification of 18 common chest CT conditions. The training data was split in 18,222:4,556 for training and validation. Performance was evaluated using F1 scores for each condition and exact match accuracy across all 18 labels. Results: ModernBERT demonstrated superior tokenization efficiency, requiring 24.0% fewer tokens per document (258.1 vs. 339.6) compared to BERT Base. This translated to significant performance improvements, with ModernBERT completing training in 1877.67 seconds versus BERT's 3090.54 seconds (39% reduction). ModernBERT processed 38.82 samples per second during training (1.65x faster) and 139.90 samples per second during inference (1.66x faster). Despite these efficiency gains, classification performance remained comparable, with ModernBERT achieving superior F1 scores in 8 conditions, while BERT performed better in 4 conditions. Overall exact match accuracy was slightly higher for ModernBERT (74.67% vs. 72.67%), though this difference was not statistically significant (p=0.6291). Conclusion: ModernBERT offers substantial improvements in tokenization efficiency and training speed without sacrificing classification performance. These results suggest that ModernBERT is a promising candidate for clinical applications in Japanese radiology reports analysis.


Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation

January 2025

·

6 Reads

JMIR Cancer

Background The application of natural language processing in medicine has increased significantly, including tasks such as information extraction and classification. Natural language processing plays a crucial role in structuring free-form radiology reports, facilitating the interpretation of textual content, and enhancing data utility through clustering techniques. Clustering allows for the identification of similar lesions and disease patterns across a broad dataset, making it useful for aggregating information and discovering new insights in medical imaging. However, most publicly available medical datasets are in English, with limited resources in other languages. This scarcity poses a challenge for development of models geared toward non-English downstream tasks. Objective This study aimed to develop and evaluate an algorithm that uses large language models (LLMs) to extract information from Japanese lung cancer radiology reports and perform clustering analysis. The effectiveness of this approach was assessed and compared with previous supervised methods. Methods This study employed the MedTxt-RR dataset, comprising 135 Japanese radiology reports from 9 radiologists who interpreted the computed tomography images of 15 lung cancer patients obtained from Radiopaedia. Previously used in the NTCIR-16 (NII Testbeds and Community for Information Access Research) shared task for clustering performance competition, this dataset was ideal for comparing the clustering ability of our algorithm with those of previous methods. The dataset was split into 8 cases for development and 7 for testing, respectively. The study’s approach involved using the LLM to extract information pertinent to lung cancer findings and transforming it into numeric features for clustering, using the K-means method. Performance was evaluated using 135 reports for information extraction accuracy and 63 test reports for clustering performance. This study focused on the accuracy of automated systems for extracting tumor size, location, and laterality from clinical reports. The clustering performance was evaluated using normalized mutual information, adjusted mutual information , and the Fowlkes-Mallows index for both the development and test data. Results The tumor size was accurately identified in 99 out of 135 reports (73.3%), with errors in 36 reports (26.7%), primarily due to missing or incorrect size information. Tumor location and laterality were identified with greater accuracy in 112 out of 135 reports (83%); however, 23 reports (17%) contained errors mainly due to empty values or incorrect data. Clustering performance of the test data yielded an normalized mutual information of 0.6414, adjusted mutual information of 0.5598, and Fowlkes-Mallows index of 0.5354. The proposed method demonstrated superior performance across all evaluation metrics compared to previous methods. Conclusions The unsupervised LLM approach surpassed the existing supervised methods in clustering Japanese radiology reports. These findings suggest that LLMs hold promise for extracting information from radiology reports and integrating it into disease-specific knowledge structures.


Citations (54)


... Zero-Shot 3D Medical Image Segmentation. Existing zero-shot 3D medical image segmentation methods fall primarily into two categories: SAM [25]-based methods [2,37,39,47] and methods based on vision-language alignment [22,23]. SAM-based 3D medical image segmentation methods [2,37,39,47] demonstrate promising zeroshot performance in segmenting certain organs, particularly larger organs with clear boundaries. ...

Reference:

Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
Using Segment Anything Model 2 for Zero-Shot 3D Segmentation of Abdominal Organs in Computed Tomography Scans to Adapt Video Tracking Capabilities for 3D Medical Imaging: Algorithm Development and Validation
  • Citing Article
  • April 2025

JMIR AI

... The CT-RATE-JPN dataset (https://huggingface.co/datasets/YYama0/CT-RATE-JPN) [18], which was developed in our previous research, is a Japanese-translated version of radiology reports from the original CT-RATE dataset (https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) [19][20][21]. The CT-RATE-JPN dataset was specifically created to facilitate Japanese medical AI model development and evaluation. ...

Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model

... We would like to begin by thanking the authors of the letter "It's not time to kick out radiologists" (Nakamura et al. 2024) for their comments on our paper. Nakamura et al. raise many valid points; however, none of the objections they raise derails what we take to be the central thesis of our paper, "When can we Kick (Some) Humans "Out of the Loop"? ...

It is Not Time to Kick Out Radiologists

Asian Bioethics Review

... In MI the most commonly used aggregation method is the standard FedAvg (Roy et al., 2019;Li et al., 2019;Roth et al., 2020;Cetinkaya et al., 2021;Parekh et al., 2021;Ziller et al., 2021b;Liu et al., 2021a;Dou et al., 2021;Feki et al., 2021;Stripelis et al., 2021a;Roth et al., 2021;Stripelis et al., 2021b;Agbley et al., 2021;Linardos et al., 2022;Adnan et al., 2022;Li et al., 2022c;He et al., 2022;Yang et al., 2022b;Subramanian et al., 2022;Luo and Wu, 2022;Ślazyk et al., 2022;Zhou et al., 2022;Elshabrawy et al., 2022;Pati et al., 2022;Misonne and Jodogne, 2022;Stripelis et al., 2022;Lu et al., 2022b;Kumar et al., 2022;Tan et al., 2023;Liu et al., 2023b;Jiménez-Sánchez et al., 2023;Mushtaq et al., 2023;Elmas et al., 2023;Levac et al., 2023;Denissen et al., 2023;Kanhere et al., 2023;Kaushal et al., 2023;Makkar and Santosh, 2023;Wu et al., 2023a;Wang et al., 2023;Kim et al., 2024;Qi et al., 2024b;Al-Salman et al., 2024;Mitrovska et al., 2024;Yamada et al., 2024;Yan et al., 2024;Xiang et al., 2024;Zheng et al., 2024;Zhou et al., 2024;Hossain et al., 2024;Sun et al., 2024;Khan et al., 2024;Deng et al., 2024a;Babar et al., 2024;Myrzashova et al., 2025;Gupta et al., 2024b;Albalawi et al., 2024;Gupta et al., 2024a;Kumar et al., 2024;Deng et al., 2024b) or its modification with equal weights (FL-EV) (Li et al., 2020b;Guo et al., 2021;Lo et al., 2021;Florescu et al., 2022;Linardos et al., 2022;Peng et al., 2023;Naumova et al., 2024;Abbas et al., 2024;Liu et al., 2024a;Vo et al., 2024). Some of the algorithms also use FedProx (Elshabrawy et al., 2022;Subramanian et al., 2022;Qi et al., 2024b), FedBN (Elshabrawy et al., 2022;Kanhere et al., 2023;Kulkarni et al., 2023) or adaptive algorithms (Stripelis et al., 2022;Levac et al., 2023;Qi et al., 2024b). ...

Investigation of distributed learning for automated lesion detection in head MR images

Radiological Physics and Technology

... The SAT and VAT areas were automatically identified in the transaxial CT image. The SAT was defined as extra-peritoneal adipose tissue with a CT-attenuation range between −190 and −30 HU, and VAT was defined as intra-abdominal adipose tissue with a CT-attenuation range between −150 and −50 HU ( Figure 1) [20,21]. Mean CT-attenuation values for SAT (SAT HU) and VAT (VAT HU) were then measured from areas on CT1 and CT2. ...

Integrated impact of multiple body composition parameters on overall survival in gastrointestinal or genitourinary cancers: A descriptive cohort study
  • Citing Article
  • July 2024

Journal of Parenteral and Enteral Nutrition

... A decrease in skeletal muscle mass has been associated with a higher incidence of falls and fractures, as well as poor outcomes in various conditions, including pneumonia, cardiovascular catheterization, and pancreatitis [1][2][3][4][5]. It is also linked to poor prognosis of various cancers in the head and neck, lung, gastrointestinal tract, liver, bile duct, pancreas, and urinary tract [1][2][3][4][5][6][7][8]. ...

Artificial intelligence-based skeletal muscle estimates and outcomes of endoscopic ultrasound-guided treatment of pancreatic fluid collections
  • Citing Article
  • July 2024

iGIE

... GPT is widely utilized and studied [51], and research consistently shows performance variations among its versions, such as GPT-3.5 and GPT-4 [48]. These performance differences are narrowing in more advanced models, illustrated by minor discrepancies between GPT-4 and its newer counterpart, GPT-4o [52]. This trend underscores the rapid development within the field, necessitating thorough and precise evaluation. ...

No improvement found with GPT-4o: results of additional experiments in the Japan Diagnostic Radiology Board Examination
  • Citing Article
  • June 2024

Japanese Journal of Radiology

... However, we have attempted to split the dataset into the training, validation, and test datasets in chronological order. According to Walston et al., although random splitting and cross-validation are categorized as internal datasets, temporal or geographical sets are categorized as external datasets [21]. Second, this model could not refer to previous reports. ...

Data set terminology of deep learning in medicine: a historical review and recommendation
  • Citing Article
  • June 2024

Japanese Journal of Radiology

... The accurate determination of the cause of death is crucial for both medical and legal purposes, requiring careful integration of clinical information and postmortem findings. Recent studies have demonstrated remarkable progress in the capabilities of large language models (LLMs) across various medical diagnostic tasks [1][2][3][4][5][6][7][8][9][10][11], including clinical reasoning, radiological interpretation, and complex case analysis. Although these advances are promising [12,13], LLM application to death investigation remains largely unexplored [14]. ...

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

Japanese Journal of Radiology

... In recent years, researchers have developed a variety of techniques to address these challenges, as we can mention: (i) creating simulated or synthetic data. This last involves the use of methods such as probabilistic models [1,2], classificationbased imputation models [3,4]. (ii) Few-shot learning [5,6,7], which is mainly used to teach a model to generalize for new tasks or problems with only a few labeled examples per class. ...

Practical Medical Image Generation with Provable Privacy Protection Based on Denoising Diffusion Probabilistic Models for High-Resolution Volumetric Images