Artificial intelligence applications for breast cancer diagnosis in ultrasound: a narrative review

Dimitri Chepkunov; Marvelious Erivwo; Vignesh Sastri; Ochuwa Tisor; Neriy Yakoubov; Myoungee Babu; Benson Babu

doi:10.21037/jmai-2025-177

Review Article

Artificial intelligence applications for breast cancer diagnosis in ultrasound: a narrative review

Dimitri Chepkunov^1,2, Marvelious Erivwo^1,2, Vignesh Sastri^1,2, Ochuwa Tisor³, Neriy Yakoubov², Myoungee Babu⁴, Benson Babu²

¹School of Medicine, St. George’s University, True Blue, Grenada; ²Department of Internal Medicine, Wyckoff Heights Medical Center, Brooklyn, NY, USA; ³College of Medicine, Albert Einstein College of Medicine, Bronx, NY, USA; ⁴Department of Education, New York City Department of Education, NY, USA

Contributions: (I) Conception and design: All authors; (II) Administrative support: B Babu, N Yakoubov; (III) Provision of study materials or patients: B Babu, N Yakoubov; (IV) Collection and assembly of data: D Chepkunov, M Erivwo, V Sastri, O Tisor; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Benson Babu, MD. Department of Internal Medicine, Wyckoff Heights Medical Center, 374 Stockholm Street, Brooklyn, NY 11237, USA. Email: Zero-Click@modooda.org.

Background and Objective: Breast ultrasound is an essential tool in breast cancer detection, particularly for women with dense breast tissue. However, its diagnostic accuracy is influenced by operator experience and subjective interpretation. Artificial intelligence (AI) has emerged as a promising approach to enhance consistency, improve diagnostic performance, and support informed clinical decision-making. This review summarizes the diagnostic performance and clinical utility of AI-assisted breast ultrasound based on recent primary studies.

Methods: A narrative review of studies published from January 2020 to October 2025 was conducted using Google Scholar, PubMed, AutoLit, Litmaps, Consensus, and Research Rabbit, in accordance with the narrative review guidelines. Searches were performed between June 10^th 2024 and October 24^th, 2025. Inclusion criteria required original studies that evaluated AI models applied to breast ultrasound and reported diagnostic performance metrics or clinical outcomes. Data extraction focused on model architecture, dataset characteristics, and reported performance. Methodological quality and risk of bias were assessed using the QUADAS-2 tool.

Key Content and Findings: Fifty-two studies met the inclusion criteria. AI models, particularly deep learning (DL) systems, have demonstrated strong diagnostic performance, with most reported sensitivities between 73.3% and 100%, specificities between 71.7% and 99.3%, and accuracies between 74.2% and 99.3%. Thirteen studies reported measurable clinical or workflow benefits, including reductions in unnecessary biopsies and improved lesion detection. Most studies, however, were retrospective and single-centered, often with limited dataset diversity and incomplete reporting of blinding, reference standards, or diagnostic timelines. External validation was uncommon, and only a small number of studies included prospective evaluation or direct comparison with experienced radiologists. The QUADAS-2 assessment indicated that the majority of studies had a moderate risk of bias, primarily due to their retrospective design and restricted inclusion criteria. At the same time, a few demonstrated either a low or high risk, depending on the methodological rigor.

Conclusions: AI has the potential to enhance diagnostic accuracy and consistency in breast ultrasound, including for women with dense breast tissue. Current evidence shows promising performance across multiple applications, but limitations such as study heterogeneity, limited external validation, and scarce prospective data remain. Further large-scale, clinically integrated studies are needed to confirm generalizability and real-world impact.

Keywords: Artificial intelligence (AI); breast ultrasound; deep learning (DL); diagnostic accuracy; breast cancer detection

Received: 31 July 2025; Accepted: 07 January 2026; Published online: 26 February 2026.

doi: 10.21037/jmai-2025-177

Introduction

Background

Breast cancer remains one of the leading causes of cancer-related morbidity and mortality worldwide, with early detection being critical for effective treatment and improved outcomes. Current screening guidelines recommend initiating breast imaging in women at age 40 years, with follow-up every 1–2 years depending on risk factors and clinical context (1). Mammography has long been the cornerstone of breast cancer screening, and it has recently seen rapid advancements in artificial intelligence (AI) integration. AI-assisted mammography has demonstrated robust diagnostic performance and is experiencing growing clinical adoption.

In contrast, the application of AI to breast ultrasound, while increasingly studied, remains less established. Ultrasound is a quick, widely available, and cost-efficient imaging modality, beneficial for women with dense breast tissue where mammography sensitivity is reduced. If AI can achieve comparable performance in ultrasound, it could enhance early detection, reduce unnecessary biopsies, and potentially expand screening capabilities, particularly in underserved or resource-limited regions.

Rationale and knowledge gap

Several recent systematic reviews have evaluated the application of AI in breast imaging, encompassing mammography, multimodal imaging, and broader radiomics-based approaches (2). Several reviews have also examined breast ultrasound assisted by AI (3). Still, these typically combine screening and diagnostic settings, focus on specific subgroups, such as molecular biomarker prediction, or provide broad overviews of deep learning (DL) without concentrating on diagnostic accuracy in ultrasound. Importantly, none of these reviews synthesize only primary research studies that evaluate AI models applied specifically to breast ultrasound diagnosis over the past 5 years.

Objective

Given the rapid expansion of AI-assisted ultrasound techniques and the development of new DL architectures since 2020, an updated synthesis focused exclusively on primary diagnostic accuracy studies is needed. By restricting inclusion to original research that reports model performance metrics and by excluding systematic reviews and other secondary sources, the present review provides a focused and contemporary assessment of AI performance, methodological quality, and sources of bias in breast ultrasound. This enables us to assess the quality and reproducibility of the current evidence and to identify the research gaps that remain before AI can be effectively integrated into routine diagnostic practice. We present this article in accordance with the Narrative Review reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-177/rc).

Methods

Eligibility criteria

The eligibility criteria and search strategy was guided by the Population, Intervention, Comparison, Outcomes, Study Design (PICOS) framework to ensure methodological clarity and reproducibility: “Population”, patients with breast cancer evaluated by ultrasound; “Intervention”, AI applications for diagnostic interpretation; “Comparison”, AI-assisted vs. conventional radiologist assessment; “Outcomes”, diagnostic performance metrics such as sensitivity, specificity, and accuracy; “Study Design”, original primary studies. Searches were performed using a standardized phrase across all databases: “AI ultrasound applications in breast cancer ‘clinical outcomes’-systematic-review”. This phrasing was selected to identify studies evaluating clinical outcomes of AI-assisted breast ultrasound while filtering out review articles. The search syntax was adjusted slightly to accommodate the formatting of each database, and no additional Boolean operators or medical subject headings (MeSH) terms were applied. Systematic reviews, editorials, and other non-original research studies were excluded during the screening process. Additionally, any studies that were inaccessible due to paywalls or search engine restrictions were excluded to maintain transparency and reproducibility.

Search strategy

In addition to the information presented in Table 1, no manual searches, such as reviewing reference lists or consulting grey literature, were performed, as the focus was on peer-reviewed studies accessible through electronic databases.

Table 1

Search strategy methods

Items	Specification
Date of search	The first search was conducted on 6/10/2024. The last search was conducted on 10/24/2025
Databases and other search engines	The databases searched were Google Scholar, PubMed; search engines were AutoLit, Litmaps, Consensus, and Research Rabbit
Search terms used	Exact phrase used: “AI ultrasound applications in breast cancer ‘clinical outcomes’-systematic-review”
	“Systematic-review” was used as a method to screen out as many systematic reviews as possible in the initial search
	“Clinical outcomes” was added to further screen for papers specifically related to patient outcomes, and screening out papers focused on other objectives
Timeframe	From January 2020 to October 2025
Inclusion and exclusion criteria	Inclusion criteria: >18 years old; breast cancer as the primary subject; ultrasound as the studied imaging modality; AI as means for analyzing radiological findings; clinical outcomes
Inclusion and exclusion criteria	Exclusion criteria: systematic reviews; usage of other imaging modalities; focus on cancers other than breast cancer
Selection process	Using the search criteria, 837 articles were found. The entire team selected articles that fit the inclusion criteria. Fifty-two papers met the study eligibility criteria out of 837

AI, artificial intelligence.

Selection process: once the search was completed, the identified studies underwent a rigorous selection process. Four student authors independently screened studies across the databases Google Scholar and PubMed. Supplemental search engines used for screening were AutoLit, Litmaps, Consensus, and Research Rabbit search engines. They conducted full-text reviews to determine eligibility based on the inclusion and exclusion criteria. During this process, blinding was not used, which may introduce selection bias. However, independent review and consensus resolution were employed to minimize this limitation. ChatGPT version 5.0 was used exclusively to improve syntax, grammar, and diction during manuscript preparation. All data extraction and paper interpretation were done manually by four student researchers and two attending physician supervisors. Any uncertainties were discussed collaboratively and resolved by consensus to ensure accuracy, consistency, and reproducibility across all included studies. Duplicate studies were manually identified and removed to prevent redundancy.

Data extraction

Following study selection, data were manually extracted by the four medical students and two senior attendings involved in the review. Given the variability among studies, extracted data points included study design, sample size, details of the AI model used, and the number of patients involved in training the AI system. In cases where discrepancies arose, such as a study initially meeting inclusion criteria but later being found to have been retracted, the study was excluded from the final analysis. Furthermore, any studies with missing or incomplete data that could not be reliably interpreted were also excluded, and these exclusions were acknowledged as limitations in the discussion section.

Quality assessment

Study quality was evaluated through expert review by the senior attending authors: (B.B.), who is an expert in AI applications in healthcare and is a senior attending physician in medicine, and (N.Y.), who is also a senior attending physician and director of medicine. When discrepancies occurred, the person (M.B.) served as the designated adjudicator to resolve the tie. Assessment was based on relevance, methodological rigor, and clinical applicability. This expert-led evaluation provided an informed and experienced perspective on the reliability and significance of the included studies. Notably, no studies were excluded based on this quality assessment, ensuring that all relevant literature meeting the inclusion criteria was considered.

Evidence acquisition

Adhering to Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA)-Diagnostic Test Accuracy reporting guidelines, we systematically searched Google Scholar and PubMed using AutoLit, Litmaps, Consensus, and Research Rabbit engines from January 2020 to October 2025. Eligibility criteria included AI application studies in breast ultrasound imaging, capturing comparative diagnostic metrics.

Study quality was assessed by an expert domain reviewer without exclusions based on quality scores. The data extracted comprised detailed AI methodologies, patient characteristics, and diagnostic performance metrics. Statistical synthesis summarized diagnostic accuracy ranges due to methodological variability among studies.

Risk of bias assessment

We assessed the methodological quality of included studies using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) framework. The assessment focused on four domains: “patient selection”, “index test”, “reference standard”, and “flow and timing”.

Patient selection

Patient selection was evaluated according to the QUADAS-2 framework, focusing on how each study recruited participants, defined inclusion/exclusion criteria, and reported sampling methods. We recorded whether studies used retrospective or prospective designs, whether case selection was consecutive or convenience-based, and whether inclusion criteria were clearly defined.

Index test (AI model)

The index test domain assessed how each study described its AI model, including architecture, preprocessing steps, training approach, and whether blinding to reference standard results occurred during model development or testing. We documented reporting practices relevant to reproducibility and risk of bias.

Reference standard

Assessment of the reference standards considered whether studies used histopathology, radiologist consensus, or other clinical outcomes as the diagnostic benchmark. We evaluated the clarity of reference standard procedures and whether blinding to AI predictions was reported.

Flow and timing

Flow and timing were evaluated by determining whether studies reported the sequence of diagnostic procedures, time intervals between the index test and reference standard, and exclusions that occurred after enrollment. Reporting completeness in this domain was assessed using QUADAS-2 criteria.

Overall quality trends

Overall quality trends were summarized by aggregating QUADAS-2 ratings across all four domains (“patient selection”, “index test”, “reference standard”, and “flow and timing”). Each study received a domain-level rating of low, high, or unclear risk of bias.

The majority of studies had a retrospective design, a lack of blinding, and single-center datasets.

Recommendations

As part of the methodological framework, QUADAS-2 outcomes informed general considerations for evaluating the quality and reproducibility of the included studies. Recommendations for future research were synthesized separately.

Results

The characteristics, dataset composition, diagnostic performance metrics, and reported clinical or workflow outcomes of the included AI models applied to breast ultrasound are summarized in Table 2.

Table 2

Summary of AI models applied to breast ultrasound: dataset characteristics, diagnostic performance, and clinical outcomes

Study	Model type	Training data	Validation data	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC	Clinical outcome
Chang et al. (4)	DeepResUnet	WBUS from 117 women	Split into training/testing (numbers not reported)	97.00	–	–	–	Streamlined workflow; reduced reading time
Niu et al. (5)	Harbin HIT intelligent diagnosis system	206 lesions (BI-RADS 4A)	Single dataset (no separate validation set)	–	–	–	–	Improved standardization; reduced consultation time
Qian et al. (6)	ResNet/VGG (4 DL models)	360 patients with DCIS	Single dataset (no separate validation set)	74.20	73.30	75.00	0.72–0.80	Aided surgical planning; fewer additional procedures
Zhang et al. (7)	Deep CNN (ResNet50)	1,311 lesions	Single dataset (no separate validation set)	77.00	82.00	71.70	0.85	Automated BI-RADS classification
Zhang et al. (7)	DL	1,311 lesions	Test set 202 lesions	77.00	82.00	71.70	0.85	Improved 4A risk stratification; higher PPV vs. radiologists
Li et al. (8)	DenseNet	599 images from 91 patients	Single dataset (no separate validation set)	–	–	–	0.72–0.83	Enhanced feature recognition; workflow optimization
Hayashida et al. (9)	CNN (Fixstars)	4,028 images (5,014 lesions)	3,166 images (3,656 lesions)	–	100.00	90.90	–	Improved accuracy; reduced variability; support for juniors
Zhou et al. (10)	DL automatic classification	707 patients (475 malignant, 232 non‑lesion)	Single dataset (no separate validation set)	83.00	87.20	85.33	0.93	Improved consistency; sonographer training support
Mori et al. (11)	TensorFlow/Keras DL model (mobile)	115 benign + 201 malignant images	Single dataset (no separate validation set)	–	100.00	75.00	–	Smartphone-based rapid diagnosis; no IT integration
Lee et al. (12)	Mask R‑CNN	153 women (breast tumor US)	Single dataset (no separate validation set)	81.05	81.36	80.85	0.81	–
Lee et al. (13)	SERA + VGG16 (fine‑tuned)	1,001 patients, 1,042 lesions (665 benign, 377 malignant)	Single dataset (no separate validation set)	–	73.70	81.10	0.84	Improved detection vs. baseline; higher AUC than radiologists
Qureshi et al. (14)	VGG19 (TensorFlow) + Mask R-CNN (PyTorch)	BUSI 780 images; Dataset B 163 images; KAIMRC 5,693 images	Multiple datasets; split not uniformly reported	87.80	87.40	–	0.90	Preprocessing (denoise/ROI/RGB fusion) improved precision & FN
Lyu et al. (15)	SONIC automatic detection system	100 cases (54 benign, 46 malignant) across 12 residents (tests)	Two test sessions; no separate hold‑out set	–	–	–	–	Improved trainee accuracy & confidence; reduced teaching burden
Wang et al. (16)	ResNet v2 + ASN (34/50/101)	769 tumors (train/test 600/169)	Train/test split reported (600/169)	78.11	85.00	76.81	–	Enhanced ABUS accuracy & efficiency
Zhang et al. (17)	VGG16/VGG19/ResNet50/InceptionV3	5,000 images (2,500 benign/2,500 malignant)	Test 1,007 images; comparison 683 images	–	–	–	0.85–0.91	Outperformed sonographers; automated lesion classification
Homayoun et al. (18)	XGBoost/RF/SVM	1,259 lesions (3 centers, 3 countries)	Single dataset (no separate validation set)	88.40	90.30	86.70	0.89	Consistent multi‑site performance; early triage support
Liao et al. (19)	EDL‑BC	7,955 lesions from 6,795 patients	Internal + 2 external cohorts	–	80.00–100.00	–	0.91–0.96	Higher AUC vs. radiologists; faster triage (Doppler + B‑mode)
Kaplan et al. (20)	PTDFG (transfer learning)	Case 1: 1,038; case 2: 996; case 3: 203 images	Three case schemes; split not uniformly reported	79.29–88.67	–	–	–	Enabled accurate BI‑RADS classification; biopsy support
Eun et al. (21)	17 CNNs (AlexNet → InceptionResNetV2)	516 lesions (364 benign, 152 malignant)	Train 410/test 106	–	–	–	0.96	Reduced BI‑RADS 4A false positives; streamlined diagnosis
Du et al. (22)	Efficient‑Det	1,181 images (487 patients) + 694 public images	Single dataset (no separate validation set)	92.60	–	–	–	0.06 s/image; reduced workload
Marini et al. (23)	Samsung S‑Detect for Breast	115 masses (curated)	Agreement vs. standard of care & VSI; no separate set	–	–	–	–	Automated diagnosis by non‑experts; accessibility gains
Wanderley et al. (24)	Koios DS Breast software	555 masses (biopsy center)	Single dataset (no separate validation set)	–	98.20	39.00	–	Comparable to radiologist; supports juniors
Pan et al. (25)	LR/SVM/DT/NB/KNN	600 female patients (Baheya Hospital)	Single dataset (no separate validation set)	–	–	–	0.73–0.99	Morphology-based ML with high recall; reproducible
Kim et al. (26)	VGG16/ResNet34/GoogLeNet	1,400 images (971 patients; 2 institutions)	Single dataset (no separate validation set)	96.00–100.00	–	–	0.89–0.96	Weakly‑supervised diagnosis & localization; less prep time
Yang et al. (27)	Attention U‑net + D‑CNN	2,057 images (1,131 patients); test 100 images	Training/validation reported; separate 100‑image test	97.00	–	–	–	Two‑stage system improved speed & consistency
Hossain et al. (28)	VGG16 (transfer learning)	BUSI + Mendeley datasets	Blind test set used; sizes not specified	91.00	–	–	–	Grad‑CAM interpretability; faster workflow
Fu et al. (29)	DL radiomics fusion model	2,585 images (497 patients)	Single dataset (no separate validation set)	–	–	–	0.90	Better ALN response prediction; ↑ sonographer accuracy
Laghmati et al. (30)	Attention3/4 U‑Net (segmentation)	891 benign images (473 patients); 421 malignant (210 patients); 266 normal (133 patients)	Train/validation split; details not fully reported	94.80+	–	–	–	Streamlined lesion localization with limited data
Chen et al. (31)	FEBrNet (video frame selection)	Training 387 cases; testing 587 cases (974 patients total)	Independent testing set (587 cases)	–	–	–	0.91	Video‑frame selection improved diagnostic accuracy
Yao et al. (32)	Xception + ABUS‑CAD	–	–	–	–	–	F1 =0.85	CAD + ABUS improved efficiency; aided less‑experienced examiners
Kikuchi et al. (33)	CNN	271 images (70 normal, 97 benign, 104 malignant)	Single dataset (no separate validation set)	–	91.80/86.20	91.40/88.40	–	Real‑time quality standardization; lesion detection
Fleury et al. (34)	Dedicated US software (reader + AI)	207 masses (143 benign, 64 malignant)	Single dataset (no separate validation set)	–	–	–	0.83	Reader + AI combo improved accuracy on ambiguous lesions
Shen et al. (35)	DL model	5,442,907 images in 288,767 exams (143,203 patients)	Internal holdout across ages/densities/devices	–	–	–	0.98	↓ False positives 37.4%; AUC 0.962–0.975
Wang et al. (36)	AI stratification system	173 suspicious lesions (prospective)	Prospective before biopsy/excision	–	–	–	–	↓ Biopsy rate in BI‑RADS 4A from 100% to 67.4%
Mansour et al. (37)	Lunit INSIGHT MMG (CNNs)	1,180 lesions (538 benign, 642 malignant)	Single dataset (no separate validation set)	–	96.80	90.10	–	Reduced unnecessary biopsies; confident triage
Dai et al. (38)	YOLOv4 + CenterNet	6,860 images (2,065 benign, 3,495 malignant, 1,300 none)	Single dataset (no separate validation set)	–	57.73–62.64	90.08–92.54	–	Reduced radiologist task load by 76.45% via triage
Ge et al. (39)	CNN + Transformer (reporting)	4,809 tumor instances	Single dataset (no separate validation set)	–	–	–	–	Structured reports; up to 90% workload reduction
Qian et al. (40)	DL system (bimodal/multimodal)	10,815 images (721 lesions)	Prospective test: 912 images (152 lesions)	–	–	–	0.92–0.96	AUC 0.955; heatmaps for explainability; BI‑RADS integration
Zhang et al. (41)	Optimized DL model	2,822 images (train), 707 (test), 210 (external)	Internal test + external test	89.70	91.30	86.90	0.90–0.96	↓ Unnecessary biopsy 67.86% in BI‑RADS 4A
Zhang et al. (42)	Back‑propagation NN	90 patients with ALN metastasis	Single dataset (no separate validation set)	97.65 (best)	–	–	> Manual segmentation	Faster, precise ALN diagnosis vs. manual
Xia et al. (43)	S‑Detect AI system	40 patients (US images)	Single dataset (no separate validation set)	89.60	95.80	93.80	0.95	↑ BI‑RADS precision, esp. for juniors
Madan et al. (3)	Deep CNN + RF (radiomics)	780 images (437 benign, 210 malignant, 133 normal)	LOOCV; no separate external validation	78.50	–	–	–	Improved screening workflow; ↓ workload
Paley et al. (44)	S‑Detect^TM (Samsung RS80A)	157 patients (7 excluded)	Single dataset (no separate validation set)	–	–	–	0.86–0.98	Automated lesion categorization; streamlined diagnosis
Li et al. (45)	BUSnet (DL)	780 samples (133 normal, 487 benign, 210 malignant)	Single dataset (no separate validation set)	–	–	–	–	Accelerated lesion classification
Ragab et al. (46)	EDLCDS‑BCDC (ensemble DL CDSS)	780 images (BUSI composition)	Single dataset (no separate validation set)	97.09	–	–	–	3D ABUS time reduction for interpretation
Bunnell et al. (47)	Detectron2 (object detection)	4,623 US images (444 women)	Testing set defined; details as reported	–	90.00	–	0.87–0.90	Enabled non‑radiologist initial assessment
Kummanee et al. (48)	3D ABUS (detection/seg/cls)	110 patients (development)	Forward testing: 50 patients	–	83.34	60.00	–	Optimized clinical workflow; identified all biopsy‑proven malignancies
Luijten et al. (49)	AIBUS diagnostic system	344 participants (HHUS + AIBUS)	Comparative evaluation vs. HHUS; no separate set	–	–	–	–	↑ Nodule detection vs. HHUS; ↑ BI‑RADS agreement (ê=0.497)
Vong et al. (50)	Deep CNNs (Lunit INSIGHT MMG v1.1.4.0; ResNet‑34)	238 patients (253 US‑detected cancers)	Standalone AI assessment; no separate validation	–	26.10 (detected)/73.90 (missed)	–	–	Detected 42 radiologist‑challenging cancers (16.6%)
Ye et al. (51)	DL system	3,166 images (3,656 lesions)	Single dataset (no separate validation set)	–	90.00	88.50	0.95	Improved clinician accuracy (69.3–73.1%); superior to clinicians (P<0.001)
Zhang et al. (SLN) (52)	XGBoost (best among 10 ML)	952 patients (SLN metastasis: 394 yes/558 no)	Validation cohort reported	84.60	87.00	86.20	0.92	Identification of SLN metastasis; outperformed others
Wang et al. (ABUS) (53)	3D Inception U‑net	196 patients (661 cancer regions)	Single dataset (no separate validation set)	–	95.10	–	–	0.1 s per ABUS volume; improved screening speed

3D, three-dimensional; ABUS, automated breast ultrasound; AI, artificial intelligence; AIBUS, artificial intelligence breast ultrasound; ALN, axillary lymph node; ASN, automatic segmentation network; AUC, area under the curve; BI-RADS, Breast Imaging Reporting and Data System; BUSI, dataset of breast ultrasound images; CAD, computer-assisted diagnosis; CDSS, clinical decision support system; CNN, convolutional neural network; D-CNN, deep convolutional neural network; DCIS, ductal carcinoma in situ; DL, deep learning; DSC, Dice similarity coefficient; DT, decision tree; EDL-BC, ensemble deep learning for breast cancer; EDLCDS‑BCDC, ensemble deep-learning-enabled clinical decision support system for breast cancer diagnosis and classification; FN, false negative; Grad‑CAM, gradient-weighted class activation mapping; HHUS, hand-held ultrasound; HIT, health information technology; KAIMRC, King Abdullah International Medical Research Center; KNN, K-nearest neighbour; LOOCV, leave-one-out cross-validation; LR, logistic regression; ML, machine learning; NB, Naïve Bayes; NN, neural network; PICOS, Population; PPV, positive predictive value; PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses; PTDFG, pyramid triple deep feature generator; QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies 2; R-CNN, region-based convolutional neural network; RF, random forest; RGB, red, green, and blue; ROI, region of interest; SERA, sample efficient reward augmentation; SLN, sentinel lymph node; SVM, support vector machine; US, ultrasound; US, ultrasound; VSI, volume sweep imaging; WBUS, whole breast ultrasound; XGBoost, extreme gradient boosting.

Due to substantial heterogeneity in study design, datasets, imaging protocols, AI architectures, validation methods, and reported diagnostic metrics, a formal meta-analysis could not be conducted. Therefore, all findings were synthesized qualitatively. Across 52 studies, diagnostic performance varied widely depending on dataset size and model design, with top-performing systems achieving sensitivities up to 100% (9,11), specificities up to 99.3% (41), and accuracies up to 99.3% (9). However, not all studies reported all three metrics: several omitted sensitivity or specificity values. This inconsistency reflects a limitation in reporting standards rather than a limitation in study quality. During data extraction, one study was excluded after being identified as retracted, and additional studies with missing or uninterpretable data were removed from the final analysis.

Across the 52 included studies, the collective dataset encompassed 178,494 patients and 5,474,296 ultrasound images. All 52 studies were retrospective in design, with no entirely prospective clinical studies identified.

Overall, the data demonstrated a consistent trend of high diagnostic performance across most studies, with several reporting accuracy levels comparable to radiologist interpretation. Studies utilizing larger and more heterogeneous populations, such as those by Shen et al. (35), Liao et al. (19), and Qian et al. (40), reported particularly high and generalizable results. Collectively, the included studies demonstrated that AI systems in breast ultrasound exhibit robust diagnostic capabilities and a consistent trend toward high accuracy and reliability.

Across the studies, we found a clear trend of AI-based systems improving diagnostic performance and workflow efficiency in breast ultrasound. Convolutional neural network (CNN)-based methodologies consistently achieved high accuracy, sensitivity, and specificity across various imaging settings. For example, Shen et al. (35) reported an area under the receiver operating characteristic curve (AUROC) of 0.976 for detecting malignant lesions using a DL system trained on over five million ultrasound images, maintaining strong diagnostic accuracy across patient age groups, breast densities, and ultrasound device types. Similarly, Chang et al. (4) achieved a Dice similarity coefficient (DSC) of 0.94 for automated whole-breast segmentation, reflecting AI’s reliability in density estimation and image analysis.

AI frequently matched or exceeded traditional radiologist evaluations in distinguishing benign from malignant lesions and reducing unnecessary biopsies. Niu et al. (5) emphasized margin and calcification features that improved diagnostic discrimination (P<0.05). In contrast, Qian et al. (6) developed a model with an AUROC of 0.85, which enhanced the prediction of ductal carcinoma in situ underestimation compared to conventional methods.

Improvements in Breast Imaging Reporting and Data System (BI-RADS) categorization were also consistently observed. Hayashida et al. (9) reported sensitivity and specificity values of 91.2% and 90.7%, respectively, which significantly outperformed clinician assessments (P<0.001). Likewise, Zhang et al. (7), Li et al. (8), and Zhou et al. (10) demonstrated reductions in interobserver variability and improved diagnostic precision, particularly among less-experienced practitioners.

A summarized comparison of performance metrics (Table 2) highlights that AI models frequently achieve accuracy and reliability on par with that of expert radiologists. Overall, across diverse datasets and study designs, the findings reveal a recurrent trend of strong, generalizable diagnostic performance for AI-assisted breast ultrasound. Across the 52 studies, top-performing systems commonly achieved sensitivities approaching or reaching 100%, specificities frequently above 90%, and area under the curve (AUC) values often exceeding 0.90, indicating that, at a descriptive level, most AI tools demonstrated performance at or near clinically meaningful thresholds.

Additionally, we conducted a structured risk of bias assessment, as outlined in the Methods. Most studies were judged to have a low to moderate risk of bias, with primary concerns related to retrospective, single-center designs, heterogeneous reporting of performance metrics, and limited external validation, rather than flaws in outcome measurement. These observations support cautious interpretation of the consistently high diagnostic performance and underscore the need for future prospective, multicenter studies with standardized reporting to confirm the robustness and generalizability of current findings.

Discussion

Overall performance and comparison to prior literature

The application of AI to breast ultrasound has been increasingly explored in recent years, with many studies reporting improved diagnostic accuracy and consistency compared to conventional interpretation. In our review, we observed a general trend toward enhanced performance among AI-assisted systems; however, the magnitude of improvement varied substantially across study design, dataset size, and validation method. These findings align with prior systematic reviews in related imaging domains, such as mammography and magnetic resonance imaging (MRI), which similarly report that AI can improve diagnostic reproducibility while remaining dependent on data quality and model transparency (2,54).

Rather than demonstrating universal superiority, the studies included in this review collectively highlight AI’s growing potential as a supportive diagnostic tool. Several investigations have reported performance metrics that approach or match those of experienced radiologists, especially when trained on large, multicenter datasets [e.g., Shen et al. (35) and Liao et al. (19)]. However, smaller, retrospective studies often yielded more variable results, consistent with findings from other reviews, which emphasize that limited sample diversity can overestimate algorithm accuracy (54).

The integration of AI into breast ultrasound has demonstrated significant improvements in diagnostic accuracy while reducing subjective variability and offers a promising advancement in breast cancer screening. By enhancing consistency and precision, AI has the potential to boost patient confidence and support more informed clinical decision-making. Seamlessly incorporating AI into clinical workflows can also augment radiologist performance, improve efficiency, and reduce patient anxiety associated with unnecessary biopsies, further advancing patient-centered care. In terms of healthcare systems, AI-driven ultrasound tools offer notable cost-saving advantages by minimizing unneeded procedures and optimizing resource allocation. Mobile-based AI platforms extend these benefits even further, increasing access to diagnostic services, particularly in underserved or resource-limited regions. Although several studies suggested potential economic benefits, the complete absence of formal cost-effectiveness analyses across the included studies represents a clear research gap that warrants dedicated investigation.

Clinical implications

Thirteen studies in our review explicitly reported improved clinical outcomes, such as enhanced detection rates and superior diagnostic performance compared to baseline or manual interpretations (4-7,9,11,13,15,17-21). Improved detection implies that AI can more accurately distinguish between benign and malignant lesions, which results in faster diagnoses, fewer missed cancers, and potentially a reduced need for second opinions or redundant imaging. However, improved detection alone does not always equate to improved clinical outcomes unless it leads to earlier interventions, reduces unnecessary procedures, or changes treatment decisions in a way that meaningfully improves patient health. Therefore, more randomized, prospective studies specifically designed to address these endpoints are needed to establish the actual clinical value of AI.

In contrast, 26 studies did not explicitly report improved clinical outcomes but did highlight meaningful contributions in areas such as diagnostic support. Many studies commonly described AI-enhanced detection tools that aid radiologists in lesion classification, image interpretation, and workflow optimization (8,10,12,16,29-35,37,38,41-51,53). While these tools may not yet show measurable impacts on patient outcomes, their role in enhancing clinical confidence, standardizing diagnostic criteria, and potentially reducing interpretation time marks a significant step toward broader AI integration in routine care. Further research is needed to assess how these diagnostic support tools translate into long-term clinical benefit.

Several studies in this review also demonstrate the capacity of AI to enhance diagnostic accuracy and consistency among less-experienced practitioners. Hayashida et al. (9) and Zhou et al. (10) reported that AI systems reduced inter-observer variability and improved interpretive precision, particularly benefiting clinicians with limited diagnostic experience. Lyu et al. (15) conducted a randomized controlled study showing that AI-assisted training significantly improved both diagnostic accuracy and practitioner confidence among medical residents. Mori et al. (11) and Ge et al. (39) described mobile-based and automated AI platforms that facilitate rapid diagnosis and preliminary reporting, thereby lessening dependence on specialist oversight and improving accessibility in diverse clinical settings. Likewise, Wanderley et al. (24) found that AI-assisted evaluations achieved diagnostic performance comparable to that of experienced radiologists, emphasizing their value as an adjunct for junior clinicians in ultrasound interpretation (9-11,15,24,39).

Only one study, conducted by Ge et al. (39), specifically reported a direct impact on workflow efficiency. The study demonstrated that AI-assisted preliminary report generation in breast ultrasound significantly enhanced the clinical workflow. By automating the initial documentation process, particularly for benign and normal cases, the AI system reduced the need for repetitive manual reporting, resulting in up to a 90% reduction in radiologist workload. This allowed clinicians to reallocate their time toward patient-centered tasks such as counseling, diagnostic clarification, and shared decision-making, thereby streamlining operations and potentially improving overall care delivery (39).

Research recommendations

Given these clinical implications, several research needs emerge that are essential for advancing AI-supported breast ultrasound. Notably, none of the 52 studies explicitly reported actual cost savings as a measured outcome. While many suggested potential economic benefits, such as reductions in unnecessary procedures or improved resource utilization, none provided direct evidence demonstrating reduced healthcare costs or quantifiable financial impact. This lack of cost-effectiveness evaluation highlights a critical research need, underscoring the importance of future studies that formally assess economic outcomes alongside diagnostic performance. Future research should incorporate rigorous economic analyses to determine whether the implementation of AI in breast ultrasound translates to meaningful financial savings for healthcare systems.

Future AI research should prioritize prospective, multicenter study designs, follow transparent reporting standards such as Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis-Artificial Intelligence Extension (TRIPOD-AI), and incorporate validated risk-of-bias tools [e.g., QUADAS-2 or Prediction Model Risk of Bias Assessment Tool (PROBAST)] during study design. Strengthening external validation practices and establishing standardized imaging and annotation protocols will further enhance the reproducibility and clinical applicability of the results.

To validate the clinical and economic value of AI-assisted ultrasound in breast cancer detection, future studies must go beyond diagnostic accuracy and incorporate formal cost-effectiveness analyses. This includes assessing whether AI reduces unnecessary follow-up imaging and biopsies of benign lesions, as well as quantifying potential time savings for clinicians. Crucially, these economic metrics should be evaluated in tandem with patient-centered outcomes to determine whether cost reductions align with improved quality of care. Long-term, real-world studies are essential to assess whether the integration of AI into routine practice is not only financially sustainable but also clinically advantageous.

Despite these promising developments, several challenges remain. Although current AI methodologies have shown robust performance across diverse populations, indicating their readiness for broader clinical adoption, prospective, multicenter validation studies are still necessary to ensure generalizability. Looking ahead, the transformative potential of AI lies in its ability to integrate with other imaging modalities and clinical data, paving the way for personalized diagnostics and precision medicine. However, persistent issues such as dataset heterogeneity, retrospective study limitations, and limited algorithm transparency must be addressed. Future research should focus on prospective validation, development of standardized training datasets, and more straightforward reporting guidelines to support better the safe and effective implementation of AI in clinical practice.

Critical appraisal of included studies

A key limitation of the existing body of evidence is the predominance of retrospective, single-center studies. More than half of the included studies (approximately 60%) analyzed relatively small datasets (fewer than 500 patients), which may limit the generalizability of findings. For example, models reported by Mori et al. (11) and Pan et al. (25) demonstrated high accuracy but were trained and tested on limited sample sizes, raising concerns about overfitting and reproducibility. In contrast, Shen et al. (35) and Liao et al. (19), who utilized datasets comprising more than 5,000 patients, achieved more robust and generalizable performance (AUROC of 0.976 and 0.950, respectively), underscoring the importance of data volume and diversity in AI development.

Most studies demonstrated a moderate risk of bias in the “patient selection” domain, primarily due to retrospective convenience sampling rather than prospective or randomized recruitment. Many studies also relied on narrowly defined inclusion criteria, such as restricting BI-RADS 4A lesions or histologically confirmed cancers, which may introduce spectrum bias and lead to overestimation of diagnostic accuracy. A minority of studies employed non-clinical acquisition protocols (e.g., smartphone-based ultrasound or automated volume sweep imaging without radiologist oversight), which contributed to higher bias in this domain.

Risk of bias for the “index test” was generally low to moderate. While most studies adequately described AI architecture (e.g., CNNs, VGG16, and ResNet) and preprocessing steps, blinding to reference standard outcomes during model development or testing was rarely documented. Additionally, thresholds for classification and reproducibility criteria were inconsistently reported, introducing further methodological uncertainty.

Approximately half of the studies showed a moderate risk of bias in the “reference standard” domain, primarily due to incomplete reporting of whether radiologists were blinded to AI outputs and unclear definition or verification of diagnostic benchmarks in some studies. Although many appropriately use histopathology as the gold standard, reliance on radiologist consensus without specifying blinding procedures may introduce review bias.

External validation was underreported, with fewer than 20% of studies testing their models across multiple institutions or imaging devices. This lack of cross-center validation raises questions about the real-world performance of AI systems, particularly in resource-limited settings with varying ultrasound hardware and operator expertise. Moreover, only a subset of studies conducted direct head-to-head comparisons with radiologists, and the results were mixed, while some AI systems outperformed or matched experienced radiologists [e.g., Qian et al. (40) and Hayashida et al. (9)], others demonstrated marginal improvements or lacked statistically significant superiority.

“Flow and timing” information were frequently incomplete or inconsistently reported. Several studies did not specify time intervals between the index test and reference standard confirmation, and others did not clarify attrition or exclusions due to missing pathology. These reporting gaps raise concerns for verification and attrition bias.

Robust, well-designed studies, such as those by Liao et al. (19), Shen et al. (35), and Qian et al. (40), tended to show low risk of bias across multiple domains due to prospective or multicenter designs and adherence to standardized reference criteria. Conversely, higher-risk studies, including those by Mori et al. (11) and Bunnell et al. (47), exhibited limitations such as a lack of clinical oversight, unconventional acquisition tools, or incomplete reporting of test performance metrics. The majority of included studies ultimately fell into the moderate risk category, reflecting common methodological issues such as retrospective design and lack of blinding.

Algorithm transparency and interpretability also remain largely unexplored. Few studies have employed explainable AI methods, such as saliency maps or gradient-weighted class activation mapping (Grad-CAM), to highlight which imaging features contributed to the decision-making process. The absence of explainability poses challenges for clinician trust, particularly when AI outputs contradict expert interpretation. Similarly, while many studies focused on conventional metrics (accuracy, AUC, sensitivity, and specificity), fewer addressed clinically meaningful endpoints, such as reductions in false-positive biopsies, improvements in time-to-diagnosis, or changes in patient outcomes.

Synthesis of diagnostic performance

Despite methodological limitations, a clear trend emerged: AI-enhanced ultrasound improved the accuracy of BI-RADS classification and reduced interobserver variability. For example, Hayashida et al. (9) reported that AI-assisted BI-RADS classification achieved superior sensitivity and specificity compared to human readers (P<0.001). At the same time, Wanderley et al. (24) found that AI-based risk prediction tools matched radiologist performance for lesion malignancy stratification. Additionally, models incorporating advanced preprocessing steps (e.g., region of interest highlighting, noise reduction, or image fusion) demonstrated improved accuracy compared to baseline CNN architectures, as shown by Qureshi et al. (14) and Fu et al. (29).

The performance of AI also varied depending on the study design and data source. Models trained on automated breast ultrasound (ABUS) or multimodal imaging datasets generally performed better due to richer spatial information and reduced operator variability. For instance, Qian et al. (40) demonstrated that a multimodal DL system achieved an AUC of 0.955, outperforming traditional single-modality approaches. This highlights the potential benefit of integrating AI with complementary imaging modalities and structured clinical data.

Clinical utility and workflow impact

Only 13 studies explicitly evaluated clinical outcomes, including reductions in unnecessary biopsies and improvements in diagnostic efficiency. Wang et al. (36) reported a 33% reduction in biopsy rates for BI-RADS 4A lesions when using AI-based risk stratification. In comparison, Ge et al. (39) demonstrated a reduction of up to 90% in radiologist workload through the use of automated report generation. However, these studies remain preliminary, and none included long-term follow-up or cost-effectiveness analyses. The lack of robust, prospective trials limits our ability to determine whether AI integration leads to tangible improvements in patient care, beyond diagnostic accuracy.

Mobile and smartphone-based AI platforms [e.g., Mori et al. (11)] represent an emerging trend, providing accessible diagnostic tools in low-resource settings. While these approaches have the potential to democratize breast cancer screening, their clinical reliability remains untested, mainly in real-world conditions. Furthermore, issues such as data privacy, regulatory approval, and ethical considerations surrounding the use of AI in remote or unsupervised environments require careful evaluation.

Economic and implementation considerations

None of the included studies directly assessed cost savings, despite frequent claims of improved efficiency. While reductions in unnecessary biopsies and radiologist workload suggest potential economic benefits, formal cost-effectiveness analyses are essential to justify investment in AI systems. Future studies should investigate whether AI-assisted ultrasound reduces overall healthcare expenditures while maintaining or improving the quality of care.

Limitations of this review

This review is limited by the heterogeneity of the included studies, both in terms of AI architecture and outcome reporting. Additionally, because blinding was not implemented during the study selection process, a degree of selection bias may hypothetically exist. However, this was mitigated through independent review by multiple student authors and consensus verification by the three senior attending authors. Variability in imaging protocols, dataset size, and validation techniques makes direct performance comparisons challenging. Furthermore, the exclusion of non-English and non-indexed studies may introduce publication bias; our qualitative synthesis approach is inherently subjective, despite adhering to PRISMA guidelines.

Conclusions

This narrative review demonstrates that AI has substantial potential to enhance the accuracy, consistency, and efficiency of breast ultrasound diagnosis. Across the 52 included studies, most AI models achieved high sensitivity, specificities, and AUC values, often approaching or matching the performance of expert radiologists. However, improved diagnostic accuracy alone does not necessarily translate into better clinical outcomes.

Only 13 of the 52 studies directly evaluated patient-centered or workflow-related outcomes, such as reductions in unnecessary biopsies, improved triage, or decreased radiologist workload. While these early findings are promising, they represent preliminary evidence rather than definitive proof of clinical benefit. The remaining studies focused solely on diagnostic metrics without demonstrating how AI affected downstream patient care, decision making, or health outcomes.

Accordingly, the current evidence supports AI as a valuable diagnostic support tool; however, it does not yet establish that the integration of AI meaningfully improves patient outcomes at the population level. To bridge this gap, future research must prioritize prospective multicenter clinical studies that measure patient-centered endpoints alongside diagnostic performance. Formal cost-effectiveness analysis and explainable AI methods are also necessary to ensure safe, transparent, and equitable adoption.

By addressing these limitations, AI-enhanced breast ultrasound may ultimately evolve from a high-performing diagnostic technology into a clinically transformative tool that improves real-world care, reduces unnecessary procedures, and supports early cancer detection.

Acknowledgments

ChatGPT (OpenAI; version 5.0) was used exclusively for language editing, including improvement of syntax, grammar, and diction. The authors take full responsibility for all scientific content and conclusions.

Footnote

Reporting Checklist: The authors have completed the Narrative Review reporting checklist. Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-177/rc

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-177/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-177/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

US Preventive Services Task Force. Screening for Breast Cancer: US Preventive Services Task Force Recommendation Statement. JAMA 2024;331:1918-30. [Crossref] [PubMed]
Branco PESC, Franco AHS, de Oliveira AP, et al. Artificial intelligence in mammography: a systematic review of the external validation. Rev Bras Ginecol Obstet 2024;46:e-rbgo71.
Madan K, Madan NK, Mittal S. Transcutaneous Mediastinal Ultrasonography (TMUS) Vs Endosonography (EBUS-TBNA and EUS-B-FNA) for Evaluating Intrathoracic Lymphadenopathy in Children. J Ultrasound Med 2022;41:1301-2. [Crossref] [PubMed]
Chang JF, Huang CS, Chang RF. Automated whole breast segmentation for hand-held ultrasound with position information: Application to breast density estimation. Comput Methods Programs Biomed 2020;197:105727. [Crossref] [PubMed]
Niu S, Huang J, Li J, et al. Application of ultrasound artificial intelligence in the differential diagnosis between benign and malignant breast lesions of BI-RADS 4A. BMC Cancer 2020;20:959. [Crossref] [PubMed]
Qian L, Lv Z, Zhang K, et al. Application of deep learning to predict underestimation in ductal carcinoma in situ of the breast with ultrasound. Ann Transl Med 2021;9:295. [Crossref] [PubMed]
Zhang N, Li XT, Ma L, et al. Application of deep learning to establish a diagnostic model of breast lesions using two-dimensional grayscale ultrasound imaging. Clin Imaging 2021;79:56-63. [Crossref] [PubMed]
Li C, Li J, Tan T, et al. Application of ultrasonic dual-mode artificially intelligent architecture in assisting radiologists with different diagnostic levels on breast masses classification. Diagn Interv Radiol 2021;27:315-22. [Crossref] [PubMed]
Hayashida T, Odani E, Kikuchi M, et al. Establishment of a deep-learning system to diagnose BI-RADS4a or higher using breast ultrasound for clinical application. Cancer Sci 2022;113:3528-34. [Crossref] [PubMed]
Zhou Y, Feng BJ, Yue WW, et al. Differentiating non-lactating mastitis and malignant breast tumors by deep-learning based AI automatic classification system: A preliminary study. Front Oncol 2022;12:997306. [Crossref] [PubMed]
Mori R, Okawa M, Tokumaru Y, et al. Application of an artificial intelligence-based system in the diagnosis of breast ultrasound images obtained using a smartphone. World J Surg Oncol 2024;22:2. [Crossref] [PubMed]
Lee YW, Huang CS, Shih CC, et al. Axillary lymph node metastasis status prediction of early-stage breast cancer using convolutional neural networks. Comput Biol Med 2021;130:104206. [Crossref] [PubMed]
Lee SE, Lee E, Kim EK, et al. Application of Artificial Intelligence Computer-Assisted Diagnosis Originally Developed for Thyroid Nodules to Breast Lesions on Ultrasound. J Digit Imaging 2022;35:1699-707. [Crossref] [PubMed]
Qureshi M, Ahmad N, Ullah S, et al. Forecasting real exchange rate (REER) using artificial intelligence and time series models. Heliyon 2023;9:e16335. [Crossref] [PubMed]
Lyu S, Zhang M, Zhang B, et al. The application of computer-aided diagnosis in Breast Imaging Reporting and Data System ultrasound training for residents-a randomized controlled study. Transl Cancer Res 2024;13:1969-79. [Crossref] [PubMed]
Wang Q, Chen H, Luo G, et al. Performance of novel deep learning network with the incorporation of the automatic segmentation network for diagnosis of breast cancer in automated breast ultrasound. Eur Radiol 2022;32:7163-72. [Crossref] [PubMed]
Zhang H, Han L, Chen K, et al. Diagnostic Efficiency of the Breast Ultrasound Computer-Aided Prediction Model Based on Convolutional Neural Network in Breast Cancer. J Digit Imaging 2020;33:1218-23. [Crossref] [PubMed]
Homayoun H, Chan WY, Kuzan TY, et al. Applications of machine-learning algorithms for prediction of benign and malignant breast lesions using ultrasound radiomics signatures: A multi-center study. Biocybernetics and Biomedical Engineering 2022;42:921-33.
Liao J, Gui Y, Li Z, et al. Artificial intelligence-assisted ultrasound image analysis to discriminate early breast cancer in Chinese population: a retrospective, multicentre, cohort study. EClinicalMedicine 2023;60:102001. [Crossref] [PubMed]
Kaplan E, Chan WY, Dogan S, et al. Automated BI-RADS classification of lesions using pyramid triple deep feature generator technique on breast ultrasound images. Med Eng Phys 2022;108:103895. [Crossref] [PubMed]
Eun NL, Lee E, Park AY, et al. Artificial intelligence for ultrasound microflow imaging in breast cancer diagnosis. Ultraschall Med 2024;45:412-7. [Crossref] [PubMed]
Du R, Chen Y, Li T, et al. Discrimination of Breast Cancer Based on Ultrasound Images and Convolutional Neural Network. J Oncol 2022;2022:7733583. [Crossref] [PubMed]
Marini TJ, Castaneda B, Parker K, et al. No sonographer, no radiologist: Assessing accuracy of artificial intelligence on breast ultrasound volume sweep imaging scans. PLOS Digit Health 2022;1:e0000148. [Crossref] [PubMed]
Wanderley MC, Soares CMA, Morais MMM, et al. Application of artificial intelligence in predicting malignancy risk in breast masses on ultrasound. Radiol Bras 2023;56:229-34. [Crossref] [PubMed]
Pan H, Shi C, Zhang Y, et al. Artificial intelligence-based classification of breast nodules: a quantitative morphological analysis of ultrasound images. Quant Imaging Med Surg 2024;14:3381-92. [Crossref] [PubMed]
Kim J, Kim HJ, Kim C, et al. Weakly-supervised deep learning for ultrasound diagnosis of breast cancer. Sci Rep 2021;11:24382. [Crossref] [PubMed]
Yang L, Zhang B, Ren F, et al. Rapid Segmentation and Diagnosis of Breast Tumor Ultrasound Images at the Sonographer Level Using Deep Learning. Bioengineering (Basel) 2023;10:1220. [Crossref] [PubMed]
Hossain ABMA, Nisha JK, Johora F. Breast cancer classification from ultrasound images using VGG16 model based transfer learning. Int J Image Graph Signal Process 2023;15:12.
Fu Y, Lei YT, Huang YH, et al. Longitudinal ultrasound-based AI model predicts axillary lymph node response to neoadjuvant chemotherapy in breast cancer: a multicenter study. Eur Radiol 2024;34:7080-9. [Crossref] [PubMed]
Laghmati S, Hicham K, Cherradi B, et al. Segmentation of breast cancer on ultrasound images using Attention U-Net Model. Int J Adv Comput Sci Appl 2023;14:770-8.
Chen J, Jiang Y, Yang K, et al. Feasibility of using AI to auto-catch responsible frames in ultrasound screening for breast cancer diagnosis. iScience 2023;26:105692. [Crossref] [PubMed]
Yao J, Zou Y, Du S, et al. Progress in the application of artificial intelligence in ultrasound diagnosis of breast cancer. Frontiers in Computing and Intelligent Systems 2023;6:56-9.
Kikuchi M, Hayashida T, Watanuki R, et al. Abstract P1-02-09: diagnostic system of breast ultrasound images using Convolutional Neural Network. Cancer Res 2020;80:P1-02-09.
Fleury EFC, Marcomini K. Impact of radiomics on the breast ultrasound radiologist's clinical practice: From lumpologist to data wrangler. Eur J Radiol 2020;131:109197. [Crossref] [PubMed]
Shen Y, Shamout FE, Oliver JR, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nat Commun 2021;12:5645. [Crossref] [PubMed]
Wang XY, Cui LG, Feng J, et al. Artificial intelligence for breast ultrasound: An adjunct tool to reduce excessive lesion biopsy. Eur J Radiol 2021;138:109624. [Crossref] [PubMed]
Mansour S, Kamal R, Hashem L, et al. Can artificial intelligence replace ultrasound as a complementary tool to mammogram for the diagnosis of the breast cancer? Br J Radiol 2021;94:20210820. [Crossref] [PubMed]
Dai J, Lei S, Dong L, et al. More reliable AI solution: Breast ultrasound diagnosis using multi-AI combination. arXiv:2101.02639 [Preprint]. 2021. Available online: https://arxiv.org/abs/2101.02639
Ge S, Ye Q, Xie W, et al. AI-assisted Method for Efficiently Generating Breast Ultrasound Screening Reports. Curr Med Imaging 2023;19:149-57. [Crossref] [PubMed]
Qian X, Pei J, Zheng H, et al. Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning. Nat Biomed Eng 2021;5:522-32. [Crossref] [PubMed]
Zhang X, Li H, Wang C, et al. Evaluating the Accuracy of Breast Cancer and Molecular Subtype Diagnosis by Ultrasound Image Deep Learning Model. Front Oncol 2021;11:623506. [Crossref] [PubMed]
Zhang L, Jia Z, Leng X, et al. Artificial Intelligence Algorithm-Based Ultrasound Image Segmentation Technology in the Diagnosis of Breast Cancer Axillary Lymph Node Metastasis. J Healthc Eng 2021;2021:8830260. [Crossref] [PubMed]
Xia Q, Cheng Y, Hu J, et al. Differential diagnosis of breast cancer assisted by S-Detect artificial intelligence system. Math Biosci Eng 2021;18:3680-9. [Crossref] [PubMed]
Paley CT, Knight AE, Jin FQ, et al. Repeatability of Rotational 3-D Shear Wave Elasticity Imaging Measurements in Skeletal Muscle. Ultrasound Med Biol 2023;49:750-60. [Crossref] [PubMed]
Li Y, Gu H, Wang H, et al. BUSnet: A Deep Learning Model of Breast Tumor Lesion Detection for Ultrasound Images. Front Oncol 2022;12:848271. [Crossref] [PubMed]
Ragab M, Albukhari A, Alyami J, et al. Ensemble Deep-Learning-Enabled Clinical Decision Support System for Breast Cancer Diagnosis and Classification on Ultrasound Images. Biology (Basel) 2022;11:439. [Crossref] [PubMed]
Bunnell A, Valdez D, Wolfgruber T, et al. Abstract P3-04-05: Artificial Intelligence Detects, Classifies, and Describes Lesions in Clinical Breast Ultrasound Images. Cancer Res 2023;83:P3-04-05.
Kummanee P, Lertsatittanakron S, Thongchai P, et al. Comparative analysis of stand-alone artificial intelligence for 3D automated breast ultrasound system (ABUS) and standard clinical practice with radiologists in breast cancer screening. In: 2023 15th Biomedical Engineering International Conference (BMEiCON). IEEE; 2023.
Luijten B, Chennakeshava N, Eldar YC, et al. Ultrasound Signal Processing: From Models to Deep Learning. Ultrasound Med Biol 2023;49:677-98. [Crossref] [PubMed]
Vong S, Ronco AJ, Najafpour E, et al. Screening Breast MRI and the Science of Premenopausal Background Parenchymal Enhancement. J Breast Imaging 2021;3:407-15. [Crossref] [PubMed]
Ye F, Yang Y, Liu J. Comparison of High-Frequency Contrast-Enhanced Ultrasound With Conventional High-Frequency Ultrasound in Guiding Pleural Lesion Biopsy. Ultrasound Med Biol 2022;48:1420-8. [Crossref] [PubMed]
Zhang G, Shi Y, Yin P, et al. A machine learning model based on ultrasound image features to assess the risk of sentinel lymph node metastasis in breast cancer patients: Applications of scikit-learn and SHAP. Front Oncol 2022;12:944569. [Crossref] [PubMed]
Wang Y, Qin C, Lin C, et al. 3D Inception U-net with Asymmetric Loss for Cancer Detection in Automated Breast Ultrasound. Med Phys 2020;47:5582-91. [Crossref] [PubMed]
Syed AH, Khan T. Corrigendum: Evolution of research trends in artificial intelligence for breast cancer diagnosis and prognosis over the past two decades: A bibliometric analysis. Front Oncol 2022;12:1061324. [Crossref] [PubMed]

doi: 10.21037/jmai-2025-177
Cite this article as: Chepkunov D, Erivwo M, Sastri V, Tisor O, Yakoubov N, Babu M, Babu B. Artificial intelligence applications for breast cancer diagnosis in ultrasound: a narrative review. J Med Artif Intell 2026;9:35.

Artificial intelligence applications for breast cancer diagnosis in ultrasound: a narrative review

Introduction

Background

Rationale and knowledge gap

Objective

Methods

Eligibility criteria

Search strategy

Table 1

Data extraction

Quality assessment

Evidence acquisition

Risk of bias assessment

Patient selection

Index test (AI model)

Reference standard

Flow and timing

Overall quality trends

Recommendations

Results

Table 2

Discussion

Overall performance and comparison to prior literature

Clinical implications

Research recommendations

Critical appraisal of included studies

Synthesis of diagnostic performance

Clinical utility and workflow impact

Economic and implementation considerations

Limitations of this review

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share