TL-BSLD: a transfer learning-based multi-convolutional neural network framework for burn severity level detection

Mohammed Faisal; Murad Khan; Muhammad Diyan; Abdullah Alharbi

doi:10.21037/jmai-2025-117

Original Article

TL-BSLD: a transfer learning-based multi-convolutional neural network framework for burn severity level detection

Mohammed Faisal¹, Murad Khan¹, Muhammad Diyan², Abdullah Alharbi³

¹Department of Computer Science & Engineering, Faculty of Engineering, Kuwait College of Science and Technology, Doha, Kuwait; ²Department of Computing & Games, Teesside University, Middlesbrough, UK; ³Department of Computer Sciences and Engineering, College of Applied Studies, King Saud University, Riyadh, Saudi Arabia

Contributions: (I) Conception and design: All authors; (II) Administrative support: M Faisal, M Khan; (III) Provision of study materials or patients: M Faisal, M Khan, A Alharbi; (IV) Collection and assembly of data: M Faisal, M Khan, M Diyan; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Assoc. Prof. Dr. Abdullah Alharbi, PhD. Department of Computer Sciences and Engineering, College of Applied Studies, King Saud University, P.O. Box, P.O. Box 28095, Riyadh 11437, Saudi Arabia. Email: arharbi@ksu.edu.sa.

Background: Burn injuries continue to pose a major global public health challenge, contributing to severe physical trauma and a high number of fatalities each year. Prompt and accurate assessment of burn severity is crucial for determining the appropriate course of treatment and improving patient outcomes. However, traditional diagnostic methods primarily depend on visual inspections by clinicians, which are inherently subjective and susceptible to errors. These limitations often lead to delayed or incorrect diagnoses. To overcome these challenges, the need for an automated, objective, and accurate diagnostic system has become increasingly evident in the field of burn care. Finally, the objective of this study is to develop and evaluate an automated framework for accurate burn severity classification using transfer learning (TL). We further aim to integrate interpretability tools to enhance clinical trust and applicability.

Methods: This study introduces the Transfer Learning-Based Burn Severity Level Detection System (TL-BSLD), an innovative framework that applies advanced computer vision and TL techniques to accurately classify the severity of burn injuries. The system utilizes three cutting-edge pre-trained models—VGG-19 architecture, Inception-v3, and NasNet—to analyze burn images and categorize them into four classes: normal, first-degree, second-degree, and third-degree burns. TL enhances the model’s ability to extract relevant features from limited medical data, improving classification performance. Moreover, to ensure transparency and clinical trust, the system incorporates two interpretability tools: Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights important regions in the input image influencing predictions, and SHapley Additive exPlanations (SHAP), which quantifies the contribution of each feature to the model’s decision.

Results: The proposed TL-BSLD system underwent extensive evaluation using standard performance metrics. Among the tested models, VGG-19 demonstrated superior performance, achieving an accuracy of 98.3% and an F1-score of 98.5%, indicating the system’s reliability and robustness in classifying burn severity levels. The integration of Grad-CAM and SHAP also provided valuable visual and feature-based insights into the model’s predictions, enhancing its interpretability and clinical applicability.

Conclusions: The TL-BSLD presents a significant advancement in automated burn diagnosis by combining high accuracy, rapid image classification, and interpretability. This approach not only improves diagnostic efficiency but also empowers clinicians with deeper insights into artificial intelligence (AI) decision-making. By reducing reliance on subjective assessments and minimizing diagnostic delays, the system demonstrates promising potential to enhance burn care and decision support; however, further validation in larger, prospective clinical studies is necessary before clinical deployment. Future work may explore deployment in real-time clinical settings and adaptation to mobile health platforms for broader accessibility.

Keywords: Burned skin diagnoses; transfer-learning; feature extraction; image classification; convolutional neural network (CNN)

Received: 21 April 2025; Accepted: 28 September 2025; Published online: 03 February 2026.

doi: 10.21037/jmai-2025-117

Highlight box

Key findings

• The Transfer Learning-Based Burn Severity Level Detection System (TL-BSLD) classifies burns into four severity levels: normal, first, second, and third-degree.

• VGG-19 outperformed Inception-v3 and NasNet, achieving the highest accuracy and F1-score in the four-class classification.

• Explainable AI (XAI) methods, Grad-CAM and SHapley Additive exPlanations (SHAP), were integrated to make the model transparent and build clinical trust.

• XAI analysis confirmed VGG-19 focused on clinically relevant features, like texture and edges, for classification.

What is known and what is new?

• Traditional visual burn diagnosis is subjective, inconsistent, and prone to error.

• Current machine learning/deep learning systems for burn diagnosis often have modest accuracy, handle only simple problems, and lack interpretability, which limits clinical trust.

• Transfer learning models (e.g., VGG-19, Inception-v3) can effectively extract features from limited medical data to achieve high accuracy.

What is the implication, and what should change now?

• The study introduces the TL-BSLD system, a robust, multi-class framework that successfully tackles the more complex four-class burn severity classification (normal, first, second, and third-degree) with high accuracy.

• It presents a highly accurate solution, with the VGG-19 model achieving an average accuracy of 98.3% and F1-score of 98.5%, which surpasses the performance of prior approaches in the literature.

• The research successfully integrates XAI techniques (Grad-CAM for visual localization and SHAP for feature contribution) into the TL-BSLD system, addressing the critical issue of interpretability that was lacking in many previous models and enhancing clinical trust.

Introduction

Background

Burn injuries represent a serious worldwide health issue, causing considerable illness and death each year. Their severity varies from mild surface burns to extensive tissue destruction, making early and precise diagnosis essential for guiding appropriate treatment. Early identification of burn severity is critical for initiating effective therapeutic strategies, improving patient outcomes, and minimizing complications. Conventional diagnosis depends on clinicians’ visual evaluations, which are often subjective and inconsistent. Recent progress in medical imaging and artificial intelligence (AI) has created opportunities to automate burn severity assessment, offering more reliable and precise results (1).

Knowledge gaps

Despite progress, several challenges persist in the literature regarding automated burn diagnosis. Traditional visual assessments remain the gold standard, but they are time-consuming and subject to interobserver variability, leading to diagnostic inconsistencies (2). Although machine learning (ML) and deep learning (DL) methods show potential, they typically depend on large labeled datasets, which are limited in medical imaging. A common shortcoming of current AI systems is poor interpretability, which reduces clinicians’ confidence in their predictions (3). Another key challenge is generalization, as models frequently perform poorly when tested on previously unseen datasets. These challenges highlight the need for robust, scalable, and explainable AI solutions to support clinicians in real-world applications. Recent advancements in computer vision and transfer learning (TL) have demonstrated significant potential in automating burn severity classification. Modern TL models like VGG-19, Inception3, and NasNet use pre-training on large datasets to extract important features from burn images, delivering strong accuracy and efficiency (4). Techniques like U-Net and Mask R-CNN are commonly employed for segmentation, enabling precise localization of burn regions. Some systems have incorporated synthetic datasets to address the scarcity of labeled medical images (5). However, while these approaches improve accuracy, most models fail to address the critical issue of interpretability, which is essential for clinical adoption and decision-making.

Numerous studies have applied ML and DL methods for the detection of burned skin. In (6), Otsu thresholding was used to isolate the burned region, after which statistical techniques were applied to obtain feature vectors. The extracted features were then used to estimate burn depth and classify burns into three severity levels. By applying several classifiers to different image regions, the study achieved an accuracy of about 74.5%. Another method, described in (7), introduced an end-to-end DL framework for both burn area segmentation and diagnosis. This framework included dataset collection, data augmentation, burn region segmentation, and subsequent classification steps. The system performed well in separating burn from non-burn regions, reporting Intersection over Union (IOU) =0.8467, pixel accuracy (PA) =0.9459, and Dice coefficient (DC) =0.9170. The study used publicly accessible images from the Burn Data Center (8) and the Kaggle (9) Skin Burn dataset. As these datasets are open-access and anonymized, no direct patient consent was required. Ethical considerations regarding data use were strictly followed.

Extensive research has focused on burn classification and segmentation with different ML and image-processing methods to enhance diagnostic accuracy and treatment decisions. For example (10), applied a fuzzy-ARTMAP neural network to categorize burns into three groups: superficial dermal, deep dermal, and full-thickness. The classification achieved a success rate of 88%. The study also applied distance metric methods to perform image segmentation and classification. In (11), a comparable method was used, involving transformation and texture-based analysis to obtain feature vectors from burn images. Burn depth was determined using an SVM classifier, which achieved 89.29% accuracy on validation data. Likewise (12), presented a convolutional neural network (CNN)-based method for detecting burn regions using color image patches. In this method, burn images were partitioned into smaller patches, which the CNN classified as either normal skin or burn wounds. This approach obtained an overall precision of 75.91%. However, most prior approaches achieve modest accuracy, often address only binary or three-class problems, and lack transparency in decision-making. These limitations reduce clinical applicability. Thus, there is a clear gap for robust multi-class frameworks that combine high accuracy with interpretability, which our TL-BSLD system aims to address.

As a result, DL has become central to medical image analysis, delivering accurate and efficient solutions for segmentation and classification of wounds. For instance (13), presented a DL approach specifically designed to segment burn wounds in images with high accuracy. This method employed the Mask R-CNN framework, considered a leading DL model for segmentation tasks. Training was performed using 1,150 labeled images in COCO format, with 1,000 images used for training and the remainder for evaluation. The evaluation involved comparing different backbone networks within the framework, including R101FA, R101A, and IV2RA. The model accuracy was assessed using the DC metric. The results showed that the R101FA backbone network achieved the highest accuracy of 84.51% when evaluated on 150 images. Furthermore, the segmentation performance of the three backbone networks was evaluated on images with varying burn depths. The R101FA backbone network showcased superior segmentation results for superficial, superficial thickness, and deep partial thickness burns. On the other hand, the R101A backbone network performed best in segmenting full-thickness burns. In addition, in (14), the authors proposed a DL model based on CNN segmentation to label each pixel as healthy skin or burned skin. They utilized atrous convolution for encoding rich contextual information and employed a pre-trained model, ResNet-101, for better feature extraction. In continuation, the authors of (15) presented a framework consisting of three steps: segmentation of burn images, feature extraction, and classification of segmented regions into healthy skin, burned skin, and background. They implemented SegNet-based semantic segmentation as a DL approach, combined with the fuzzy c-means algorithm for segmentation and a multilayer feed-forward artificial neural network (ANN) trained using back-propagation for classification. Accurate estimation of burn severity is crucial for effective treatment planning, and recent advancements in ML have enabled the development of innovative methods for this purpose. For instance, in (16), the BPBSAM method is proposed for estimating the severity of burns using body part-specific support vector machines (SVMs) trained with CNN features extracted from burnt body part images. To address the limited availability of burnt body part images, larger-size datasets of non-burn images of different body parts are utilized. The method involves identifying the body part of burn images using a CNN and training DL models for body part classification and feature extraction for severity estimation. The results demonstrate that the proposed BPBSAM method achieves an overall average F1-score of 77.8% and accuracy of 84.85% for the test BI dataset, and 87.2% and 91.53% for the UBI dataset, respectively.

For body part classification in burn images, an average accuracy of around 93% is achieved. In terms of burn severity assessment, the BPBSAM method outperforms generic methods with an overall average accuracy improvement of 10.61%, 4.55%, and 3.03% using the ResNet50, VGG16, and VGG-19 pipelines, respectively (16). Consequently, in (17), the authors proposed a hybrid approach named Dense Mask Regional Convolutional Neural Network (RCNN) approach for segmenting the skin burn region based on the various degrees of burn severity. In this, hybrid integration of Mask-region-based convolution neural network CNN (mask R-CNN) and dense pose estimation are integrated into Dense Mask RCNN, which calculates the full-body human pose and performs semantic segmentation (18). At first, they used the Residual Network with a dilated convolution using a weighted mapping model to generate the dense feature map. Then, the feature map is fed into the Region proposal network (RPN), which utilizes a Feature pyramid network (FPN) to detect the objects at different scales of location and pyramid level from the input images. For the accurate alignment of pixel-to-pixel labels, they introduce a Region of interest (RoI)-pose align module that properly aligns the objects based on the human pose with the characteristics of scale, right-left, translation, and left-right flip to a standard scale. In a research study presented in (19) the authors test three state-of-the-art pre-trained DL models that include ResNet50, ResNet101, and ResNet152 for image patterns extraction via two TL strategies—a fine-tuning approach were dense and classification layers were modified and trained with features extracted by base layers and in the second approach SVM was used to replace top-layers of the pre-trained models, trained using the shelf features from the base layers. The authors of (20) proposed a deep convolutional neural network (DCNN) based approach for detecting burn injury severity utilizing real-time images of skin burns. The DCNN architecture leverages the utilization of TL with fine-tuning employing three types of pre-trained models on top of multiple convolutional layers with hyperparameter tuning for feature extraction from the images and then a fully connected feed-forward neural network to classify the images into three categories according to their burn severity: first, second and third-degree burns.

Integrating advanced ML techniques has shown significant potential in enhancing the accuracy of medical image classification, particularly in challenging domains like burn analysis. For example, in (21), the authors suggested integrating FCM and the multilayer feed-forward ANN-based techniques to improve the burn image classification. In this, the texture features are extracted, specifying the spatial arrangement intensities of variation for the classification task. This method utilizes four segmentation techniques, which are Fuzzy C-Means (FCM), K-means, expected maximization, and simple linear iterative clustering. The input image is segmented based on the structure of the type of segmentation. Next, the combination of statistical histogram features and the Haralick-based feature matrix (FMI) is used to train the classifier. Then, the forward selection algorithm selects the optimal number of features forwarded to FM2. Finally, the FM3 matrix trains the classifier segments of the burn portions in the image.

Objectives

This study introduces the TL-BSLD system, an AI-powered approach for precise and efficient burn severity classification. The system leverages three TL architectures—VGG-19, Inception3, and NasNet—to classify burn images into four categories: normal, first-degree, second-degree, and third-degree burns (22). To enhance its clinical applicability, we integrate TL-BSLD with explainable AI (XAI) techniques, including Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP), which provide visual and textual justifications for the model’s decisions (23). These explainability features address AI’s black-box nature, building clinicians’ trust and fostering healthcare adoption. The TL-BSLD sets a new standard for automated burn severity assessment by combining high accuracy with interpretability. Its potential to improve decision-making, optimize resource allocation, and deliver timely interventions underscores its significance in advancing patient care. Further, the significance of the TL-BSLD system lies in its potential to revolutionize burn diagnosis by offering an objective and automated approach. By reducing the reliance on subjective assessments and providing prompt and accurate classification, the TL-BSLD system can assist healthcare professionals in making well-informed treatment decisions. Additionally, the system’s efficiency can lead to significant time savings, enabling healthcare providers to allocate their resources effectively and provide timely interventions. The performance of the proposed TL-BSLD system is evaluated in accurately classifying burn injuries into their respective degrees. Performance metrics such as accuracy, F1-score, sensitivity, and precision are used to assess the system’s effectiveness in distinguishing between different degrees of burn severity. Moreover, we compare the performance of the three TL architectures to determine which model yields the best results in burn degree detection. Finally, the summary of the schemes discussed in the paper is presented in the following Table 1.

Table 1

Summary of the related papers

Reference	Segmentation algorithm	Classification model	Classes	Measurements
(2)	Semantic Segmentation (U-Net with ResNet50, HRNetV2)	N/A	–	IOU =0.8467, PA =0.9459, DC =0.9170
(3)	Distance Metrics	Fuzzy-ARTMAP Neural Network	3	Accuracy =88.6%
(4)	N/A	SVM	2	Accuracy =89.29%
(5)	CNN-based Semantic Segmentation (Hue & saturation patches)	CNN	2	Accuracy =75.91%
(6)	R101FA, R101A, IV2RA	CNN with VGG-19	–	IOU =0.67, PA =0.85
(11)	Semantic Segmentation	ResNet-101 with Atrous Convolution (R101A)	2	Accuracy =84.51%
(12)	Digital Image Processing (Clustering)	–	2	Accuracy =91%
(14)	Fuzzy C-Means Segmentation	Multilayer Feed-Forward ANN	2	F1-score =74.28%
(15)	CNN	SVM	2	F1-score =77.8%
(16)	Semantic Segmentation (DenseMask RCNN)	CNN	3	–
(24)	–	ResNet101 + SVM	3	–
(19)	Semantic Segmentation	CNN (Transfer Learning, VGG16)	3	–
(20)	Fuzzy C-Means, K-means, EM, SLIC	SVM	3	–
(21)	SVM	SVM	3	-

ANN, artificial neural network; CNN, convolutional neural network; DC, Dice coefficient; EM, expectation-maximization; IOU, Intersection over Union; N/A, note applicable; PA, pixel accuracy; RCNN, regional convolutional neural network; SLIC, simple linear iterative clustering; SVM, support vector machine.

Methods

Study design

This work departs from conventional image processing approaches, opting instead to employ CNNs for burn severity classification. Rather than building models from scratch, we used pre-trained CNNs to enhance efficiency, accuracy, and the extraction of higher-level features such as edges and textures. As explained later in the paper, four additional layers were added to these pre-trained models. The proposed TL-BSLD is designed to determine the degree of burnt skin (Normal, First Degree, Second Degree, and Third Degree) based on TL techniques. In TL-BSLD, we used and developed three different TL-based systems (Inception3, NasNet, and VGG-19) to determine the degree of burnt skin from raw images without requiring feature extraction. Approximately 1,500 images were gathered from publicly available datasets (8,9), distributed across four categories: normal, first-degree, second-degree, and third-degree burns. Among those, 600 images belong to the normal class, 300 images belonging to the first-degree class, 300 images belonging to the second-degree class, and 300 images belonging to the third-degree class. We employed the Computer Vision Annotation Tool (CVAT) to manually delineate the burn wound areas in the images (25). The images were annotated meticulously to guarantee accurate representation of the burn wound regions. We also used the segmentation annotation mode in CVAT to draw polygons around the burn wound regions. Representative examples from the dataset are shown in Figure 1. The dataset provided four broad categories (normal, first-degree, second-degree, and third-degree) without further subdivision of partial-thickness burns. Therefore, we limited our classification to these clinically recognized levels, consistent with the labeling available in the public datasets. The CVAT segmentation tool was used to annotate and delineate burn wound regions to ensure accurate labeling of the dataset. These segmentations were not used to crop regions of interest for classifier input. Instead, the CNN models were trained on the resized and augmented raw images, with segmentation serving as a supporting step for annotation and dataset preparation. The Segmentation was performed by trained annotators and validated by faculty experts in AI and medical imaging. No formal inter-rater reliability was assessed, which is noted as a limitation for future studies. To enhance interpretability, SHAP values were computed at the pixel level from the final convolutional layers of each model. This allowed us to quantify individual pixel contributions, complementing Grad-CAM’s region-level visualization.

Figure 1 Samples of the dataset (8,9), as these images are open-access and anonymized, no direct patient consent was required. Ethical considerations regarding data use were strictly followed.

Model architecture

In this study, we evaluated three TL architectures independently—VGG-19, Inception-v3, and NasNet—rather than combining them into an ensemble. Each model was trained and tested separately, and their performance metrics were compared to identify the most effective architecture for burn severity classification. VGG-19 (Figure 2) is a 19-layer CNN proposed by the Visual Geometry Group (VGG) at Oxford. This model was created for the ILSVRC-2014 image classification challenge and trained on the large-scale ImageNet dataset (8). The VGG-19 architecture is known for its straightforward design, featuring stacked 3×3 convolution layers with progressively greater depth. Dimensionality reduction within the network is accomplished using max pooling operations. A Softmax classifier follows two fully connected layers with 4,096 nodes each. In our proposed system, we froze the layers 1–15 of the pre-trained network VGG-19 architecture to retain low-level feature extraction (edges, textures, and shapes). For four-class burn classification, we modified VGG-19 by adding five layers: global average pooling for dimensionality reduction, dropout (0.3) to limit overfitting, two dense layers (128 and 64 units) for task-specific learning, and a final Softmax for classification into four categories. This modified VGG-19 architecture has 20,098,499 parameters, with 2,433,923 trainable parameters and 17,664,576 non-trainable parameters.

Figure 2 The VGG-19 architecture.

The Inception CNN architecture was first presented as GoogleNet, also known as Inception-v1. Later, Ioffe and Szegedy enhanced the Inception design with batch normalization, which resulted in the Inception-v2 model. This was followed by the work of Szegedy et al. in 2015, where they introduced factorization to enhance Inception-v2 and named it Inception-v3 (26). The core idea of the Inception architecture is to discover an efficient local structure for convolutional networks and replicate it across spatial dimensions. The motivation behind Inception stems from the notion that multiple connections between layers can be redundant and contain redundant information due to their correlation. Therefore, the Inception architecture employs a parallel approach with 22 layers (Figure 3). Additionally, it incorporates several auxiliary classifiers within the intermediate layers, enhancing the discriminative capacity of the lower layers (27). For Inception-v3 in the proposed TL-BSLD system, the convolutional base was frozen and five new trainable layers were added: global average pooling, dense [1,024], batch normalization, dense [1,024], and Softmax before the final layer. This modified Inception-v3 architecture encompasses a total of 22,073,507 parameters, out of which 270,723 are trainable parameters, and 21,802,784 are non-trainable parameters. We used the same manner of modifying the VGG and Inception to modify the NasNet architecture.

Figure 3 The used Inception-v3 architecture.

Model training

As illustrated in Figure 4, we started by collecting dataset images with different burnt skin levels. The dataset was split into 80% training and 20% validation subsets per class (e.g., 480/120 for normal, 240/60 for each burn type), ensuring balanced representation. Augmentation was applied equally across all classes. To further expand the dataset, we applied augmentation using Keras ImageDataGenerator with parameters including: 45° rotations, 0.2 shifts (height and width), 0.2 zoom, and 0.2 shear transformations. These augmentations helped in increasing dataset diversity and reducing overfitting during training. Additionally, we resized all images to meet the specified dimensions of 224×224, as required by the models. After that, we divided the dataset into training and testing datasets and then applied the retrained CNN models (VGG-19, Inception-V3, and NasNet) to determine the degree of burnt skin. Our training process encompassed the following parameters: a batch size of 16, 30 epochs, and the ADAM optimizer was applied with a learning rate of 0.0001, β₁=0.9, β₂=0.999, and ε=1e−07. Robustness was assessed through 5-fold cross-validation applied during both training and validation phases.. This approach helped in validating the generalization capabilities of the proposed architectures. To mitigate class imbalance, heavy augmentation was applied to burn classes, and stratified 5-fold cross-validation was adopted to maintain proportional class representation across folds.

Figure 4 Proposed burned skin level detection system. Aug., augmentation.

Statistical analysis

To assess the effectiveness of the proposed Burned Skin Level Detection System (TL-BSLD), three widely recognized pre-trained DL models—Inception-v3, NasNet, and VGG-19—were employed. The evaluation of each model was conducted using standard performance metrics, including accuracy, F1-score, precision, recall, and the confusion matrix. Each architecture underwent a five-fold cross-validation process, with 30 training epochs per fold. The reported performance values represent the averaged results across all folds. Among the evaluated models, VGG-19 demonstrated the highest classification performance. The VGG-19 model achieved an average accuracy of 98.3%, an F1-score of 98.5%, a precision of 98.1%, and a recall of 97.7%. These results indicate that the VGG-19-based implementation of the TL-BSLD system is highly reliable and consistent in identifying various degrees of burn injuries. The model’s strong performance across all evaluation metrics underscores its potential suitability for clinical decision support in burn severity assessment. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. IRB approval and patients consent are not applicable to this study.

Results

In the proposed TL-BSLD system, three well-known pre-trained DL architectures are used: Inception3, NasNet, and VGG19. The training steps of all models were performed on a laptop workstation with the following specifications: Intel^© i9-9880H core @2.3 GHz Processor and 32 GB RAM, and 8 GB for GPU on 64-bit Windows 10. The TL-BSLD system was assessed using standard performance indicators, including accuracy, precision, recall, F1-score, and confusion matrices. In the proposed TL-BSLD system, we performed a five-fold with 30 epochs for each fold for all VGG-19, Inception, and NasNet models and took the overall average of all results. Table 2 summarizes the performance evaluation metrics (accuracy, F1-score, recall, and precision) of the VGG-19 model. Here, accuracy measures overall correctness, precision indicates the ratio of true positives among predicted positives, recall measures the proportion of correctly identified positives, and the F1-score represents the harmonic mean of precision and recall.

$Accuracy = \frac{TP + TN}{FP + FN + TP + TN}$ [1]

$Precision = \frac{TP}{TP + TF}$ [2]

$Recall = \frac{TP}{TP + FN}$ [3]

$F 1 -score = \frac{Precision * Recall}{Precision + Recall}$ [4]

Table 2

Performance metrics for VGG-19 architectures of all folds

Measurements	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Average
Accuracy	0.981	0.981	0.981	0.983	0.978	0.982
F1-score	0.983	0.983	0.983	0.979	0.98	0.983
Average precision
Normal	0.98	0.99	0.98	0.991	0.97	0.982
Level 1	0.99	0.98	0.96	0.980	1	0.982
Level 2	0.98	0.98	0.99	0.970	0.981	0.985
Level 3	0.98	0.97	0.98	0.980	0.992	0.980
Average recall
Normal	0.98	0.97	0.98	0.990	0.964	0.977
Level 1	0.97	0.99	0.99	0.980	0.977	0.982
Level 2	0.98	0.98	0.95	0.989	0.987	0.977
Level 3	0.98	0.99	0.98	0.992	0.99	0.986
Avg. precision	0.9825	0.98	0.9775	0.980	0.98575	0.982
Avg. recall	0.9775	0.9825	0.975	0.988	0.9795	0.981

Avg., average.

According to the results presented in Table 2, the VGG-19 model achieved an average accuracy of 98.2% across the five folds, with an F1-score of 98.3%, an average precision of 98.2%, and an average recall of 98.1%.

As illustrated in Table 3, the NasNet performed 97.1% as an average accuracy of the three folds, 97.9% for F1-score, 97.6% Avg. precision, and 97.6% Avg. recall.

Table 3

Performance metrics for NasNet architectures of all folds

Measurements	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Average
Accuracy	0.965	0.988	0.964	0.972	0.964	0.971
F1-score	0.968	0.989	0.97	0.976	0.99	0.979
Average precision
Normal	0.98	0.99	0.99	0.987	0.98	0.985
Level 1	0.97	0.98	0.99	0.980	1	0.984
Level 2	0.9	0.98	0.92	0.977	0.96	0.955
Level 3	0.99	0.97	0.98	0.990	0.97	0.980
Average recall
Normal	0.98	0.97	0.97	0.970	0.97	0.972
Level 1	0.9	0.99	0.92	0.990	0.98	0.955
Level 2	0.99	0.98	0.99	0.984	0.99	0.987
Level 3	0.98	0.99	0.99	0.998	0.99	0.990
Avg. precision	0.96	0.98	0.97	0.984	0.9775	0.976
Avg. recall	0.9625	0.9825	0.9675	0.986	0.9825	0.976

Avg., average.

As presented in Table 4, the Inception model achieved an average accuracy of 95.6% across the three folds, with an F1-score of 95.4%, an average precision of 95.1%, and an average recall of 95.8%.

Table 4

Performance metrics for Inception architectures of all folds

Measurements	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Average
Accuracy	0.938	0.94	0.981	0.940	0.981	0.956
F1-score	0.922	0.94	0.983	0.940	0.983	0.954
Average precision
Normal	0.97	0.92	0.98	0.920	0.98	0.954
Level 1	0.93	0.91	0.96	0.910	0.96	0.934
Level 2	0.91	0.97	0.99	0.970	0.99	0.952
Level 3	0.98	0.94	0.98	0.940	0.98	0.964
Average recall
Normal	0.94	0.99	0.98	0.990	0.98	0.976
Level 1	0.92	0.92	0.99	0.920	0.99	0.951
Level 2	0.91	0.94	0.95	0.940	0.95	0.938
Level 3	0.91	0.96	1	0.960	1	0.966
Avg. precision	0.9475	0.935	0.9775	0.935	0.9775	0.951
Avg. recall	0.92	0.9525	0.98	0.953	0.98	0.958

Avg., average.

Table 5 provides an overview of the average performance evaluation metrics, including accuracy, F1-score, recall, and precision, for each model in the proposed TL-BSLD system. The results in Table 5 demonstrate that the VGG-19 model achieved the highest average accuracy of 98.2%, whereas the Inception model exhibited the lowest average accuracy of 95.6%. Similarly, the VGG-19 model obtained the maximum average F1-score of 98.3%, while the Inception model obtained the minimum average F1-score of 95.4%. Regarding average precision, the VGG-19 model achieved the highest score of 98.2%, while the Inception model achieved the lowest score of 93.4%. Additionally, the NasNet model demonstrated a maximum average recall of 98.4%, whereas the Inception model showcased a minimum average recall of 93.4%.

Table 5

Average performance metrics for the proposed TL-BSLD system

Measurements	VGG 19	Inception3	NasNet
Accuracy	0.982	0.956	0.971
F1-score	0.983	0.954	0.979
Avg. precision	0.982	0.954	0.985
Avg. recall	0.982	0.934	0.984
Average precision
Normal	0.985	0.952	0.955
Level 1	0.980	0.964	0.980
Level 2	0.977	0.976	0.972
Level 3	0.982	0.951	0.955
Average recall
Normal	0.977	0.938	0.987
Level 1	0.986	0.966	0.990
Level 2	0.982	0.951	0.976
Level 3	0.981	0.958	0.976

TL-BSLD, Transfer Learning-Based Burn Severity Level Detection System.

Figure 5 illustrates the learning performance accuracy of VGG-19 in one-fold cross-validation, with 30 epochs; on the other hand, Figure 6 illustrates the training and validation loss throughout the same cross-validation process. The loss function used in this study was categorical cross-entropy, defined as:

$L = \sum_{i = 1}^{N} y^{i} \log ({\hat{y}}^{i})$ [5]

Figure 5 Learning accuracy curves of one fold on VGG-19 with 30 epochs.

Figure 6 Learning loss curves of one fold on VGG-19 with 30 epochs.

Where $y^{i}$ is the true label (one-hot encoded), ${\hat{y}}^{i}$ is the predicted probability for class i, and N is the total number of classes.

As shown in Figures 5,6, the VGG-19 model exhibits an excellent fit and consistent performance. The training and validation loss decreased to a point of stability with a minimal gap between the two final loss values.

Figure 7 depicts class-wise confusion results, showing that most errors occur between first- and second-degree burns due to visual similarity. This visualization enhances interpretability by clearly illustrating the distribution of correct and incorrect predictions across all burn categories.

Figure 7 The confusion matrix for VGG-19 for one fold of the proposed TL-BSLD system. TL-BSLD, Transfer Learning-Based Burn Severity Level Detection System.

In this section, we compare the proposed TL-BSLD’s performance metrics with those of the reference studies used for burned skin classifications. Table 6 summarizes the performances of the VGG-19, Inception 3, and NasNet of the proposed TL-BSLD, as well as the reference studies. The results in Table 6 demonstrate that the proposed VGG-19 model achieved the highest average accuracy of 98.3%, and obtained the maximum average F1-score of 98.5% with four classes. In contrast, the other reference studies obtained lase accuracies and F1-scores.

Table 6

Comparison between proposed TL-BSLD and the studies in the literature

Ref.	Classification model	No. of classes	Measurements
(2)	Fuzzy-ARTMAP neural network	3	Accuracy (88.6%)
(3)	SVM	2	Accuracy (89.29%)
(12)	Residual Network-101	2	Accuracy (84.51%)
(5)	clustering segmentation	2	Accuracy (91%)
(14)	U_NET with ResNet-101	2	Accuracy (93.4%)
(15)	MFFNN	2	F1-score (74.28%)
(16)	SVM	2	F1-score (77.8%)
Proposed TL-BSLD	AVG19	4	Accuracy (98.3%)
	AVG19	4	F1-score (98.5%)
	Inception 3	4	Accuracy (95.3%)
	Inception 3	4	F1-score (94.8%)
	NasNet	4	Accuracy (97.2%)
			F1-score (97.6%)

MFFNN, Multilayer Feed-Forward Neural Network; SVM, support vector machine; TL-BSLD, Transfer Learning-Based Burn Severity Level Detection System.

Finally, it is noted that many reference works addressed 2- or 3-class tasks, whereas TL-BSLD tackled a 4-class classification. This increases task complexity, and therefore, numerical comparisons should be viewed as indicative rather than strictly equivalent.

The integration of explainable XAI techniques into the proposed TL-BSLD provided significant insights into the model’s decision-making process. Grad-CAM heatmaps demonstrated that the VGG-19 model consistently focused on critical regions in burn images, such as edges and texture differences, which are clinically relevant to burn severity. For instance, in first-degree burns, the model highlighted smooth, less severe burn areas, whereas for third-degree burns, it emphasized deeper and darker regions, aligning with medical expectations. In addition, SHAP analysis revealed the contribution of individual features to the classification decision. For example, texture-based features, such as pixel intensity variations, were identified as critical factors in distinguishing first-degree burns from normal skin. This insight validated the model’s accuracy and ensured its predictions were interpretable and clinically actionable. Figure 8 visualized the original burn image (left) and the Grad-CAM heatmap overlay (right). Red regions indicate the areas most critical for the model’s decision, yellow regions show moderately influential areas, and blue regions represent the least influential areas.

Figure 8 Visualization of the original burn image (left) and the Grad-CAM heatmap overlay (right).

The quantitative evaluation of the explainability module showed that Grad-CAM achieved 90% of the highlighted regions overlapped with manually segmented areas by medical experts, ensuring high agreement with human evaluations. In addition, SHAP shows that the top five features contributing to predictions covered 85% of the cumulative importance, showcasing the model’s efficiency in feature utilization. Further, the left sub-figure in Figure 8 shows the original burn image, while the right sub-figure in Figure 8 panel overlays a Grad-CAM heatmap on the same image. The heatmap highlights areas that the DL model focuses on when predicting the severity of the burn. Reddish regions in the overlay indicate the most critical areas contributing to the model’s decision, such as regions with distinct textures, color changes, or patterns indicative of burn severity. In addition, SHAP overlays were generated to provide feature attribution at the pixel level. The quantitative evaluation showed that the top five SHAP features accounted for 85% of cumulative importance, with 90% overlap between SHAP-highlighted and expert-annotated regions, further supporting interpretability.

Discussion

This study introduced the TL-BSLD framework, a TL-based multi-CNN approach for burn severity detection, and demonstrated its ability to achieve high diagnostic accuracy. Among the three evaluated models, the modified VGG-19 achieved the best performance, with an average accuracy of 98.3% and F1-score of 98.5%. These results highlight the feasibility of leveraging pre-trained CNNs for accurate and rapid classification of burn injuries into four clinically relevant categories. Importantly, the integration of Grad-CAM and SHAP provided visual and feature-level explanations, enhancing model transparency and supporting clinical trust.

Compared with prior research, TL-BSLD shows notable improvements. Earlier methods relying on handcrafted features or statistical analysis achieved accuracies in the range of 74–89%). More recent CNN- and segmentation-based methods have reported accuracies between 84% and 93%. The performance of the VGG-19 variant in TL-BSLD surpasses these, indicating that TL with fine-tuned deep architectures can better capture discriminative patterns in burn images. Furthermore, the use of explainable AI distinguishes this study from many previous works, where model interpretability remained limited and hindered clinical adoption.

Several strengths should be emphasized. The study employed stratified five-fold cross-validation, extensive augmentation, and multiple architectures to ensure robust evaluation. The integration of interpretability methods is a significant advance toward clinician acceptance. Nonetheless, limitations remain. For instance, another limitation concerns the lack of finer-grained burn categories. For example, superficial and deep partial-thickness burns were not separately available in the dataset. Inclusion of such detailed clinical stratification in future datasets will strengthen the system’s clinical relevance. Finally, some further limitations include: (I) absence of an independent external test set, which may risk overfitting; (II) use of limited folds instead of higher k-fold validation, affecting robustness; and (III) lack of baseline classifiers for comparison. Addressing these will enhance the reliability of results in future work. In addition, the dataset, though carefully annotated, was limited to publicly available sources and may not fully represent clinical diversity. No inter-rater reliability analysis was performed for annotations, and the framework was evaluated only on static images rather than longitudinal wound progression. These constraints may affect the system’s generalizability.

Future work should focus on prospective clinical validation using larger, multi-institutional datasets, integration into real-time clinical workflows, and extension to mobile platforms for broader accessibility. Additionally, combining image-based predictions with clinical metadata could further improve diagnostic accuracy and support personalized treatment planning.

In summary, TL-BSLD demonstrates that TL combined with explainability can yield a reliable and interpretable system for burn severity classification. The proposed framework holds promise to reduce diagnostic subjectivity, improve decision-making, and ultimately enhance patient outcomes in burn care.

Conclusions

In this study, we presented the TL-BSLD as an intelligent approach for precise and efficient diagnosis of burn injury degrees. By leveraging computer vision techniques and three TL architectures (VGG-19, Inception3, and NasNet), the TL-BSLD system demonstrated exceptional performance in accurately classifying burn images into four distinct categories: Normal, First Degree, Second Degree, and Third Degree. The results of our evaluation showcased the effectiveness of the TL-BSLD system in burn degree detection. Notably, the VGG-19 architecture exhibited outstanding performance metrics, achieving an accuracy of 98.3% and an F1-score of 98.5%. Also, incorporating XAI techniques, such as Grad-CAM and SHAP, in the proposed system represents a significant advancement in AI-based burn severity classification. By enhancing interpretability, the system addresses one of the key barriers to AI adoption in healthcare, paving the way for more reliable and trustworthy applications in clinical environments. These findings underscore the system’s capability to provide highly accurate and reliable burn severity assessments, thus enabling timely and appropriate treatment interventions. The successful implementation of the TL-BSLD system has the potential to revolutionize burn care, benefiting patients and healthcare providers alike. By automating the diagnosis process and providing accurate burn severity assessments, the system can assist healthcare professionals in making informed decisions, optimizing resource allocation, and ultimately improving patient outcomes. In the future, we plan to integrate the proposed system with healthcare providers. Additionally, workflow integration will be addressed in future extensions of this research.

Acknowledgments

None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-117/prf

Funding: This research project was funded by Ongoing Research Funding Program (ORF-2025-444), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-117/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. IRB approval and patients consent are not applicable to this study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Rahman S, Rahman M, Faezipour M, et al. Enhancing Burn Severity Assessment With Deep Learning: A Comparative Analysis and Computational Efficiency Evaluation. IEEE Access 2024;12:147249-68.
Zhang R, Tian D, Xu D, et al. A Survey of Wound Image Analysis Using Deep Learning: Classification, Detection, and Segmentation. IEEE Access 2022;10:79502-15.
Osman OB, Harris ZB, Khani ME, et al. Deep neural network classification of in vivo burn injuries with different etiologies using terahertz time-domain spectral imaging. Biomed Opt Express 2022;13:1855-68. [Crossref] [PubMed]
Rahman S, Faezipour M, Ribeiro GA, et al. Attention-Based CNN Model for Burn Severity Assessment. Pittsburgh, PA, USA: 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), 15-18 Oct. 2023.
Schenkenfelder B, Kaltenleithner S, Sabrowsky-Hirsch B, et al. Synthesizing Diagnostic Burn Images For Deep Learning Applications. San Diego, CA, USA: 2022 Annual Modeling and Simulation Conference (ANNSIM) 2022;18-20.
Rehman Butt AU, Ahmad W, Ashraf R, et al. Computer Aided Diagnosis (CAD) for Segmentation and Classification of Burnt Human skin. Swat, Pakistan: 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Swat, Pakistan, 2019.
Liu H, Yue K, Cheng S, et al. A framework for automatic burn image segmentation and burn depth diagnosis using deep learning. Computational and Mathematical Methods in Medicine 2021; [Crossref]
Burn Data Center. Dataset of the burnt images. University of Washington. [Online]. Available online: https://burndata.washington.edu/about-bms/ (accessed: Sep. 23, 2024; Oct. 17, 2024).
Baid S. Burn skin dataset. Kaggle. [Online]. Available online: https://www.kaggle.com/datasets/shubhambaid/skin-burn-dataset (accessed: Oct. 17, 2024).
Serrano C, Acha B, Gómez-Cía T, et al. A computer assisted diagnosis tool for the classification of burns by depth of injury. Burns 2005;31:275-81. [Crossref] [PubMed]
Wantanajittikul K, Auephanwiriyakul S, Theera-Umpon N, et al. Automatic segmentation and degree identification in burn color images. Chiang Mai, Thailand: The 4th 2011 Biomedical Engineering International Conference; 2012:169-173. doi: 10.1109/BMEiCon.2012.6172044.
Badea MS, Vertan C, Florea C, et al. Automatic burn area identification in color images. Bucharest, Romania: 2016 International Conference on Communications (COMM); 2016:65-68. doi: 10.1109/ICComm.2016.7528325.
Jiao C, Su K, Xie W, et al. Burn image segmentation based on Mask Regions with Convolutional Neural Network deep learning framework: more accurate and more convenient. Burns Trauma 2019;7:6. [Crossref] [PubMed]
Chauhan J, Goyal P. Convolution neural network for effective burn region segmentation of color images. Burns 2021;47:854-62. [Crossref] [PubMed]
Sevik U, Karakullukçu E, Berber T, et al. Automatic Classification of Skin Burn Color Images Using Texture Based Feature Extraction. IET Image Processing 2019;13:2018-2028.
Chauhan J, Goyal P. BPBSAM: Body part-specific burn severity assessment model. Burns 2020;46:1407-23. [Crossref] [PubMed]
Pabitha C, Vanathi B. Densemask RCNN: A Hybrid Model for Skin Burn Image Classification and Severity Grading. Neural Process Lett 2021;53:319-37.
He K, Gkioxari G, Dollár P, et al. Mask R-CNN. Venice, Italy: 2017 IEEE International Conference on Computer Vision (ICCV); 2017:2961-9.
Abubakar A, Ajuji M, Usman Yahya I. Comparison of deep transfer learning techniques in human skin burns discrimination. Applied System Innovation 2020;3:20.
Suha SA, Sanam TF. A deep convolutional neural network-based approach for detecting burn severity from skin burn images. Machine Learning with Applications 2022;9:100371.
Yadav DP, Sharma A, Singh M, et al. Feature Extraction Based Machine Learning for Human Burn Diagnosis From Burn Images. IEEE J Transl Eng Health Med 2019;7:1800507. [Crossref] [PubMed]
Rahman S, Faezipour M, Ribeirod GA, et al. Inflammation Assessment of Burn Wound with Deep Learning. Las Vegas, NV, USA: 2022 International Conference on Computational Science and Computational Intelligence (CSCI); 2022:1792-5. doi: 10.1109/CSCI58124.2022.00318.
Rozo A, Miskovic V, Rose T, et al. A Deep Learning Image-to-Image Translation Approach for a More Accessible Estimator of the Healing Time of Burns. IEEE Trans Biomed Eng 2023;70:2886-94. [Crossref] [PubMed]
He K, Zhang X, Ren S, et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37:1904-16. [Crossref] [PubMed]
CVT Team. CVAT (Computer Vision Annotation Tool): Annotate better with CVAT, the industry-leading data engine for machine learning. GitHub. Available online: https://github.com/cvat-ai/cvat (accessed: June. 21, 2025).
Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015:1-9.
Mahdianpari M, Salehi B, Rezaee M, et al. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sensing 2018;10:1119.

doi: 10.21037/jmai-2025-117
Cite this article as: Faisal M, Khan M, Diyan M, Alharbi A. TL-BSLD: a transfer learning-based multi-convolutional neural network framework for burn severity level detection. J Med Artif Intell 2026;9:21.

TL-BSLD: a transfer learning-based multi-convolutional neural network framework for burn severity level detection

Highlight box

Introduction

Background

Knowledge gaps

Objectives

Table 1

Methods

Study design

Model architecture

Model training

Statistical analysis

Results

Table 2

Table 3

Table 4

Table 5

Table 6

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share