Skin cancer classification using vision transformers and explainable artificial intelligence
Original Article

Skin cancer classification using vision transformers and explainable artificial intelligence

Getamesay Haile Dagnaw ORCID logo, Meryam El Mouhtadi, Musa Mustapha

School of Digital Engineering and Artificial Intelligence, Euro-Mediterranean University of Fes, Fes, Morocco

Contributions: (I) Conception and design: GH Dagnaw, M Mustapha; (II) Administrative support: M El Mouhtadi; (III) Provision of study materials or patients: GH Dagnaw; (IV) Collection and assembly of data: GH Dagnaw; (V) Data analysis and interpretation: GH Dagnaw; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Getamesay Haile Dagnaw, PhD Student. School of Digital Engineering and Artificial Intelligence, Euro-Mediterranean University of Fes, Route Principale Fès Meknès, Fes, Morocco. Email:

Background: Skin cancer diagnosis is a critical aspect of dermatological healthcare, and requires accurate and efficient classification methods. Recently, vision transformers (ViTs) and convolutional neural networks (CNNs) have emerged as promising architectures. However, the interpretability of these models remains a concern, hindering their widespread adoption in the clinical setting. Therefore, the aim of this research is to propose an explainable skin cancer classification using deep learning and explainable artificial intelligence methods.

Methods: This study presents skin cancer classification utilizing two pretrained ViTs, three Swin transformers, five pretrained CNNs, and three visual-based explainable artificial intelligence (XAI) models. The ViT-base, ViT-large, Swin-tiny, Swin-base, and Swin-small transformer models and VGG19, ResNet18, ResNet50, MobileNetV2, and DenseNet201 pretrained CNN models are used for classification. For explanation, gradient-weighted class activation mapping (Grad-CAM), Grad-CAM++, and score-weighted class activation mapping (Score-CAM) XAI models were adopted. The study used freely available datasets to train and test the proposed models and adopted synthetic minority oversampling technique (SMOTE) to address class imbalance issues.

Results: The performances of the ViT and CNN models were evaluated using five performance metrics: accuracy, precision, F1 score, sensitivity, and specificity. The ViT models demonstrated competitive performance with ViT-large and ViT-base, achieving an accuracy of 88.6%. Swin-base exhibited a balanced sensitivity and specificity. Mainly, ResNet50 outperformed the tested models with an accuracy of 88.8%, precision of 86.9%, sensitivity of 88.6%, F1 score of 87.8%, and specificity of 88.9%. The integration of XAI techniques into the ResNet50 model showed that the model learns from relevant regions of the image to classify a given image as benign or malignant.

Conclusions: This study presents ViT and CNN models for skin cancer classification, and the XAI techniques applied to the model contributes to enhancing transparency of the decision-making process of deep learning models. These findings will aid in accurate and trustworthy skin cancer classification and will be vital for clinical adoption in enhancing clinical decision-making in dermatological healthcare.

Keywords: Deep learning; skin cancer classification; vision transformers (VITs); Swin transformers

Received: 08 January 2024; Accepted: 18 March 2024; Published online: 28 May 2024.

doi: 10.21037/jmai-24-6

Highlight box

Key findings

• The vision transformer (ViT) models, particularly ViT-large and ViT-base, demonstrated competitive performance, achieving an accuracy of 88.6%.

• Integration of explainable artificial intelligence (XAI) techniques into ResNet50 showcased its ability to learn from relevant image regions, thereby enhancing interpretability.

What is known and what is new?

• Skin cancer diagnosis requires accurate and efficient classification methods, and recent advancements in ViT and convolutional neural network (CNN) have shown promising results.

• Integrating XAI with a CNN can enhance trust in automatic skin cancer classification.

• This paper presents an approach to skin cancer classification using ViT, Swin transformer models, CNN, and XAI techniques.

What is the implication, and what should change now?

• Our findings underscore the potential of the ViT, CNN, and XAI models for accurate skin cancer classification, with implications for enhancing clinical decision-making in dermatological healthcare.

• The integration of XAI techniques into deep learning models, as demonstrated by ResNet50, should be prioritized to improve transparency and trustworthiness in clinical adoption.

• Skin cancer classification is a complex task. Therefore, multimodal skin cancer classification by integrating different explainable AI models can enhance the performance of automatic diagnosis.


Computerized medical image processing has revolutionized healthcare by providing a comprehensive solution for automatic disease diagnosis. Integrating these methods in medical imaging not only reduces costs and saves time but also bolsters the confidence of medical professionals in patient diagnosis, particularly in the domain of skin cancer (1). Skin cancer, a pervasive and potentially lethal disease affecting millions of people worldwide, arises from abnormal cell growth triggered by excessive ultraviolet (UV) radiation exposure (2). Malignant melanoma (MM) is the deadliest of the various types of skin cancer worldwide. Its incidence varies geographically, impacting white populations most profoundly and affecting men more than women (3). Melanoma diagnosis is uniquely challenging due to the overlap and common characteristics shared with other diseases. Distinguishing between melanoma and nonmelanoma diseases often requires expertise and specialized tools beyond naked-eye examination (4).

Melanoma is commonly classified into four progressive stages. At stage 0, melanoma cells are localized in the outermost layer of the skin (epidermis), resulting in a 98% survival rate (5). Progressing to stage I, where the cancer infiltrates the upper (dermal) region of the inner skin layer, the survival rate decreases to 81% (6). At stage II, melanoma advances into the deeper layers of the dermis, correlating with a decrease in the survival rate to 78% (6). Advancing to stage III, cancer cells disseminate beyond the skin to involve the lymphatic nodes, resulting in a survival rate of 63% (5). In stage IV, the cancer extends to distant organs, such as the heart, brain, and tissues, yielding a mere 22% (5) chance of survival. Therefore, early detection and treatment offer high survival rates, whereas advanced stages of melanoma reduce recovery and survival rates (7,8). Moreover, studies indicate the potential for melanoma recurrence within five years of treatment, necessitating close follow-up to ensure the success of therapeutic interventions (9,10). Historically, melanoma diagnosis has relied heavily on subjective physician assessments, a process inherently susceptible to variations in training, experience, and individual doctor interpretation (11). Dermatoscopy, a non-invasive diagnostic technique for skin diseases, has emerged and it is the most widely used tool for skin cancer screening and lesion assessment by dermatologists (12). Compared to a naked eye examination, dermoscopy demonstrably enhances diagnostic accuracy by 10–30% (13). However, its clinical application has revealed a significant reliance on the expertise of doctors, resulting in subjective interpretations and limited reproducibility. Research has revealed that expert dermatologists employing dermoscopy achieve sensitivities of up to 90% and specificities of 59% (14,15), whereas less experienced professionals achieve sensitivities of 62–63% (15,16). In the early days, melanoma classification relied on low-level image processing techniques by extracting lesion image attributes such as morphology, colour, and texture. Technological advancements have led to the development of high-resolution cameras to capture clinical training data, faster computers, and graphics processing units (GPUs) for training models. This has also empowered researchers to develop sophisticated algorithms for melanoma classification by improving algorithm performance and refining feature extraction. Several clinical prediction rules and algorithms have been developed for the classification of MMs and benign skin lesions. These included the asymmetry, border, color, diameter and evolving (ABCDE) rule (17), 7-point checklist (18), three-point checklist (19), and CASH (color, architecture, symmetry, and homogeneity) algorithm (20).

In recent years, numerous models have emerged for the classification of melanoma-related skin cancer. However, these models share a common framework involving input, algorithm, and output stages. The input to a melanoma model is typically a dermoscopic image of the skin lesion. However, capturing images in uncontrolled environments can introduce noise, which may impede training and lead to inaccurate predictions (21). Consequently, image preprocessing techniques, such as hair removal, significantly improve melanoma classification performance (22). Studies have further highlighted the positive impact of image segmentation on melanoma classification accuracy (23,24). This process helps isolate the lesion for training and plays a crucial role in improving classifier accuracy. However, segmentation of clinical images is complex and challenging because of the susceptibility to factors such as noise, low contrast, varying illumination, and irregular lesion boundaries (25). The emergence of potent image-based classification algorithms, such as convolutional neural networks (CNNs), deep learning methods, vision transformers (ViTs), and autoencoders, has significantly improved skin cancer lesion classification while reducing the associated time and costs (26-28). CNNs were extensively employed in medical image recognition and have demonstrated remarkable accuracy. It has also been effectively utilized in the categorization of skin cancer (29-34). Ahmadi Mehr and Ameri (35) introduced deep learning methodologies for the categorization of skin cancer based on lesion images, incorporating patient metadata such as gender, age, and lesion location on the body. Their model leveraged a pretrained ResNet-v2 for classification. The training was conducted using a dataset comprising 57,526 dermoscopic images classified into benign and malignant classes, resulting in a model with a reported accuracy of 95%. The findings indicate that the inclusion of metadata positively influences the model’s performance. However, it was noted that the dataset exhibited a gender imbalance, with a greater representation of males than females, potentially introducing bias to the model. Furthermore, the skin cancer classification task needs a model which explains why it makes a certain decision and it still needs additional research work to fully apply in clinical practices.

Researchers (36) proposed a deep learning approach by integrating the XAI method for skin lesion classification. They used the ResNet-18 pretrained model and local interpretable model- agnostic explanations (LIME) (37) as the XAI model to enhance trust in the model classification results. Another paper (38) proposed a CNN model for skin disease classification and the paper used a class activation mapping (CAM) (39) as an XAI method. The authors achieved an accuracy of 82.7%. However, CAM is architectural dependent and is suitable only when the model has a global average pooling (GAP) layer. Zunair and Ben Hamza (40) also proposed skin cancer classification using a VGG16 pretrained model and they used CAM as an XAI model. The authors achieved a sensitivity of 91.76% and an AUC of 81.18%. Skin disease detection and classification using transfer learning and XAI is proposed in Mayanja et al.’s study (41). The paper concludes integrating XAI into deep learning improves diagnostic outcomes in dermatology and allows for the mitigation of errors. In general, integrating XAI in skin cancer classification improves transparency in decision-making process of black-box deep learning models. Recently, ViT have been adopted in heart diseases detection (42), skin cancer classification (43-45), breast cancer detection (46) and suggested that using ViT provide best performance in classification accuracy. However, skin cancer classification remains a complex task due to the diversity of lesions, subtle variations in appearance, and the need for accurate discrimination among different subtypes. Moreover, the application of machine learning and computer vision in dermatology still faces limitations due to factors such as dataset quality, class imbalance, concerns about algorithm bias, and the lack of explainability inherent in certain ML models (47-49). Therefore, we propose a skin cancer classification model using ViT and XAI-based CNN models to enhance the classification accuracy and explainability. Our findings underscore the potential of the ViT, CNN, and XAI models for accurate skin cancer classification, with implications for enhancing clinical decision-making in dermatological healthcare. The integration of XAI techniques into deep learning models, as demonstrated with ResNet50, shows how our model learns to determine whether an input image is benign or malignant. In general, the contributions of this work are as follows:

  • We present two ViT, three Swin transformers and five CNN models for skin cancer classification and evaluate using state-of-the-art XAI models: Grad-CAM, Grad-CAM++ and Score-CAM.
  • We apply SMOTE to address class imbalance issues.
  • We perform a comparative analysis of existing skin cancer classification models with the proposed model.

The rest of the paper is organized as follows. The detailed methodology of the proposed work is presented in Methods. The experimental results and discussion are presented in Results and discussion, and we conclude the paper in Conclusions.


Data sources

For this study, we use a dataset from a part of the International Skin Imaging Collaboration (ISIC) dataset (50), freely available from Kaggle (51) to train and test the ViT and CNN models. The dataset contains a total of 3,297 images of which 2,637 images were in the training set and 660 in the testing set. The data distribution is presented in Table 1 and the sample images are presented in Figure 1.

Table 1

Data distribution

Class Training Testing
Benign 1,440 360
Malignant 1,197 300
Total 2,637 660
Figure 1 Sample images in the dataset with benign and malignant classes.

Class imbalance

As Table 1 illustrates, there is a class imbalance in the dataset. Therefore, to address these imbalanced data we apply the synthetic minority oversampling technique (SMOTE) (52) which is an oversampling technique where the synthetic samples are generated for the minority class. SMOTE helps to overcome the overfitting challenge caused by random oversampling. In this study, we apply SMOTE only to the training data and the results before and after are presented in Figure 2. As shown in the figure, there were 1,197 malignant images before, while with SMOTE the number of images is balanced with the number of benign images. Adopting SMOTE for imbalance data helps to improve the classification accuracy and generalizability by avoiding bias in the model (53).

Figure 2 Class distribution before and after SMOTE. SMOTE, synthetic minority oversampling technique.


We resized all the images in the dataset to a size of 224×224 pixels, which is the preferred size for generating a patch from the input images.

Proposed method

In this study, we conducted ViT and CNN-based benign or malignant skin cancer classification. We apply the transfer learning approach by fine-tuning the pretrained ViT and CNN models. Initially, the models were pretrained on the ImageNet dataset and we fine-tuned those models for the skin cancer classification task. Finally, we use three CAM-based XAI models to interpret the best-performing model whether it learns from the important features.


Recently, ViT has been applied in many computer vision applications and has achieved impressive results in medical image classification tasks (46,54,55). ViT was developed from the basic transformer model utilized in the natural language processing (NLP) model, where the input is a one-dimensional sequence of word tokens. In the case of ViT, as images are two-dimensional, input images are split into patches and those patches are used as tokens. In this case, the patch is a small rectangular region of an image which is typically 16×16 pixels (56). Each patch of the input image is represented as a vector once it has been divided into patches. The vector represents the patch’s features, which are typically extracted using a CNN. The vectors for each patch are then sent into a transformer encoder layer, which is a self-attention layer. The model can learn long-term dependencies between patches due to the Self-attention method. The transformer encoder produces a sequence of vectors that represent the image’s features, and those features are then utilized to classify the input image. The general architecture of the ViT model is presented in Figure 3.

Figure 3 ViT model architecture for skin cancer classification. MLP, multilayer perceptron; ViT, vision transformer.

The detailed parameters of each type of ViT model are presented in Table 2.

Table 2

Details of vision transformer model variants

Models Hidden size D MLP size Heads Layers
ViT-base 768 3,072 12 12
ViT-large 1,024 4,096 16 24
ViT-huge 1,280 5,120 16 32

ViT, vision transformer; MLP, multilayer perceptron.

Swin transformer (57) uses shifted windows to improve computational efficiency, and to obtain feature maps of multi-resolution a hierarchical feature is adopted. In the Swin transformer, the input images are split into multiple non-overlapping patches and they are converted into embeddings. The Swin transformer blocks are then applied to the patches in four stages with each successive stage reducing the number of patches to maintain hierarchical representation. The Swin transformer block contains two sub-modules: shifted window multi-head self-attention (SW-MSA) and window multi-head self-attention (W-MSA) and which replace multi-head attention in ViT model. Figure 4 presents the overall architecture of the Swin transformer and the detailed parameters of each type of Swin transformer models are presented in Table 3.

Figure 4 Architecture of Swim transformers. H, height; W, width; C, channel; MLP, multilayer perceptron; SW-MSA, shifted window-based multi-head self-attention; LN, layer normalization; Z, output features; W-MSA, window-based multi-head self-attention.

Table 3

Swin transformer architecture hyper-parameters

Models Channel number Layer numbers
Swin-T 96 {2, 2, 6, 2}
Swin-S 96 {2, 2, 18, 2}
Swin-B 128 {2, 2, 18, 2}

In this study, we utilized two variants of ViT (ViT-base and ViT-large), and three variants of Swin transformer (Swin-small, Swin-base, and Swin-tiny) models.

CNN models

This paper proposes a transfer learning-based XAI for skin cancer classification using DL and three visual-based XAI. We use pre-trained DL models, namely: ResNet50 (58), VGG19 (59), DenseNet201 (60), and MobileNetV2 (61). Initially, all those models were trained with ImageNet. VGG19 has a good performance on the ImageNet dataset, and it is used in skin cancer classification papers (62). DenseNet201 can handle image variations and is one of the best models for medical image classification (63). MobileNetV2 has a small parameter suitable for low computational requirements and embedded devices. Due to this, MobileNetV2 has been used for creating a lightweight model for many medical image classifications and gives a better result (64,65). However, it may not perform well on very complex datasets. ResNet50 avoids vanishing gradients and is a widely used model for skin cancer classification tasks (62,66,67). Due to this, we used those models as a base model for our skin cancer classification problem. The proposed models include preprocessing, feature extraction and classification, XAI and evaluation with five metrics.

XAI models

XAIs play a vital role in developing trust in automatic skin disease classification tasks by showing which part of the image contributes most to classifying an image as benign or malignant. Researchers have developed many XAI models to explain the black box AI. These models include LIME (37), CAM (68), Grad-CAM (69), Score-CAM (70), F-CAM (71) and Grad-CAM++ (72). Those XAI models can provide visual explanations for why deep learning models make predictions. In addition, XAI methods are helpful for better visual representations and interpreting the decision-making process of DL models. Applying XAI in medical imaging can also increase transparency and trust by visualizing the logic behind the inferences that can be interpreted in a way that is easily understandable to humans. It also enhances confidence in the results of the neural networks (73). This study uses three visual-based EXA models: Score-CAM, Grad-CAM, and Grad-CAM++. As Grad-CAM++ is the variant of Grad-CAM, we discussed only Score-CAM and Grad-CAM.

Grad-CAM is a visualization techniques for image classification, which allows calculating the gradient of a differentiable output, like class score, with the convolutional features of a selected layer (74). It weights the 2D activations by the average gradient. By considering convolutional layer l in a model given a class of interest c, Grad-CAM can be expressed as follows:


wk(c) is defined as:


where Ak(x,y) is the activation of node k in the target layer of the model at position (x,y), and Y(c) is the model output score for class c before softmax.

Score-CAM is a post-hoc visual explanation, which is the first gradient-free CAM-based visualization that achieves better visual performance (70). Score-CAM metrics can be defined as follows:


wk(c) is defined as:


Where Ak(x,y) is the activation of node k in the target layer of the model at position (x,y), Y(c)(X) is the model output score for class c before the softmax function for input X, Xb is a baseline image and Mk is defined as follows:


where U is the upsampling operation and ⊙ refers to the element-wise multiplication.

Experimental setup

In this study, we implement two ViT and three variants of Swin transformer models for benign and malignant skin cancer classification by finetuning the pretrained models. Then, we implement five pretrained CNN models and compare the performance with ViT models. Finally, we implement three visual based XAI models on ResNet50 and visualize how the model learns to classify an image into benign or malignant classes. Lastly, we compared the performance of our model with previous works.

Implementation details

The models are implemented on Google Colab Pro+ with 83.5 GB of system RAM and a 16 GB GPU. The models are trained for 30 epochs with Adam optimizer and a learning rate of 0.001. We use those values from previous works (75). In addition, all the models use a batch size of 32.

Performance metrics

In this study, we use five metrics to validate the robustness of both ViT and CNN models: accuracy, sensitivity, specificity, precision, and F1 score.






TP: true positive, TN: true negative, FP: false positive, and FN: false negative.

Results and discussion

Table 4 illustrates the comparative performance of five models, ViT-b, ViT-L, Swin-s, Swin-b, and Swin-T, with five evaluation metrics for skin cancer classification. As shown in Table 4, ViT-b and ViT-L achieved classification accuracies of 88.6%. In addition, Swin-b closely follows at 87.8%, with Swin-s and Swin-T slightly trailing at 87.7%. In contrast, Swin-T achieved a high-precision score of 90.1%. Swin-s closely follows at 89.5%, whereas ViT-L excels at 89.2%. ViT-b and Swin-b exhibited precision scores of both 88.6%. In terms of recall, Swin-b took the lead at 86.6%, followed closely by ViT-b at 86.0%. ViT-L, Swin-s, and Swin-T displayed recall scores of 85.3%, 82.6%, and 82.0%, respectively. Furthermore, ViT-b achieved a high F1 score with a score of 87.3%, while ViT-L performed admirably at 87.2%. Swin-s, Swin-T, and Swin-b achieved F1 scores of 85.9%, 85.8%, and 86.6%, respectively. In the case of specificity, Swin-T gave a score of 92.5%, closely followed by Swin-s at 91.9%. ViT-L, Swin-b, and ViT-b demonstrated specificity scores of 91.3%, 88.8%, and 90.8%, respectively. In summary, all ViT models show promising performance in all metrics, and Swin-T is emerging as a promising choice for achieving high precision and specificity. Figure 5 shows a comparative analysis of the models for the test sets.

Table 4

Performance the ViT models in the testing set with five evaluation metrics

Models Accuracy (%) Precision (%) Recall (sensitivity) (%) F1 score (%) Specificity (%)
ViT-b 88.6 88.6 86.0 87.3 90.8
ViT-L 88.6 89.2 85.3 87.2 91.3
Swin-s 87.7 89.5 82.6 85.9 91.9
Swin-b 87.8 86.6 86.6 86.6 88.8
Swin-T 87.7 90.1 82.0 85.8 92.5

ViT, vision transformer.

Figure 5 Performance of ViT models. ViT, vision transformer.

Table 5 provides a comprehensive overview of the performance metrics for CNN models: VGG19, ResNet18, ResNet50, DenseNet201, and MobileNetV2 in five evaluation metrics. Notably, ResNet50 exhibits superior accuracy of 88.8%, a precision of 86.9%, a recall at 88.6%, F1 score at 87.8% and a specificity of 88.9%. Figure 6 illustrates the comparative performance of CNN models for benign and malignant class classification. Figure 6 presents the comparative performance of CNN models and Figure 7 illustrates the comparison of ViT and CNN models in terms of classification accuracy.

Table 5

Performance CNN models in the testing set with five evaluation metrics

Models Accuracy (%) Precision (%) Recall (sensitivity) (%) F1 score (%) Specificity (%)
Vgg19 85.6 81.9 87.7 84.7 83.9
Resnet18 82.4 80.5 81.0 80.7 83.6
Resnet50 88.8 86.9 88.6 87.8 88.9
Densnet201 83.2 85.1 76.3 80.5 88.9
Mobilenetv2 85.3 83.5 84.3 83.9 86.1

CNN, convolutional neural network.

Figure 6 Performance comparison of CNN models. CNN, convolutional neural network.
Figure 7 Performance comparison of both ViT and CNN models with accuracy. ViT, vision transformer; CNN, convolutional neural network.

Figure 8 shows the total number of parameters (in millions) for CNN models, such as VGG19, ResNet18, ResNet50, DenseNet201, MobileNetV2, and different ViT and Swin transformer models. Among the architectures presented, ViT-L has the highest parameters (304.2 million), showing its complexity. On the other hand, MobileNetV2 had the lowest parameter count of 3.5 million. With a parameter count of 25.5 million, ResNet50 has a moderate level of complexity compared to models such as ViT-L and Swin-b. This means that ResNet50 achieves commendable accuracy while maintaining a manageable level of complexity, making it a practical choice for scenarios in which computational efficiency is as important as accuracy. Furthermore, ResNet50 outperformed other models, such as VGG19, ResNet18, DenseNet201, and MobileNetV2, in terms of accuracy.

Figure 8 Total number of parameters in millions. ViT, vision transformer.

The integration of the XAI method into the ResNet50 model contributes to the interpretability and transparency of the model for the classification of benign and malignant skin cancers. We chose ResNet50 because of its high classification accuracy and manageable level of complexity, and we adopted the Score-CAM implementation in GitHub (76). Grad-CAM, as a gradient-based localization technique, facilitates the generation of CAMs by highlighting regions of interest in input images and applying Grad-CAM to ResNet50 helps for a visual representation of the areas crucial to the classification decisions. Grad-CAM++ which extends the Grad-CAM methodology by incorporating higher-order derivatives, refining the localization accuracy. By employing Grad-CAM++, we aim to achieve enhanced granularity in identifying the critical regions of skin images that influence ResNet50’s classification decisions. Score-CAM further enriches the interpretability spectrum by assigning importance scores to individual pixels, based on their contribution to the final prediction. The resultant attribution map offers detailed insight into pixel-level significance, facilitating a fine-grained analysis of ResNet50’s decision-making process. Figures 9,10 present benign and malignant inputs with their corresponding Score-CAM visualizations and how the ResNet50 model classifies those images as benign or malignant. As indicated in the figures, the model learned from important features, and this can enhance trust in the automatic skin cancer classification task.

Figure 9 Benign class visualization: five benign inputs and Score-CAM visualization for ResNet50 model. Score-CAM, score-weighted class activation mapping.
Figure 10 Malignant class visualization: five malignant inputs and Score-CAM visualization for ResNet50 model. Score-CAM, score-weighted class activation mapping.

Figure 11 shows the visualization of four benign input images using three XAI models, and the figure presents how ResNet50 model learns to classify the input images as a benign class. As shown in the figure, although all three XAI models visualize very well, there are still issues in the model. For example, in the second input image, the model generates heatmap from non-important regions.

Figure 11 Four benign inputs and Grad-CAM, Grad-CAM++ and Score-CAM visualization. Grad-CAM, gradient-weighted class activation mapping; Score-CAM, score-weighted class activation mapping.

Figure 12 shows the score-CAM, Grad-CAM, and Grad-CAM++ visualizations of the malignant class. In the figure, the Score-CAM XAI model clearly shows that the ResNet50 model learns from important features to classify the input images as a malignant class. However, in the third and fourth input images, the visualization indicated that the model learned from irrelevant regions of the image to predict the input as a malignant class.

Figure 12 Four malignant inputs and Grad-CAM, Grad-CAM++ and Score-CAM visualization. Grad-CAM, gradient-weighted class activation mapping; Score-CAM, score-weighted class activation mapping.

In general, although all the three XAI models present promising results on how the ResNet50 model classifies a certain image into benign or malignant classes, there is still a need to improve the performance of the model. Since having small dataset is one of the challenges in medical imaging, resulting poor generalizability (77), the performance of our model can be improved by training with additional datasets.

The implementation of ViT models and five CNN models, along with the incorporation of Grad-CAM, Grad-CAM++, and Score-CAM techniques, offers a comprehensive approach to skin cancer classification. These classification and visualization methods contribute to the robustness and interpretability of automatic skin cancer diagnostic systems. ViT models provide a unique perspective for skin classification by capturing global contextual information from skin lesion images, and this enables the models to discern subtle patterns crucial for accurate diagnosis. On the other hand, the CNN models are good at capturing local features and have been massively applied in medical image classification tasks. The application of these models in skin cancer classification plays a vital role in facilitating the automatic diagnosis of the disease. Furthermore, implementing Grad-CAM, Grad-CAM++, and Score-CAM enhances the interpretability of the decision-making process of the CNN model. These methods generate heatmaps that highlight the regions of input images that contribute the most to the model’s decision to classify images as benign or malignant. This provides clinicians with valuable insights into the features that influence classification decisions made by CNN models. In addition, visualization is crucial in building trust in AI models, as professionals can visually validate the model’s focus areas and ensure alignment with their diagnostic reasonings. However, evaluation of XAI model is not conducted in this study to verify the generated heatmap is correct or not. We will collaborate with dermatologist in the future to enhance the applicability of the proposed model in real clinical environments. Moreover, we will enhance the explainability by conducting multimodal XAI model.

Performance comparison with previous works

In recent studies on transfer learning in skin cancer classification, various approaches and architectures have been explored to enhance model performance. The paper (78) presented transfer learning with AlexNet and achieving a classification accuracy of 87.1%. Another paper (79) proposed a transfer learning approach and they reported VGG16, ResNet50, and Xception, achieved accuracies of 86.5%, 81.6%, and 90.9%, respectively. However, those papers use black box models, and they did not incorporate any explainability methods. In contrast, a study (38) utilized a custom CNN with an accuracy of 82.7% and they applied the CAM technique for explainability. However, CAM is architectural dependent, and it only works when the model has a GAP layer. A paper (80) proposed transfer learning with EfficientNets B0-B7 and achieving an accuracy of 87.91% with EfficientNetB4, without using explainability approaches. Another study (67) focused on transfer learning with ViT and 11 pre-trained CNNs, achieving an accuracy of 92.14% with ViT and 82% with ResNet50, but they did not incorporate XAI methods. Our research explored transfer learning with ViT, Swin transformers, and CNN and achieved accuracies of 88.6% with ViT and 88.8% with ResNet50. Notably, in this study, we applied three XAI methods: Score-CAM, Grad-CAM, and Grad-CAM++. These techniques provide insights into the model’s decision-making processes, contributing to a more interpretable and transparent understanding of the predictions. A summary of the result of the comparative analysis of our paper with previously published papers is presented in Table 6.

Table 6

Comparative analysis with previous works

Paper Year Methods Accuracy XAI approach
(78) 2022 Transfer learning with AlexNet 87.1% Not used
(79) 2022 VGG16, ResNet50, and Xception 86.5% in VGG16, 81.6% in Resnet50 and 90.9 in Xception Not used
(38) 2021 Custom CNN 82.7% CAM
(80) 2022 Transfer learning with efficient Nets B0-B7 87.91% with EfficientnetB4 Not used
(67) 2023 Transfer learning with ViT and CNN 92.14% with ViT, 82% in ResNet50 Not used
This paper 2024 Transfer learning with ViT, Swin transformers and CNN 88.6% with ViT and 88.8% with ResNet50 Grad-CAM, Grad-CAM++ and Score-CAM

XAI, explainable artificial intelligence; CNN, convolutional neural network; CAM, class activation mapping; ViT, vision transformer; Grad-CAM, gradient-weighted class activation mapping; Score-CAM, score-weighted class activation mapping.


The findings of this study provide valuable insights into the application of deep learning for skin cancer classification and pave the way for advancements in automated dermatological diagnostics. We implemented five ViT models (ViT-b, ViT-L, Swin-s, Swin-b, and Swin-T) and five pre-trained CNN architectures with a transfer learning approach, which provided promising results for benign and malignant skin lesion classification tasks. From these models, ViT-L and ViT-b, demonstrated competitive performance, achieving an accuracy of 88.6%. In addition, ResNet50 outperformed the tested models with an accuracy of 88.8%, precision of 86.9%, sensitivity of 88.6%, F1 score of 87.8%, and specificity of 88.9%. Furthermore, we integrated three CAM-based XAI methods and demonstrated that the model learns from important regions of the input image to classify them as benign or malignant. Although the overall performance of the models was robust, it still needs further fine-tuning or architectural adjustments were required to enhance their performance. Therefore, in future work, we will test and validate the models with additional data, test different XAI models, optimize hyperparameters to improve the performance of existing models, and collaborate with dermatologists to evaluate the XAI models.


Funding: None.


Data Sharing Statement: Available at

Peer Review File: Available at

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This article does not contain any studies with human participants performed by any of the authors. IRB approval and informed consent was waived as no patients were involved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See:


  1. Khanna NN, Maindarkar MA, Viswanathan V, et al. Economics of Artificial Intelligence in Healthcare: Diagnosis vs. Treatment. Healthcare (Basel) 2022;10:2493. [Crossref] [PubMed]
  2. Ahmed B, Qadir MI, Ghafoor S. Malignant Melanoma: Skin Cancer-Diagnosis, Prevention, and Treatment. Crit Rev Eukaryot Gene Expr 2020;30:291-7. [Crossref] [PubMed]
  3. Apalla Z, Lallas A, Sotiriou E, et al. Epidemiological trends in skin cancer. Dermatol Pract Concept 2017;7:1-6. [Crossref] [PubMed]
  4. Cabrera R, Recule F. Unusual Clinical Presentations of Malignant Melanoma: A Review of Clinical and Histologic Features with Special Emphasis on Dermatoscopic Findings. Am J Clin Dermatol 2018;19:15-23. [Crossref] [PubMed]
  5. Melanoma Research Alliance [Internet]. [cited 2023 Nov 15]. Melanoma Survival Rates. Available online:
  6. Melanoma: Your Chances for Recovery (Prognosis) | Saint Luke’s Health System [Internet]. [cited 2024 Feb 15]. Available online:
  7. Nova JA, Sánchez-Vanegas G, Gamboa M, et al. Melanoma risk factors in a Latin American population. An Bras Dermatol 2020;95:531-3. [Crossref] [PubMed]
  8. Tichanek F, Försti A, Hemminki A, et al. Survival in melanoma in the nordic countries into the era of targeted and immunological therapies. Eur J Cancer 2023;186:133-41. [Crossref] [PubMed]
  9. Wan G, Nguyen N, Liu F, et al. Prediction of early-stage melanoma recurrence using clinical and histopathologic features. NPJ Precis Oncol 2022;6:79. [Crossref] [PubMed]
  10. Han D, van Akkooi ACJ, Straker RJ 3rd, et al. Current management of melanoma patients with nodal metastases. Clin Exp Metastasis 2022;39:181-99. [Crossref] [PubMed]
  11. Harkemanne E, Baeck M, Tromme I. Training general practitioners in melanoma diagnosis: a scoping review of the literature. BMJ Open 2021;11:e043926. [Crossref] [PubMed]
  12. Ring C, Cox N, Lee JB. Dermatoscopy. Clin Dermatol 2021;39:635-42. [Crossref] [PubMed]
  13. Vestergaard ME, Macaskill P, Holt PE, et al. Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting. In: Database of Abstracts of Reviews of Effects (DARE): Quality-assessed Reviews [Internet]. Centre for Reviews and Dissemination (UK); 2008.
  14. Masood A, Al-Jumaily AA. Computer aided diagnostic support system for skin cancer: a review of techniques and algorithms. Int J Biomed Imaging 2013;2013:323268. [Crossref] [PubMed]
  15. Menzies SW, Bischof L, Talbot H, et al. The performance of SolarScan: an automated dermoscopy image analysis instrument for the diagnosis of primary melanoma. Arch Dermatol 2005;141:1388-96. [Crossref] [PubMed]
  16. Jartarkar SR, Cockerell CJ, Patil A, et al. Artificial intelligence in Dermatopathology. J Cosmet Dermatol 2023;22:1163-7. [Crossref] [PubMed]
  17. Nachbar F, Stolz W, Merkle T, et al. The ABCD rule of dermatoscopy: High prospective value in the diagnosis of doubtful melanocytic skin lesions. J Am Acad Dermatol 1994;30:551-9. [Crossref] [PubMed]
  18. Argenziano G, Fabbrocini G, Carli P, et al. Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions. Comparison of the ABCD rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Arch Dermatol 1998;134:1563-70. [Crossref] [PubMed]
  19. Soyer HP, Argenziano G, Zalaudek I, et al. Three-point checklist of dermoscopy. A new screening method for early detection of melanoma. Dermatology 2004;208:27-31. [Crossref] [PubMed]
  20. Henning JS, Dusza SW, Wang SQ, et al. The CASH (color, architecture, symmetry, and homogeneity) algorithm for dermoscopy. J Am Acad Dermatol 2007;56:45-52. [Crossref] [PubMed]
  21. Akkoca Gazioğlu BS, Kamaşak ME. Effects of objects and image quality on melanoma classification using deep neural networks. Biomed Signal Process Control. 2021;67:102530. [Crossref]
  22. Delibasis K, Moutselos K, Vorgiazidou E, et al. Automated hair removal in dermoscopy images using shallow and deep learning neural architectures. Comput Methods Programs Biomed Update 2023;4:100109. [Crossref]
  23. Mohakud R, Dash R. Skin cancer image segmentation utilizing a novel EN-GWO based hyper-parameter optimized FCEDN. J King Saud Univ - Comput Inf Sci 2022;34:9889-904. [Crossref]
  24. Salem Ghahfarrokhi S, Khodadadi H, Ghadiri H, et al. Malignant melanoma diagnosis applying a machine learning method based on the combination of nonlinear and texture features. Biomed Signal Process Control 2023;80:104300. [Crossref]
  25. Rizzo S, Botta F, Raimondi S, et al. Radiomics: the facts and the challenges of image analysis. Eur Radiol Exp 2018;2:36. [Crossref] [PubMed]
  26. Wu Y, Chen B, Zeng A, et al. Skin Cancer Classification With Deep Learning: A Systematic Review. Front Oncol 2022;12: [Internet]. [Crossref] [PubMed]
  27. Debelee TG. Skin Lesion Classification and Detection Using Machine Learning Techniques: A Systematic Review. Diagnostics (Basel) 2023;13:3147. [Crossref] [PubMed]
  28. Bhatt H, Shah V, Shah K, et al. State-of-the-art machine learning techniques for melanoma skin cancer detection and classification: a comprehensive review. Intelligent Medicine 2023;3:180-90. [Crossref]
  29. Nie Y, Sommella P, Carratù M, et al. Recent Advances in Diagnosis of Skin Lesions Using Dermoscopic Images Based on Deep Learning. IEEE Access 2022;10:95716-47.
  30. He X, Wang Y, Zhao S, et al. Deep metric attention learning for skin lesion classification in dermoscopy images. Complex Intell Syst 2022;8:1487-504. [Crossref]
  31. S M J. Classification of skin cancer from dermoscopic images using deep neural network architectures. Multimed Tools Appl 2023;82:15763-78. [Crossref] [PubMed]
  32. Shetty B, Fernandes R, Rodrigues AP, et al. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci Rep 2022;12:18134. [Crossref] [PubMed]
  33. Salma W, Eltrass AS. Automated deep learning approach for classification of malignant melanoma and benign skin lesions. Multimed Tools Appl 2022;81:32643-60. [Crossref]
  34. Hasan MR, Fatemi MI, Monirujjaman Khan M, et al. Comparative Analysis of Skin Cancer (Benign vs. Malignant) Detection Using Convolutional Neural Networks. J Healthc Eng 2021;2021:5895156. [Crossref] [PubMed]
  35. Ahmadi Mehr R, Ameri A. Skin Cancer Detection Based on Deep Learning. J Biomed Phys Eng 2022;12:559-68. [PubMed]
  36. Nigar N, Umar M, Shahzad MK, et al. A Deep Learning Approach Based on Explainable Artificial Intelligence for Skin Lesion Classification. IEEE Access 2022;10:113715-25.
  37. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. 2016 [cited 2023 Mar 9]; Available online: 10.1145/2939672.293977810.1145/2939672.2939778
  38. Chowdhury T, Bajwa ARS, Chakraborti T, et al. Exploring the Correlation Between Deep Learned and Clinical Features in Melanoma Detection. In: Papież BW, Yaqub M, Jiao J, Namburete AIL, Noble JA, editors. Medical Image Understanding and Analysis. Cham: Springer International Publishing; 2021. p. 3-17. (Lecture Notes in Computer Science).
  39. Zhou B, Khosla A, Lapedriza A, et al. Learning Deep Features for Discriminative Localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Internet]. Las Vegas, NV, USA: IEEE; 2016 [cited 2023 Apr 8]. p. 2921–9. Available online:
  40. Zunair H, Ben Hamza A. Melanoma detection using adversarial training and deep transfer learning. Phys Med Biol 2020;65:135005. [Crossref] [PubMed]
  41. Mayanja J, Asanda EH, Mwesigwa J, et al. Explainable Artificial Intelligence and Deep Transfer Learning for Skin Disease Diagnosis. In: Shakya S, Tavares JMRS, Fernández-Caballero A, Papakostas G, editors. Fourth International Conference on Image Processing and Capsule Networks. Singapore: Springer Nature; 2023. p. 711-24. (Lecture Notes in Networks and Systems).
  42. Jamil S, Roy AM. An efficient and robust Phonocardiography (PCG)-based Valvular Heart Diseases (VHD) detection framework using Vision Transformer (ViT). Comput Biol Med 2023;158:106734. [Crossref] [PubMed]
  43. Xin C, Liu Z, Zhao K, et al. An improved transformer network for skin cancer classification. Comput Biol Med 2022;149:105939. [Crossref] [PubMed]
  44. He X, Tan EL, Bi H, et al. Fully transformer network for skin lesion analysis. Med Image Anal 2022;77:102357. [Crossref] [PubMed]
  45. Cirrincione G, Cannata S, Cicceri G, et al. Transformer-Based Approach to Melanoma Detection. Sensors (Basel) 2023;23:5677. [Crossref] [PubMed]
  46. Ayana G, Dese K, Dereje Y, et al. Vision-Transformer-Based Transfer Learning for Mammogram Classification. Diagnostics (Basel) 2023;13:178. [Crossref] [PubMed]
  47. Band SS, Yarahmadi A, Hsu CC, et al. Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Informatics in Medicine Unlocked 2023;40:101286. [Crossref]
  48. Herm LV, Heinrich K, Wanner J, et al. Stop ordering machine learning algorithms by their explainability! A user-centered investigation of performance and explainability. International Journal of Information Management 2023;69:102538. [Crossref]
  49. Mittermaier M, Raza MM, Kvedar JC. Bias in AI-based models for medical applications: challenges and mitigation strategies. NPJ Digit Med 2023;6:113. [Crossref] [PubMed]
  50. ISIC [Internet]. [cited 2024 Feb 12]. ISIC | International Skin Imaging Collaboration. Available online:
  51. CLAUDIO FANCONI. Skin Cancer: Malignant vs. Benign [Internet]. [cited 2023 Sep 20]. Available online:
  52. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 2002;16:321-57. [Crossref]
  53. Joloudari JH, Marefat A, Nematollahi MA, et al. Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks. Appl Sci 2023;13:4006. [Crossref]
  54. Thirunavukarasu R. Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review. Comput Biol Med 2022;149:106020. [Crossref] [PubMed]
  55. Xiao H, Li L, Liu Q, et al. Transformers in medical image segmentation: A review. Biomed Signal Process Control. 2023;84:104791. [Crossref]
  56. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Internet]. arXiv; 2021 [cited 2023 Dec 25]. Available online:
  57. Liu Z, Lin Y, Cao Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Internet]. arXiv; 2021 [cited 2023 Dec 25]. Available online: 10.1109/ICCV48922.2021.0098610.1109/ICCV48922.2021.00986
  58. He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. 2015 [cited 2023 Apr 8]; Available online:
  59. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014 [cited 2023 Apr 8]; Available online:
  60. Huang G, Liu Z, van der Maaten L, et al. Densely Connected Convolutional Networks. 2016 [cited 2023 Apr 8]; Available online:
  61. Sandler M, Howard A, Zhu M, et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks [Internet]. arXiv; 2019 [cited 2023 Apr 8]. Available online:
  62. Gairola AK, Kumar V, Sahoo AK. Exploring Multiple Deep learning Models for Skin Cancer Classification. In: 2022 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) [Internet]. 2022 [cited 2024 Jan 2]. p. 805–10. Available online:
  63. Rezvantalab A, Safigholi H, Karimijeshni S. Dermatologist Level Dermoscopy Skin Cancer Classification Using Different Deep Learning Convolutional Neural Networks Algorithms [Internet]. arXiv; 2018 [cited 2024 Jan 2]. Available online:
  64. Indraswari R, Rokhana R, Herulambang W. Melanoma image classification based on MobileNetV2 network. Procedia Comput Sci 2022;197:198-207. [Crossref]
  65. Toğaçar M, Cömert Z, Ergen B. Intelligent skin cancer detection applying autoencoder, MobileNetV2 and spiking neural networks. Chaos Solitons Fractals. 2021;144:110714. [Crossref]
  66. Gouda W, Sama NU, Al-Waakid G, et al. Detection of Skin Cancer Based on Skin Lesion Images Using Deep Learning. Healthcare (Basel) 2022;10:1183. [Crossref] [PubMed]
  67. Arshed MA, Mumtaz S, Ibrahim M, et al. Multi-Class Skin Cancer Classification Using Vision Transformer Networks and Convolutional Neural Network-Based Pre-Trained Models. Information 2023;14:415. [Crossref]
  68. Zhou B, Khosla A, Lapedriza A, et al. Learning Deep Features for Discriminative Localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Internet]. Las Vegas, NV, USA: IEEE; 2016 [cited 2023 Mar 9]. p. 2921-9. Available online:
  69. Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In: 2017 IEEE International Conference on Computer Vision (ICCV) [Internet]. Venice: IEEE; 2017 [cited 2023 Mar 9]. p. 618-26. Available online:
  70. Wang H, Wang Z, Du M, et al. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Internet]. Seattle, WA, USA: IEEE; 2020 [cited 2023 Mar 9]. p. 111-9. Available online:
  71. Belharbi S, Sarraf A, Pedersoli M, et al. F-CAM: Full Resolution Class Activation Maps via Guided Parametric Upscaling. 2021 [cited 2023 Mar 9]; Available online:
  72. Chattopadhay A, Sarkar A, Howlader P, et al. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, NV: IEEE; 2018, p. 839-47.
  73. Dhar T, Dey N, Borra S, et al. Challenges of Deep Learning in Medical Image Analysis—Improving Explainability and Trust. IEEE Trans Technol Soc 2023;4:68-75. [Crossref]
  74. He M, Li B, Sun S. A Survey of Class Activation Mapping for the Interpretability of Convolution Neural Networks. In: Wang Y, Liu Y, Zou J, Huo M, editors. Signal and Information Processing, Networking and Computers. Singapore: Springer Nature; 2023. p. 399-407. (Lecture Notes in Electrical Engineering).
  75. Dagnaw GH, Mouthadi ME. Towards Explainable Artificial Intelligence for Pneumonia and Tuberculosis Classification from Chest X-Ray. In: 2023 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) 2023, pp. 55-60.
  76. GitHub [Internet]. [cited 2023 Oct 29]. Score-CAM/ at master tabayashi0117/Score-CAM. Available online:
  77. Patrício C, Neves JC, Teixeira LF. Explainable Deep Learning Methods in Medical Image Classification: A Survey. ACM Computing Surveys 2023;56:1-41. [Crossref]
  78. Ghazal TM, Hussain S, Khan MF, et al. Detection of Benign and Malignant Tumors in Skin Empowered with Transfer Learning. Comput Intell Neurosci 2022;2022:4826892. [Crossref] [PubMed]
  79. Bassel A, Abdulkareem AB, Alyasseri ZAA, et al. Automatic Malignant and Benign Skin Cancer Classification Using a Hybrid Deep Learning Approach. Diagnostics (Basel) 2022;12:2472. [Crossref] [PubMed]
  80. Ali K, Shaikh ZA, Khan AA, et al. Multiclass skin cancer classification using EfficientNets – a first step towards preventing skin cancer. Neurosci Inform. 2022;2:100034. [Crossref]
doi: 10.21037/jmai-24-6
Cite this article as: Dagnaw GH, El Mouhtadi M, Mustapha M. Skin cancer classification using vision transformers and explainable artificial intelligence. J Med Artif Intell 2024;7:14.

Download Citation