A multimodal approach to the detection and classification of skin diseases
Highlight box
Key findings
• Using only skin disease images, the highest accuracy achieved on the new dataset is 80.1% but by combining it with text data, the accuracy can reach 91% across 26 classes.
What is known and what is new?
• Doctors use both image and textual/voice information to diagnose skin diseases. However, the computational diagnosis of skin disease has been explored via image and text separately but never together.
• Proposed a new method that uses common and standard patient information, such as skin images and patient narratives, to predict the patient’s skin disease, leveraging a new large language model finetuning technique for sequence classification tasks, chain of options, that uses chain of thought and precursory information to improve training accuracy.
What is the implication, and what should change now?
• Demonstrated the potential of the proposed model for detecting multiple skin diseases simultaneously, which significantly improves the efficiency and accuracy of skin disease diagnosis.
• Proposed a new multimodal skin disease dataset containing 26 disease classes, offering a more comprehensive and realistic classification challenge.
Introduction
Background
According to PBS, nearly one-third of Americans lack access to primary care services, particularly in underserved or remote areas where accessing doctors may be limited, and another forty percent delay going to avoid medical costs (1). As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms. With the rise of artificial intelligence (AI), self-diagnosis and improved disease recognition have become more promising than ever. AI-driven diagnostic systems can potentially improve the accuracy and speed of disease diagnosis, especially for skin diseases. These tools have shown promising results in the diagnosis of skin diseases, with some studies demonstrating superior performance compared to human dermatologists (2). In the near future, by integrating these tools into mobile applications, individuals can capture images of their skin lesions and receive instant feedback or recommendations for further evaluation. This approach has the advantage of reducing wait times for appointments and providing timely guidance to patients who may not have easy access to doctor services.
Current state of the art deep learning algorithms have demonstrated remarkable capabilities in analyzing skin images and identifying patterns indicative of various skin conditions. These models leverage convolutional neural networks (CNNs) to extract features from images and make predictions based on learned patterns. CNNs consist of multiple layers of convolutional, pooling, and fully connected layers that extract features from input images and classify them into different disease categories. In particular, several vision models including Very Deep Convolutional Networks (VGGNet) (3), ResNet (4), and Efficientnet (5) have been developed, adapted and finetuned on various image classification tasks.
Thomsen (6), trained a VGG-16 model on 16,543 non-standardized images categorized with ICD-10 codes related to acne, rosacea, psoriasis, eczema, and cutaneous T-cell lymphoma. This model distinguished between the diseases with accuracy above 77%, indicating a clinically relevant accuracy compared to the reported diagnostic accuracy in dermatology of primary care physicians (48–77%). Abunadi (7) reported that ResNet-50, when applied to skin disease diagnosis using transfer learning, achieved an accuracy rate of 90% on the ISIC 2018 dataset (8) and 95.8% on the PH2 dataset (9). In fact, many studies have shown that ResNet-based models achieve high accuracy in differentiating between benign and malignant skin lesions, outperforming human dermatologists in certain cases (10,11). Wu (12) using the Efficientnet-B4 model, developed an AI dermatology diagnosis assistant for psoriasis, eczema & atopic dermatisit and healthy skins. The model was trained on 4,740 clinical images, and performance evaluated using expert-confirmed clinical images grouped into three different dermatologist-labeled diagnosis classifications with accuracies of 96%, 89% and 93%, respectively. In conjunction with the features extracted from the model itself (13), also includes hand-crafted image features from experts to be used for classification.
In addition to images of the skin disease, textual information related to the skin disease is also crucial for accurate diagnosis, especially when the visual signals of the diseases are very similar. This information can come in many forms such as patient history, patient stories, or clinical metadata such as their age, sex, or physical fitness. The most common form is patient stories. Patient stories or narratives are personal accounts of a person’s experiences, in this case with skin diseases, and are one of the first sources of information that a doctor uses for diagnosing skin conditions. These readily available stories contain information about the symptoms that a patient is experiencing which is especially important as some diseases may look the same from a physical inspection but cause the patient to exhibit different symptoms that are not apparent from an image. However, patient narratives tend to be incomplete and sometimes inaccurate descriptors. For example, a patient presenting with eczema may only describe that they have constant itching and dry, cracked skin but not notice (or mention) the rash that they have which is a clear indicator of eczema. With this information, it is up to the dermatologist to narrow down the diseases and present the most likely one.
With the popularization of using large language models (LLMs) such as ChatGPT for answering daily questions, patients have begun to use LLMs for diagnostic advice similar to how they would have asked a dermatologist (14). Services such as ChatGPT (15) and Claude (16) enable patients to conveniently get their skin ailment-related issues answered without needing to wait for a doctor’s response. Although, the disease diagnosis from these LLMs could be inaccurate especially since they weren’t designed for skin disease classification. To address this issue of specialization, researchers in other fields proposed several methods such as finetuning or prompting the LLM. The most popular of these techniques is chain of thought prompting (CoT) (17). CoT improves the ability of LLMs to solve complex reasoning tasks by forcing the LLM to present a series of intermediate reasoning steps before presenting the final result. In this way, the LLM breaks down the complex task into smaller, simpler, subtasks that are less error prone, thus increasing accuracy. In the context of disease diagnosis, LLMs prompted with CoT would be able to reason about each individual symptom and then make a final judgment based on each of the reasonings.
Rationale and knowledge gap
Overall, the success in skin disease diagnosis is largely enabled by the development of large datasets, consisting of thousands of annotated images and text. Despite these promising advancements, AI driven skin disease diagnosis still faces several challenges and limitations that must be addressed before being put into clinical practice. One major challenge is the need for large, high-quality datasets with diverse patient populations and skin disease types. Another challenge is the outdated methods of study, resulting in studies limited to only a few diseases or modalities.
Using only images to diagnose skin diseases has been explored in the past, especially with the creation of many skin disease image datasets such as ISIC (18), Dermnet (19), and the ones on Kaggle (20). These open or closed source datasets have enabled researchers to have consistent baselines with which to monitor and track model performance improvements. Each dataset generally comes from one or two sources such as medical archives, clinical studies, or research repositories and so only contain a few skin diseases. As such, there is a need for a new dataset that aggregates these separate datasets into one definitive dataset that all researchers can use for model evaluation. Correspondingly baselines for all the most popular networks including ResNet (4), VGG (3), Efficientnet (5), and ViT (21) need to be run on this dataset. This study accomplishes both of these objectives.
Although LLMs have been used for diagnosing skin diseases in the past, the use case is limited to prompting where the model weights are unchanged and only the input, prompt, to the LLM is changed. This method works to a certain extent but is reliant on the model having knowledge about the skin disease and its symptoms. Instead, this work finetunes the LLM by treating skin disease diagnosis as a classification task where the input is a patient narrative text and the output is the predicted skin disease class. In this way, the model can have domain-specific knowledge about skin diseases, increasing diagnostic accuracy. Additionally, this work extends the chain of thought paradigm to the finetuning step to better utilize this prompt structure’s ability to incrementally narrow down the skin disease given an initial list of potential ones while also being domain-specific.
Overall, combining vision and text data for the task of skin disease classification is a novel approach that has not been extensively explored in the past. This setup more closely aligns with how dermatologists actually classify skin diseases in patients since it allows for a more wholistic understanding of the patient’s conditions which can’t be expressed in one modality alone. For example, using images alone, some skin disease symptoms are not as easily recognizable due to patients having other skin conditions such as tattoos or extensive hair growth. In fact, even skin color could affect the diagnostic accuracy. Patient narratives do not suffer from this issue as, regardless of the patient’s skin color, the symptoms described by the patient would be more or less the same. However, these symptom descriptions may not be perfect which is why dermatologists still need to use images to support their diagnosis.
Along these lines, Cai (22) proposed a dataset that includes skin disease images and clinical metadata which are fed into a multimodal Transformer. Although this method also combines both images and text, the text data they use is not as informative about the disease or as easy to obtain as compared to patient narratives. Patient narratives provide key information for diagnosis such as non-visual signals (trouble breathing, headaches, etc.) which are not easily apparent from patient metadata. This is especially true if the patient is not forthcoming about their age, sex, or other metadata information but are willing to discuss how they are feeling. Overall, current methods that combine image and text information are only evaluated on 10 diseases or less (22-25). The prediction accuracy that these experiments achieve is around 84% (26,27). In this work, higher multimodal accuracy is achieved while also predicting more skin disease classes.
Objective
In summary, this method makes the following contributions:
- Proposed an end-to-end system that utilizes common and standard patient information such as affected skin area images as well as patient narratives to predict the skin disease afflicting the patient.
- Proposed a new multimodal skin disease dataset containing 26 disease classes, offering a more comprehensive and realistic classification challenge. Evaluated several models and optimization techniques to determine the optimal configuration for the dataset which surpasses state of the art.
- Introduced a new LLM finetuning technique for sequence classification tasks, chain of options, that utilizes chain of thought and precursory information to improve training accuracy and efficiency.
- Demonstrated the potential of the proposed model for detecting multiple skin diseases simultaneously, which can significantly improve the efficiency and accuracy of skin disease diagnosis.
We present this article in accordance with the CLEAR reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/rc).
Methods
Dataset development
Image data
In this work, the top three skin disease datasets available on Kaggle, a public repository for storing datasets, were used (18-20). These datasets are comprised of thousands of annotated images collected from diverse sources, including medical archives, research repositories, and clinical studies. The first dataset (20) consists of 27,153 images classified into ten classes of skin diseases. This dataset has good distributions of the 10 classes and was used by Hammad (23), who focused on only two classes, which are the 1,677 images for eczema and the 2,055 images for psoriasis. The second dataset (19) includes 19,600 images classified into 23 different types of skin diseases/classes. The third dataset (18), derived from the ISIC competition source website, contains about 23,618 images classified into 8 types of skin diseases. It is important to note that several classes appear in more than one dataset such as basal cell carcinoma which appears in all three. In this way, after combining the three datasets into the final dataset, the total number of classes is not the direct sum of the number of classes in each individual dataset and is actually less.
The reason for combining these datasets is because each of the different datasets contains a different type of skin disease images such as (18) containing more clinical (zoomed in and focused) images of the skin diseases while (19) contains more natural images with not optimal lighting and surrounding skin as well. This diversity in style, batch effects, actually improves the dataset’s representativeness of real life skin disease images and helps to increase models trained on this dataset’s generalizability. However, the datasets used are comprised of primarily Caucasian skin which limits the dataset’s representativeness in that respect which is covered in more detail in the Discussion section. Since previous skin disease methods (23,28) also use these datasets, albeit separately, this method aims to be an improvement over those.
In total, there are approximately 70,000 images across the three datasets. However, some images in the combined dataset are duplicates or not related to skin diseases. Therefore, the images were manually checked to determine their relevancy based on the following criteria:
- The images labeled as a certain disease class are actually of that disease;
- The images must be in RGB not gray-scale;
- The images must be of skin and should not be of internal organs or contain non-skin artifacts such as clothing;
- The images should all be unique and not be perturbations or augmentations of the same image.
After manual checking, an automated check was also done to remove any duplicate images that may be under a different file name to prevent data leakage. This was done by taking the MD5 hash of each image and then making sure that the same hash doesn’t appear more than once. If the same hash does appear more than once, manual inspection was done to verify that the two images are indeed the same and one of them is deleted.
After filtering, the final dataset contains 36,995 images across 26 types of skin disease. Unfortunately, class distribution is not perfect because of the large variation in image numbers for different diseases. Some diseases only have around 200 (dermatofibroma has only 239 images) images while others have over 7,900 images (melanocytic nevi). This highlights one of the challenges of getting skin disease images for AI applications which this study overcomes. Table 1 shows the classes and number of images in each dataset used in this study. The images were divided into 80% for training and 20% for validation/testing. This split was varied for each training run based on preset seeds.
Table 1
| Class name (skin diseases) | Number of images | Final dataset | ||
|---|---|---|---|---|
| 1st dataset | 2nd dataset | 3rd dataset | ||
| Acne and rosacea | 1,152 | 858 | ||
| Atopic dermatitis | 1,257 | 612 | 1,210 | |
| BCC | 3,323 | 1,437 | 3,323 | 4,709 |
| BKL | 2,079 | 2,624 | 2,065 | |
| Bullous disease | 561 | 528 | ||
| Cellulitis impetigo and other bacterial infections | 361 | 352 | ||
| Eczema | 1,677 | 1,544 | 1,553 | |
| Exanthems and drug eruptions | 505 | 467 | ||
| Hair loss alopecia and other hair diseases | 299 | 282 | ||
| Light diseases and disorders of pigmentation | 711 | 676 | ||
| Lupus and other connective tissue diseases | 525 | 511 | ||
| Melanocytic nevi (NV) | 7,970 | 12,896 | 7,967 | |
| Melanoma skin cancer nevi and moles | 3,140 | 579 | 4,522 | 3,698 |
| Nail fungus and other nail disease | 1,301 | 1,163 | ||
| Poison ivy and other contact dermatitis | 325 | 308 | ||
| Psoriasis pictures lichen planus and related diseases | 2,055 | 1,757 | 1,791 | |
| Scabies Lyme disease and other infestations and bites | 539 | 479 | ||
| Seborrheic keratoses and other benign tumors | 1,847 | 1,714 | 1,802 | |
| Systemic disease | 758 | 698 | ||
| Tinea ringworm candidiasis and other fungal infections | 1,702 | 1,625 | 1,546 | |
| Urticaria hives | 265 | 261 | ||
| Vascular tumors | 603 | 253 | 845 | |
| Vasculitis | 521 | 510 | ||
| Warts molluscum and other viral infections | 2,103 | 1,358 | 1,849 | |
| Squamous cell carcinoma | 628 | 628 | ||
| Dermatofibroma | 239 | 239 | ||
| Total diseases (classes) | 10 | 22 | 7 | 26 |
| Total images | 27,153 | 19,052 | 23,618 | 36,995 |
BCC, basal cell carcinoma; BKL, benign keratosis-like lesions.
Text data from disease symptoms
In addition to images of skin diseases, symptoms of these diseases are another source of information for determining what ailment a patient has. This information about the common indicators and symptoms of various skin diseases are readily available on the Internet. In this study, symptom data is manually aggregated across several reputable sources such as Mayo Clinic (29), American Academy of Dermatology Association (30), National Institution of Arthritis and Musculoskeletal and Skin Diseases (31), and Cleveland Clinic (32). Table 2 shows examples of three diseases and their aggregated symptoms using the aforementioned sources. This symptom data is used to generate mock “patient stories” which are used as data input for finetuning LLMs for skin disease classification. Additionally, ChatGPT was prompted to describe the symptoms of each skin disease to determine whether it already has a preliminary understanding of them. Although ChatGPT mentioned a few symptoms of the skin disease, the list of symptoms it described was less thorough and less extensive as compared to the manually aggregated one.
Table 2
| Disease name | Aggregated symptoms | Symptoms from ChatGPT |
|---|---|---|
| BCC | • A shiny, skin-colored bump that’s translucent, meaning you can see a bit through the surface. The bump can look pearly white or pink on white skin. On brown and black skin, the bump often looks brown or glossy black. Tiny blood vessels might be visible, though they may be difficult to see on brown and black skin. The bump may bleed and scab over | BCC is the most common type of skin cancer, usually caused by exposure to UV radiation from sunlight. It typically appears as a pearly or waxy bump on the skin, often with visible blood vessels, or as a flat, scaly, reddish patch. BCC tends to grow slowly and rarely metastasizes, but it can cause disfigurement if left untreated. AK is a precancerous skin lesion caused by long-term exposure to UV radiation. It appears as rough, scaly patches on the skin, commonly found on sun-exposed areas such as the face, scalp, and hands. While AK itself is not cancerous, it has the potential to develop into squamous cell carcinoma if left untreated |
| • A brown, black or blue lesion—or a lesion with dark spots—with a slightly raised, translucent border | ||
| • A flat, scaly patch with a raised edge. Over time, these patches can grow quite large | ||
| • A white, waxy, scar-like lesion without a clearly defined border | ||
| Eczema | • Dry, cracked skin | Eczema, also known as atopic dermatitis, is a chronic inflammatory skin condition characterized by dry, itchy, and red patches of skin. It can occur at any age but is most common in children. The exact cause of eczema is not fully understood, but it is believed to involve a combination of genetic and environmental factors. Triggers for eczema flare-ups can include irritants like soaps and detergents, allergens like dust mites or pet dander, changes in weather or humidity, and stress |
| • Itchiness (pruritus) | ||
| • Rash on swollen skin that varies in color depending on your skin color | ||
| • Small, raised bumps, on brown or black skin | ||
| • Oozing and crusting | ||
| • Thickened skin | ||
| • Darkening of the skin around the eyes | ||
| • Raw, sensitive skin from scratching | ||
| BKL | • A round or oval-shaped waxy or rough bump, typically on the face, chest, a shoulder or the back | BKL encompass a range of skin growths, including seborrheic keratoses and other benign tumors. Seborrheic keratoses are common non-cancerous growths that appear as wart-like or stuck-on lesions on the skin’s surface. They can vary in color, size, and texture, often presenting as brown, black, or tan growths with a waxy, scaly, or rough surface. While seborrheic keratoses are typically harmless, they may be cosmetically bothersome or cause itching. Other benign tumors, such as dermatofibromas or skin tags, may also be categorized under BKL. These growths are generally benign and do not pose a significant health risk, but they may be removed for cosmetic reasons or if they cause discomfort |
| • A flat growth or a slightly raised bump with a scaly surface, with a characteristic “pasted on” look | ||
| • Varied size, from very small to more than 1 inch (2.5 centimeters) across | ||
| • Varied number, ranging from a single growth to multiple growths | ||
| • Very small growths clustered around the eyes or elsewhere on the face, sometimes called flesh moles or dermatosis papulosa nigra, common on black or brown skin | ||
| • Varied in color, ranging from light tan to brown or black |
AK, actinic keratosis; BCC, basal cell carcinoma; BKL, benign keratosis-like lesions; UV, ultraviolet.
As most real patient stories contain an incomplete list of symptoms for the patient’s disease, when generating mock patient stories, not all of the disease symptoms are used in each data sample. The data samples are generated by prompting ChatGPT to use a random subset of a disease’s symptoms to write a patient narrative as if it were the patient themselves Table 3. The prompt used for each patient narrative generation is also different to improve the diversity of narrative style as well as dialects as different patients would express their symptoms in different ways. These two steps are important in ensuring that the LLM input (patient narratives) is similar to what an actual patient would say to improve the dataset’s representativeness and the subsequent model’s generalizability. Contrasting these patient narratives with the original disease symptoms, patient narratives are much closer to what an actual dermatologist would have access to when making a diagnosis. Despite this, using artificially generated patient narratives does have limitations as these descriptions have not been validated by practicing dermatologists and might not capture natural variations and complexities of real patient descriptions. This is covered in more detail in the Discussion section. For now, these mock patient narratives serve as a proof of concept tool for combining text and vision modality inputs for skin disease detection as well as for LLM related finetuning techniques.
Table 3
| Eczema disease | Keywords | Prompt (GPT question) | Story (GPT response) |
|---|---|---|---|
| Example 1 | • Dry, cracked skin | Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: dry, cracked skin, itchiness, small, raised bumps, on brown or black skin, oozing and crusting, thickened skin, sensitive skin from scratching | Ugh, my skin has been a total mess lately, and it’s driving me crazy. It’s so dry and cracked that it actually hurts, and the itching is nonstop—I keep scratching without even thinking, which just makes it worse. There’s this rash that looks different depending on the lighting, but on my brown skin, it’s super noticeable. Plus, I’ve got these tiny bumps that sometimes ooze (gross, I know) and then crust over, while other spots feel weirdly thick. And to top it all off, the skin around my eyes has gotten way darker. I have no clue what’s going on, but I need to figure it out ASAP because this is seriously stressing me out |
| • Itchiness (pruritus) | |||
| • Rash on swollen skin that varies in color depending on your skin color | |||
| • Small, raised bumps, on brown or black skin | |||
| • Oozing and crusting | |||
| • Thickened skin | |||
| • Darkening of the skin around the eyes | |||
| • Raw, sensitive skin from scratching | |||
| 2 | • Dry, cracked skin | Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: dry, cracked skin, itchiness (pruritus), rash on swollen skin that varies in color depending on your skin color | I’ve been grappling with a series of unsettling skin issues that are really affecting my daily life. My skin feels incredibly dry and cracked, and the constant itchiness has become maddening. The appearance of a rash on my swollen skin is particularly troubling, and what’s strange is that it varies in color depending on my skin tone. It’s not just uncomfortable; it’s also visually distressing. I can’t seem to find relief, and I’m anxious to understand what’s causing these symptoms and how I can get some relief for my skin |
| • Itchiness (pruritus) | |||
| • Rash on swollen skin that varies in color depending on your skin color | |||
| 3 | • Rash on swollen skin that varies in color depending on your skin color | I would like to talk to doctor and would like describe my symptoms: rash on swollen skin, oozing and crusting and thickened skin. please construct a one paragraph patient narrative using those symptoms | I’ve been experiencing a rash that appears on swollen skin, accompanied by oozing and crusting. The affected area also seems to have thickened skin. The rash is red and raised, and the surrounding skin appears puffy. There’s discharge coming from the rash, with a yellowish color and crust forming over time. The skin feels tougher than usual, with a rough texture. These symptoms started recently and haven’t improved with home remedies. I’m concerned about what might be causing this and would appreciate your insight and guidance |
| • Oozing and crusting | |||
| • Thickened skin | |||
| 4 | • Dry, cracked skin | Pretending you go to hospital with your daughter to see the skin doctor, please help write a paragragh to describe her symptoms: cracked skin, Itchiness (pruritus) and raw, sensitive skin from scratching | My daughter has been experiencing cracked skin, relentless itchiness (pruritus), and her skin has become raw and sensitive from scratching when she was infancy. The cracks in her skin are visible, especially on her arms and legs, and they seem to worsen with movement. She constantly complains of itching, which she tries to relieve by scratching, leading to further irritation and discomfort. Her skin has become red and inflamed in areas where she scratches the most, and it’s evident that she’s in significant discomfort. We’re hoping to find relief and a solution to alleviate her symptoms |
| • Itchiness (pruritus) | |||
| • Raw, sensitive skin from scratching | |||
| 5 | • Small, raised bumps, on brown or Black skin | Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: small, raised bumps, on brown or black skin, oozing and crusting, thickened skin, darkening of the skin around the eyes | Lord have mercy, I just couldn’t ignore what was happening to his poor skin. Little raised bumps started popping up all over, and some of ‘em were even oozing and crusting. His skin in those spots felt thick and rough, nothing like the smooth complexion he’s always had. But what really got me worried was the darkening around his eyes—it just didn’t look right, like something more serious might be going on. I knew we needed to get some answers, and fast. I’m just praying the dermatologist can tell us what’s wrong and how to fix it |
| • Oozing and crusting | |||
| • Thickened skin | |||
| • Darkening of the skin around the eyes |
The final dataset consists of 20 stories per skin disease, totaling 520 datapoints in total. The patient stories were divided into 60% for training, 20% for validation, and 20% for testing. Experimentally, training with more than 10 samples of each class results in little to no model improvement so 20 stories per skin disease was sufficient for this task. This makes sense as each disease had only a small number of symptoms and thus 10 or more training samples would sufficiently cover the range of symptoms possible for that disease. Table 3 shows an example of 5 stories for the eczema skin disease.
Image classification
Initial development using baseline dataset
Using the first dataset (20), baseline experiments were run with four different vision models (VGG, Resnet, Efficientnet, and Vision Transformer) to classify 10 different skin diseases (classes).
Image augmentation techniques have facilitated the generation of synthetic dermatological images, augmenting existing datasets, and enhancing the diversity and representativeness of training data. This approach helps mitigate the challenges of dataset imbalance and improves the robustness of machine learning models to variations in skin types, lighting conditions, and camera quality. There are several standard techniques used for image classification tasks such as color jitter, gaussian blur, horizontal/vertical flip, and image resizing. These techniques perform well in standard image classification techniques where the target entity can be anywhere on the image. However, for the proposed skin disease dataset, the afflicted area is generally in the center of the image, as well as being orientation agnostic, allowing for more fine-grained augmentations to be applied. In this case, the additional augments are random cropping and random rotation. In particular, instead of directly random cropping to the desired resolution, a resize was first performed to a size slightly larger than the crop size to retain most of the relevant skin proportions. These are considered the “final” augmentations, combining both general augments and dataset specific augments. To determine the optimal hyperparameters for the data augmentation techniques, models were trained from scratch on the first dataset (20) using an 80:20 train/validation split. For this and following experiments, the Adam optimizer was used to optimize the available network parameters.
Applying base models to aggregated new image dataset
Now that baselines for the first dataset (20) have been established, more classes were incrementally added into the aggregated new image dataset to determine the effect that more classes would have on accuracy. In total, the aggregated new image dataset combines three datasets (18-20) for a total of 26 classes, 16 more than the baseline dataset (10 classes). It is worth noting that datasets (19) and (18) contain smaller numbers of images for each disease class, leading to the unbalanced class distribution in the final dataset. This imbalance is unavoidable given that some diseases are rarer than others. Due to the new dataset’s class imbalance, reaching model convergence takes longer and gets stuck at suboptimal minimas more often as compared to the original dataset across all models from the baseline study.
Transfer learning comparison
To further combat class imbalance, transfer learning was used. Transfer learning occurs when 100% of the feature extractor is frozen and only the classifier is trained on a new dataset. In this case, the feature extractor is initialized to ImageNet weights (33). By using transfer learning, the model doesn’t need to learn a feature extractor and classifier simultaneously. This is especially useful in an imbalanced dataset as a trainable feature extractor would only learn the relevant features in the most common classes. Instead, a pretrained feature extractor would already understand what features are important in an image and enable the classifier to decide what disease these extracted features are related to.
Finetuning of pretrained Resnet model
However, directly using ImageNet-trained feature extractor weights has some issues. This is because the data distribution for the pretrained dataset, ImageNet, is significantly different from the current use case. While ImageNet contains pictures of dogs, cats, and other easily identifiable objects, the current dataset contains only skin disease images which have minute differences that correspond to vastly different diseases. As a result, what may be a distinguishing feature between a dog and a cat can’t be used to distinguish between melanoma and basal cell carcinoma. To address this, various percentages of the model layers were unfrozen (as compared to completely frozen in previous experiment) and allowed to be trained in order to determine the optimal percent at which the model would achieve the highest accuracy. By doing so, the model would be able to utilize the high level, class-agnostic, features learned through pretraining such as the outlines of shapes which are commonly found in the initial layers of the feature extractor. The model would then only need to adjust the later layers of the feature extractor which are responsible for low level features that are specific to the task at hand.
Adjusting image resolution
To further improve performance, image resolution was also adjusted. Since higher resolution images offer more visual cues of what the skin disease is, various models were evaluated with differing resolutions. The first 75% of layers in the feature extractor for each of the models is frozen based on findings from the previous section. Beginning with the initial resolution of 224×224 pixels, the images were enlarged using bicubic resampling up to 528×528 which is the approximate size of the original, unmodified, image. At the initial 224×224 resolution used in previous works, the original image would need to be shrunk by approximately 80%, significantly reducing the number of features available. By increasing the resolution, the resulting image would have more features available for the model to discern using which might actually detract from performance since many of these features could be extraneous and not related to a disease. In addition, using higher resolution images results in longer training times which is not feasible especially if there is limited compute budget. In fact, higher resolution images may not be available to begin with, especially given the scarcity of skin disease images. This study aims to determine the optimal resolution for images on this new dataset.
Text classification
Baseline development
Using the text (patient narratives) part of the newly created dataset, multi-class sequence classification was done with LLM where the LLM must classify the disease given the input patient narrative. Thus, three state of the art LLMs [Llama-7B (34), Falcon-7B (35), Mistral-7B (36)] were finetuned on this dataset. This finetuning step is necessary in order for the model to understand what specific keywords are associated with a particular disease. Due to computational constraints associated with such large models, LLMs larger than 7 billion parameters were not used in this study and the three selected LLMs were finetuned using Low Rank Adaptation (LoRA) (37). By using LoRA, only a small, low rank, subset of the model’s parameters (7–10%) need to be finetuned which enables the models to fit in consumer-grade GPUs. The dataset itself is split 70:30 train/validation due to its small size.
Initial prompt tuning
Along with the patient narrative text block input for sequence classification, additional supporting information was also added. As LLMs are trained on millions of corpuses that describe hundreds of diseases, it may be difficult for them to associate the presented symptoms to only the subset used in this study. As such, it makes sense to restrict the potential set of diseases the LLM must choose from to the subset specified in the study by presenting the LLM, in its prompt, the available disease options as shown in Figure 1.
Chain of options
Inspired by CoT, a popular prompting technique that breaks down a complex reasoning task into intermediate steps for the LLM to take at inference time, Chain of Options breaks down the task during finetuning instead of inference. This allows for more control in the reasoning process as the LLM’s performance could be evaluated during intermediate steps instead of only at the end. Furthermore, since Chain of Options operates during the finetuning step as opposed to the inference step, the method enables the LLM to learn more domain-specific information while CoT must rely on the LLM’s general knowledge.
At a high level, Chain of Options incrementally removes k options from the options list (problem space) based on which diseases are the most unlikely. Gradually, the options list would decrease until there are less than k diseases left at which point the top prediction is considered the final prediction as seen in Figure 2. This “narrowing down” process reframes the task from needing to pick the most likely disease out of 26 options, which could be difficult if many diseases have similar symptoms, into a task to simply determine which k diseases are the least likely. In contrast, CoT’s generated intermediate reasoning steps do not reduce the candidate pool which is less efficient for classification tasks with a defined set of classes. Conceptually, picking the least likely diseases that a patient narrative could describe based on a single symptom mismatch is a simpler task as compared to picking the most likely disease which would have required all symptoms to match. This process also allows for more leniency in case the patient’s symptoms are inconsistent with the actual disease.
During finetuning, random combinations of diseases of varying lengths are concatenated in the same way as the options list was to the input patient narrative. This randomness simulates the various instances of the chain that could potentially propagate through while also adding some noise to the input to reduce overfitting.
Combining image and text classification
Along with providing an options list, the output from the image model is also added to the narrative, imitating an “expert” (LLM) having the final say in what disease a patient has, given initial recommendations (image model). To imitate this process, the top-N most likely predictions from the image model would be used as an initial prediction which are then included alongside the patient narrative as input into the LLM. Figure 3 describes the architecture of the model for skin diseases classification in this study. Conceptually, this would further decrease the number of potential diseases the LLM must choose from to the top-N. In practice, this is not the case as the image model is not entirely accurate even when the top-5 predicted diseases are considered. As a result, the LLM must not always assume that the correct disease is within the initial recommended diseases although it is very likely it is. To imitate this in training, instead of always including the correct disease in the recommendation, there is a chance that the provided list of recommendations doesn’t actually include the correct prediction, with the chance being the same as the prediction accuracy of the image model.
Statistical analysis
In this study, several statistical methods are used to compare the classification performance across various models, datasets, and configurations. The primary metric used was accuracy which is evaluated as the number of predictions that correctly matched the actual disease class. The accuracy metric is further divided into top-1, top-3, and top-5 accuracy which represents whether the highest 1, 3, or 5 predictions outputted by the model contains the actual disease class. Given that there is class imbalance within the datasets, receiver operating characteristic area under the curve (ROC AUC), Precision, Recall, and F1-score were also measured. These metrics measure the extent of class imbalance to determine whether a model is overfitting on any particular class. Each experiment was run three times to gauge the statistical significance and robustness of the performance metrics via standard deviation measurements.
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. IRB approval is not applicable, as the study does not contain human or animal experiments. The study does not require informed consent as all datasets used are open sourced on Kaggle.
Results
The experiments were done on a home computer with an RTX 4090 GPU, Intel Core i9-13900K CPU, and 32 GB DDR5 memory.
Image models for image classification
Dataset development
Baselines for multiple models were established on the initial dataset, the first dataset mentioned in Table 1.
Table 4 shows the effect of adding various image augmentation techniques during training for various image models including Resnet, Efficientnet, and ViT. Several image augmentation techniques were used, beginning with ones commonly used for image classification tasks such as color jitter, gaussian blur, horizontal/vertical flip and image resizing. These augmentations improved validation accuracy from 85.297%±0.827% to 88.377%±0.666% as shown in Table 4. Correspondingly, the ROC AUC and F1-scores also increased to 0.957±0.011 and 0.552±0.012, respectively. Observing that a majority of “what is important” in the image is in the center of the image as well as the fact that skin diseases are orientation-agnostic, random cropping and random rotation image augmentations were added, improving performance to 89.276%±0.242% on the first dataset. Using this augmentation configuration, baselines for Efficientnet-B3, ViT, and VGG-11 were also run with similar validation accuracies in Table 4. These validation accuracies by Efficientnet-B3, ViT, Resnet-50, and VGG-11 were 88.662%±0.303%, 91.201%±0.504%, 89.276%±0.242% and 45.465%±0.000% respectively, which were in line with the top performing models within Kaggle (18-20).
Table 4
| Model | Augmentation | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|
| Resnet-50 | None | 85.297±0.827 | 0.945±0.012 | 0.553±0.018 | 0.534±0.011 | 0.534±0.010 |
| VGG11 | Standard | 46.392±1.311 | 0.505±0.008 | 0.111±0.086 | 0.119±0.011 | 0.084±0.021 |
| Resnet-50 | Standard | 88.377±0.666 | 0.957±0.011 | 0.551±0.011 | 0.556±0.013 | 0.552±0.012 |
| Effnet3 | Standard | 86.110±0.081 | 0.948±0.007 | 0.569±0.033 | 0.552±0.021 | 0.549±0.019 |
| ViT | Standard | 92.456±0.020 | 0.946±0.007 | 0.732±0.002 | 0.700±0.002 | 0.709±0.002 |
| VGG11 | Final | 45.465±0.000 | 0.500±0.000 | 0.051±0.000 | 0.111±0.000 | 0.069±0.000 |
| Resnet-50 | Final | 89.276±0.242 | 0.955±0.002 | 0.644±0.020 | 0.615±0.015 | 0.622±0.018 |
| Effnet3 | Final | 88.662±0.303 | 0.965±0.007 | 0.609±0.028 | 0.588±0.019 | 0.591±0.022 |
| ViT | Final | 91.201±0.504 | 0.964±0.009 | 0.725±0.012 | 0.690±0.009 | 0.702±0.003 |
Data are presented as mean ± standard deviation. AUC, area under the curve; ROC, receiver operating characteristic.
To test the effect of adding more classes to the dataset, an experiment was run where two intermediary datasets with 15 and 20 classes, respectively, were evaluated with the four models and final augmentations mentioned above. As the number of classes increased, accuracy, ROC AUC, Precision, Recall, and F1-score for all four models decreased. Lastly, these four base models were evaluated on the final aggregated new image dataset which contains 36,995 images spanning 26 classes. Their results are compared to the datasets with 9, 15, and 20 classes in Table 5. Across the board, model performance decreased with VGG-11 and ViT failing to converge (as reflected in their accuracy, ROC AUC, and F1-scores) and Resnet-50 and Efficientnet-B3 accuracy dropping by at least 20% to 66.608%±2.857% and 60.675%±1.718% respectively. Efficientnet-B3 ROC AUC dropped by 0.04 and F1-score dropped by 0.12. Resnet-50 ROC AUC dropped by 0.01 and F1-score dropped by 0.08.
Table 5
| Model | Dataset (classes) | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|
| VGG11 | 9 | 45.465±0.000 | 0.500±0.000 | 0.051±0.000 | 0.111±0.000 | 0.069±0.000 |
| Resnet-50 | 9 | 89.276±0.242 | 0.955±0.002 | 0.644±0.020 | 0.615±0.015 | 0.622±0.018 |
| Effnet3 | 9 | 88.662±0.303 | 0.965±0.007 | 0.609±0.028 | 0.588±0.019 | 0.591±0.022 |
| ViT | 9 | 91.201±0.504 | 0.964±0.009 | 0.725±0.012 | 0.690±0.009 | 0.702±0.003 |
| VGG11 | 15 | 26.234±0.000 | 0.500±0.000 | 0.017±0.000 | 0.067±0.000 | 0.028±0.000 |
| Resnet-50 | 15 | 74.589±0.047 | 0.960±0.002 | 0.653±0.003 | 0.655±0.001 | 0.651±0.002 |
| Effnet3 | 15 | 72.326±1.222 | 0.949±0.002 | 0.629±0.017 | 0.601±0.008 | 0.610±0.010 |
| ViT | 15 | 73.947±0.023 | 0.958±0.002 | 0.638±0.003 | 0.669±0.007 | 0.649±0.002 |
| VGG11 | 20 | 23.632±0.000 | 0.500±0.000 | 0.012±0.000 | 0.050±0.000 | 0.019±0.000 |
| Resnet-50 | 20 | 71.542±0.178 | 0.953±0.001 | 0.626±0.013 | 0.611±0.000 | 0.615±0.007 |
| Effnet3 | 20 | 70.445±1.373 | 0.954±0.000 | 0.623±0.004 | 0.607±0.012 | 0.607±0.004 |
| ViT | 20 | 68.428±0.849 | 0.949±0.003 | 0.570±0.018 | 0.609±0.007 | 0.582±0.009 |
| VGG11 | 26 | 21.508±0.007 | 0.500±0.000 | 0.008±0.000 | 0.038±0.000 | 0.014±0.000 |
| Resnet-50 | 26 | 66.608±2.857 | 0.942±0.008 | 0.556±0.032 | 0.541±0.043 | 0.542±0.040 |
| Effnet3 | 26 | 60.675±1.718 | 0.921±0.006 | 0.490±0.028 | 0.461±0.020 | 0.468±0.021 |
| ViT | 26 | 21.500±0.000 | 0.500±0.000 | 0.008±0.000 | 0.038±0.000 | 0.014±0.000 |
Data are presented as mean ± standard deviation. AUC, area under the curve; ROC, receiver operating characteristic.
Optimizing model for new dataset
Transfer learning
Since VGG-11 and ViT failed to converge when trained from scratch on this dataset, the transfer learning paradigm was used. Table 6 compares the highest accuracy and the epoch at which it was achieved for models that were initialized with ImageNet weights (33) and then trained fully versus the ones also initialized with ImageNet weights but with 100% of the feature extractor frozen, transfer learning. Directly training the pretrained feature extractors lead to suboptimal performance (accuracy, ROC AUC, F1-score) for models that are able to initially converge as was the case for Resnet and Efficientnet models. However, using the fixed pretrained weights helped improve performance for the ViT model.
Table 6
| Model | Training | Epochs | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| VGG11 | From scratch | 12±14 | 21.508±0.007 | 0.500±0.000 | 0.008±0.000 | 0.038±0.000 | 0.014±0.000 |
| VGG16 | From scratch | 129±193 | 38.970±19.483 | 0.742±0.221 | 0.167±0.205 | 0.170±0.168 | 0.154±0.183 |
| Resnet18 | From scratch | 189±44 | 65.250±1.183 | 0.941±0.002 | 0.551±0.008 | 0.537±0.021 | 0.539±0.017 |
| Resnet-50 | From scratch | 325±69 | 66.608±2.857 | 0.942±0.008 | 0.556±0.032 | 0.541±0.043 | 0.542±0.040 |
| Resnet152 | From scratch | 195±15 | 66.334±2.731 | 0.945±0.007 | 0.557±0.028 | 0.548±0.038 | 0.546±0.036 |
| Effnet0 | From scratch | 204±23 | 64.251±2.027 | 0.934±0.005 | 0.522±0.029 | 0.517±0.035 | 0.513±0.033 |
| Effnet3 | From scratch | 172±17 | 60.675±1.718 | 0.921±0.006 | 0.490±0.028 | 0.461±0.020 | 0.468±0.021 |
| Effnet6 | From scratch | 25±6 | 21.508±0.007 | 0.500±0.000 | 0.008±0.000 | 0.038±0.000 | 0.014±0.000 |
| ViT | From scratch | 8±7 | 21.500±0.000 | 0.500±0.000 | 0.008±0.000 | 0.038±0.000 | 0.014±0.000 |
| VGG11 | Transfer learning | 6±8 | 29.447±3.149 | 0.782±0.018 | 0.079±0.021 | 0.120±0.016 | 0.077±0.017 |
| VGG16 | Transfer learning | 4±4 | 29.834±0.612 | 0.775±0.026 | 0.100±0.024 | 0.117±0.016 | 0.080±0.010 |
| Resnet18 | Transfer learning | 81±68 | 45.601±1.046 | 0.902±0.002 | 0.391±0.067 | 0.313±0.022 | 0.283±0.020 |
| Resnet-50 | Transfer learning | 66±27 | 54.822±4.473 | 0.925±0.013 | 0.406±0.050 | 0.428±0.059 | 0.401±0.063 |
| Resnet152 | Transfer learning | 87±54 | 53.923±5.210 | 0.924±0.012 | 0.414±0.041 | 0.421±0.071 | 0.391±0.071 |
| Effnet0 | Transfer learning | 125±82 | 53.167±1.099 | 0.903±0.006 | 0.406±0.016 | 0.460±0.014 | 0.418±0.017 |
| Effnet3 | Transfer learning | 197±70 | 50.603±0.490 | 0.904±0.003 | 0.380±0.008 | 0.437±0.004 | 0.392±0.008 |
| Effnet6 | Transfer learning | 158±58 | 44.193±0.405 | 0.877±0.005 | 0.319±0.012 | 0.372±0.006 | 0.326±0.010 |
| ViT | Transfer learning | 187±106 | 66.325±1.700 | 0.942±0.001 | 0.533±0.027 | 0.570±0.012 | 0.545±0.022 |
Data are presented as mean ± standard deviation. AUC, area under the curve; ROC, receiver operating characteristic.
Finetuning of pretrained model
To further increase model performance, the proportion of the pretrained model that is frozen was also adjusted ranging from 0% to 100% frozen. For this task, the Resnet-50 model was chosen as it was the best performing model from the previous experiment between fully trained and 100% frozen feature extractor layers. From this experiment, it was determined that 75% frozen parameters is optimal as shown in Figure 4 with an accuracy of 75.843%±2.387%.
Using the 75% frozen parameters configuration found in the previous experiment, image resolution experiments were also run for all three model types as seen in Table 7. Three image resolutions were tried which were 224×224, 300×300, and 528×528. The top performing model was found to be Resnet-50 with images at 300×300 resolution. This model achieved 80.28%±0.23% top-1 accuracy which is 8% higher than the baseline model from Kaggle. Correspondingly, the ROC AUC and F1-score were also higher than the baseline. Further analysis shows that this model achieves 91.97%±0.11% and 95.20%±0.04% top-3 and top-5 accuracy, respectively. Increasing the resolution past 300×300, accuracy improvement saturates as the model accuracy is about the same at 300×300 and 528×528 resolution.
Table 7
| Model | Image size | Top-1 (%) | Top-3 (%) | Top-5 (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|---|
| Effnet0 | 224×224 | 70.23±2.92 | 88.06±0.77 | 92.78±0.59 | 0.96±0.00 | 0.58±0.03 | 0.61±0.02 | 0.59±0.03 |
| ViT | 224×224 | 76.14±0.55 | 88.48±0.32 | 92.23±0.71 | 0.95±0.00 | 0.67±0.00 | 0.64±0.00 | 0.65±0.00 |
| VGG11 | 224×224 | 44.47±0.89 | 63.68±0.93 | 71.61±0.59 | 0.87±0.00 | 0.28±0.02 | 0.29±0.02 | 0.23±0.01 |
| Resnet-50 | 224×224 | 75.84±2.38 | 89.61±1.10 | 93.50±0.72 | 0.95±0.00 | 0.67±0.02 | 0.65±0.03 | 0.66±0.02 |
| Baseline | 300×300 | 72.99±0.86 | 87.71±1.04 | 91.98±0.77 | 0.96±0.00 | 0.62±0.00 | 0.61±0.01 | 0.61±0.00 |
| Resnet-50 | 300×300 | 80.28±0.23 | 91.97±0.11 | 95.20±0.04 | 0.97±0.00 | 0.72±0.00 | 0.70±0.00 | 0.71±0.00 |
| Efficientnet3 | 300×300 | 76.03±0.23 | 89.97±0.17 | 93.80±0.12 | 0.96±0.00 | 0.65±0.00 | 0.66±0.00 | 0.65±0.00 |
| Effnet0 | 300×300 | 76.07±0.70 | 90.11±0.12 | 94.00±0.13 | 0.96±0.00 | 0.65±0.01 | 0.66±0.00 | 0.65±0.00 |
| Resnet-50 | 528×528 | 79.62±0.64 | 91.82±0.19 | 95.07±0.17 | 0.97±0.00 | 0.70±0.00 | 0.69±0.00 | 0.70±0.00 |
| Effnet6 | 528×528 | 77.07±0.68 | 90.46±0.38 | 94.05±0.21 | 0.97±0.00 | 0.67±0.01 | 0.68±0.00 | 0.67±0.01 |
Data are presented as mean ± standard deviation. AUC, area under the curve; ROC, receiver operating characteristic.
In conclusion, after optimizing the Resnet-50 model through image augmentation, finetuning, image resolution adjustment, the model with image resolution of 300×300 and 75% pretrained on the new image dataset (36,995 images and 26 classes) was determined to have the highest top-1, top-3, and top-5 accuracy. The top-1 predictions of the model are shown in Figure 5.
LLM for text classification
Table 8 shows the baseline accuracies of three LLMs (Mistral, Llama, and Falcon) with or without Option List prompt addition which explicitly lists all 26 disease classes that are contained within this dataset in the prompt. In addition, the prompt also contains varying amounts of preliminary disease classification predictions ranging from 0 class predictions to 5 class predictions. This emulates the preliminary predictions coming from the image model. By adding, the Option List prompt, classification performance for all three models improved. For Mistral, accuracy improved from 45.005%±21.959% to 81.306%±3.788%. For Llama, accuracy improved from 68.447%±4.968% to 83.383%±1.360%. For Falcon, accuracy improved from 58.160%±10.411% to 74.876%±7.222%. However, combining Option List and Top-N predictions prompt augmentation, performance tends to decrease such as the Falcon Top-5 accuracy with Option List being 56.973%±9.025% and without being 80.119%±1.652%. Using Top-N alone, performance also increases to a level comparable to Option List alone. Adding only the Top-1 prediction had the highest improvement while Top-3 and Top-5 had slightly less.
Table 8
| Model | Options list | Augmentation | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| Mistral | Y | N/A | 81.306±3.788 | 0.991±0.004 | 0.842±0.028 | 0.814±0.038 | 0.815±0.038 |
| Y | Pred1 | 92.878±1.853 | 0.998±0.001 | 0.938±0.014 | 0.929±0.018 | 0.929±0.019 | |
| Y | Pred3 | 85.757±3.992 | 0.993±0.002 | 0.872±0.035 | 0.858±0.040 | 0.856±0.042 | |
| Y | Pred5 | 84.372±2.928 | 0.994±0.002 | 0.872±0.011 | 0.844±0.029 | 0.844±0.033 | |
| Llama | Y | N/A | 83.383±1.360 | 0.992±0.001 | 0.847±0.012 | 0.834±0.014 | 0.832±0.014 |
| Y | Pred1 | 94.560±1.739 | 1.000±0.000 | 0.953±0.016 | 0.946±0.017 | 0.946±0.017 | |
| Y | Pred3 | 87.735±4.242 | 0.997±0.003 | 0.887±0.044 | 0.878±0.042 | 0.877±0.044 | |
| Y | Pred5 | 84.174±2.147 | 0.993±0.004 | 0.854±0.021 | 0.842±0.021 | 0.841±0.022 | |
| Falcon | Y | N/A | 74.876±7.222 | 0.975±0.013 | 0.763±0.072 | 0.749±0.072 | 0.744±0.073 |
| Y | Pred1 | 87.141±5.047 | 0.994±0.005 | 0.886±0.041 | 0.872±0.051 | 0.871±0.050 | |
| Y | Pred3 | 65.084±10.638 | 0.957±0.031 | 0.673±0.092 | 0.651±0.107 | 0.646±0.106 | |
| Y | Pred5 | 56.973±9.025 | 0.943±0.025 | 0.602±0.090 | 0.570±0.090 | 0.570±0.088 | |
| Mistral | N | N/A | 45.005±21.959 | 0.882±0.092 | 0.487±0.230 | 0.450±0.220 | 0.434±0.233 |
| N | Pred1 | 92.681±0.747 | 0.998±0.001 | 0.939±0.004 | 0.927±0.007 | 0.926±0.007 | |
| N | Pred3 | 90.504±2.077 | 0.998±0.001 | 0.921±0.021 | 0.905±0.021 | 0.905±0.020 | |
| N | Pred5 | 87.043±4.405 | 0.995±0.005 | 0.877±0.045 | 0.871±0.044 | 0.867±0.047 | |
| Llama | N | N/A | 68.447±4.968 | 0.968±0.004 | 0.705±0.044 | 0.685±0.049 | 0.682±0.045 |
| N | Pred1 | 94.955±1.360 | 0.999±0.001 | 0.955±0.013 | 0.950±0.014 | 0.949±0.014 | |
| N | Pred3 | 94.560±2.187 | 0.999±0.000 | 0.955±0.017 | 0.946±0.022 | 0.946±0.022 | |
| N | Pred5 | 93.175±3.126 | 0.999±0.001 | 0.939±0.028 | 0.932±0.031 | 0.931±0.032 | |
| Falcon | N | N/A | 58.160±10.411 | 0.925±0.036 | 0.595±0.102 | 0.582±0.105 | 0.576±0.103 |
| N | Pred1 | 94.362±2.355 | 0.999±0.001 | 0.949±0.022 | 0.944±0.023 | 0.942±0.025 | |
| N | Pred3 | 84.471±3.255 | 0.994±0.002 | 0.861±0.037 | 0.845±0.033 | 0.843±0.035 | |
| N | Pred5 | 80.119±1.652 | 0.988±0.003 | 0.814±0.016 | 0.802±0.016 | 0.799±0.017 |
Data are presented as mean ± standard deviation. AUC, area under the curve; N, no; N/A, not applicable; ROC, receiver operating characteristic; Y, yes.
Instead of a static list of disease classes, by using Chain of Options, it is possible to further improve LLM performance as seen in Table 9. By combining Chain of Options and Top-N predictions, accuracy for Mistral (94.560%±1.042%) and Falcon (93.472%±1.805%) models outperformed Option List as well as Top-N predictions combined or separate. The performance for Llama (94.461%±2.972%) was about the same as Top-N predictions alone (94.955%±1.360%).
Table 9
| Model | Augmentation | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|
| Mistral | N/A | 80.613±3.439 | 0.989±0.002 | 0.830±0.034 | 0.807±0.034 | 0.802±0.038 |
| Pred1 | 94.560±1.042 | 0.999±0.001 | 0.952±0.009 | 0.946±0.010 | 0.944±0.011 | |
| Pred3 | 91.494±0.685 | 0.998±0.000 | 0.924±0.004 | 0.915±0.007 | 0.913±0.009 | |
| Pred5 | 87.735±3.577 | 0.994±0.004 | 0.892±0.027 | 0.878±0.036 | 0.875±0.036 | |
| Llama | N/A | 90.010±3.045 | 0.998±0.001 | 0.913±0.027 | 0.900±0.030 | 0.900±0.031 |
| Pred1 | 94.461±2.972 | 0.999±0.001 | 0.951±0.025 | 0.945±0.030 | 0.944±0.030 | |
| Pred3 | 92.087±2.247 | 0.999±0.001 | 0.928±0.020 | 0.921±0.022 | 0.921±0.022 | |
| Pred5 | 90.702±1.813 | 0.998±0.001 | 0.918±0.017 | 0.907±0.018 | 0.907±0.018 | |
| Falcon | N/A | 84.273±3.012 | 0.992±0.005 | 0.855±0.022 | 0.843±0.030 | 0.841±0.031 |
| Pred1 | 93.472±1.805 | 0.999±0.001 | 0.946±0.008 | 0.935±0.018 | 0.936±0.015 | |
| Pred3 | 88.131±3.210 | 0.995±0.002 | 0.892±0.025 | 0.882±0.032 | 0.879±0.033 | |
| Pred5 | 87.735±5.787 | 0.993±0.007 | 0.890±0.050 | 0.878±0.058 | 0.877±0.057 |
Data are presented as mean ± standard deviation. AUC, area under the curve; N/A, not applicable; ROC, receiver operating characteristic.
Combining vision model and LLM for skin disease diagnosis
Table 10 shows the final results after combining the optimal configuration for the vision and LLMs. Using Resnet-50, image alone is only able to achieve 80.28±0.23% accuracy on the new dataset as discussed in the previous sections. However, by combining both the top performing image model Resnet-50 and the best LLM (Llama-7B), a final skin disease classification accuracy of 93.927% was achieved with the image model recommending the top 1 candidate diseases and the LLM adding this candidate to the input along with chain of options finetuning. This performance exceeds the performance of ChatGPT with CoT which only achieves 76.261% accuracy. Notably, using just Chain of Options, Llama was able to achieve 90.01%±3.045% accuracy which also exceeds that of ChatGPT.
Table 10
| Model | Prediction | Training method | Accuracy (%) | ROC AUC | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| Resnet + Llama | 1 | Chain of Options | 93.927 | 1.000 | 0.903 | 0.920 | 0.901 |
| 1 | Normal | 90.000 | 0.997 | 0.877 | 0.899 | 0.879 | |
| 3 | Chain of Options | 93.900 | 1.000 | 0.935 | 0.922 | 0.921 | |
| 3 | Normal | 90.013 | 0.997 | 0.862 | 0.872 | 0.842 | |
| 5 | Chain of Options | 87.409 | 0.999 | 0.874 | 0.885 | 0.865 | |
| 5 | Normal | 85.776 | 0.997 | 0.855 | 0.875 | 0.835 | |
| ChatGPT + chain of thought prompting | 76.261 | 0.876 | 0.785 | 0.763 | 0.747 | ||
AUC, area under the curve; N/A, not applicable; ROC, receiver operating characteristic.
In summary, the Resnet-50 model was applied to the new image dataset and the top-1 validation accuracy was 66.608%±2.857% for 26 types of skin disease diagnosis. After further optimizing the model through various means such as finetuning, top-1 accuracy was improved to 80.28%±0.23%. The best LLM, by itself, was able to achieve 68.447%±4.968% accuracy; adding Chain of Options training improved performance to 90.010%±3.045%. By combining the top performing image model with the top LLM, a final accuracy of 93.927% was achieved (Table 11).
Table 11
| Model | Accuracy (%) | ROC AUC |
|---|---|---|
| Vanilla Resnet-50 | 66.61±2.86 | 0.94±0.008 |
| Optimized Resnet-50 | 80.28±0.23 | 0.97±0.00 |
| Vanilla Llama | 68.45±4.97 | 0.97±0.004 |
| CoO Llama | 90.01±3.05 | 0.998±0.001 |
| Resnet-50 + CoO Llama | 93.927 | 1.0 |
Data are presented as mean ± standard deviation. AUC, area under the curve; ROC, receiver operating characteristic.
A confusion matrix of the final accuracy can be seen in Figure 6.
Discussion
Key findings
Image classification
From experiments done on the baseline dataset, it was determined that standard image models such as Efficientnet, Resnet, ViT, and VGG are able to successfully classify skin diseases on a dataset containing nine skin diseases. By adding various augments including ones that are skin disease specific, accuracy was further improved to be 45.465%±0.000%, 89.276%±0.242%, 88.662%±0.303%, and 91.201%±0.504%, VGG-11, Resnet-50, Efficientnet-B3, and ViT respectively as seen in Table 4. The best accuracy achieved was 89.276%±0.242% using Resnet-50. To test the effect that increasing the number of skin diseases to classify would have on accuracy, an experiment (Table 5) was run where the total number of skin diseases classes in the dataset was increased from 9 to 15 to 20 and finally to 26 which is the final number of skin disease classes in the final dataset. By increasing the number of classes while using the same augmentations, accuracy decreased across all four models to 21.508%±0.007%, 66.608%±2.857%, 60.675%±1.718%, 21.500%±0.000% for VGG-11, Resnet50, Efficientnet-B3, and ViT respectively. The ROC AUC, precision, recall, and F1-scores of the models also decreased with the addition of new classes. This shows that, by consolidating multiple sources into a unified, large-scale dataset, this dataset provides a more realistic benchmark that better reflects the complexities of real-world skin disease diagnosis.
To improve model performance on the new dataset, the model was initialized using ImageNet weights and then finetuned with varying amounts of the feature extractor frozen ranging from the whole feature extractor being frozen to none of it. From this experiment, it was determined that freezing the first 75% of layers is optimal with Resnet-50 achieving an accuracy of 75.84%±2.38% as seen in Figure 4. This was a 9% accuracy improvement from training from scratch. Using this finetuning configuration, a final image only experiment was performed where the image resolution was adjusted. From this experiment (Table 7), it was found that models trained on images with resolution 300×300 performed optimally with Resnet-50 achieving 80.28%±0.23% top-1 accuracy.
Text classification
For text classification, three models (Mistral, Llama, and Falcon) were used as seen in Table 8.
Without any prompt tuning, the models only achieved 45.005%±21.959%, 68.447%±4.968%, and 58.160%±10.411% accuracies, respectively. By adding preliminary predictions, Top-N, ranging from 1 to 5 predictions to the prompt, accuracy improved, achieving 94.955%±1.360% using Llama and 1 preliminary prediction. Adding an Option List to the prompt also improved performance but not as significantly as with Top-N. Combining Options List and Top-N, accuracy was above that of Options List alone but below that of Top-N alone. In addition, it was found that the Top-1 prediction had the highest accuracy improvement while Top-3 and Top-5 had slightly less. This trend continued when combining Chain of Options with Top-N predictions in Table 9. Using Chain of Options alone, the models outperformed Options List but didn’t perform as well as Top-N predictions alone. When combining Chain of Options and Top-N predictions, the best performance for Mistral (94.560%±1.042%) and Falcon (93.472%±1.805%) were achieved with performance for Llama (94.461%±2.972%) on par with Top-N predictions alone.
Finally, by integrating the predictions generated by the image model with the Chain of Options LLM training strategy, an overall accuracy of 93.927% was achieved which surpasses the performance of ChatGPT with CoT (76.261%) as seen in Table 10. This performance was better than that of the image model alone which achieved 80.28%±0.23% and also that of directly using the LLM which had 90.01%±3.05% accuracy (Table 11).
Explanations of findings
Image classification
The newly aggregated dataset, due to its class imbalance as well as having more classes overall, is a more comprehensive skin disease classification task as compared to the initial datasets. While all four models had strong performance on the initial dataset (9 classes), the addition of more classes, especially some with fewer samples, resulted in model performance decreasing across the board. For example, dermatofibroma, which was not part of the initial dataset and was added in later, contains only 239 images. In contrast, melanocytic nevi contained 7,967 images (Table 1). Despite this, the incremental addition of classes from 9 to 15 to 20 to 26 did not yield significant declines in ROC AUC, precision, recall, or F1-scores for Resnet or Efficientnet as seen in Table 5. This suggests that model performance was not heavily skewed towards the majority classes and was mostly affected by the number of classes in the dataset. This is further corroborated by the final confusion matrix (Figure 5), of the Resnet-50 model after optimization which does not show significant differences in misclassification rates for any of the classes based on image count. For example, dermatitis and warts, which each had at least 1.5 K, images performed on par with dermatofibroma despite dermatofibroma having significantly less images. It is important to note that VGG and ViT failed to converge on the final dataset and overfit on a few specific classes, resulting in exactly 0.5 ROC AUC and 21.5% accuracy. F1-score, precision, and recall were all very low which supports the claim that the model failed to converge.
Among the evaluated models, Resnet, on average, outperformed VGG, Efficientnet, and ViT. This can be attributed to Resnet’s ability to mitigate the vanishing gradient problem common in deeper networks through its residual connections. In turn, the deeper networks allow for better feature extraction which is necessary for skin disease classification where many diseases can present very similarly. On the other hand, VGG struggled with the vanishing gradient problem due to its depth. ViT’s lack of inductive biases for images resulted in poor performance when trained from scratch. However, by using transfer learning, where ViT has already learned the inductive biases towards images, performance does improve. Efficientnet outperformed ViT and VGG but did not perform as well as Resnet due to its focus on parameter and computational efficiency instead of feature extraction effectiveness.
To further improve performance, it was determined that freezing 75% of the model’s parameters during finetuning was optimal (Figure 4). This percentage balances the high-level, class-agnostic, features learned from pre-training for general image classification tasks with the low-level, task specific features learned during finetuning on skin diseases specifically. Freezing fewer layers than 75% would lead to overfitting as the model would memorize too many disease and image specific features while freezing more layers prevents the model from learning the nuances of skin disease classification. Overall, 75% enabled the model to retain general image understanding while also understanding disease-specific features.
In terms of increasing the image resolution, performance improved up to a certain extent. While there was performance improvement from 224×224 to 300×300, there was no significant performance difference between 300×300 and 528×528 as seen in Table 7. Although higher resolution images do provide more visual details, they also have extraneous features that can lead to overfitting and longer training times. Therefore, the 300×300 resolution performed optimally as it had sufficient details without too many unnecessary features.
Text classification
By adding Options List to the prompt, LLM performance improved by restricting the model’s choices to just the number of disease classes in the dataset (Table 8). However, combining the Options List with Top-N predictions did not further improve performance as the Top-N predictions already contained a narrowed-down list of likely diseases, making the Options List redundant information which would confuse the model. Interestingly, accuracy decreased as more predictions were included since the model needed to understand more combinations of mock skin disease predictions which outweighs the fact that these combinations have a higher chance of having the correct disease.
Apart from Top-N predictions prompting, Chain of Options finetuning was also found to be an effective method for improving text classification performance (Table 9). Unlike Chain of Thought which generates intermediate reasoning steps at inference, Chain of Options incrementally filters unlikely options during inference. This approach simplifies the task by allowing the model to eliminate improbable diseases early, incrementally reducing the complexity of the classification. On the other hand, because Chain of Thought considers previous reasoning steps when generating the current reasoning step, the task complexity does not decrease over time. As a result, in the case of classification, Chain of Options performs better since the model does not need to consider diseases that had already been filtered out. In addition, because Chain of Options is performed during the finetuning step, the method is able to learn more domain-specific information which Chain of Thought is unable to do as it could only rely on its general knowledge.
While some non-Chain of Options configurations, using only Top-N predictions, achieved comparable performance, Chain of Options consistently outperformed those configurations when combined with the vision model predictions. As a result, Chain of Options was used for the final evaluation in multimodal classification.
Combining image and text
The combined accuracy of the vision and language model (93.927%) was slightly lower than that of just the LLM (94.461%±2.972%) (Table 11). This discrepancy can be attributed to the vision model’s predictions not being uniformly distributed; certain diseases are more likely to be confused. Therefore, some top-N prediction combinations appeared more often than others, leading to slightly lower accuracy especially if those combinations of diseases were also difficult for just the LLM. On the other hand, for just the LLM, the top-N predictions were randomly selected from the overall set of predictions so the frequency of the problematic disease prediction combinations was lessened and thus performance was slightly better.
Strengths and limitations
One of the major strengths of this study is the thorough evaluations of several optimization techniques for both vision and language models. The study uses proven as well as state of the art vision models to thoroughly explore various optimization techniques ranging from image augmentation to model finetuning to image resolution adjustment. By doing so, future works have a strong foundational understanding of what techniques work and don’t work on this newly proposed dataset as well as strong baselines to compare to. In addition, by using a variety of LLMs, the prompt tuning and fine tuning techniques proposed are shown to be robust and could potentially be used for other models or applications. The overall dataset proposed in this study is the first of its kind in terms of image count, class count, and its combination of skin disease images and associated, generated, patient narratives.
Finally, combining vision and text data for the task of skin disease classification is a novel approach that has not been extensively explored in the past. This setup mitigates potential issues that each modality can have individually and more closely reflects how a real world dermatologist would classify a patient’s skin disease. Current skin disease datasets are comprised primarily of Caucasian skin which limits the methods trained only on skin disease images’ representativeness for other skin types. This can result in bias and inaccuracy for these other skin types. On the other hand, patient narratives do not contain this bias as symptoms for a particular disease are largely the same for all patients but suffer from the fact that the patient may not present a complete list of symptoms. By combining image and text, the impact of each modality’s issues are mitigated, allowing for unbiased and more accurate diagnosis.
Despite this, there are some limitations in this study that can be taken up in future works. On the image side, despite aggregating three of the most popular skin disease datasets used in previous studies, there is still a lack of dataset diversity in terms of skin types and ethnicity. This can limit the method’s generalizability in real world applications where the distribution of skin types is different than that of the dataset. Similarly, the image model needs to be validated on other datasets to further prove its generalizability. To address both of these limitations, new open source datasets will need to be created that contain a variety of skin types and ethnicities as many current skin disease datasets that do contain this diversity are closed source.
On the text side, there are limitations with using ChatGPT generated patient narratives in lieu of actual patient narratives. ChatGPT generated data may not reflect how an actual patient would present their symptoms to a doctor as there could be inconsistencies or oversights that were not accounted for during initial dataset generation. This could cause the model to not generalize to real world patient narratives. The current dataset tries to mitigate this by presenting only a small subset of all the symptoms of the disease as well as prompting the model to use different dialects. To extend this work to real world applications, a new dataset will need to be created where actual patient narratives, spanning various regions and dialects, are used.
Overall, although the results of this study show that using image and language models to classify skin diseases can work with a high degree of accuracy, the study does not compare performance with human dermatologists on the same dataset, This is a potential limitation on this study’s applicability to the real world but does not detract from the study’s main objective which is to create a state of the art multi-modal skin disease classifier. In addition, the methods proposed in this study can be used as another viewpoint and datapoint that a dermatologist can use to support and reinforce their diagnosis instead of as a replacement.
Comparison with similar research
Several studies have explored the use of vision models for skin disease classification in the past. These studies have established strong baselines on the initial datasets used in this work (23,28). However, this study extends these baselines by aggregating multiple datasets together to create a larger, more diverse dataset that encompasses 26 skin disease classes. On this new dataset, a more extensive list of vision models (while also including the ones from previous studies) were evaluated. This study goes a step further and introduces several optimization techniques that boost performance even more. Aside from the standard vision models used in this study, there are also proprietary models that have been created for skin disease classification (13,23,38). Despite their performance, these models are not open sourced and so are not accessible for broader validation or adaptation. On the other hand, this study uses only widely available, open source models such as VGG, Resnet, Efficientnet, and ViT, allowing this study to be easily reproducible for future research.
This study is the first to combine LLMs with vision models for multimodal skin disease classification using patient narratives and skin disease images. While previous works (22) explored the use of images and clinical metadata, they did not leverage patient narratives or LLM finetuning techniques. By introducing Chain of Options and integrating it with vision model predictions, this work achieves state of the art accuracy of 93.927% which outperforms methods that use a single modality. This multimodal approach creates a new standard for skin disease classification and proposes a framework for future research that combines image and text data for medical diagnosis.
Implications and actions needed
This work presents a novel method that uses vision and language models to diagnosis skin diseases more accurately and efficiently than current state of the art methods. In conjunction, the new proposed dataset can serve as an effective baseline for future work given its size and multimodality. The new proposed LLM finetuning strategy for sequence classification, Chain of Options, which breaks down a complex reasoning task into intermediate steps at training time instead of inference, can also be extended to other sequence classification tasks. Future work would be to mainly improve the dataset itself by gathering more skin disease images from a wider variety of skin types and ethnicities as well as including real patient narratives into the dataset. In addition, the results of the method can also be compared with human dermatologists on the same dataset to determine whether it can be used to improve final dermatologist diagnosis accuracy.
Conclusions
This study presents a novel multimodal approach for the classification of skin diseases, combining vision and language models to achieve state of the art diagnostic accuracy. The key contributions of this work include the development of a comprehensive multimodal dataset comprised of 26 skin disease classes, establishment of robust baselines for popular image models, and the introduction of a novel LLM finetuning method, Chain of Options, which enhances classification-specific LLM performance. Overall, the study achieves a final accuracy of 93.927%, surpassing existing methods on the newly proposed dataset.
This approach highlights the potential of multimodal methods in improving the efficiency and accuracy of skin disease diagnosis. However, limitations such as dataset diversity and the use of synthetic patient narratives require further study before using this method in the real world. In conclusion, this study advances the use of AI in dermatology by demonstrating the effectiveness of combining image and text data. The proposed methods and dataset provide a strong foundation for future research, enhancing diagnostic tools for clinicians and improving patient outcomes.
Acknowledgments
Portions of this work were previously presented as a poster at the Alameda County Science and Engineering Fair (ACSEF) in, Chabot College, Hayward, CA, USA, March 13-16, 2024 (https://www.acsef.org/). Parts of the work were also preprinted at: https://arxiv.org/pdf/2411.13855.
Footnote
Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/rc
Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/dss
Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/prf
Funding: None.
Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/coif). The authors have no conflicts of interests to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. IRB approval is not applicable, as the study does not contain human or animal experiments. The study does not require informed consent as all datasets used are open sourced on Kaggle.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Why more Americans are putting off going to the doctor, 2023. Available online: https://www.pbs.org/newshour/show/why-more-americans-are-putting-off-going-to-the-doctor
- Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115-8. [Crossref] [PubMed]
- Boesch G. Very deep convolutional networks (vgg) essential guide. Available online: https://viso.ai/deeplearning/vgg-very-deep-convolutional-networks/, 2025. Accessed: 2023:12:22.
- He K, Zhang X, Ren S, et al. Deep residual learning for image recognition, 2015. Available online: https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
Tan M Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks ,2020 . URL https://arxiv.org/abs/1905.11946.- Thomsen K, Christensen AL, Iversen L, et al. Deep Learning for Diagnostic Binary Classification of Multiple-Lesion Skin Diseases. Front Med (Lausanne) 2020;7:574329. [Crossref] [PubMed]
- Abunadi I, Senan EM. Deep learning and machine learning techniques of diagnosis dermoscopy images for early detection of skin diseases. Electronics 2021;10:3158.
- Isic 2018 challenge. Available online: https://challenge.isic-archive.com/landing/2018/, 2018. Accessed: 2024:01:21.
- PH2 Database. Accessed: 2024:01:22. Available online: https://www.fc.up.pt/addi/ph2%20database.html
- Han SS, Kim MS, Lim W, et al. Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm. J Invest Dermatol 2018;138:1529-38. [Crossref] [PubMed]
- Hu Z, Wang B, Pan X, et al. Using deep learning to distinguish malignant from benign parotid tumors on plain computed tomography images. Front Oncol 2022;12:919088. [Crossref] [PubMed]
- Wu H, Yin H, Chen H, et al. A deep learning, image based approach for automated diagnosis for inflammatory skin diseases. Ann Transl Med 2020;8:581. [Crossref] [PubMed]
- Naeem A, Anees T, Khalil M, et al. Snc_net: skin cancer detection by integrating handcrafted and deep learning-based features using dermoscopy images. Mathematics 2024;12:1030.
- Shieh A, Tran B, He G, et al. Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 2024;14:9330. [Crossref] [PubMed]
- Chatgpt. Available online: https://chat.openai.com/chat. Accessed: 2024:02:03.
- Claude. Available online: https://claude.ai/new. Accessed: 2024:02:03.
- Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models, 2023. Available online: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
- ISIC_2019_challenge. Accessed: 2024:01:21. Available online: https://www.kaggle.com/datasets/agsam23/isic-2019-challenge
- Dermnet. Available online: https://www.kaggle.com/datasets/shubhamgoel27/dermnet/data. Accessed: 2024:01:19.
- Skin diseases image dataset. Accessed: 2024:01:20. Available online: https://www.kaggle.com/datasets/ismailpromus/skin-diseases-image-dataset
- Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. 2020; arXiv preprint arXiv:2010.11929.
- Cai G, Zhu Y, Wu Y, et al. A multimodal transformer to fuse images and metadata for skin disease classification. Vis Comput 2022; Epub ahead of print. [Crossref]
- Hammad M, Pławiak P, ElAffendi M, et al. Enhanced Deep Learning Approach for Accurate Eczema and Psoriasis Skin Detection. Sensors (Basel) 2023;23:7295. [Crossref] [PubMed]
- Jain A, Way D, Gupta V, et al. Development and Assessment of an Artificial Intelligence-Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices. JAMA Netw Open 2021;4:e217249. [Crossref] [PubMed]
- Escalé-Besa A, Yélamos O, Vidal-Alaball J, et al. Exploring the potential of artificial intelligence in improving skin lesion diagnosis in primary care. Sci Rep 2023;13:4293. [Crossref] [PubMed]
- Liopyris K, Gregoriou S, Dias J, et al. Artificial Intelligence in Dermatology: Challenges and Perspectives. Dermatol Ther (Heidelb) 2022;12:2637-51. [Crossref] [PubMed]
- Li Z, Koban KC, Schenck TL, et al. Artificial Intelligence in Dermatology Image Analysis: Current Developments and Future Trends. J Clin Med 2022;11:6826. [Crossref] [PubMed]
- Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for isic 2019 using deep convolutional neural network and transfer learning. IEEE Access 2020;8:114822-32.
- Mayo clinic. Available online: https://www.mayoclinic.org/. Accessed: 2024:02:03.
- American academy of dermatology association. Available online: https://www.aad.org/. Accessed: 2024:02:03.
- National institute of arthritis and musculoskeletal and skin diseases. Available online: https://www.niams.nih.gov/. Accessed: 2024:02:03.
- Cleveland clinic. Available online: https://my.clevelandclinic.org/. Accessed: 2024:02:03.
- Deng J, Dong W, Socher R, et al. Imagenet: A largescale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA, 2009, pp. 248-255.
- Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models. 2023; arXiv preprint arXiv:2302.13971.
- Almazrouei E, Alobeidli H, Alshamsi A, et al. The falcon series of open language models, 2023. arXiv preprint arXiv:2311.16867.
- Jiang A, Sablayrolles A, Mensch A, et al. Mistral 7b, 2023. arXiv preprint arXiv:2310.06825.
- Hu EJ, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models. ICLR 2022;1:3. Available online: https://openreview.net/forum?id=nZeVKeeFYf9
- Ali MU, Khalid M, Alshanbari H, et al. Enhancing Skin Lesion Detection: A Multistage Multiclass Convolutional Neural Network-Based Framework. Bioengineering (Basel) 2023;10:1430. [Crossref] [PubMed]
Cite this article as: Yang A, Yang E. A multimodal approach to the detection and classification of skin diseases. J Med Artif Intell 2026;9:3.


