A multimodal approach to the detection and classification of skin diseases

Allen Yang; Edward Yang

doi:10.21037/jmai-24-429

Original Article

A multimodal approach to the detection and classification of skin diseases

Allen Yang¹, Edward Yang²

¹Mission San Jose High School, Fremont, CA, USA; ²Yale University, New Haven, CT, USA

Contributions: (I) Conception and design: Both authors; (II) Administrative support: E Yang; (III) Provision of study materials or patients: E Yang; (IV) Collection and assembly of data: A Yang; (V) Data analysis and interpretation: A Yang; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: Edward Yang, MS. Yale University, New Haven, CT, USA; 3075 Olcott St, Santa Clara, CA 95054, USA. Email: edwardyang016@gmail.com.

Background: According to Public Broadcasting Service (PBS), nearly one-third of Americans lack access to primary care services, and another forty percent delay going to avoid medical costs. As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms on the skin. With recent advancements in machine learning, self-diagnosis and improved disease recognition have become more promising than ever. In spite of that, existing methods suffer from a lack of large-scale patient databases and outdated methods of study, resulting in studies being limited to only a few diseases or modalities. This work aims to combine text and image skin disease information on a new aggregated, multimodal, dataset to better classify skin diseases while also establishing baselines for popular image and text models.

Methods: This study incorporates common and standard patient information via image and text for skin disease classification on a new dataset of 26 skin disease types that includes both skin disease images and associated patient narratives. Using the image part of the dataset (aggregated from the top three public skin disease image sources), baselines for various image models were established and then improved upon using various optimization techniques such as transfer learning, data augmentation, and finetuning. Using the textual patient narratives, this study explores methods for improving large language model (LLM) performance on classification tasks via finetuning. In addition, this study proposes a novel finetuning strategy, Chain of Options, which breaks down a complex reasoning task into intermediate steps at training time instead of inference. Finally, this study combines the best performing text and image model to perform skin disease classification using both the skin disease images and the associated patient narratives.

Results: Initially, the best performing image model was only able to achieve an accuracy of 66.61%±2.86% and receiver operating characteristic area under the curve (ROC AUC) of 0.94±0.008 but, after finetuning as well as dataset specific image augmentations, the accuracy was improved to 80.28%±0.23%. With Chain of Options and preliminary disease recommendations from the image model, the overall accuracy was further improved to a state of the art accuracy of 93.927% with ROC AUC of 1.0 in diagnosing patient skin disease given just an image of the afflicted area as well as a patient description of the symptoms (such as itchiness or dizziness).

Conclusions: Through this research, an earlier diagnosis of skin diseases can occur, and clinicians can work with deep learning models to give a more accurate diagnosis, improving quality of life and saving lives.

Keywords: Large language models (LLMs); convolutional neural networks (CNNs); skin disease classification

Received: 18 November 2024; Accepted: 22 August 2025; Published online: 17 December 2025.

doi: 10.21037/jmai-24-429

Highlight box

Key findings

• Using only skin disease images, the highest accuracy achieved on the new dataset is 80.1% but by combining it with text data, the accuracy can reach 91% across 26 classes.

What is known and what is new?

• Doctors use both image and textual/voice information to diagnose skin diseases. However, the computational diagnosis of skin disease has been explored via image and text separately but never together.

• Proposed a new method that uses common and standard patient information, such as skin images and patient narratives, to predict the patient’s skin disease, leveraging a new large language model finetuning technique for sequence classification tasks, chain of options, that uses chain of thought and precursory information to improve training accuracy.

What is the implication, and what should change now?

• Demonstrated the potential of the proposed model for detecting multiple skin diseases simultaneously, which significantly improves the efficiency and accuracy of skin disease diagnosis.

• Proposed a new multimodal skin disease dataset containing 26 disease classes, offering a more comprehensive and realistic classification challenge.

Introduction

Background

According to PBS, nearly one-third of Americans lack access to primary care services, particularly in underserved or remote areas where accessing doctors may be limited, and another forty percent delay going to avoid medical costs (1). As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms. With the rise of artificial intelligence (AI), self-diagnosis and improved disease recognition have become more promising than ever. AI-driven diagnostic systems can potentially improve the accuracy and speed of disease diagnosis, especially for skin diseases. These tools have shown promising results in the diagnosis of skin diseases, with some studies demonstrating superior performance compared to human dermatologists (2). In the near future, by integrating these tools into mobile applications, individuals can capture images of their skin lesions and receive instant feedback or recommendations for further evaluation. This approach has the advantage of reducing wait times for appointments and providing timely guidance to patients who may not have easy access to doctor services.

Current state of the art deep learning algorithms have demonstrated remarkable capabilities in analyzing skin images and identifying patterns indicative of various skin conditions. These models leverage convolutional neural networks (CNNs) to extract features from images and make predictions based on learned patterns. CNNs consist of multiple layers of convolutional, pooling, and fully connected layers that extract features from input images and classify them into different disease categories. In particular, several vision models including Very Deep Convolutional Networks (VGGNet) (3), ResNet (4), and Efficientnet (5) have been developed, adapted and finetuned on various image classification tasks.

Thomsen (6), trained a VGG-16 model on 16,543 non-standardized images categorized with ICD-10 codes related to acne, rosacea, psoriasis, eczema, and cutaneous T-cell lymphoma. This model distinguished between the diseases with accuracy above 77%, indicating a clinically relevant accuracy compared to the reported diagnostic accuracy in dermatology of primary care physicians (48–77%). Abunadi (7) reported that ResNet-50, when applied to skin disease diagnosis using transfer learning, achieved an accuracy rate of 90% on the ISIC 2018 dataset (8) and 95.8% on the PH2 dataset (9). In fact, many studies have shown that ResNet-based models achieve high accuracy in differentiating between benign and malignant skin lesions, outperforming human dermatologists in certain cases (10,11). Wu (12) using the Efficientnet-B4 model, developed an AI dermatology diagnosis assistant for psoriasis, eczema & atopic dermatisit and healthy skins. The model was trained on 4,740 clinical images, and performance evaluated using expert-confirmed clinical images grouped into three different dermatologist-labeled diagnosis classifications with accuracies of 96%, 89% and 93%, respectively. In conjunction with the features extracted from the model itself (13), also includes hand-crafted image features from experts to be used for classification.

In addition to images of the skin disease, textual information related to the skin disease is also crucial for accurate diagnosis, especially when the visual signals of the diseases are very similar. This information can come in many forms such as patient history, patient stories, or clinical metadata such as their age, sex, or physical fitness. The most common form is patient stories. Patient stories or narratives are personal accounts of a person’s experiences, in this case with skin diseases, and are one of the first sources of information that a doctor uses for diagnosing skin conditions. These readily available stories contain information about the symptoms that a patient is experiencing which is especially important as some diseases may look the same from a physical inspection but cause the patient to exhibit different symptoms that are not apparent from an image. However, patient narratives tend to be incomplete and sometimes inaccurate descriptors. For example, a patient presenting with eczema may only describe that they have constant itching and dry, cracked skin but not notice (or mention) the rash that they have which is a clear indicator of eczema. With this information, it is up to the dermatologist to narrow down the diseases and present the most likely one.

With the popularization of using large language models (LLMs) such as ChatGPT for answering daily questions, patients have begun to use LLMs for diagnostic advice similar to how they would have asked a dermatologist (14). Services such as ChatGPT (15) and Claude (16) enable patients to conveniently get their skin ailment-related issues answered without needing to wait for a doctor’s response. Although, the disease diagnosis from these LLMs could be inaccurate especially since they weren’t designed for skin disease classification. To address this issue of specialization, researchers in other fields proposed several methods such as finetuning or prompting the LLM. The most popular of these techniques is chain of thought prompting (CoT) (17). CoT improves the ability of LLMs to solve complex reasoning tasks by forcing the LLM to present a series of intermediate reasoning steps before presenting the final result. In this way, the LLM breaks down the complex task into smaller, simpler, subtasks that are less error prone, thus increasing accuracy. In the context of disease diagnosis, LLMs prompted with CoT would be able to reason about each individual symptom and then make a final judgment based on each of the reasonings.

Rationale and knowledge gap

Overall, the success in skin disease diagnosis is largely enabled by the development of large datasets, consisting of thousands of annotated images and text. Despite these promising advancements, AI driven skin disease diagnosis still faces several challenges and limitations that must be addressed before being put into clinical practice. One major challenge is the need for large, high-quality datasets with diverse patient populations and skin disease types. Another challenge is the outdated methods of study, resulting in studies limited to only a few diseases or modalities.

Using only images to diagnose skin diseases has been explored in the past, especially with the creation of many skin disease image datasets such as ISIC (18), Dermnet (19), and the ones on Kaggle (20). These open or closed source datasets have enabled researchers to have consistent baselines with which to monitor and track model performance improvements. Each dataset generally comes from one or two sources such as medical archives, clinical studies, or research repositories and so only contain a few skin diseases. As such, there is a need for a new dataset that aggregates these separate datasets into one definitive dataset that all researchers can use for model evaluation. Correspondingly baselines for all the most popular networks including ResNet (4), VGG (3), Efficientnet (5), and ViT (21) need to be run on this dataset. This study accomplishes both of these objectives.

Although LLMs have been used for diagnosing skin diseases in the past, the use case is limited to prompting where the model weights are unchanged and only the input, prompt, to the LLM is changed. This method works to a certain extent but is reliant on the model having knowledge about the skin disease and its symptoms. Instead, this work finetunes the LLM by treating skin disease diagnosis as a classification task where the input is a patient narrative text and the output is the predicted skin disease class. In this way, the model can have domain-specific knowledge about skin diseases, increasing diagnostic accuracy. Additionally, this work extends the chain of thought paradigm to the finetuning step to better utilize this prompt structure’s ability to incrementally narrow down the skin disease given an initial list of potential ones while also being domain-specific.

Overall, combining vision and text data for the task of skin disease classification is a novel approach that has not been extensively explored in the past. This setup more closely aligns with how dermatologists actually classify skin diseases in patients since it allows for a more wholistic understanding of the patient’s conditions which can’t be expressed in one modality alone. For example, using images alone, some skin disease symptoms are not as easily recognizable due to patients having other skin conditions such as tattoos or extensive hair growth. In fact, even skin color could affect the diagnostic accuracy. Patient narratives do not suffer from this issue as, regardless of the patient’s skin color, the symptoms described by the patient would be more or less the same. However, these symptom descriptions may not be perfect which is why dermatologists still need to use images to support their diagnosis.

Along these lines, Cai (22) proposed a dataset that includes skin disease images and clinical metadata which are fed into a multimodal Transformer. Although this method also combines both images and text, the text data they use is not as informative about the disease or as easy to obtain as compared to patient narratives. Patient narratives provide key information for diagnosis such as non-visual signals (trouble breathing, headaches, etc.) which are not easily apparent from patient metadata. This is especially true if the patient is not forthcoming about their age, sex, or other metadata information but are willing to discuss how they are feeling. Overall, current methods that combine image and text information are only evaluated on 10 diseases or less (22-25). The prediction accuracy that these experiments achieve is around 84% (26,27). In this work, higher multimodal accuracy is achieved while also predicting more skin disease classes.

Objective

In summary, this method makes the following contributions:

Proposed an end-to-end system that utilizes common and standard patient information such as affected skin area images as well as patient narratives to predict the skin disease afflicting the patient.
Proposed a new multimodal skin disease dataset containing 26 disease classes, offering a more comprehensive and realistic classification challenge. Evaluated several models and optimization techniques to determine the optimal configuration for the dataset which surpasses state of the art.
Introduced a new LLM finetuning technique for sequence classification tasks, chain of options, that utilizes chain of thought and precursory information to improve training accuracy and efficiency.
Demonstrated the potential of the proposed model for detecting multiple skin diseases simultaneously, which can significantly improve the efficiency and accuracy of skin disease diagnosis.

We present this article in accordance with the CLEAR reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-429/rc).

Methods

Dataset development

Image data

In this work, the top three skin disease datasets available on Kaggle, a public repository for storing datasets, were used (18-20). These datasets are comprised of thousands of annotated images collected from diverse sources, including medical archives, research repositories, and clinical studies. The first dataset (20) consists of 27,153 images classified into ten classes of skin diseases. This dataset has good distributions of the 10 classes and was used by Hammad (23), who focused on only two classes, which are the 1,677 images for eczema and the 2,055 images for psoriasis. The second dataset (19) includes 19,600 images classified into 23 different types of skin diseases/classes. The third dataset (18), derived from the ISIC competition source website, contains about 23,618 images classified into 8 types of skin diseases. It is important to note that several classes appear in more than one dataset such as basal cell carcinoma which appears in all three. In this way, after combining the three datasets into the final dataset, the total number of classes is not the direct sum of the number of classes in each individual dataset and is actually less.

The reason for combining these datasets is because each of the different datasets contains a different type of skin disease images such as (18) containing more clinical (zoomed in and focused) images of the skin diseases while (19) contains more natural images with not optimal lighting and surrounding skin as well. This diversity in style, batch effects, actually improves the dataset’s representativeness of real life skin disease images and helps to increase models trained on this dataset’s generalizability. However, the datasets used are comprised of primarily Caucasian skin which limits the dataset’s representativeness in that respect which is covered in more detail in the Discussion section. Since previous skin disease methods (23,28) also use these datasets, albeit separately, this method aims to be an improvement over those.

In total, there are approximately 70,000 images across the three datasets. However, some images in the combined dataset are duplicates or not related to skin diseases. Therefore, the images were manually checked to determine their relevancy based on the following criteria:

The images labeled as a certain disease class are actually of that disease;
The images must be in RGB not gray-scale;
The images must be of skin and should not be of internal organs or contain non-skin artifacts such as clothing;
The images should all be unique and not be perturbations or augmentations of the same image.

After manual checking, an automated check was also done to remove any duplicate images that may be under a different file name to prevent data leakage. This was done by taking the MD5 hash of each image and then making sure that the same hash doesn’t appear more than once. If the same hash does appear more than once, manual inspection was done to verify that the two images are indeed the same and one of them is deleted.

After filtering, the final dataset contains 36,995 images across 26 types of skin disease. Unfortunately, class distribution is not perfect because of the large variation in image numbers for different diseases. Some diseases only have around 200 (dermatofibroma has only 239 images) images while others have over 7,900 images (melanocytic nevi). This highlights one of the challenges of getting skin disease images for AI applications which this study overcomes. Table 1 shows the classes and number of images in each dataset used in this study. The images were divided into 80% for training and 20% for validation/testing. This split was varied for each training run based on preset seeds.

Table 1

Class statistics for skin disease image datasets

Class name (skin diseases)	Number of images			Final dataset
Class name (skin diseases)	1^st dataset	2^nd dataset	3^rd dataset	Final dataset
Acne and rosacea		1,152		858
Atopic dermatitis	1,257	612		1,210
BCC	3,323	1,437	3,323	4,709
BKL	2,079		2,624	2,065
Bullous disease		561		528
Cellulitis impetigo and other bacterial infections		361		352
Eczema	1,677	1,544		1,553
Exanthems and drug eruptions		505		467
Hair loss alopecia and other hair diseases		299		282
Light diseases and disorders of pigmentation		711		676
Lupus and other connective tissue diseases		525		511
Melanocytic nevi (NV)	7,970		12,896	7,967
Melanoma skin cancer nevi and moles	3,140	579	4,522	3,698
Nail fungus and other nail disease		1,301		1,163
Poison ivy and other contact dermatitis		325		308
Psoriasis pictures lichen planus and related diseases	2,055	1,757		1,791
Scabies Lyme disease and other infestations and bites		539		479
Seborrheic keratoses and other benign tumors	1,847	1,714		1,802
Systemic disease		758		698
Tinea ringworm candidiasis and other fungal infections	1,702	1,625		1,546
Urticaria hives		265		261
Vascular tumors		603	253	845
Vasculitis		521		510
Warts molluscum and other viral infections	2,103	1,358		1,849
Squamous cell carcinoma			628	628
Dermatofibroma			239	239
Total diseases (classes)	10	22	7	26
Total images	27,153	19,052	23,618	36,995

BCC, basal cell carcinoma; BKL, benign keratosis-like lesions.

Text data from disease symptoms

In addition to images of skin diseases, symptoms of these diseases are another source of information for determining what ailment a patient has. This information about the common indicators and symptoms of various skin diseases are readily available on the Internet. In this study, symptom data is manually aggregated across several reputable sources such as Mayo Clinic (29), American Academy of Dermatology Association (30), National Institution of Arthritis and Musculoskeletal and Skin Diseases (31), and Cleveland Clinic (32). Table 2 shows examples of three diseases and their aggregated symptoms using the aforementioned sources. This symptom data is used to generate mock “patient stories” which are used as data input for finetuning LLMs for skin disease classification. Additionally, ChatGPT was prompted to describe the symptoms of each skin disease to determine whether it already has a preliminary understanding of them. Although ChatGPT mentioned a few symptoms of the skin disease, the list of symptoms it described was less thorough and less extensive as compared to the manually aggregated one.

Table 2

Examples of three diseases and their symptoms

Disease name	Aggregated symptoms	Symptoms from ChatGPT
BCC	• A shiny, skin-colored bump that’s translucent, meaning you can see a bit through the surface. The bump can look pearly white or pink on white skin. On brown and black skin, the bump often looks brown or glossy black. Tiny blood vessels might be visible, though they may be difficult to see on brown and black skin. The bump may bleed and scab over	BCC is the most common type of skin cancer, usually caused by exposure to UV radiation from sunlight. It typically appears as a pearly or waxy bump on the skin, often with visible blood vessels, or as a flat, scaly, reddish patch. BCC tends to grow slowly and rarely metastasizes, but it can cause disfigurement if left untreated. AK is a precancerous skin lesion caused by long-term exposure to UV radiation. It appears as rough, scaly patches on the skin, commonly found on sun-exposed areas such as the face, scalp, and hands. While AK itself is not cancerous, it has the potential to develop into squamous cell carcinoma if left untreated
	• A brown, black or blue lesion—or a lesion with dark spots—with a slightly raised, translucent border
	• A flat, scaly patch with a raised edge. Over time, these patches can grow quite large
	• A white, waxy, scar-like lesion without a clearly defined border
Eczema	• Dry, cracked skin	Eczema, also known as atopic dermatitis, is a chronic inflammatory skin condition characterized by dry, itchy, and red patches of skin. It can occur at any age but is most common in children. The exact cause of eczema is not fully understood, but it is believed to involve a combination of genetic and environmental factors. Triggers for eczema flare-ups can include irritants like soaps and detergents, allergens like dust mites or pet dander, changes in weather or humidity, and stress
	• Itchiness (pruritus)
	• Rash on swollen skin that varies in color depending on your skin color
	• Small, raised bumps, on brown or black skin
	• Oozing and crusting
	• Thickened skin
	• Darkening of the skin around the eyes
	• Raw, sensitive skin from scratching
BKL	• A round or oval-shaped waxy or rough bump, typically on the face, chest, a shoulder or the back	BKL encompass a range of skin growths, including seborrheic keratoses and other benign tumors. Seborrheic keratoses are common non-cancerous growths that appear as wart-like or stuck-on lesions on the skin’s surface. They can vary in color, size, and texture, often presenting as brown, black, or tan growths with a waxy, scaly, or rough surface. While seborrheic keratoses are typically harmless, they may be cosmetically bothersome or cause itching. Other benign tumors, such as dermatofibromas or skin tags, may also be categorized under BKL. These growths are generally benign and do not pose a significant health risk, but they may be removed for cosmetic reasons or if they cause discomfort
	• A flat growth or a slightly raised bump with a scaly surface, with a characteristic “pasted on” look
	• Varied size, from very small to more than 1 inch (2.5 centimeters) across
	• Varied number, ranging from a single growth to multiple growths
	• Very small growths clustered around the eyes or elsewhere on the face, sometimes called flesh moles or dermatosis papulosa nigra, common on black or brown skin
	• Varied in color, ranging from light tan to brown or black

AK, actinic keratosis; BCC, basal cell carcinoma; BKL, benign keratosis-like lesions; UV, ultraviolet.

As most real patient stories contain an incomplete list of symptoms for the patient’s disease, when generating mock patient stories, not all of the disease symptoms are used in each data sample. The data samples are generated by prompting ChatGPT to use a random subset of a disease’s symptoms to write a patient narrative as if it were the patient themselves Table 3. The prompt used for each patient narrative generation is also different to improve the diversity of narrative style as well as dialects as different patients would express their symptoms in different ways. These two steps are important in ensuring that the LLM input (patient narratives) is similar to what an actual patient would say to improve the dataset’s representativeness and the subsequent model’s generalizability. Contrasting these patient narratives with the original disease symptoms, patient narratives are much closer to what an actual dermatologist would have access to when making a diagnosis. Despite this, using artificially generated patient narratives does have limitations as these descriptions have not been validated by practicing dermatologists and might not capture natural variations and complexities of real patient descriptions. This is covered in more detail in the Discussion section. For now, these mock patient narratives serve as a proof of concept tool for combining text and vision modality inputs for skin disease detection as well as for LLM related finetuning techniques.

Table 3

Examples of eczema text data from symptoms generated by ChatGPT

Eczema disease	Keywords	Prompt (GPT question)	Story (GPT response)
Example 1	• Dry, cracked skin	Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: dry, cracked skin, itchiness, small, raised bumps, on brown or black skin, oozing and crusting, thickened skin, sensitive skin from scratching	Ugh, my skin has been a total mess lately, and it’s driving me crazy. It’s so dry and cracked that it actually hurts, and the itching is nonstop—I keep scratching without even thinking, which just makes it worse. There’s this rash that looks different depending on the lighting, but on my brown skin, it’s super noticeable. Plus, I’ve got these tiny bumps that sometimes ooze (gross, I know) and then crust over, while other spots feel weirdly thick. And to top it all off, the skin around my eyes has gotten way darker. I have no clue what’s going on, but I need to figure it out ASAP because this is seriously stressing me out
	• Itchiness (pruritus)
	• Rash on swollen skin that varies in color depending on your skin color
	• Small, raised bumps, on brown or black skin
	• Oozing and crusting
	• Thickened skin
	• Darkening of the skin around the eyes
	• Raw, sensitive skin from scratching
2	• Dry, cracked skin	Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: dry, cracked skin, itchiness (pruritus), rash on swollen skin that varies in color depending on your skin color	I’ve been grappling with a series of unsettling skin issues that are really affecting my daily life. My skin feels incredibly dry and cracked, and the constant itchiness has become maddening. The appearance of a rash on my swollen skin is particularly troubling, and what’s strange is that it varies in color depending on my skin tone. It’s not just uncomfortable; it’s also visually distressing. I can’t seem to find relief, and I’m anxious to understand what’s causing these symptoms and how I can get some relief for my skin
	• Itchiness (pruritus)
	• Rash on swollen skin that varies in color depending on your skin color
3	• Rash on swollen skin that varies in color depending on your skin color	I would like to talk to doctor and would like describe my symptoms: rash on swollen skin, oozing and crusting and thickened skin. please construct a one paragraph patient narrative using those symptoms	I’ve been experiencing a rash that appears on swollen skin, accompanied by oozing and crusting. The affected area also seems to have thickened skin. The rash is red and raised, and the surrounding skin appears puffy. There’s discharge coming from the rash, with a yellowish color and crust forming over time. The skin feels tougher than usual, with a rough texture. These symptoms started recently and haven’t improved with home remedies. I’m concerned about what might be causing this and would appreciate your insight and guidance
	• Oozing and crusting
	• Thickened skin
4	• Dry, cracked skin	Pretending you go to hospital with your daughter to see the skin doctor, please help write a paragragh to describe her symptoms: cracked skin, Itchiness (pruritus) and raw, sensitive skin from scratching	My daughter has been experiencing cracked skin, relentless itchiness (pruritus), and her skin has become raw and sensitive from scratching when she was infancy. The cracks in her skin are visible, especially on her arms and legs, and they seem to worsen with movement. She constantly complains of itching, which she tries to relieve by scratching, leading to further irritation and discomfort. Her skin has become red and inflamed in areas where she scratches the most, and it’s evident that she’s in significant discomfort. We’re hoping to find relief and a solution to alleviate her symptoms
	• Itchiness (pruritus)
	• Raw, sensitive skin from scratching
5	• Small, raised bumps, on brown or Black skin	Pretending you are a patient, please construct a one paragraph patient narrative using these symptoms: small, raised bumps, on brown or black skin, oozing and crusting, thickened skin, darkening of the skin around the eyes	Lord have mercy, I just couldn’t ignore what was happening to his poor skin. Little raised bumps started popping up all over, and some of ‘em were even oozing and crusting. His skin in those spots felt thick and rough, nothing like the smooth complexion he’s always had. But what really got me worried was the darkening around his eyes—it just didn’t look right, like something more serious might be going on. I knew we needed to get some answers, and fast. I’m just praying the dermatologist can tell us what’s wrong and how to fix it
	• Oozing and crusting
	• Thickened skin
	• Darkening of the skin around the eyes

The final dataset consists of 20 stories per skin disease, totaling 520 datapoints in total. The patient stories were divided into 60% for training, 20% for validation, and 20% for testing. Experimentally, training with more than 10 samples of each class results in little to no model improvement so 20 stories per skin disease was sufficient for this task. This makes sense as each disease had only a small number of symptoms and thus 10 or more training samples would sufficiently cover the range of symptoms possible for that disease. Table 3 shows an example of 5 stories for the eczema skin disease.

Image classification

Initial development using baseline dataset

Using the first dataset (20), baseline experiments were run with four different vision models (VGG, Resnet, Efficientnet, and Vision Transformer) to classify 10 different skin diseases (classes).

Image augmentation techniques have facilitated the generation of synthetic dermatological images, augmenting existing datasets, and enhancing the diversity and representativeness of training data. This approach helps mitigate the challenges of dataset imbalance and improves the robustness of machine learning models to variations in skin types, lighting conditions, and camera quality. There are several standard techniques used for image classification tasks such as color jitter, gaussian blur, horizontal/vertical flip, and image resizing. These techniques perform well in standard image classification techniques where the target entity can be anywhere on the image. However, for the proposed skin disease dataset, the afflicted area is generally in the center of the image, as well as being orientation agnostic, allowing for more fine-grained augmentations to be applied. In this case, the additional augments are random cropping and random rotation. In particular, instead of directly random cropping to the desired resolution, a resize was first performed to a size slightly larger than the crop size to retain most of the relevant skin proportions. These are considered the “final” augmentations, combining both general augments and dataset specific augments. To determine the optimal hyperparameters for the data augmentation techniques, models were trained from scratch on the first dataset (20) using an 80:20 train/validation split. For this and following experiments, the Adam optimizer was used to optimize the available network parameters.

Applying base models to aggregated new image dataset

Now that baselines for the first dataset (20) have been established, more classes were incrementally added into the aggregated new image dataset to determine the effect that more classes would have on accuracy. In total, the aggregated new image dataset combines three datasets (18-20) for a total of 26 classes, 16 more than the baseline dataset (10 classes). It is worth noting that datasets (19) and (18) contain smaller numbers of images for each disease class, leading to the unbalanced class distribution in the final dataset. This imbalance is unavoidable given that some diseases are rarer than others. Due to the new dataset’s class imbalance, reaching model convergence takes longer and gets stuck at suboptimal minimas more often as compared to the original dataset across all models from the baseline study.

Transfer learning comparison

To further combat class imbalance, transfer learning was used. Transfer learning occurs when 100% of the feature extractor is frozen and only the classifier is trained on a new dataset. In this case, the feature extractor is initialized to ImageNet weights (33). By using transfer learning, the model doesn’t need to learn a feature extractor and classifier simultaneously. This is especially useful in an imbalanced dataset as a trainable feature extractor would only learn the relevant features in the most common classes. Instead, a pretrained feature extractor would already understand what features are important in an image and enable the classifier to decide what disease these extracted features are related to.

Finetuning of pretrained Resnet model

However, directly using ImageNet-trained feature extractor weights has some issues. This is because the data distribution for the pretrained dataset, ImageNet, is significantly different from the current use case. While ImageNet contains pictures of dogs, cats, and other easily identifiable objects, the current dataset contains only skin disease images which have minute differences that correspond to vastly different diseases. As a result, what may be a distinguishing feature between a dog and a cat can’t be used to distinguish between melanoma and basal cell carcinoma. To address this, various percentages of the model layers were unfrozen (as compared to completely frozen in previous experiment) and allowed to be trained in order to determine the optimal percent at which the model would achieve the highest accuracy. By doing so, the model would be able to utilize the high level, class-agnostic, features learned through pretraining such as the outlines of shapes which are commonly found in the initial layers of the feature extractor. The model would then only need to adjust the later layers of the feature extractor which are responsible for low level features that are specific to the task at hand.

Adjusting image resolution

To further improve performance, image resolution was also adjusted. Since higher resolution images offer more visual cues of what the skin disease is, various models were evaluated with differing resolutions. The first 75% of layers in the feature extractor for each of the models is frozen based on findings from the previous section. Beginning with the initial resolution of 224×224 pixels, the images were enlarged using bicubic resampling up to 528×528 which is the approximate size of the original, unmodified, image. At the initial 224×224 resolution used in previous works, the original image would need to be shrunk by approximately 80%, significantly reducing the number of features available. By increasing the resolution, the resulting image would have more features available for the model to discern using which might actually detract from performance since many of these features could be extraneous and not related to a disease. In addition, using higher resolution images results in longer training times which is not feasible especially if there is limited compute budget. In fact, higher resolution images may not be available to begin with, especially given the scarcity of skin disease images. This study aims to determine the optimal resolution for images on this new dataset.

Text classification

Baseline development

Using the text (patient narratives) part of the newly created dataset, multi-class sequence classification was done with LLM where the LLM must classify the disease given the input patient narrative. Thus, three state of the art LLMs [Llama-7B (34), Falcon-7B (35), Mistral-7B (36)] were finetuned on this dataset. This finetuning step is necessary in order for the model to understand what specific keywords are associated with a particular disease. Due to computational constraints associated with such large models, LLMs larger than 7 billion parameters were not used in this study and the three selected LLMs were finetuned using Low Rank Adaptation (LoRA) (37). By using LoRA, only a small, low rank, subset of the model’s parameters (7–10%) need to be finetuned which enables the models to fit in consumer-grade GPUs. The dataset itself is split 70:30 train/validation due to its small size.

Initial prompt tuning

Along with the patient narrative text block input for sequence classification, additional supporting information was also added. As LLMs are trained on millions of corpuses that describe hundreds of diseases, it may be difficult for them to associate the presented symptoms to only the subset used in this study. As such, it makes sense to restrict the potential set of diseases the LLM must choose from to the subset specified in the study by presenting the LLM, in its prompt, the available disease options as shown in Figure 1.

Figure 1 Sample prompt presented to LLM. BCC, basal cell carcinoma; BKL, benign keratosis-like lesions; LLM, large language model.

Chain of options

Inspired by CoT, a popular prompting technique that breaks down a complex reasoning task into intermediate steps for the LLM to take at inference time, Chain of Options breaks down the task during finetuning instead of inference. This allows for more control in the reasoning process as the LLM’s performance could be evaluated during intermediate steps instead of only at the end. Furthermore, since Chain of Options operates during the finetuning step as opposed to the inference step, the method enables the LLM to learn more domain-specific information while CoT must rely on the LLM’s general knowledge.

At a high level, Chain of Options incrementally removes k options from the options list (problem space) based on which diseases are the most unlikely. Gradually, the options list would decrease until there are less than k diseases left at which point the top prediction is considered the final prediction as seen in Figure 2. This “narrowing down” process reframes the task from needing to pick the most likely disease out of 26 options, which could be difficult if many diseases have similar symptoms, into a task to simply determine which k diseases are the least likely. In contrast, CoT’s generated intermediate reasoning steps do not reduce the candidate pool which is less efficient for classification tasks with a defined set of classes. Conceptually, picking the least likely diseases that a patient narrative could describe based on a single symptom mismatch is a simpler task as compared to picking the most likely disease which would have required all symptoms to match. This process also allows for more leniency in case the patient’s symptoms are inconsistent with the actual disease.

Figure 2 Chain of options example. BCC, basal cell carcinoma; BKL, benign keratosis-like lesions; LLM, large language model.

During finetuning, random combinations of diseases of varying lengths are concatenated in the same way as the options list was to the input patient narrative. This randomness simulates the various instances of the chain that could potentially propagate through while also adding some noise to the input to reduce overfitting.

Combining image and text classification

Along with providing an options list, the output from the image model is also added to the narrative, imitating an “expert” (LLM) having the final say in what disease a patient has, given initial recommendations (image model). To imitate this process, the top-N most likely predictions from the image model would be used as an initial prediction which are then included alongside the patient narrative as input into the LLM. Figure 3 describes the architecture of the model for skin diseases classification in this study. Conceptually, this would further decrease the number of potential diseases the LLM must choose from to the top-N. In practice, this is not the case as the image model is not entirely accurate even when the top-5 predicted diseases are considered. As a result, the LLM must not always assume that the correct disease is within the initial recommended diseases although it is very likely it is. To imitate this in training, instead of always including the correct disease in the recommendation, there is a chance that the provided list of recommendations doesn’t actually include the correct prediction, with the chance being the same as the prediction accuracy of the image model.

Figure 3 Model architecture for skin disease classification. BCC, basal cell carcinoma; BKL, benign keratosis-like lesions; CNN, convolutional neural network; LLM, large language model.

Statistical analysis

In this study, several statistical methods are used to compare the classification performance across various models, datasets, and configurations. The primary metric used was accuracy which is evaluated as the number of predictions that correctly matched the actual disease class. The accuracy metric is further divided into top-1, top-3, and top-5 accuracy which represents whether the highest 1, 3, or 5 predictions outputted by the model contains the actual disease class. Given that there is class imbalance within the datasets, receiver operating characteristic area under the curve (ROC AUC), Precision, Recall, and F1-score were also measured. These metrics measure the extent of class imbalance to determine whether a model is overfitting on any particular class. Each experiment was run three times to gauge the statistical significance and robustness of the performance metrics via standard deviation measurements.

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. IRB approval is not applicable, as the study does not contain human or animal experiments. The study does not require informed consent as all datasets used are open sourced on Kaggle.

Results

The experiments were done on a home computer with an RTX 4090 GPU, Intel Core i9-13900K CPU, and 32 GB DDR5 memory.