Deep learning architectures for multi-class World Health Organization grading of meningiomas

Suthiphon Netwong; Thara Tunthanathip; Jermphiphut Jaruenpunyasak; Wanwara Thuptimdang

doi:10.21037/jmai-2025-153

Original Article

Deep learning architectures for multi-class World Health Organization grading of meningiomas

Suthiphon Netwong¹, Thara Tunthanathip¹, Jermphiphut Jaruenpunyasak², Wanwara Thuptimdang²

¹Division of Neurosurgery, Department of Surgery, Faculty of Medicine, Prince of Songkla University, Hat Yai, Songkhla, Thailand; ²Institute of Biomedical Engineering, Department of Biomedical Sciences and Biomedical Engineering, Faculty of Medicine, Prince of Songkla University, Hat Yai, Songkhla, Thailand

Contributions: (I) Conception and design: All authors; (II) Administrative support: None; (III) Provision of study materials or patients: S Netwong, T Tunthanathip; (IV) Collection and assembly of data: J Jaruenpunyasak, W Thuptimdang; (V) Data analysis and interpretation: S Netwong, T Tunthanathip; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Thara Tunthanathip, MD, PhD. Division of Neurosurgery, Department of Surgery, Faculty of Medicine, Prince of Songkla University, Kanjanavanich Rd., Hat Yai, Songkhla 90110, Thailand. Email: tsus4@hotmail.com; thara7640@gmail.com.

Background: Because the recurrence rate and prognosis of individuals with meningiomas varied significantly across all three World Health Organization (WHO) grades, preoperative differentiation aids clinical understanding and enables personalized surgical planning. The present study aimed to compare preoperative WHO grading performance in meningiomas across several deep learning (DL) architectures.

Methods: Preoperative magnetic resonance imaging (MRI) scans from 239 patients with pathologically confirmed meningiomas were collected. A total of 8,345 preoperative images were divided into training, validation, and test sets using a 7:2:1 ratio. DL models were developed using four architectures: traditional convolutional neural network (CNN), visual geometry group (VGG), vision transformer (ViT), and residual neural network (ResNet). Performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1-score, and area under the receiver operating characteristic curve (AUC). Additional analyses included precision-recall curves and decision curve analysis (DCA).

Results: ResNet demonstrated the highest performance across most metrics for WHO grades 1 and 2 (AUC =0.91 and 0.89, respectively), with F1-score of 0.86 for grade 1 and 0.69 for grade 2. VGG also performed strongly, with highest specificity (0.86) and PPV (0.84) in both grades. ViT showed relatively balanced but lower performance. All models demonstrated limited ability to detect WHO grade 3 tumors, likely due to pronounced class imbalance. Precision-recall and DCA analyses confirmed superior clinical utility of ResNet and VGG for low- and intermediate-grade classification.

Conclusions: ResNet and VGG performed well in diagnosing WHO grade 1 and 2 meningiomas using MRI scans. However, detection of grade 3 remains challenging. Future work should focus on addressing class imbalance, multi-center validation, and clinical deployment through artificial intelligence-driven decision support systems.

Keywords: Image classification; meningioma; deep learning (DL); convolutional neural network (CNN)

Received: 19 June 2025; Accepted: 04 September 2025; Published online: 09 December 2025.

doi: 10.21037/jmai-2025-153

Highlight box

Key findings

• Meningiomas are the most frequent type of primary intracranial tumor and are classified into three classes according to the World Health Organization (WHO).

• Differentiation across all three grades before surgery is essential for clinical assessment and for personalizing surgical plans to each patient.

What is known and what is new?

• The present study used different deep learning (DL) architectures [convolutional neural network, visual geometry group (VGG), vision transformer, and residual neural network (ResNet)] for multi-class WHO grading (grades 1, 2, and 3).

• ResNet and VGG performed well in detecting WHO grades 1 and 2; however, all models struggled with grade 3, owing to class imbalance.

What is the implication, and what should change now?

• DL models, particularly ResNet, may be a beneficial decision-support tool for preoperative grading, neurosurgical planning, and patient counseling.

Introduction

Meningiomas are the predominant primary intracranial tumors, categorized into three World Health Organization (WHO) grades: grade 1, grade 2, and grade 3 (1). Although grade 1 is most commonly found in approximately 80% of meningiomas, higher-grade tumors have more aggressive behavior and poor prognosis (2). Grading has traditionally been based on histological examinations, which are vulnerable to interobserver variability and diagnostic problems (1,2). Ammendola et al. observed a concordance rate of 63–74% in the diagnosis of atypical meningioma using whole slide imaging between two neuropathologists, whereas a previous study demonstrated a concordance rate of 87.8–93.6% for meningioma grading (3,4).

Recent advancements in deep learning (DL) have shown promise in image-based classification tasks in various neurosurgical fields (5). Bo et al. used DL to distinguish between brain abscesses and cystic gliomas and reported that the area under the curve (AUC) varied between 0.85 and 0.86 (6), while Jaruenpunyasak et al. used convolutional neural network (CNN) to classify between glioblastoma and primary central nervous system lymphoma and the AUC of CNN model was 0.83–0.84 (7). Additionally, several DL architectures have been investigated to assess the classification performance between germinoma and non-germinoma tumors of the pineal area, utilizing magnetic resonance imaging (MRI) images. As a result, the CNN and visual geometry group (VGG) architectures had the highest AUC of 0.96 and 0.96, while the vision transformer (ViT) and residual neural network (ResNet) models had AUC of 0.80 and 0.54 (8).

According to the literature review, DL designs have been investigated to classify meningioma grades. A previous meta-analysis and systematic review revealed that the pooled sensitivity, specificity, and AUC were 0.923, 0.953, and 0.97, respectively (9). Based on DL architectures, the accuracy of the CNN model ranged from 0.71 to 0.99 (10-12), while the VGG and ResNet architectures were 0.989 and 0.994 accurate (13,14). Although DL has been investigated for meningioma grading, prior DL approaches have mostly focused on binary classification, particularly distinguishing between low-grade (WHO 1) and high-grade (WHO 2/3) tumors (9).

Despite being classified as high-grade, WHO grade 2 and WHO grade 3 meningiomas differ greatly in their biological behavior, risk of recurrence, and mortality. Holleczek et al. studied the prognosis of meningioma in Germany and found that WHO grades 1, 2, and 3 meningiomas had 5-year overall survival probabilities of 88%, 86%, and 50%, respectively. Moreover, 10-year survival probabilities of WHO grade 1, 2, and 3 meningiomas were 77%, 71%, and 23%, respectively (15). The 3-year progressive-free survival (PFS) for WHO grade 2 meningioma ranged from 63.1% to 41.9% (16), whereas the 5-year PFS for WHO grade 3 meningioma was 37% (17).

The recurrence rate of meningiomas is highly related to the WHO grade and the amount of surgical removal (2,15). In addition to WHO grade, the extent of tumor resection is associated with recurrence; however, factors such as tumor location, extension to the basilar skull, or vascular encasement may restrict complete tumor excision (18,19). Differentiating across all three WHO grades facilitates clinical insights and preoperative planning of individualized surgical procedures. However, there is a lack of evidence for DL-based classification using a multi-class approach. Zhu et al. studied three-class WHO grade classification with the LeNet model and found a training accuracy of 0.898 and a test accuracy of 0.833 (20). Therefore, the present study aimed to compare preoperative WHO grading performance in meningiomas across different DL architectures. We present this article in accordance with the TRIPOD reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-153/rc).

Methods

Study design and study population

This is a retrospective cohort study of patients with meningioma who had their diagnosis verified by board-certified pathologists using tissue samples between January 2014 and December 2022. A range of MRI images, including T1-weighted (T1W), T1-weighted gadolinium-enhanced (T1-Gd), T2-weighted (T2W), and fluid-attenuated inversion recovery (FLAIR) scans, were obtained from recruited patients for the present investigation. The following patients were excluded: (I) Individuals whose preoperative MRI images were unavailable; (II) meningiomas had no intracranial involvement such as pure intraosseous meningioma. Due to the involvement of pure tumors in bone, it may be challenging and obstruct feature extraction. Therefore, consecutive patients were identified from a single academic center. The patient who was enrolled was initially a part of a cohort that consisted of 267 individuals. Based on these criteria, 17 patients lacking preoperative MRI scans and 11 patients with pure intraosseous meningioma were eliminated. Consequently, baseline characteristics and image acquisition were obtained from 239 patients. Clinical details and tumor characteristics were analyzed using descriptive statistics. While percentages were used to represent categorical data, the mean and standard deviation (SD) were used for continuous variables. The R program (R Foundation, Vienna, Austria) version 4.3.3 was used to conduct the statistical analyses.

Data gathering

The workflow of DL model training, validation, and test process in the present study is shown in Figure 1. Initially, sagittal, axial, and coronal planes of intracranial T1W, T1-Gd, T2W, and FLAIR MRI images were obtained. A total of 8,345 images were gathered for the total dataset; hence, the patient-wise splitting was employed to prevent data leakage (21-23). In this method, all images from a single patient were grouped together and randomly assigned as a unit to either the training, validation, or test set. This ensures that no information from the same patient appears across multiple datasets, which is crucial for simulating real-world deployment where the model encounters entirely unseen patients. The dataset was split such that 70% of patients were assigned to the training set, 20% to the validation set, and 10% to the test set. The training dataset consisted of 5,852 images, whereas the validation and test datasets had 1,667 and 826 images, respectively.

Figure 1 Workflow of image classification of meningioma grade using deep learning. AUC, area under the ROC curve; CNN, convolutional neural network; FLAIR, fluid-attenuated inversion recovery; MRI, magnetic resonance imaging; ResNet, residual neural network; ROC, receiver operating characteristic; VGG, visual geometry group; ViT, vision transformer; WHO, World Health Organization.

Image preprocessing and augmentation

To standardize input dimensions and improve model generalization, image transformations were used. For the training dataset, whole-slice images were first resized to a fixed resolution of 224×224 pixels to match the input requirements of DL architectures. A random horizontal flip was then used as a data augmentation method to simulate different image orientations and limit the risk of model overfitting. Each image was then transformed into a PyTorch tensor, with pixel values scaled to the [0,1] range. Subsequently, standardization was performed using the mean and SD derived exclusively from the training set to maintain independence of validation and test data. These training-derived normalization parameters were then consistently applied to the validation and test sets to prevent data leakage.

Model architectures

Four DL architectures were used for classification: CNN, VGG-16, ViT, and ResNet. Hence, the final classification layers were replaced with a fully connected layer with three output neurons corresponding to the WHO grades.

CNN architecture

The architecture comprised three successive convolutional blocks with increasing channel depths (32, 64, and 128), each followed by a Rectified Linear Unit (ReLU) activation and a 2×2 max-pooling operation. The resulting feature maps were flattened and passed through two fully connected layers: the first with 512 neurons, ReLU activation, and a dropout rate of 0.5 to reduce overfitting, and the final layer with three output neurons representing the WHO grades. A softmax activation was implicitly applied through the cross-entropy loss function to enable multi-class classification.

VGG architecture

We utilized a transfer learning approach by fine-tuning the pre-trained VGG16 CNN for the task of the three WHO grades. The VGG16 model, which was originally trained on the ImageNet dataset, was loaded with its gained weights, and the final fully connected classification layer was replaced with a new linear layer having three output neurons representing the three WHO grades.

ViT architecture

The pre-trained model was loaded using the ViT-B/16 model. To adapt the model for our classification task, we replaced the final classification head with a new linear layer comprising three output units, corresponding to the three WHO grades. This transformer-based architecture leverages self-attention mechanisms to capture discriminative features across the entire image, making it particularly suitable for complex MRI image analysis.

ResNet architecture

The ResNet-18 model was initialized with pre-trained weights from ImageNet to leverage transfer learning and accelerate convergence. To tailor the network for our three-class classification task, the final fully connected layer was replaced with a new linear layer comprising three output neurons, each corresponding to one of the WHO grades.

Training procedure

We performed hyperparameter tuning using GridSearch to identify the optimal configuration for each model architecture. Key parameters such as batch size, dropout rate and learning rate were included in the search space. Based on the best-performing configuration from validation accuracy, we selected final parameters and proceeded with model training. Therefore, all models were then trained using the Adam optimizer with the selected learning rate of 0.0001. To determine the optimal number of training epochs and reduce overfitting, early stopping was applied based on validation loss with a defined patience threshold. Although training was allowed for up to 40 epochs, training often terminated earlier when no further improvement in validation loss was observed. The final training was performed using the best epoch identified through this procedure, with a batch size of 32. Additionally, class-balanced focal loss was also applied to mitigate the effects of class imbalance, particularly for WHO grade 3 tumors.

During each training epoch, the model was set to training mode and small batches of images. The computed loss was backpropagated to update model parameters via gradient descent. During the model training process, metrics such as accuracy and training loss were computed.

Validation was performed at the end of each epoch using a separate validation set. Validation accuracy and loss were calculated without gradient tracking to ensure computational efficiency. After training, learning curves were plotted to show the training and validation losses, as well as the classification accuracy over epochs.

Model evaluation and performance metrics

Based on validation accuracy, the best-performing model was evaluated using a confusion matrix. Moreover, we calculated several standard classification metrics, including accuracy, sensitivity (recall), specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1-score for each tumor grade. Furthermore, the overall performance of DL models for WHO grade classification was calculated using a micro-averaging approach (21).

To evaluate the classification model’s discriminatory performance across the three WHO meningioma grades, receiver operating characteristic (ROC) curves were created for each class using a one-vs-rest approach (22). True labels were converted into binary format to facilitate multi-class ROC analysis. For instance, the ROC curve for grade I meningioma showed a remarkable ability to distinguish between benign and high-grade tumors. The AUC was near to 1, which meant that it was very sensitive and specific. The DeLong test was used for statistical comparison of AUCs between models (24). Additionally, precision-recall (PR) curves and decision curve analysis (DCA) were generated to evaluate performance under class imbalance and assess clinical utility (25,26). The DL architectures were created and validated with Python software and Keras version 2.15.0 (Python Software Foundation) through Google Colaboratory (Google). Furthermore, the Python scripts were accessed through the following link: https://github.com/thara7640/meningioma_grading.

Ethical considerations

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This research was approved by the Human Research Ethics Committee of the Faculty of Medicine, Prince of Songkla University (REC.67-412-10-1). As a retrospective analysis, the study did not require informed patient consent. To ensure confidentiality, all patient identity numbers were encoded before the training process.

Results

Characteristics of patients in the present cohort

There were 239 patients with meningioma in the present study. The majority of the patients were female, with an average age of 52.5±12.8 years, as shown in Table 1. Common sites of meningioma were the convexity, suprasellar region, and parasagittal region were detected in 28.9%, 16.3%, and 15.5% of cases, respectively. Meningothelial meningiomas were the most prevalent histological subtype, accounting for 48.1%, followed by atypical and fibrous meningiomas, which accounted for 21.8% and 15.1%, respectively. Additionally, WHO grade 1 was observed in 74.1% of cases, while WHO grades 2 and 3 were found in 23.8% and 2.1% of cases, respectively.

Table 1

Demographic data of patients with meningioma (N=239)

Factor	Value
Gender
Male	53 (22.2)
Female	186 (77.8)
Age (years)	52.0 [17]
Location
Convexity	69 (28.9)
Parasagittal	37 (15.5)
Falcine	12 (5.0)
Intraventricular	3 (1.3)
Olfactory groove	13 (5.4)
Suprasellar	39 (16.3)
Sphenoid wing	28 (11.7)
Clinoid	3 (1.3)
Cavernous	4 (1.7)
Cerebellar convexity	1 (0.4)
Tentorial	12 (5.0)
Petro clival	10 (4.2)
Foramen magnum	3 (1.3)
Lateralization
Left	103 (43.1)
Right	93 (38.9)
Midline	40 (16.7)
Bilateral	3 (1.3)
Histology subtype
Meningothelial	115 (48.1)
Fibrous	36 (15.1)
Transitional	20 (8.4)
Microcystic	3 (1.3)
Angiomatous	2 (0.8)
Psammomatous	1 (0.4)
Atypical	52 (21.8)
Chordoid	5 (2.1)
Clear cell	1 (0.4)
Anaplastic	3 (1.3)
Rhabdoid	1 (0.4)
WHO grade
1	177 (74.1)
2	57 (23.8)
3	5 (2.1)

Data are presented as median [interquartile range] or n (%). WHO, World Health Organization.

DL model development and validation

The loss and accuracy metrics were evaluated over 40 training epochs. Across all models, the training loss consistently decreased, reflecting effective learning on the training data. Concurrently, training accuracy steadily increased and eventually plateaued, demonstrating convergence. These trends are illustrated in Figure 2 for each model architecture.

Figure 2 Training and validation performance of various deep learning architectures over 40 epochs. Top row (A-D) shows training and validation loss; bottom row (E-H) shows training and validation accuracy. (A,E) Convolutional neural network; (B,F) visual geometry group; (C,G) vision transformer; (D,H) residual neural network. Train Acc, training accuracy; Train Loss, training loss; Val Acc, validation accuracy; Val Loss, validation loss.

Using unseen MRI images from the test dataset, the classification for multiclass image classification was assessed among four DL architectures, as shown in Table 2. For WHO grade 1, ResNet yielded the highest sensitivity [0.92; 95% confidence interval (CI): 0.90–0.94] and F1-score (0.86; 95% CI: 0.84–0.88), while VGG demonstrated the highest specificity (0.72; 95% CI: 0.67–0.77) and PPV (0.84; 95% CI: 0.81–0.87). ViT and ResNet showed comparable accuracies of 0.76 and 0.82, respectively. In the classification of WHO grade 2, VGG outperformed other models across all metrics, achieving a sensitivity of 0.74 (95% CI: 0.69–0.79), specificity of 0.86 (95% CI: 0.83–0.89), PPV of 0.72, and F1-score of 0.73. ResNet followed closely, with an accuracy of 0.81 and a sensitivity of 0.64 (95% CI: 0.58–0.69). For WHO grade 3, all models exhibited perfect specificity (0.99–1.00) and high NPV (0.95–0.96); however, none of the models demonstrated the ability to correctly identify grade 3 cases, with sensitivities of 0.00 across all architectures. This resulted in F1-scores of 0.00 and PPVs of 0.00, highlighting the challenge of detecting grade 3 meningiomas, possibly due to their limited representation in the dataset. Overall, VGG and ResNet consistently showed superior performance across WHO grades 1 and 2, while all models struggled with grade 3. Furthermore, Figure 3 shows multi-class confusion metrics among DL models.

Table 2

Diagnostic performances for WHO grading of meningiomas among various deep learning architectures by test dataset

Architectures	Sensitivity (95% CI)	Specificity (95% CI)	PPV (95% CI)	NPV (95% CI)	Accuracy (95% CI)	F1-score (95% CI)
Classification performances for WHO grade 1
CNN	0.86 (0.83–0.89)	0.41 (0.36–0.47)	0.72 (0.68–0.75)	0.63 (0.57–0.70)	0.70 (0.67–0.73)	0.78 (0.76–0.81)
VGG	0.88 (0.85–0.90)	0.72 (0.67–0.77)	0.84 (0.81–0.87)	0.77 (0.72–0.82)	0.82 (0.79–0.84)	0.86 (0.84–0.88)
ViT	0.88 (0.85–0.90)	0.54 (0.49–0.60)	0.77 (0.73–0.80)	0.72 (0.66–0.78)	0.76 (0.73–0.78)	0.82 (0.80–0.84)
ResNet	0.92 (0.90–0.94)	0.63 (0.57–0.68)	0.81 (0.78–0.84)	0.83 (0.77–0.87)	0.82 (0.79–0.84)	0.86 (0.84–0.88)
Classification performances for WHO grade 2
CNN	0.39 (0.34–0.45)	0.85 (0.82–0.88)	0.57 (0.50–0.64)	0.74 (0.70–0.77)	0.70 (0.67–0.73)	0.47 (0.41–0.52)
VGG	0.74 (0.69–0.79)	0.86 (0.83–0.89)	0.72 (0.67–0.77)	0.87 (0.84–0.90)	0.82 (0.79–0.85)	0.73 (0.69–0.77)
ViT	0.57 (0.51–0.62)	0.87 (0.84–0.89)	0.68 (0.62–0.74)	0.80 (0.77–0.83)	0.77 (0.74–0.80)	0.62 (0.56–0.66)
ResNet	0.64 (0.58–0.69)	0.90 (0.88–0.93)	0.77 (0.71–0.82)	0.83 (0.80–0.86)	0.81 (0.79–0.84)	0.69 (0.65–0.74)
Classification performances for WHO grade 3
CNN	0.00 (0.00–0.00)	0.99 (0.98–0.99)	0.00 (0.00–0.32)	0.96 (0.95–0.98)	0.96 (0.94–0.97)	0.00 (0.00–0.00)
VGG	0.00 (0.00–0.12)	1.00 (1.00–1.00)	0.00 (0.00–0.00)	0.96 (0.95–0.98)	0.96 (0.95–0.98)	0.00 (0.00–0.00)
ViT	0.00 (0.00–0.12)	1.00 (1.00–1.00)	0.00 (0.00–0.00)	0.96 (0.95–0.98)	0.96 (0.95–0.98)	0.00 (0.00–0.00)
ResNet	0.00 (0.00–0.12)	0.99 (0.99–1.00)	0.00 (0.00–0.49)	0.96 (0.95–0.98)	0.96 (0.94–0.97)	0.00 (0.00–0.00)

CI, confidence interval; CNN, convolutional neural network; NPV, negative predictive value; PPV, positive predictive value; ResNet, residual neural network; VGG, visual geometry group; ViT, vision transformer; WHO, World Health Organization.

Figure 3 Confusion metrics among various deep learning architectures. (A) CNN; (B) VGG; (C) ViT; (D) ResNet. CNN, convolutional neural network; ResNet, residual neural network; VGG, visual geometry group; ViT, vision transformer; WHO, World Health Organization.

Figure 4 illustrates the comparative diagnostic performance of the four DL architectures for the WHO grading of meningiomas. All models demonstrated the highest discriminative performance for WHO grade 1. The ResNet model achieved the highest AUC for WHO grade 1 (AUC =0.91, 95% CI: 0.89–0.92) and grade 2 (AUC =0.89, 95% CI: 0.87–0.91), while VGG also performed well for grade 1 and 2 with AUC =0.87 for both. The ViT model had balanced AUCs of 0.83 for grades 1 and 2, but performed better in grade 3 (AUC =0.61, 95% CI: 0.51–0.71) than other models. The CNN model had moderate AUCs for grade 1 and 2 but lower for grade 3 (AUC =0.65, 95% CI: 0.55–0.76). All models showed reduced performance in detecting grade 3 meningiomas. Furthermore, Table 3 presents the results of the DeLong test, which was used to statistically compare the AUCs among different DL architectures across WHO tumor grades. These results show that ResNet consistently did better than other models for WHO grades 1 and 2. However, grade 3 tumors remained difficult to distinguish, regardless of the architecture employed.

Figure 4 ROC curve with area under the curve among various deep learning architectures. (A) CNN; (B) VGG; (C) ViT; (D) ResNet. AUC, area under the ROC curve; CI, confidence interval; CNN, convolutional neural network; ResNet, residual neural network; ROC, receiver operating characteristic; VGG, visual geometry group; ViT, vision transformer; WHO, World Health Organization.

Table 3

Comparison performance of the DeLong test among various deep learning architectures by WHO grading

Architectures	P value
Architectures	CNN	VGG	ViT	ResNet
Classification performances for WHO grade 1
CNN (AUC =0.76)	–	<0.001	<0.001	<0.001
VGG (AUC =0.87)	<0.001	–	<0.001	<0.001
ViT (AUC =0.83)	<0.001	<0.001	–	0.009
ResNet (AUC =0.91)	<0.001	<0.001	0.009	–
Classification performances for WHO grade 2
CNN (AUC =0.75)	–	<0.001	<0.001	<0.001
VGG (AUC =0.87)	<0.001	–	<0.001	<0.001
ViT (AUC =0.83)	0.002	<0.001	–	0.003
ResNet (AUC =0.89)	<0.001	<0.001	0.003	–
Classification performances for WHO grade 3
CNN (AUC =0.65)	–	0.002	0.002	0.009
VGG (AUC =0.52)	0.002	–	0.26	0.71
ViT (AUC =0.61)	<0.001	0.26	–	0.13
ResNet (AUC =0.58)	0.009	0.71	0.13	–

AUC, area under the receiver operating characteristic curve; CNN, convolutional neural network; ResNet, residual neural network; VGG, visual geometry group; ViT, vision transformer; WHO, World Health Organization.

Furthermore, PR curves were created to test performance when there was an imbalance between classes, as shown in Figure S1. Across all models, the best performance in area under the PR curve (AP) was observed for WHO grade 1, with AP values ranging from 0.84 to 0.94. For WHO grade 2, ViT and ResNet also performed robustly with APs of 0.72 and 0.77, respectively. However, classification of WHO grade 3 tumors remained challenging across all architectures, with extremely low AP values, indicating a high rate of false positives and a limited ability to recall true grade 3 cases. These results highlight the class imbalance and difficulty in detecting grade 3 meningiomas, despite overall high performance for low and intermediate grades. As shown in Figure S2, DCA was used to evaluate the clinical net benefit of each model across a range of probability thresholds for WHO grades 1, 2, and 3. The net benefit was highest for grade 1 classification in all models, particularly with ResNet and VGG, across most threshold probabilities. In contrast, WHO grade 3 provided no net benefit in any model, as the curves overlapped with or fell below the “Treat None” line. Overall, ResNet and ViT demonstrated greater net benefit consistency for grade 2 classification compared to CNN and VGG. These findings suggest that ResNet and ViT may offer more clinically meaningful predictions for WHO grade 1 and 2 meningiomas, while challenges persist in WHO grade 3 tumor identification.

Discussion

In the present study, we compared the performance of four DL models for the automated multi-class classification of meningioma WHO grades using preoperative MRI images. Our findings suggest that ResNet and VGG models achieved the highest classification performance for WHO grades 1 and 2, while all models struggled substantially with identifying WHO grade 3 tumors. Although WHO grades 2 and 3 are considered high-grade, they have different prognoses and therapeutic consequences. Multi-class image classification presents challenges because the distinctions between grades 1 and 2, as well as 2 and 3, can be very subtle in imaging. Therefore, DL models can efficiently extract fine-grained discriminative features. These differences often require fine-grained feature extraction (24,25). Zhu et al. investigated three-class WHO grade classification using the LeNet model on T1-Gd MRI scans and reported a test accuracy of 0.833 (20), whereas the present research found that several DL models performed well in multiclass WHO grades of meningioma. According to a review of the literature in multiclass classification, Tummala et al. employed a ViT model to classify three types of brain tumors-glioma, pituitary tumor, and meningioma-and reported an overall test accuracy of 0.982 (26). Kakarla et al. utilized a CNN model to classify glioma, pituitary tumor, and meningioma, reporting an overall test accuracy of 0.974 (27). In addition, Srinivasan et al. investigated CNN, VGG, and ResNet models for five-class brain tumor classification (glioma, metastasis, benign, pituitary tumor, and meningioma). The reported accuracies were 0.938 for CNN, 0.891 for VGG, and 0.764 for ResNet, highlighting performance variation across different architectures (28).

The present study findings found that ResNet and VGG models achieved the highest classification performance for WHO grades 1 and 2, while all models struggled substantially with identifying WHO grade 3 tumors. The superior performance of ResNet architecture in the present study-especially in classifying WHO grades 1 and 2 meningiomas-can be attributed to several architectural advantages that are particularly well-suited to the nature of MRI-based brain tumor classification. The ResNet introduces skip (residual) connections, which solve the degradation problem seen in very deep networks by allowing the model to learn residual functions instead of direct mappings. This improves gradient flow and enables the training of deeper models, which are crucial for learning complex feature hierarchies (27,28). In brain MRI classification, subtle differences in meningioma grade-such as peritumoral edema, necrosis, or irregular borders-may span across different image regions and require deep contextual understanding. ResNet’s depth and ability to learn hierarchical representations make it particularly adept at capturing such multi-level spatial dependencies (29). Its strong generalization across WHO grades 1 and 2—where tumor appearance can vary due to cellular atypia, mitotic figures, or invasion—may stem from this structural advantage.

Furthermore, VGG architecture allows for progressive feature abstraction, capturing both low- and mid-level visual patterns, such as edges, shapes, and textures, reliably and without excessive complexity (30,31) In meningiomas, distinguishing WHO grade 1 from grade 2 tumors often depends on subtle differences in enhancement pattern, margins, and heterogeneity, which VGG is well-suited to capture.

Clinically, accurate preoperative grading is critical for neurosurgical decision-making, especially when high-grade tumors may warrant aggressive resection or adjuvant therapy (32). The demonstrated ability of DL models to predict WHO grade 2 tumors with reasonable accuracy could assist in planning surgical approaches, anticipating recurrence risk, and optimizing patient counseling. However, the inability to detect grade 3 tumors limits current clinical utility and suggests that imaging alone may not capture all discriminative features-possibly due to histological heterogeneity or absence of overt radiological correlates in aggressive subtypes (33,34). One of the clinical challenges in WHO grading of meningiomas is that radiologic differentiation between grades can significantly influence preoperative surgical planning. For instance, a tumor that is difficult to access surgically or vascular encasement that makes surgical resection more difficult (35). It is essential to accurately predict the tumor grade before surgery in order to determine the surgical approach, anticipate intraoperative risks, and balance the trade-off between neurological preservation and maximal tumor resection (2). In real-world scenarios, neurosurgeons must evaluate the potential benefit of achieving gross total resection for low- and high-grade meningiomas against the risk of complications, including cranial nerve deficits, brain edema, or vessel injury (36,37). Recent studies have demonstrated the potential of DL in addressing these challenges. For example, Tunthanathip et al. used DL models to preoperatively classify germinoma and non-germinoma in pineal region tumors, aiding in surgical planning and treatment stratification (8).

There were a few issues that needed to be addressed. Firstly, WHO Grade 1 meningiomas are much more common than grades 2 and 3; hence, high-grade tumors are attributed to class imbalance. This imbalance can cause the model to overestimate the majority class, resulting in decreased sensitivity and predictive accuracy for high-grade tumors (34,35). To better represent performance across imbalanced classes, we attempted to utilize the F1-score and PR curves with APs in the present study (25,38,39). Moreover, future studies could try to develop greater multi-center datasets with more balanced class distributions or to ensure adequate representation of WHO grades 2 and 3 through a focused collection of data. Secondly, the study relied on imaging data from a single institution, which may limit the model’s applicability to other neuroimaging scanners, protocols, or populations. Multi-center validation is essential for clinical translation (40,41). Thirdly, the retrospective design may reduce performance when DL models are run in real-world settings. Hence, prospective evaluation in clinical settings is required to determine the true utility and temporal validation of the model (33).

Therefore, future investigation about external validation and generalizability should be conducted as a multi-center study for cross-institutional testing to confirm robustness and reduce overfitting to local imaging characteristics (8). Moreover, because WHO grading incorporates mitotic index, brain invasion, and genetic features (42), future research ought to investigate the integration between imaging and molecular data that could improve predictive accuracy. Additionally, the ResNet architecture has a great deal of potential for use as a decision-support tool in clinical workflows because of its high performance in distinguishing between WHO grade 1 and grade 2 meningiomas. The development of a web-based tool that enables physicians to provide preoperative MRI images and obtain an automated tumor grade prediction is one practical application (25,40). This could serve as an adjunct to radiological assessment and help guide surgical planning.

Conclusions

ResNet and VGG architectures showed excellent performance for automated grading of WHO grade 1 and 2 meningiomas from preoperative MRI. These findings suggest their potential as decision-support tools in neurosurgical planning. However, detection of grade 3 tumors remains a major challenge. Future studies should prioritize balanced datasets, external validation, and integration into web-based applications to support real-world clinical deployment.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-153/rc

Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-153/dss

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-153/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-153/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This research was approved by the Human Research Ethics Committee of the Faculty of Medicine, Prince of Songkla University (REC.67-412-10-1). As a retrospective analysis, the study did not require informed patient consent.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Yarabarla V, Mylarapu A, Han TJ, et al. Intracranial meningiomas: an update of the 2021 World Health Organization classifications and review of management with a focus on radiation therapy. Front Oncol 2023;13:1137849. [Crossref] [PubMed]
Bumrungrachpukdee P, Pruphetkaew N, Phukaoloun M, et al. Recurrence of intracranial meningioma after surgery: analysis of influencing factors and outcome. J Med Assoc Thai 2014;97:399-406.
Ammendola S, Bariani E, Eccher A, et al. The histopathological diagnosis of atypical meningioma: glass slide versus whole slide imaging for grading assessment. Virchows Arch 2021;478:747-56. [Crossref] [PubMed]
Rogers CL, Perry A, Pugh S, et al. Pathology concordance levels for meningioma classification and grading in NRG Oncology RTOG Trial 0539. Neuro Oncol 2016;18:565-74. [Crossref] [PubMed]
Priya S, Liu Y, Ward C, et al. Machine learning based differentiation of glioblastoma from brain metastasis using MRI derived radiomics. Sci Rep 2021;11:10478. [Crossref] [PubMed]
Bo L, Zhang Z, Jiang Z, et al. Differentiation of Brain Abscess From Cystic Glioma Using Conventional MRI Based on Deep Transfer Learning Features and Hand-Crafted Radiomics Features. Front Med (Lausanne) 2021;8:748144. [Crossref] [PubMed]
Jaruenpunyasak J, Duangsoithong R, Tunthanathip T. Deep learning for image classification between primary central nervous system lymphoma and glioblastoma in corpus callosal tumors. J Neurosci Rural Pract 2023;14:470-6. [Crossref] [PubMed]
Tunthanathip T, Kaewborisutsakul A, Supbumrung S. Comparative analysis of deep learning architectures for performance of image classification in pineal region tumors. J Med Artif Intell 2025;8:13.
Noori Mirtaheri P, Akhbari M, Najafi F, et al. Performance of deep learning models for automatic histopathological grading of meningiomas: a systematic review and meta-analysis. Front Neurol 2025;16:1536751. [Crossref] [PubMed]
Shwetha V, Madhavi CHR. MR image-based brain tumor classification with deep learning neural networks. WSEAS Transactions on Systems and Control 2022;17:193-200.
Anita JN, Kumaran S. A Deep Learning Architecture for Meningioma Brain Tumor Detection and Segmentation. J Cancer Prev 2022;27:192-8. [Crossref] [PubMed]
Srinivasan S, Francis D, Mathivanan SK, et al. A hybrid deep CNN model for brain tumor image multi-classification. BMC Med Imaging 2024;24:21. [Crossref] [PubMed]
Mahmoud A, Awad NA, Alsubaie N, et al. Advanced deep learning approaches for accurate brain tumor classification in medical imaging. Symmetry 2023;15:571.
Gurunathan A, Krishnan B. A. Hybrid CNN-GLCM classifier for detection and grade classification of brain tumor. Brain Imaging Behav 2022;16:1410-27. [Crossref] [PubMed]
Holleczek B, Zampella D, Urbschat S, et al. Incidence, mortality and outcome of meningiomas: A population-based study from Germany. Cancer Epidemiol 2019;62:101562. [Crossref] [PubMed]
Kowalchuk RO, Shepard MJ, Sheehan K, et al. Treatment of WHO Grade 2 Meningiomas With Stereotactic Radiosurgery: Identification of an Optimal Group for SRS Using RPA. Int J Radiat Oncol Biol Phys 2021;110:804-14. [Crossref] [PubMed]
Tosefsky K, Rebchuk AD, Wang JZ, et al. Grade 3 meningioma survival and recurrence outcomes in an international multicenter cohort. J Neurosurg 2024;140:393-403. [Crossref] [PubMed]
Lemée JM, Corniola MV, Da Broi M, et al. Extent of Resection in Meningioma: Predictive Factors and Clinical Implications. Sci Rep 2019;9:5944. [Crossref] [PubMed]
Musigmann M, Akkurt BH, Krähling H, et al. Analysis of the Predictability of Postoperative Meningioma Resection Status Based on Clinical Features. Cancers (Basel) 2024;16:3751. [Crossref] [PubMed]
Zhu H, Fang Q, He H, et al. Automatic Prediction of Meningioma Grade Image Based on Data Amplification and Improved Convolutional Neural Network. Comput Math Methods Med 2019;2019:7289273. [Crossref] [PubMed]
Pachetti E, Colantonio S. 3D-Vision-Transformer Stacking Ensemble for Assessing Prostate Cancer Aggressiveness from T2w Images. Bioengineering (Basel) 2023;10:1015. [Crossref] [PubMed]
Ashames MMA, Demir A, Gerek ON, et al. Are deep learning classification results obtained on CT scans fair and interpretable? Phys Eng Sci Med 2024;47:967-79. [Crossref] [PubMed]
Packhäuser K, Gündel S, Münster N, et al. Deep learning-based patient re-identification is able to exploit the biometric nature of medical chest X-ray data. Sci Rep 2022;12:14851. [Crossref] [PubMed]
Demler OV, Pencina MJ, D'Agostino RB Sr. Misuse of DeLong test to compare AUCs for nested models. Stat Med 2012;31:2577-87. [Crossref] [PubMed]
Tunthanathip T, Phuenpathom N, Jongjit A. Web-based calculator using machine learning to predict intracranial hematoma in geriatric traumatic brain injury. J Hosp Manag Health Policy 2023;7:16.
Tummala S, Kadry S, Bukhari SAC, et al. Classification of Brain Tumor from Magnetic Resonance Imaging Using Vision Transformers Ensembling. Curr Oncol 2022;29:7498-511. [Crossref] [PubMed]
Kakarla J, Isunuri BV, Doppalapudi KS, et al. Three-class classification of brain magnetic resonance images using average-pooling convolutional neural network. Int J Imaging Syst Technol 2021;31:1731-40.
Boucherit I, Kheddar H. Reinforced Residual Encoder–Decoder Network for Image Denoising via Deeper Encoding and Balanced Skip Connections. Big Data Cogn Comput 2025;9:82.
Bilotta G, Bibbo L, Meduri GM, et al. Deep Learning Innovations: ResNet Applied to SAR and Sentinel-2 Imagery. Remote Sens 2025;17:1961.
Grigorian AA. Understanding VGG Neural Networks: Architecture and Implementation [Internet]. Medium. 2020 Dec 1 [cited 2025 Aug 8]. Available online: https://thegrigorian.medium.com/understanding-vgg-neural-networks-architecture-and-implementation-400d99a9e9ba
Hamano Y, Nagasaka S, Shouno H. A multi-scale vision transformer network for brain tumor segmentation. Neural Networks 2023;168:300-12. [Crossref] [PubMed]
Tunthanathip T, Oearsak T. Comparison of predicted survival curves and personalized prognosis among cox regression and machine learning approaches in glioblastoma. J Med Artif Intell 2023;6:10.
Jitchanvichai J, Tunthanathip T. Cost-effectiveness of intracranial pressure monitoring in severe traumatic brain injury in Southern Thailand. Acute Crit Care 2025;40:69-78. [Crossref] [PubMed]
Tunthanathip T, Kaewborisutsakul A, Supbumrung S. Comparative analysis of deep learning architectures for performance of image classification in pineal region tumors. J Med Artif Intell. 2025;8:1.
Schwartz TH, McDermott MW. The Simpson grade: abandon the scale but preserve the message. J Neurosurg 2021;135:488-95. [Crossref] [PubMed]
Brokinkel B, Spille DC, Brokinkel C, et al. The Simpson grading: defining the optimal threshold for gross total resection in meningioma surgery. Neurosurg Rev 2021;44:1713-20. [Crossref] [PubMed]
Salunke P, Singh A, Kamble R, et al. Vascular involvement in anterior clinoidal meningiomas : Biting the 'artery' that feeds. Clin Neurol Neurosurg 2019;184:105413. [Crossref] [PubMed]
Kaewborisutsakul A, Tunthanathip T. Development and internal validation of a nomogram for predicting outcomes in children with traumatic subdural hematoma. Acute Crit Care 2022;37:429-37. [Crossref] [PubMed]
Singh J, Beeche C, Shi Z, et al. Batch-balanced focal loss: a hybrid solution to class imbalance in deep learning. J Med Imaging (Bellingham) 2023;10:051809. [Crossref] [PubMed]
Tunthanathip T, Sae-Heng S, Oearsakul T, et al. Economic impact of a machine learning-based strategy for preparation of blood products in brain tumor surgery. PLoS One 2022;17:e0270916. [Crossref] [PubMed]
Taweesomboonyat T, Kaewborisutsakul A, Tunthanathip T, et al. Necessity of in-hospital neurological observation for mild traumatic brain injury patients with negative computed tomography brain scans. J Health Sci Med Res 2020;38:267-74.
Olar A, Wani KM, Sulman EP, et al. Mitotic Index is an Independent Predictor of Recurrence-Free Survival in Meningioma. Brain Pathol 2015;25:266-75. [Crossref] [PubMed]

doi: 10.21037/jmai-2025-153
Cite this article as: Netwong S, Tunthanathip T, Jaruenpunyasak J, Thuptimdang W. Deep learning architectures for multi-class World Health Organization grading of meningiomas. J Med Artif Intell 2026;9:24.

Deep learning architectures for multi-class World Health Organization grading of meningiomas

Highlight box

Introduction

Methods

Study design and study population

Data gathering

Image preprocessing and augmentation

Model architectures

CNN architecture

VGG architecture

ViT architecture

ResNet architecture

Training procedure

Model evaluation and performance metrics

Ethical considerations

Results

Characteristics of patients in the present cohort

Table 1

DL model development and validation

Table 2

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share