Design and evaluation of transfer learning models for glioma grade prediction
Highlight box
Key findings
• Attention-based fusion significantly improves patient-level glioma grade classification across diverse convolutional neural network (CNN) architectures.
• DenseNet201 with soft voting achieved the best overall performance (F1-score: 0.902, recall: 0.976).
• Lightweight models like NASNetMobile showed near-parity with larger models while using significantly fewer parameters.
• RadImageNet-pretrained models underperformed, suggesting domain-specific pretraining may not always generalize well to magnetic resonance imaging (MRI) tasks.
What is known and what is new?
• Prior studies have used deep learning for glioma classification from MRI, but typically rely on single models or standard majority voting across slices.
• This study systematically evaluates 16 CNN architectures with three aggregation strategies, demonstrating that attention-based fusion offers consistent, statistically significant gains in F1-score. It also reveals that compact models can match or outperform deeper architectures when paired with adaptive fusion.
What is the implication, and what should change now?
• The findings show that carefully chosen aggregation strategies, particularly attention-based methods, are critical for enhancing model performance in patient-level glioma grading. The success of efficient architectures opens up possibilities for faster and more scalable clinical deployment.
• Future glioma classification systems should move beyond traditional majority voting and incorporate attention-based aggregation. Clinical ap-plications should consider lightweight models like NASNetMobile for real-time inference, and reevaluate reliance on medical-domain pretraining unless modality alignment is ensured.
Introduction
Gliomas are the most prevalent malignant brain tumors in adults, originating from glial cells that provide structural and metabolic support to neurons within the central nervous system (CNS) (1). Based on histopathological features, gliomas are categorized into low-grade (less aggressive) and high-grade (more aggressive) tumors, with the grade directly influencing prognosis and therapeutic strategies (2,3). High-grade gliomas (HGGs), particularly glioblastoma multiforme (GBM), are associated with dismal survival rates despite aggressive interventions, including surgery, radiotherapy, and chemotherapy (1,4-7). Patients diagnosed with GBM typically have a median survival of only 12–16 months following diagnosis (6,8).
Currently, glioma grading is primarily performed via histopathological examination of tissue obtained through invasive biopsy procedures (9). While this approach remains the gold standard, it is time-intensive, costly, and subject to inter-observer variability. Consequently, there is a growing need for non-invasive, automated, and accurate alternatives for glioma grade prediction, with magnetic resonance imaging (MRI) emerging as a critical modality due to its superior soft-tissue contrast and multi-parametric capabilities (10-14).
Traditional machine learning (ML) methods have been employed to predict glioma grades using radiomic features extracted from MRI. Sko¨gen et al. achieved an area under the curve (AUC) of 0.91 using histogram-based texture analysis (15), while Tian et al. and Kumar et al. reported ac-curacies exceeding 97% using support vector machines and random forests applied to handcrafted features (16,17). However, these approaches depend heavily on manual feature engineering, which may fail to capture the intricate, hierarchical patterns embedded in medical images.
Deep learning (DL), particularly convolutional neural networks (CNNs), has emerged as a powerful alternative due to its ability to learn discriminative features directly from raw image data (18,19). Several studies have demonstrated the promise of CNNs in glioma grading. Ertosun and Rubin achieved 96% accuracy using a CNN-based approach (20), while Anaraki et al. combined CNNs with genetic algorithms to reach 90.9% accuracy (21).
Despite their success, training CNNs from scratch often requires large, annotated datasets that are scarce in medical imaging. Transfer learning offers a powerful alternative by leveraging models pre-trained on large-scale datasets such as ImageNet or domain-specific datasets like RadImageNet. These pre-trained models encode generic visual features that can be fine-tuned to target tasks with relatively little data, enabling better generalization, faster convergence, and reduced risk of overfitting. In glioma grading, transfer learning is particularly beneficial because it enables the use of deep, high-capacity models without the need for millions of labeled medical images. In recent years, pre-trained architectures such as ResNet, Inception, and EfficientNet have been effectively adapted to brain tumor classification tasks (22,23). Yang et al. (23) achieved 90% accuracy using a transfer learning-based method, while Lo et al. (24) demonstrated that transfer learning with data augmentation can outperform both handcrafted methods and CNNs trained from scratch. Hao et al. (25) further advanced this approach by integrating active learning to reduce annotation costs while improving accuracy. These findings collectively underscore the potential of transfer learning to enhance glioma classification, particularly when data availability is limited.
In this work, we present a predictive modeling pipeline that includes both the development and validation of deep transfer learning models for glioma grade classification. Our goal is to systematically evaluate and compare multiple pre-trained CNN architectures, along with different patient-level aggregation strategies, to identify optimal configurations for clinical deployment. Our evaluation spans multiple categories:
- Lightweight models: ResNet50, MobileNetV2, EfficientNetB0, NASNetMobile;
- Mid-size models: DenseNet121, DenseNet201, InceptionV3;
- High-performance models: ResNet101, ResNet152, EfficientNetB3, EfficientNetB4, ConvNeXt;
- Radiology-specific models (RadImageNet-pretrained): DenseNet121, InceptionResNetV2, In-ceptionV3, ResNet50.
We also investigate ensemble-based prediction strategies, majority voting, soft voting, and attention-based fusion, to aggregate slice-level outputs into patient-level classifications. Our comprehensive analysis provides new insights into the performance trade-offs of different model families and fusion techniques, contributing to the development of accurate, non-invasive tools for glioma grading in clinical settings.
Imaging data description
The dataset used in this work is obtained from the Brain Tumor Segmentation Challenge 2020 (BraTS’20) (15,26-28). It consists of pre-operative, multimodal MRI scans from 369 patients, including 293 with HGGs and 76 with low-grade gliomas (LGGs). The imaging data were retrospectively collected from 19 institutions between 2001 and 2018, with follow-up periods varying by institutional protocols. Since this dataset contains fully de-identified imaging data with no human subjects directly involved, this study is exempt from institutional ethical approval.
Each patient scan includes four MRI sequences; however, our study used only contrast-enhanced T1-weighted (T1CE), T2-weighted (T2), and T2 fluid-attenuated inversion recovery (FLAIR). These sequences provide superior contrast for tumor boundary delineation and edema characterization, which are crucial for accurate glioma grading. We excluded the T1 sequence based on quantitative evidence from prior studies (29,30). In brain metastasis segmentation, models using T1 alone achieved a median Dice similarity coefficient (DSC) of 0.70, whereas T1CE-only models reached 0.96, a substantial 26-point performance gap (29). Likewise, in glioma segmentation, using only T1CE, T2, and FLAIR achieved performance equivalent to the full four-sequence models while reducing scan time by approximately 33% (30). These results support the exclusion of T1 in favor of more informative sequences. All images in the BraTS dataset are co-registered to a common anatomical space, skull-stripped, and interpolated to a uniform resolution of 1 mm³. Additionally, tumor subregions, including enhancing tumor, peritumoral edema, and necrotic core, are manually segmented by clinical experts following standard annotation protocols (27,31).
A key strength of the BraTS dataset in its diversity. MRI scans were acquired from 19 different institutions using a range of scanners and imaging protocols. This heterogeneity makes the dataset well-suited for training machine learning models that generalize across different clinical settings.
Figure 1 illustrates representative axial slices from two glioma cases, one LGG and one HGG, across the three MRI sequences: T1CE, T2, and FLAIR. In the LGG case (top row), the tumor appears less aggressive, with limited peritumoral edema. In contrast, the HGG case (bottom row) shows a large, heterogeneously enhancing mass with extensive edema and mass effect. These visual distinctions highlight the complementary nature of different MRI sequences and their importance in capturing tumor heterogeneity and progression.
Preprocessing
A detailed overview of the study design, patient inclusion criteria, preprocessing steps, and model development pipeline is provided in Figure 2. The initial cohort included 369 patients from the BraTS 2020 dataset. Inclusion criteria for this study were: (I) availability of all three required MRI sequences (T1CE, T2, and FLAIR); (II) confirmed diagnosis of HGG or LGG as annotated in the BraTS 2020 dataset; and (III) presence of complete segmentation labels for tumor subregions. Exclusion criteria included: (I) scans affected by severe motion artifacts or poor image quality; (II) corrupted or unreadable image files; and (III) missing one or more of the required MRI sequences. Following a manual quality review, 12 HGG cases were excluded based on these criteria, yielding a final dataset of 357 patients, comprising 281 HGG and 76 LGG cases. To prevent data leakage, we performed a stratified patient-level split of the full dataset (N=357) into 80% training (n=286) and 20% testing (n=71). The training set was further divided into 80%training (n=229) and 20% validation (n=57) subsets. The validation set was used for early 140 stopping. This consistent and balanced splitting strategy ensured fair comparison across all 16 model architectures, robust model evaluation, and reduced overfitting and computational overhead.
Given that many slices within a volumetric MRI scan may contain minimal or no visible tumor, we employed a slice selection strategy to retain only those slices that contain most information. Specifically, for each patient, we utilized pre-identified start and end indices corresponding to the range of axial slices that contained visible tumor regions across the sequences. Slices outside this range were discarded. After slice selection, the dataset comprised of 13,868 HGG slices and 3,626 LGG slices, drawn from 281 HGG and 76 LGG patients respectively. This step reduced back-ground noise and ensured that the training process focused on tumor-relevant anatomy. The dataset is imbalanced, with HGG cases outnumbering LGG cases by approximately 3.7 to 1. No data augmentation or under-sampling of HGG cases was applied, as reducing samples from this clinically important class could hinder generalization. Instead, we addressed class imbalance through stratified splits and by evaluating models with metrics such as F1-score, precision, and recall, which are more informative than accuracy in imbalanced settings.
For each patient, we extracted corresponding 2D axial slices from the T1CE, T2, and FLAIR sequences and combined them by stacking along the channel dimension, resulting in a three-channel input per slice. All models were trained on these stacked, multi-sequence inputs. We did not train separate models for each individual sequence, as our goal was to leverage the complementary information across sequences within a unified model architecture. Depending on the input requirements of each model architecture, the slices were resized using bilinear interpolation to one of the following standard input dimensions: 224×224, 299×299, or 300×300 pixels. Finally, all pixel intensity values were normalized to the [0, 1] range using min–max scaling applied at the slice level, where each 2D input was normalized independently using its own minimum and maximum pixel values 163 to emphasize local contrast while ensuring numerical stability during model training. Sixteen pre-trained CNNs spanning four architecture families (lightweight, mid-size, high-performance, and RadImageNet-pretrained) were trained using 2D axial slices. Patient-level predictions were generated using three aggregation strategies: majority voting, soft voting, and attention-based fusion. Evaluation was based on accuracy, precision, recall, F1-score, and 95% confidence intervals.
Methods
We evaluated 16 pre-trained CNN architectures for glioma grade classification using multi-sequence brain MRI. Each model was trained on slice-level data from the BraTS 2020 dataset, and its performance was assessed using a held-out test set.
Model architectures
To evaluate the impact of transfer learning on glioma grade classification from MRI, we explored a diverse set of CNN architectures spanning four categories: lightweight, mid-size, high-performance, and RadImageNet-pretrained models. Model selection within each category was based on architectural diversity, demonstrated effectiveness in medical imaging, and variation in parameter size and computational cost. Lightweight models offer efficiency for resource-constrained settings; mid-size models balance accuracy and complexity; high-performance models reflect state-of-the-art depth and capacity; and RadImageNet-pretrained models enable evaluation of domain-specific transfer learning. This stratified approach enabled a comprehensive comparison of performance across architectures with differing design goals and computational requirements.
Lightweight models
Lightweight models are designed to offer fast inference and reduced computational overhead, making them well-suited for deployment in resource-constrained environments. While these architectures have fewer parameters and lower capacity, they are effective in many medical imaging tasks when paired with transfer learning.
ResNet50 (32): a widely adopted CNN architecture that introduces residual skip connections to ease the training of deep networks and mitigate vanishing gradients.
MobileNetV2 (33): employs depth wise separable convolutions and inverted residual bottle-necks, significantly reducing the number of computations while maintaining performance.
EfficientNetB0 (34): introduces a compound scaling method that uniformly scales depth, width, and resolution using a fixed set of scaling coefficients.
NASNetMobile (35): a mobile-friendly architecture discovered using neural architecture search (NAS), designed to optimize accuracy under parameter and latency constraints.
The input images were resized to 224×224 to match the model requirements, and these models served as efficient baselines for comparison with their larger counterparts.
Mid-size models
Mid-size models provide a balance between architectural depth and computational feasibility, often delivering strong performance without the resource demands of very deep networks.
DenseNet121 and DenseNet201 (36): utilize dense connectivity patterns where each layer receives input from all preceding layers, promoting feature reuse and reducing the number of parameters.
InceptionV3 (37): applies multi-scale processing using parallel convolutional filters of different sizes, factorized convolutions, and aggressive dimensionality reduction to improve computational efficiency.
These architectures were particularly selected for their proven utility in medical imaging classification and segmentation tasks.
High-performance models
This group consists of deeper and more complex networks with greater representational power, often achieving state-of-the-art results on large-scale image classification benchmarks.
ResNet101 and ResNet152 (32): deeper versions of ResNet50 that allow hierarchical feature learning through additional residual blocks.
EfficientNetB3 and EfficientNetB4 (34): employ the same compound scaling strategy as Effi-cientNetB0, but with increased depth and resolution (300×300 and 380×380 respectively), enabling better feature extraction.
ConvNeXt-Base (38): a modernized CNN architecture inspired by vision transformers. It replaces traditional convolutions with larger kernels, incorporates layer normalization, and uses Gaussian Error Linear Unit (GELU) activation functions to improve learning dynamics.
These high-capacity models were included to evaluate whether increased depth and representational complexity translate to better glioma classification performance.
RadImageNet-pretrained models
To explore the benefits of domain-specific transfer learning, we evaluated models pretrained on RadImageNet, a large-scale radiology dataset containing over 1.3 million annotated medical im-ages across multiple modalities (39). These models are hypothesized to offer improved feature generalization for medical imaging tasks compared to standard ImageNet-pretrained networks.
The RadImageNet-based models utilized in this work include: RadImageNet-DenseNet121, RadImageNet-InceptionV3, RadImageNet-InceptionResNetV2, and RadImageNet-ResNet50.
Each of these architectures was fine-tuned using the same preprocessing and slice selection pipeline applied to their ImageNet counterparts, ensuring a fair comparison.
Model customization
For all models, the original classification head was removed and replaced with a custom head consisting of a global average pooling layer, a dense layer with 128 ReLU-activated units, and a dropout layer with a 0.5 rate to reduce neuron co-adaptation. A final dense layer with a single sigmoid-activated neuron was used for binary classification (HGG vs. LGG).
Training procedure
All models were trained using a standardized pipeline to ensure fair comparison across architectures and aggregation strategies. We used the categorical cross-entropy loss to supervise binary classification at the slice level. All models were trained using the Adam optimizer, which adapts learning rates for each parameter to stabilize convergence. For all models, the custom classification head were initialized using He normal initialization to facilitate stable gradient flow during training. To mitigate overfitting, early stopping and dropout were applied. As part of early stopping strategy, training was halted if the F1-score on the validation set did not improve for 5 consecutive epochs. For each model, three aggregation strategies (majority voting, soft voting, attention-based fusion) were independently evaluated using the trained slice-level predictions.
Patient-level aggregation
Glioma grading is fundamentally a patient-level (3D) task, while our models operate on individual 2D axial slices. Since tumor characteristics, such as size, location, and shape, can vary across slices, it is important to capture this broader 3D context. To achieve this, we used three aggregation strategies to combine slice-level predictions into a single patient-level decision: majority voting, soft voting, and attention-based fusion. These methods allow the model to account for variations across slices while maintaining the computational efficiency of 2D processing. All aggregation strategies were applied consistently across model architectures to ensure fair and comparable evaluation.
Majority voting
In this simple yet widely used method, the model independently predicts a class label for each 2D slice. The final patient-level label is determined by the majority class among all predicted slices. In cases where the number of predicted high-grade and low-grade slices are equal, the patient is assigned the high-grade label to reflect clinical caution.
where is the predicted label for the ithslice and N is the total number of slices used for that patient.
Soft voting
Instead of hard classification labels, soft voting uses the predicted class probabilities for each slice.
The final patient prediction is based on the average softmax probabilities across all slices, and the class with the highest mean probability is chosen.
where is the predicted softmax probability vector for slice i, and N is the number of selected slices for the patient.
Attention-based fusion
In addition to majority and soft voting, we implemented an attention fusion strategy to improve patient-level prediction by assigning dynamic importance to individual slices. This method computes attention weights directly from the slice-level prediction probabilities, prioritizing slices with higher model confidence.
Let pidenote the predicted probability of HGG for the ithslice in a given patient scan, where , where N is the total number of slices. The attention weight αi foreach slice is computed using a softmax function:
This normalization ensures that all attention weights are non-negative and sum to one. Slices with higher predicted probabilities are thus given greater influence in the final decision.
The patient-level probability is computed as the weighted sum of slice probabilities:
Finally, the binary class label is determined by thresholding the aggregated probability:
This attention-based strategy is intuitive, non-parametric, and computationally efficient. It allows the model to focus more on slices that are individually more predictive of tumor grade while still leveraging the contextual information across the entire tumor volume.
Evaluation metrics
We evaluated model performance at the patient level using standard classification metrics, including accuracy, precision, recall, and F1-score. Given the clinical importance of detecting HGGs, emphasis was placed on recall and F1-score. Reported performance values are point estimates obtained from the test dataset. The confidence intervals followed the same pattern across models and were omitted for simplicity. To provide a more comprehensive assessment, we also included calibration plots to evaluate the agreement between predicted probabilities and observed outcomes. Additionally, decision curve analysis (DCA) was used to assess the clinical utility across a range of threshold probabilities.
Ethics
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This work used the publicly available, anonymized BraTS 2020 dataset. No institutional review board approval or informed consent was required.
Results
The final analysis was conducted on 357 patients after excluding 12 HGG cases with severe artifacts or poor image quality. This curated dataset served as the basis for evaluating the performance of 16 CNN architectures across three aggregation strategies.
Performance of lightweight models
Table 1 summarizes the performance of lightweight models. Among the lightweight models, NAS-NetMobile with attention-based fusion achieved the highest overall performance, with an F1-scoreof 0.898 and near-perfect recall (0.988). MobileNetV2 also attained perfect recall (1.000) across all aggregation strategies, though at the expense of lower precision (0.787), indicating a tendency to over-predict high-grade cases. ResNet50 maintained balanced performance across strategies (F1-score: 0.870), while EfficientNetB0 showed the largest improvement with attention-based fusion, increasing its F1-score from 0.809 to 0.862. Overall, attention-based fusion consistently enhanced recall and F1-scores compared to majority and soft voting, underscoring its effectiveness for patient-level aggregation in lightweight architectures.
Table 1
| Model | Aggregation strategy | Accuracy | Precision | Recall | F1-score | Train F1-score |
|---|---|---|---|---|---|---|
| ResNet50 | Majority | 0.787 | 0.837 | 0.906 | 0.870 | 0.911 |
| Soft | 0.787 | 0.837 | 0.906 | 0.870 | 0.909 | |
| Attention | 0.787 | 0.837 | 0.906 | 0.870 | 0.909 | |
| MobileNetV2 | Majority | 0.787 | 0.787 | 1.000 | 0.881 | 0.883 |
| Soft | 0.787 | 0.787 | 1.000 | 0.881 | 0.883 | |
| Attention | 0.787 | 0.787 | 1.000 | 0.881 | 0.883 | |
| EfficientNetB0 | Majority | 0.685 | 0.774 | 0.847 | 0.809 | 0.872 |
| Soft | 0.685 | 0.774 | 0.847 | 0.809 | 0.882 | |
| Attention | 0.759 | 0.786 | 0.953 | 0.862 | 0.876 | |
| NASNetMobile | Majority | 0.815 | 0.828 | 0.976 | 0.896 | 0.913 |
| Soft | 0.815 | 0.828 | 0.976 | 0.896 | 0.917 | |
| Attention | 0.824 | 0.824 | 0.988 | 0.898 | 0.913 |
Performance of mid-size models
The performance of mid-size models are presented in Table 2. DenseNet201 achieved the best overall performance in this group, with an F1-score of 0.902 and recall of 0.976 using both majority and soft voting, and a slightly lower F1-score (0.898) with attention-based fusion. DenseNet121 showed moderate gains with attention-based fusion (F1-score: 0.869, recall: 0.894), while InceptionV3 also benefited from soft and attention-based fusion, both yielding an F1-score of 0.868 and high recall (0.965). Across mid-size models, attention-based fusion generally improved recall and F1-score, with DenseNet201 emerging as the strongest performer.
Table 2
| Model | Aggregation strategy | Accuracy | Precision | Recall | F1-score | Train F1-score |
|---|---|---|---|---|---|---|
| DenseNet121 | Majority | 0.750 | 0.837 | 0.847 | 0.842 | 0.900 |
| Soft | 0.769 | 0.849 | 0.859 | 0.854 | 0.895 | |
| Attention | 0.787 | 0.844 | 0.894 | 0.869 | 0.903 | |
| DenseNet201 | Majority | 0.833 | 0.838 | 0.976 | 0.902 | 0.926 |
| Soft | 0.833 | 0.838 | 0.976 | 0.902 | 0.930 | |
| Attention | 0.824 | 0.824 | 0.988 | 0.898 | 0.924 | |
| InceptionV3 | Majority | 0.759 | 0.786 | 0.953 | 0.862 | 0.914 |
| Soft | 0.769 | 0.788 | 0.965 | 0.868 | 0.918 | |
| Attention | 0.769 | 0.788 | 0.965 | 0.868 | 0.916 |
Performance of high-performance models
Table 3 reports the performance of high-capacity CNN architectures. ResNet152 with attention-based fusion achieved the highest F1-score in this group (0.886), balancing strong precision (0.857) and recall (0.918). ResNet101 performed consistently well across strategies, reaching an F1-score of 0.885 with majority and soft voting. EfficientNetB4 with attention-based fusion delivered the highest recall (0.988) in the group, though with slightly lower precision. ConvNeXt also demonstrated balanced performance (F1-score: 0.884) with attention-based fusion. Overall, attention-based fusion improved performance for most high-performance models, particularly in enhancing recall for high-grade cases.
Table 3
| Model | Aggregation strategy | Accuracy | Precision | Recall | F1-score | Train F1-score |
|---|---|---|---|---|---|---|
| ResNet101 | Majority | 0.796 | 0.818 | 0.965 | 0.885 | 0.914 |
| Soft | 0.796 | 0.818 | 0.965 | 0.885 | 0.912 | |
| Attention | 0.787 | 0.812 | 0.941 | 0.872 | 0.912 | |
| ResNet152 | Majority | 0.787 | 0.837 | 0.906 | 0.870 | 0.917 |
| Soft | 0.796 | 0.837 | 0.918 | 0.875 | 0.917 | |
| Attention | 0.815 | 0.857 | 0.918 | 0.886 | 0.915 | |
| EfficientNetB3 | Majority | 0.759 | 0.800 | 0.882 | 0.839 | 0.698 |
| Soft | 0.796 | 0.837 | 0.918 | 0.875 | 0.680 | |
| Attention | 0.796 | 0.837 | 0.918 | 0.875 | 0.742 | |
| EfficientNetB4 | Majority | 0.796 | 0.837 | 0.918 | 0.875 | 0.894 |
| Soft | 0.796 | 0.837 | 0.918 | 0.875 | 0.891 | |
| Attention | 0.787 | 0.792 | 0.988 | 0.880 | 0.894 | |
| ConvNeXt | Majority | 0.787 | 0.818 | 0.929 | 0.870 | 0.883 |
| Soft | 0.796 | 0.818 | 0.941 | 0.875 | 0.883 | |
| Attention | 0.796 | 0.824 | 0.953 | 0.884 | 0.883 |
Performance of RadImageNet-pretrained models
Table 4 summarizes the performance of RadImageNet-pretrained models. Across RadImageNet-pretrained models, InceptionResNetV2 achieved the highest performance, peaking at an F1-score of 0.751 with attention-based fusion. RadImageNet-InceptionV3 and RadImageNet-ResNet50 showed modest gains with soft and attention-based fusion but remained below an F1-score of 0.710. RadImageNet-DenseNet121 underperformed substantially (F1-score: 0.431 with majority voting), highlighting poor generalization to this MRI-based task. These results suggest that, de-spite medical-domain pretraining, the modality mismatch between RadImageNet’s training data and MRI limited performance compared to ImageNet-pretrained counterparts.
Table 4
| Model | Aggregation strategy | Accuracy | Precision | Recall | F1-score | Train F1-score |
|---|---|---|---|---|---|---|
| Rad-DenseNet121 | Majority | 0.389 | 0.806 | 0.294 | 0.431 | 0.364 |
| Soft | 0.444 | 0.765 | 0.376 | 0.503 | 0.364 | |
| Attention | 0.481 | 0.749 | 0.435 | 0.550 | 0.442 | |
| Rad-InceptionV3 | Majority | 0.639 | 0.786 | 0.612 | 0.689 | 0.715 |
| Soft | 0.648 | 0.792 | 0.624 | 0.698 | 0.726 | |
| Attention | 0.657 | 0.801 | 0.635 | 0.709 | 0.811 | |
| Rad-IncResNetV2 | Majority | 0.685 | 0.803 | 0.671 | 0.731 | 0.000 |
| Soft | 0.694 | 0.809 | 0.682 | 0.740 | 0.000 | |
| Attention | 0.703 | 0.818 | 0.694 | 0.751 | 0.000 | |
| Rad-ResNet50 | Majority | 0.620 | 0.786 | 0.588 | 0.672 | 0.493 |
| Soft | 0.630 | 0.793 | 0.594 | 0.680 | 0.493 | |
| Attention | 0.648 | 0.805 | 0.612 | 0.695 | 0.620 |
Summary of models
Table 5 summarizes the performance of models from various categories. DenseNet201 with soft voting achieved the highest overall F1-score (0.902) and strong recall (0.976), making it the top-performing single model. EfficientNetB4 paired with attention-based fusion delivered the highest recall (0.988), while ResNet152 with attention-based fusion offered the best precision-recall balance (precision: 0.857, recall: 0.918, F1-score: 0.886). InceptionV3 with soft voting also achieved high recall (0.965) with solid overall performance (F1-score: 0.868). Among lightweight models, NASNetMobile with attention-based fusion matched the performance of much larger architectures (F1-score: 0.898, recall: 0.988), highlighting its suitability for resource-constrained settings. In contrast, RadImageNet-pretrained DenseNet121 performed poorly (F1-score: 0.431), underscoring that domain-specific pretraining does not guarantee improved results when modality mismatch exists. Overall, the findings reinforce the advantages of attention-based fusion and carefully tuned ImageNet-pretrained models for non-invasive glioma grading.
Table 5
| Model | Aggregation strategy | Precision | F1-score | Recall | Train F1-score | Notes |
|---|---|---|---|---|---|---|
| DenseNet201 | Soft | 0.838 | 0.976 | 0.902 | 0.930 | Best overall F1 |
| EfficientNetB4 | Attention | 0.792 | 0.988 | 0.880 | 0.894 | Highest recall |
| ResNet152 | Attention | 0.857 | 0.918 | 0.886 | 0.915 | Balanced metrics |
| InceptionV3 | Soft | 0.788 | 0.965 | 0.868 | 0.918 | Strong recall |
| NASNetMobile | Attention | 0.824 | 0.988 | 0.898 | 0.913 | Compact model |
| Rad-DenseNet121 | Base | 0.806 | 0.294 | 0.431 | 0.364 | Low generalization |
Comparison of aggregation strategies
Across all 16 evaluated models, attention-based fusion consistently delivered the highest average performance, achieving a mean F1-score of 0.872, compared to 0.861 for soft voting and 0.848 for majority voting. The advantage of attention-based fusion was most evident in models where recall is critical, such as NASNetMobile, EfficientNetB4, and InceptionV3, without substantially compromising precision. Soft voting also outperformed majority voting in most cases, benefiting from probabilistic averaging that captures slice-level confidence, whereas majority voting often oversimplified decision-making and disregarded uncertainty.
To assess whether the differences in F1-score performance across aggregation strategies were statistically significant, we conducted a Friedman test on the test F1-scores of all 16 models under majority voting, soft voting, and attention-based fusion. The test yielded a chi-squared statistic of 11.49 with a P value of 0.0032, indicating a statistically significant difference in performance among the three strategies (P<0.01). This result confirms that the choice of aggregation strategy has a measurable impact on classification performance. Specifically, it supports the observation that attention-based fusion consistently outperforms majority and soft voting in terms of F1-score. These findings highlight the importance of selecting an effective fusion technique when aggregating slice-level predictions for patient-level glioma grading.
Model efficiency vs. accuracy
When considering both predictive performance and computational cost, DenseNet201 and NASNetMobile emerged as the most efficient architectures (Figure 3). DenseNet201 achieved the highest F1-score (0.902) with a moderate parameter count (20.2M), while NASNetMobile matched the performance of larger models (F1-score: 0.898) with only 5.3M parameters, making it well-suited for real-time or resource-limited deployment. In contrast, high-capacity models such as Con-vNeXt (88.6M parameters) and ResNet152 (60.2M) delivered only marginal performance gains relative to their substantial size. Lightweight models like MobileNetV2 (3.4M) and EfficientNetB0 (5.3M) also achieved competitive results, especially when paired with attention-based fusion, further highlighting the role of aggregation strategies in maximizing efficiency. Notably, RadImageNet-pretrained models, despite large parameter counts, underperformed due to modality mismatch, demonstrating that domain-specific pretraining is not inherently advantageous. These findings emphasize that optimal model selection for glioma grading requires balancing accuracy, recall, and computational efficiency, with attention-based fusion enabling compact architectures to rival or surpass deeper, more resource-intensive networks.
Calibration and DCA
To complement the standard classification metrics, we evaluated the calibration and clinical utility of the best-performing model, DenseNet201 with soft voting, using calibration plots and DCA. The calibration plot (Figure 4, left) demonstrates that the model is well-calibrated, especially in higher predicted probability ranges, with predicted probabilities closely matching observed outcome frequencies. The DCA plot (Figure 4, right) shows that DenseNet201 consistently provides higher net benefit compared to both the “treat all” and “treat none” strategies across a wide range of threshold probabilities (0.1–0.85). These results support the model’s reliability and potential clinical value for non-invasive glioma grading.
Discussion
This study presents a comprehensive development and internal validation of 16 CNN models for glioma grade classification using multi-sequence MRI data. We evaluated four model families, lightweight, mid-size, high-performance, and domain-specific (RadImageNet-pretrained), and compared three aggregation strategies: majority voting, soft voting, and attention-based fusion. Across nearly all architectures, attention-based fusion consistently outperformed other aggregation methods, achieving the highest average F1-score (0.872). The best-performing individual model was DenseNet201 with soft voting (F1 =0.902, recall =0.976), followed closely by NASNetMobile with attention-based fusion (F1 =0.898, recall =0.988). Despite being compact, lightweight models like NASNetMobile demonstrated strong performance, suggesting their suitability for real-time or resource-constrained deployment.
Strengths of this study include: (I) a large-scale comparative analysis of diverse CNN architectures; (II) use of multi-sequence MRI inputs, which better reflect clinical imaging practice; and (III) introduction and systematic evaluation of an attention-based aggregation strategy that dynamically weights slice-level predictions to improve patient-level classification. However, the study has several limitations. First, all results are based on internal validation using a held-out test set from the BraTS 2020 dataset, and no external validation was conducted. This limits conclusions about generalizability across institutions or imaging protocols. Second, although we combined T1CE, T2, and FLAIR modalities, we did not assess the individual contribution of each sequence or explore alternative fusion methods. Third, no interpretability techniques such as gradient-weighted class activation mapping (Grad-CAM) or attention heatmaps were applied, limiting insight into model decision-making. Finally, although all scans in the BraTS dataset were preprocessed, we did not evaluate model robustness to low-quality, noisy, or corrupted images, which is crucial for real-world deployment.
Our results compare favorably to existing DL studies in glioma grading. For instance, Yang et al. (23) reported 90% accuracy using transfer learning with ResNet50 on conventional MRimages, while Anaraki et al. (21) achieved 90.9% accuracy using CNNs combined with genetic algorithms. Compared to these studies, our approach offers higher F1-scores and recall, especially when using attention-based aggregation. Moreover, prior works often rely on a single architecture or majority voting, whereas our study provides a broader and more systematic evaluation of model families and fusion strategies. Importantly, our findings challenge the assumption that medical-domain pretraining (e.g., RadImageNet) always outperforms general pretraining (ImageNet), as the former underperformed in our MRI-based task, likely due to the domain shift from CT/X-ray to MRI.
Attention-based fusion improved classification by adaptively weighting slices according to model confidence, amplifying informative inputs and reducing noise from less relevant ones. This is especially important in glioma grading, where tumor heterogeneity means that only a subset of slices captures the most clinically relevant features. Compared to majority voting, which oversimplifies decisions, and soft voting, which dilutes the influence of key slices by treating all equally, attention-based fusion provides a more effective patient-level synthesis. Among architectures, DenseNet201 with soft voting achieved the highest single-model performance (F1-score: 0.902, recall: 0.976), while attention-based fusion consistently delivered gains across all 16 models. Notably, lightweight networks such as NASNetMobile performed nearly on par with larger architectures (F1-score: 0.898, recall: 0.988), underscoring their suitability for resource-limited deployment. In contrast, RadImageNet-pretrained models generalized poorly, likely due to the mismatch between their training data (primarily X-ray/CT) and the MRI-based task. Overall, DenseNet201 with soft voting represents the top individual model, but attention-based fusion should be regarded as the most robust and generalizable aggregation strategy.
These findings have significant implications for clinical applications in neurology, neuro-oncology, and radiology. Accurate, non-invasive glioma grading using MRI could support treatment planning, biopsy decisions, and prognosis without requiring histopathology. Lightweight models like NASNetMobile could be embedded in real-time inference systems for resource-limited or point-of-care settings. However, several challenges must be addressed before clinical deployment. Future models should incorporate quality control steps to detect and manage poor-quality or incomplete MR scans. Moreover, clinical acceptance will depend on model interpretability, robustness, and external validation. To ensure broader applicability, future work should focus on prospective validation across diverse institutions, imaging protocols, and patient demographics. Integrating interpretability methods (e.g., Grad-CAM) can build clinician trust. It will also be important to train models to spot or ignore poor-quality images, or to adjust their confidence based on image quality, to ensure safe integration into radiological workflows.
Conclusions
This work systematically evaluated 16 CNN architectures and three aggregation strategies for glioma grade classification. We found that attention-based aggregation consistently enhanced performance by prioritizing informative slices, confirming its statistical and practical advantages over majority and soft voting. DenseNet201 with soft voting achieved the best single-model performance, whereas attention-based fusion emerged as the most reliable strategy across diverse architectures, highlighting its potential as the default choice in clinical applications. Our results also revealed that RadImageNet-pretrained models did not outperform standard ImageNet-pretrained models, suggesting that domain-specific pretraining may not always confer an advantage, especially when modality differences exist. Overall, this work underscores the importance of model selection, aggregation strategy, and transfer learning choices in building accurate and efficient DL systems for non-invasive glioma grading.
Acknowledgments
None.
Footnote
Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-154/prf
Funding: None.
Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-154/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This work used the publicly available, anonymized BraTS 2020 dataset. No institutional review board approval or informed consent was required.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Wen PY, Kesari S. Malignant gliomas in adults. N Engl J Med 2008;359:492-507. [Crossref] [PubMed]
- Lapointe S, Perry A, Butowski NA. Primary brain tumours in adults. Lancet 2018;392:432-46. [Crossref] [PubMed]
- Ostrom QT, Gittleman H, Truitt G, et al. CBTRUS Statistical Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2011-2015. Neuro Oncol 2018;20:iv1-iv86. [Crossref] [PubMed]
- Cairncross G, Wang M, Shaw E, et al. Phase III trial of chemoradiotherapy for anaplastic oligodendroglioma: long-term results of RTOG 9402. J Clin Oncol 2013;31:337-43. [Crossref] [PubMed]
- Weller M, van den Bent M, Tonn JC, et al. European Association for Neuro-Oncology (EANO) guideline on the diagnosis and treatment of adult astrocytic and oligodendroglial gliomas. Lancet Oncol 2017;18:e315-29. Erratum in: Lancet Oncol 2017;18:e642. [Crossref] [PubMed]
- Behin A, Hoang-Xuan K, Carpentier AF, et al. Primary brain tumours in adults. Lancet 2003;361:323-31. [Crossref] [PubMed]
- Ostrom QT, Cioffi G, Gittleman H, et al. CBTRUS Statistical Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2012-2016. Neuro Oncol 2019;21:v1-v100. [Crossref] [PubMed]
- Chen J, McKay RM, Parada LF. Malignant glioma: lessons from genomics, mouse models, and stem cells. Cell 2012;149:36-47. [Crossref] [PubMed]
- Jiang B, Chaichana K, Veeravagu A, et al. Biopsy versus resection for the management of low-grade gliomas. Cochrane Database Syst Rev 2017;4:CD009319. [Crossref] [PubMed]
- Xiao T, Hua W, Li C, et al. Glioma grading prediction by exploring radiomics and deep learning features. In: Proceedings of the Third International Symposium on Image Computing and Digital Medicine (ISICDM 2019). Association for Computing Machinery, New York, NY, USA; 2019:208-13.
- Gao M, Huang S, Pan X, et al. Machine Learning-Based Radiomics Predicting Tumor Grades and Expression of Multiple Pathologic Biomarkers in Gliomas. Front Oncol 2020;10:1676. [Crossref] [PubMed]
- Cheng J, Liu J, Yue H, et al. Prediction of Glioma Grade Using Intratumoral and Peritumoral Radiomic Features From Multiparametric MRI Images. IEEE/ACM Trans Comput Biol Bioinform 2022;19:1084-95. [Crossref] [PubMed]
- Patel A, Silverberg C, Becker-Weidman D, et al. Understanding body MRI sequences and their ability to characterize tissues. Univers. J. Med. Sci. 2016;4:1-9.
- Revett K. An introduction to magnetic resonance imaging: From image acquisition to clinical diagnosis. In: Kwaśnicka H, Jain LC. (eds). Innovations in Intelligent Image Analysis. Studies in Computational Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011:127-61.
- Skogen K, Schulz A, Dormagen JB, et al. Diagnostic performance of texture analysis on MRI in grading cerebral gliomas. Eur J Radiol 2016;85:824-9. [Crossref] [PubMed]
- Tian Q, Yan LF, Zhang X, et al. Radiomics strategy for glioma grading using texture features from multiparametric MRI. J Magn Reson Imaging 2018;48:1518-28. [Crossref] [PubMed]
- Kumar R, Gupta A, Arora HS, et al. CGHF: A computational decision support system for glioma classification using hybrid radiomics-and stationary wavelet-based features. IEEE Access 2020;8:79440-58.
- LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. P IEEE 2002;86:2278-324.
- LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436-44. [Crossref] [PubMed]
- Ertosun MG, Rubin DL. Automated Grading of Gliomas using Deep Learning in Digital Pathology Images: A modular approach with ensemble of convolutional neural networks. AMIA Annu Symp Proc 2015;2015:1899-908.
- Anaraki AK, Ayati M, Kazemi F. Magnetic resonance imaging-based brain tumor grades classification and grading via convolutional neural networks and genetic algorithms. Biocybern Biomed Eng 2019;39:63-74.
- Valbuena Rubio S, García-Ordás MT, García-Olalla Olivera O, et al. Survival and grade of the glioma prediction using transfer learning. PeerJ Comput Sci 2023;9:e1723. [Crossref] [PubMed]
- Yang Y, Yan LF, Zhang X, et al. Glioma Grading on Conventional MR Images: A Deep Learning Study With Transfer Learning. Front Neurosci 2018;12:804. [Crossref] [PubMed]
- Lo CM, Chen YC, Weng RC, et al. Intelligent glioma grading based on deep transfer learning of MRI radiomic features. Appl. Sci. 2019;9:4926.
- Hao R, Namdar K, Liu L, et al. A Transfer Learning-Based Active Learning Framework for Brain Tumor Classification. Front Artif Intell 2021;4:635766. [Crossref] [PubMed]
- Weninger L, Rippel O, Koppers S, et al. Segmentation of brain tumors and patient survival prediction: Methods for the brats 2018 challenge. In: Crimi A, Bakas S, Kuijf H, et al., (eds). Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Cham: Springer International Publishing; 2019:3-12.
- Menze BH, Jakab A, Bauer S, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging 2015;34:1993-2024. [Crossref] [PubMed]
- Bakas S, Reyes M, Jakab A, et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv preprint arXiv:1811.02629.
- Buchner JA, Peeken JC, Etzel L, et al. Identifying core MRI sequences for reliable automatic brain metastasis segmentation. Radiother Oncol 2023;188:109901. [Crossref] [PubMed]
- Ruffle JK, Mohinta S, Gray R, et al. Brain tumour segmentation with incomplete imaging data. Brain Commun 2023;5:fcad118. [Crossref] [PubMed]
- Bakas S, Akbari H, Sotiras A, et al. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci Data 2017;4:170117. [Crossref] [PubMed]
- He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, 2016, pp. 770-8.
- Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, 2018, pp. 4510-20.
- Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning; 2019:6105-14.
- Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 pp. 8697-8710.
- Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 4700-8.
- Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA; 2016, pp. 2818-2826.
- Liu Z, Mao H, Wu CY, et al. A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA; 2022 pp. 11976-11986.
- Mei X, Liu Z, Robson PM, et al. RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning. Radiol Artif Intell 2022;4:e210315. [Crossref] [PubMed]
Cite this article as: Gutta S, Yathirajam SS. Design and evaluation of transfer learning models for glioma grade prediction. J Med Artif Intell 2026;9:31.

