miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning
Highlight box
Key findings
• The study developed a machine-learning (ML) model using miRNA expression profiles to diagnose diffuse large B-cell lymphoma (DLBCL) with high accuracy. By analyzing 54 miRNAs associated with DLBCL and integrating them with 54 random miRNAs, the ML classifiers achieved an accuracy of 93.52% with the training dataset. Validation with three independent datasets from tumor samples demonstrated accuracies ranging from 86.36% to 100%. Furthermore, the study identified several enriched biological pathways, including PI3K and FoxO signaling, implicated in DLBCL pathogenesis.
What is known and what is new?
• Traditional diagnostic methods for DLBCL, such as excisional biopsy, often have high false-negative rates and sampling errors. miRNAs are known for their stability, and various dysregulated miRNAs in DLBCL patients have been documented. However, research has primarily focused on individual miRNAs rather than comprehensive profiling.
• This study introduces the use of ML to analyze miRNA expression profiles for diagnosing DLBCL. By creating a dataset of 54 DLBCL-associated miRNAs combined with 54 random miRNAs, we developed ML models that achieved high diagnostic accuracy.
What is the implication, and what should change now?
• ML models based on miRNA expression profiles offers a highly accurate and reliable diagnostic tool, surpassing traditional methods plagued by sampling errors and high false-negative rates. This approach should be adopted in clinical settings to improve early and accurate detection of DLBCL, leading to better patient outcomes. The identification of key pathways involved in DLBCL pathogenesis suggests new targets for therapeutic intervention.
Introduction
The optimal practice for diagnosing diffuse large B-cell lymphoma (DLBCL) involves conducting an excisional biopsy on a lymph node that is unusually enlarged and appears abnormal, as determined through clinical assessment and imaging studies (1). With enough tissue material, a diagnosis is confirmed when tissue samples reveal large, transformed B cells characterized by prominent nucleoli, a diffuse growth pattern, and a high rate of cell proliferation (2). These cells commonly display a range of B-cell-specific antigens such as CD19, CD20, CD22, CD79a, and CD45 (1). Despite excisional biopsy being the preferred method, fine needle aspiration (FNA) and core needle biopsy (CNB) are frequently employed for initial diagnoses. Both techniques can lead to sampling errors and a high rate of false negatives due to the varied nature of lymph node tissues. It is impossible to guarantee that the cells obtained through aspiration accurately reflect the entire pathological condition (3). In a study conducted by Sandhaus (3), it’s calculated that the false-negative rate of FNA in diagnosing lymphoma was as high as 16% and the rebiopsy rate would be 33.4%. Another study shows FNA has a sensitivity of only 74% (4). CNB faces many of the same challenges as FNA, despite yielding a larger tissue sample. A meta-analysis examining the effectiveness of CNB in identifying lymphoma within cervical lymph nodes reported a variability in actionable diagnoses, with rates as low as 30% (ranging from 30% to 96.3%). In a separate analysis of 457 biopsies, which included 339 excisional and 118 CNB samples, it was found that only about 56.8% of CNB samples were adequate for diagnostic purposes, in stark contrast to the 96.8% adequacy rate observed in excisional biopsies (4).
Diagnosing conditions solely through the visual inspection of tissue samples from FNA can be problematic, particularly when the samples are small or lack structural context. However, studies that focus on miRNA expression in FNA samples face fewer such issues. Notably, in-situ hybridization (ISH) techniques, which utilize locked nucleic acid (LNA)-modified DNA probes, enable precise measurement of miRNA expression at the level of single cells (5). Additionally, miRNA molecules demonstrate resilience against extreme temperatures, variations in pH, and the processes involved in formalin-fixed paraffin-embedding (FFPE) (6). Improved methods for extracting RNAs from FFPE tissues opened the possibility for conducting large retrospective studies with archival tissue blocks (5).
Since the discovery of miRNAs’ role in promoting B-cell lymphoma in 2005 (7), extensive research has deepened our knowledge of how miRNAs contribute to B-cell transformation and their viability as biomarkers for diagnosis and prognosis. For example, Liu and colleagues found a marked elevation in miR-21 expression in DLBCL tissues relative to normal tissues (8). Their findings suggest that Bcl-2 is a target of miR-21, proposing that miR-21 enhances Bcl-2 expression by directly binding to the 3'UTR region of the Bcl-2 mRNA.
Despite the increased miRNA biomarkers discovery based on miRNA expression, further miRNA profiling, including features such as target genes, may result in new biomarkers more relevant to the molecular mechanism of DLBCL pathogenesis. Thus, there is a compelling need to develop more reliable methods to identify DLBCL miRNA biomarkers. The goal of our research is to analyze miRNA sets implied in previous studies (9-11), show their distinguishing and common descriptors, and to create a machine-learning (ML) model that would be able to diagnose DLBCL patients from healthy control subjects. Furthermore, the target genes can be cross-referenced in pathway analysis to identify oncogenic pathways functioning with participation of miRNAs.
Methods
We used the following programs and databases for miRNA analysis, ML model training and testing, and pathway analysis: miRDB (12,13), Waikato Environment for Knowledge Analysis (WEKA) (14), Database for Annotation, Visualization, and Integrated Discovery (DAVID) (15,16), and Kyoto Encyclopedia of Genes and Genomes (KEGG) (17-19). The flowchart of methods is shown in Figure 1.
PubMed
PubMed is used to search for information on miRNA and DLBCL with criteria: (miRNA OR microRNA) AND ((Diffuse Large B-Cell Lymphoma) OR (DLBCL) OR (B-Cell non-Hodgkin Lymphoma)) AND ((review[Filter]) AND (2019/1/1:3000/12/12[pdat])). The process began with the selection of dysregulated miRNAs significantly presented in biopsied tissue specimens from individuals diagnosed with DLBCL, which are confirmed in at least two research papers. A control set of randomly selected miRNAs that have not been implicated in DLBCL are extracted from the downloaded file of all human miRNAs from miRBase (20).
miRDB
miRDB is a database for miRNA target prediction. By searching selected miRNAs in the miRDB, we could attribute their corresponding features: predicted target genes with a target score of 97 or above. The control set of random miRNAs was also attributed to predicted target genes from miRDB with a score of 97 or above. We selected a target score threshold of 97 to prioritize highly correlated miRNA-target gene interactions. To facilitate the analysis, we developed a table where each miRNA is represented as a row and each unique gene as a column (a descriptor). For each miRNA-gene pair in the table, we assign a ‘yes’ if the miRNA targets the respective gene, and ‘no’ otherwise.
miRDB also provide miRNA sequence information, which we used to generate sequence-based attributes, a method developed by Kang et al. (21). The sequence of a miRNA is important because it determines the complimentary bindings to the target genes of the miRNA to inhibit the target gene’s expression (22). Table 1 shows the sequence-based attributes. NA is the amount of Adenine bases, NC is the amount of Cytosine bases, NG is the amount of Guanine bases, NU is the amount of Uracil bases and N is the total amount of bases.
Table 1
Sequence-based attribute | Description |
---|---|
Bases in miRNA sequence | N |
Frequency of each base | NA/N, Nc/N, NG/N, NU/N |
Mean mass of bases | (135.1 (NA)+111.1 (NC)+151.1 (NG)+112.1 (NU))/N |
Number of hydrogen bonds | 2 (NA + NU)+3 (NC+ NG) |
2 base motifs | Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not |
3 base motifs | Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not |
4 base motifs | Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not |
NA is the amount of Adenine bases, NC is the amount of Cytosine bases, NG is the amount of Guanine bases, NU is the amount of Uracil bases and N is the total amount of bases.
The Database for Annotation, Visualization, and Integrated Discovery (DAVID)
DAVID uses Over-Representation Analysis (ORA) to identify pathways that are significantly enriched in a given gene list (15,16). In our case, we input the target genes from all selected miRNA biomarkers into DAVID, which uses Fisher’s Exact Test to compute a P value for each associated pathway. We selected the pathways with a P value of less than 0.05. The lower the P value, the more associated the pathway is with the target genes from each miRNA. Consequently, we attributed the miRNAs with associated pathways. We added each unique pathway as a column (descriptor) in the miRNA data table. For each miRNA-pathway pair in the table, we assigned a ‘yes’ if the miRNA is associated with the pathway, and ‘no’ otherwise.
WEKA
Developed by the University of Waikato, WEKA is primarily used for data mining through a machine-learning approach. The application consists of multiple machine-learning tools, but this study mostly utilized the Preprocess, Classify, and Auto-WEKA tools. We combined the set of miRNAs implied in DLBCL and the control set of random miRNAs to be the training set to develop the ML model. With the training set, after running CfsSubsetEval attribute selection to reduce attributes from 9,241 to 34, we tested the performance of four different ML classifiers in WEKA: Bayes network, logistic regression (LR), multi-layer perceptron (MLP), and random forest. The Bayes network classifier was found through the Auto-WEKA plug-in. Then we analyzed significantly dysregulated miRNAs profiles of DLBCL patients published by Lawrie et al. (23), Beheshti et al. (24), and Sun et al. (25). These miRNA sets became the independent datasets to test the effectiveness of our trained ML models. We attributed miRNAs with the target genes and sequence-based attributes from miRDB, and associated pathways from DAVID just as we did for the training set. Three such sets were obtained and tested against the trained ML models to verify how accurately the ML model can classify the miRNAs.
KEGG
With pathways identified using DAVID, we then use KEGG program to map the various miRNA target genes in the pathways and illustrate the number of miRNAs associated with the genes in the pathways.
Results
Selection of miRNAs dysregulated in DLBCL was obtained from studies referenced in Table S1. Several systematic reviews (9-11) also reported their roles in the pathogenesis of DLBCL. We selected only miRNAs, which had been confirmed in at least two studies. A total of 54 miRNAs are selected and listed in Table S1. This 54-miRNA set is then mixed with 54 random miRNAs (not implicated in DLBCL) to form our training dataset for training ML classifiers.
ML analysis
DLBCL miRNA training datasets were used to develop the ML models with four different classifiers in WEKA: Bayes network, LR, MLP, and random forest. Then we conducted 10-fold cross-validation to produced average accuracy rates. When predicting whether a miRNA was associated with DLBCL based on its attributes of target genes, pathways, and sequence properties, these selected WEKA classifiers yielded accuracy rates that varied from 86.11% to 93.52% comparing to benchmark value of 46.30% given by ZeroR, a random classifier (see Figure 2 for a visual chart of the classifiers and their accuracy rates).
The final training dataset consisted of 108 miRNAs, 54 of which are related to DLBCL and 54 randomly selected ones. After attribute selection, there are 33 final attributes, including 15 target genes (AKAP13, AKR1B10, ASB4, BTG2, CASP8AP2, CELSR1, CREBRF, KCNMA1, MMGT1, NPTX1, PPP1R11, RPS6KA5, TRIM71, ZBTB34, ZNF512B), 4 sequence-based attributes (number of nucleotide bases, frequency of G, frequency of U, mean mass), 3 three-base nucleotide motifs, where A is standing for Adenine, C for Cytosine, G for Guanine, and U for Uracil (CAU, GGA, UGG), 10 four-base motifs (AAUC, AUAU, AUCU, CGGG, GAGC, GCGG, GGAC, UACA, UGGA, UUGU), and 1 pathway attribute (pathways in cancer hsa05200).
We performed feature importance analysis to better understand the contribution of each miRNA attribute to the model’s predictions. For random forest, WEKA’s built-in attribute importance function based on Gini impurity metric was used, and the results are shown in Table 2.
Table 2
Attribute name | Average impurity | Number of nodes |
---|---|---|
MEAN_MASS | 0.38 | 194 |
PPP1R11 | 0.37 | 19 |
CASP8AP2 | 0.35 | 27 |
TRIM71 | 0.34 | 50 |
hsa05200 | 0.33 | 142 |
RATIO_G | 0.33 | 225 |
CAU | 0.31 | 72 |
ASB4 | 0.31 | 23 |
RATIO_U | 0.3 | 231 |
AAUC | 0.3 | 55 |
NUM_OF_BASES | 0.3 | 176 |
UUGU | 0.28 | 75 |
NPTX1 | 0.28 | 22 |
AKAP13 | 0.27 | 16 |
UGG | 0.27 | 82 |
ZNF512B | 0.25 | 40 |
BTG2 | 0.25 | 36 |
CGGG | 0.25 | 32 |
UACA | 0.24 | 23 |
GCGG | 0.24 | 65 |
CREBRF | 0.22 | 26 |
GGA | 0.22 | 59 |
ZBTB34 | 0.19 | 21 |
MMGT1 | 0.18 | 19 |
AKR1B10 | 0.18 | 15 |
UGGA | 0.17 | 33 |
CELSR1 | 0.17 | 7 |
GGAC | 0.17 | 42 |
AUAU | 0.17 | 24 |
GAGC | 0.17 | 45 |
AUCU | 0.16 | 41 |
KCNMA1 | 0.13 | 33 |
RPS6KA5 | 0.11 | 22 |
With these trained models, we tested three independent miRNA datasets. All three test sets were derived from studies (23-25) specifically focused on identifying dysregulated miRNAs in DLBCL patients, which is why they are imbalanced. We intentionally used the full set of miRNAs from the studies as they were reported, which resulted in some overlap with the training set. To address this issue, we conducted additional experiments where the overlapping miRNAs were removed from the test sets to ensure a fair evaluation. The results for both the full set and the “clean” set (with no overlap) are presented in Figures 3-5.
The rationale for including the results with the full set is based on the real-world scenario where miRNA profiles from patients may contain miRNAs that were already part of the training data. By testing with the full miRNA collection, we aim to evaluate how the model performs in a setting that reflects practical clinical applications, where some miRNAs may be known. However, the “clean” set results offer a more stringent assessment, and we present both to provide a comprehensive evaluation of classifier performance.
The independent dataset #1 of 22 miRNAs are taken from a study of 64 DLBCL patients by Lawrie et al. (23). In their study, biopsy samples from 64 DLBCL de-novo patients were analyzed using tonsillar samples from healthy individuals as a common reference. They found 22 miRNAs are differentially expressed, listed in Table 3. With the independent set #1, these models were able to predict the miRNAs associated with DLBCL with a accuracy from 81.82% to 86.36%. If we remove the 12 common miRNAs found in both the independent test set and the training set, the accuracy is from 60% to 70%. The results are depicted in Figure 3.
Table 3
miRNA test dataset #1 |
hsa-mir-200c |
hsa-mir-638 |
hsa-mir-518a |
hsa-mir-199a* |
hsa-mir-93* |
hsa-mir-22 |
hsa-mir-34a* |
hsa-mir-362 |
hsa-mir-206 |
hsa-mir-451* |
hsa-mir-636 |
hsa-mir-92* |
hsa-mir-27b* |
hsa-mir-199b |
hsa-mir-27a* |
hsa-mir-24* |
hsa-mir-106a* |
hsa-mir-20a* |
hsa-mir-19b* |
has-mir-99a |
hsa-mir-18b* |
hsa-mir-100 |
DLBCL, diffuse large B-cell lymphoma.
The independent dataset #2 of 9 miRNAs are taken from a study of 86 DLBCL patients by Beheshti et al. (24). In this study, the researchers analyzed serum from 86 DLBCL patients and recorded miRNAs with higher circulation levels, listed in Table 4. With independent set #2, these models were able to predict the miRNAs with a uniform accuracy of 100%. If we remove the 5 common miRNAs found in both the independent set and the training set, the accuracy is also 100%. The results are depicted in Figure 4.
Table 4
miRNA test dataset #2 |
hsa-mir-10b |
hsa-mir-155* |
hsa-let-7c |
hsa-let-7b |
hsa-mir-130a |
hsa-mir-24* |
hsa-mir-27a* |
hsa-mir-18a* |
hsa-mir-15a* |
The independent dataset #3 of 8 miRNAs are taken from a study of 20 DLBCL patients by Sun et al. (25). This study collected serum samples at diagnosis of 20 newly diagnosed DLBCL patients and analyzed the samples for miRNA array. They then showed the miRNAs with a mean fold change greater than 2.5 and a P value less than 0.05, listed in Table 5. With the independent set #3, these models were able to predict the miRNAs with accuracy from 75.00% to 87.50%. If we remove the 2 common miRNAs found in both the independent set and the training set, the accuracy is from 66.67% to 83.33%. The results are depicted in Figure 5.
Table 5
miRNA test dataset #3 |
hsa-mir-21* |
hsa-mir-130b |
hsa-mir-155* |
hsa-mir-7 |
hsa-mir-28 |
hsa-mir-128 |
hsa-mir-424 |
hsa-mir-454 |
The independent test sets contain only “Yes” (DLBCL-associated) miRNAs, and thus the model should ideally classify all instances as “Yes”. However, the model misclassified instances as “No” in independent sets #1 and #3. This indicates that despite the model’s overall strong performance, it exhibited uncertainty with certain “Yes” instances, potentially due to feature interactions or model structure issues.
We will investigate this further by analyzing feature importance to identify whether specific features are leading to these misclassifications. Additionally, we plan to refine the model through parameter tuning and potentially explore alternative classifiers to improve its ability to correctly identify all “Yes” instances in future experiments.
Pathway analysis
We used DAVID to identify enriched pathways most relevant to the target genes of the 54 training miRNAs. Specifically, we input a total of 1,100 target genes from all 54 miRNAs into DAVID for pathway analysis. We found various pathways with a P value ≤0.05 and listed them in Table 6. Notable pathways with the lowest P values and the highest significance include signaling pathways regulating pluripotency of stem cells with a P value <0.001 and 23 target genes. Additionally, the endocrine resistance pathway had 18 target genes with a P value <0.001. Accordingly, since DLBCL miRNA target genes fall on these pathways, these pathways may be associated with the cancer and should be further examined. Pathways with more than 20 genes targeted by the miRNA biomarkers are depicted in Figure 6.
Table 6
Term | Count of target genes | −Log10(P) | P value |
---|---|---|---|
Signaling pathways regulating pluripotency of stem cells | 23 | 4.6021 | <0.001 |
Endocrine resistance | 18 | 4.3468 | <0.001 |
FoxO signaling pathway | 21 | 4.1805 | <0.001 |
Pathways in cancer | 53 | 3.8861 | <0.001 |
Axon guidance | 25 | 3.8539 | <0.001 |
Proteoglycans in cancer | 27 | 3.8539 | <0.001 |
Neurotrophin signaling pathway | 19 | 3.7696 | <0.001 |
mTOR signaling pathway | 22 | 3.5686 | <0.001 |
Cellular senescence | 22 | 3.5686 | <0.001 |
PI3K-Akt signaling pathway | 38 | 3.4318 | <0.001 |
EGFR tyrosine kinase inhibitor resistance | 14 | 3.2291 | <0.001 |
Prolactin signaling pathway | 13 | 3.1871 | <0.001 |
MAPK signaling pathway | 33 | 3.1427 | <0.001 |
Focal adhesion | 25 | 3.1427 | <0.001 |
Protein digestion and absorption | 16 | 3.0555 | <0.001 |
Glutamatergic synapse | 17 | 3.0044 | <0.01 |
Autophagy - animal | 19 | 2.8539 | 0.001 |
Regulation of actin cytoskeleton | 26 | 2.7447 | 0.001 |
Human papillomavirus infection | 33 | 2.4949 | 0.003 |
Lysine degradation | 11 | 2.4815 | 0.003 |
Transcriptional misregulation in cancer | 22 | 2.3768 | 0.004 |
AMPK signaling pathway | 16 | 2.3565 | 0.004 |
Tight junction | 20 | 2.3565 | 0.004 |
Ras signaling pathway | 25 | 2.2596 | 0.005 |
Phospholipase D signaling pathway | 18 | 2.2518 | 0.005 |
Renin secretion | 11 | 2.1938 | 0.006 |
ErbB signaling pathway | 12 | 2.0000 | 0.01 |
p53 signaling pathway | 11 | 2.0000 | 0.01 |
Phosphatidylinositol signaling system | 13 | 2.0000 | 0.01 |
Wnt signaling pathway | 19 | 1.9586 | 0.01 |
Hedgehog signaling pathway | 9 | 1.8239 | 0.01 |
cGMP-PKG signaling pathway | 18 | 1.7447 | 0.01 |
Oocyte meiosis | 15 | 1.6990 | 0.02 |
Endocytosis | 24 | 1.6778 | 0.02 |
cAMP signaling pathway | 22 | 1.6576 | 0.02 |
Inositol phosphate metabolism | 10 | 1.6021 | 0.02 |
Apelin signaling pathway | 15 | 1.4949 | 0.03 |
ECM-receptor interaction | 11 | 1.4685 | 0.03 |
Relaxin signaling pathway | 14 | 1.4202 | 0.03 |
C-type lectin receptor signaling pathway | 12 | 1.3979 | 0.04 |
GnRH signaling pathway | 11 | 1.3565 | 0.04 |
Thyroid hormone signaling pathway | 13 | 1.3098 | 0.049 |
TGF-beta signaling pathway | 12 | 1.3010 | 0.05 |
DAVID, Database for Annotation, Visualization, and Integrated Discovery; FoxO, Forkhead box O; mTOR, mammalian target of rapamycin; PI3K, phosphoinositide 3-kinase; Akt, Ak strain transforming; EGFR, epidermal growth factor receptor; MAPK, mitogen-activated protein kinase; AMPK, AMP-activated protein kinase; AMP, adenosine 3',5'-monophosphate; Ras, rat sarcoma; ErbB, erythroblastic oncogene B; cGMP, cyclic guanosine monophosphate; PKG, protein kinase G; cAMP, cyclic adenosine 3',5'-monophosphate; ECM, extracellular matrix; GnRH, gonadotropin-releasing hormone; TGF, transforming growth factor.
Furthermore, after querying target genes of the DLBCL miRNA training data into KEGG, we were able to create a map of pathways in cancer and highlight the number of target genes in each pathway as shown in Figure 7. These results show participation of miRNAs on significant sections of the several main cancer pathways.
Discussion
The performances of the classifiers are evaluated using confusion matrices: true positive rate (TPR) (sensitivity), false positive rate (FPR), precision, recall, F-measure (harmonic mean of precision and recall), Matthews correlation coefficient (MCC), area under receiver operating characteristic (ROC) curve (AUC) (the balance between the TPR and FPR), and area under precision-recall (PR) curve (AUC-PR) (the balance between and recall). Table 7 presents a comparison of the four classifiers examined in this study: Bayes network, LR, MLP, and random forest.
Table 7
Classifier | Accuracy | TP rate | FP rate | Precision | Recall | F-measure | MCC | AUC | AUC-PR |
---|---|---|---|---|---|---|---|---|---|
Bayes network | 0.917 | 0.917 | 0.083 | 0.917 | 0.917 | 0.917 | 0.833 | 0.967 | 0.967 |
Logistic regression | 0.935 | 0.935 | 0.065 | 0.937 | 0.935 | 0.935 | 0.872 | 0.973 | 0.974 |
Multi-Layer perceptron | 0.880 | 0.880 | 0.12 | 0.886 | 0.88 | 0.879 | 0.766 | 0.963 | 0.962 |
Random forest | 0.861 | 0.861 | 0.139 | 0.861 | 0.861 | 0.861 | 0.772 | 0.941 | 0.944 |
TP, true positive; FP, false positive; MCC, Matthews correlation coefficient; AUC, area under receiver operating characteristic curve; AUC-PR, area under precision-recall curve.
All classifiers exhibited an accuracy rate exceeding 85%, with LR achieving the highest level at 93.5% resulted from cross-validation experiments. LR demonstrated a low FPR of 6.5%, significantly outperforming the 16% FPR reported in Sandhaus’s study on FNA biopsies (3). Moreover, LR exhibited a sensitivity of 93.5%, notably surpassing the 74% sensitivity reported in Paquin’s FNA study (4). Compared to CNB with a diagnostic rate of 56.8% (4), our miRNA-based diagnosis also showcased superior performance. Notably, our best-performing LR model approached the reported accuracy rate of excisional biopsies from other studies (4), demonstrating the potential to deliver high accuracy without being constrained by sample size limitations.
We also investigated the 15 target gene attributes used in the model after attribute selection: AKAP13, AKR1B10, ASB4, BTG2, CASP8AP2, CELSR1, CREBRF, KCNMA1, MMGT1, NPTX1, PPP1R11, RPS6KA5, TRIM71, ZBTB34, ZNF512B. Some of the genes, such as AKAP13 (26), CASP8AP2 (27), CREBRF (28), BTG2 (29), have been implied in DLBCL. For example, the dysregulation of CASP8AP2 has been associated with DLBCL, and its dysregulation may impact apoptotic pathways, which play a crucial role in cancer development and progression (27). The other 11 genes are not implied in DLBCL but have been implied in the development of other cancers. For instance, AKR1B10 plays a role in metabolism and its elevated expression is linked to unfavorable prognoses across several cancer types (30). Meanwhile, ZBTB34 functions as a transcriptional repressor and has been implicated in hematological malignancies (31). The fact that all 15 target genes identified through attribute selection in our model are implicated in cancer pathogenesis is not coincidental. This serves as evidence that our model effectively identifies oncogenes.
The methodology outlined in this study holds promise for miRNA analysis in various other types of cancer or diseases, provided that relevant miRNA data is available. In a clinical setting, sequencing data from a patient’s biopsy or serum sample would first be processed to obtain the miRNA expression levels. These expression levels would then be transformed into the feature space used by the ML models, which includes attributes such as target genes, enriched pathways, and sequence-based properties. Once the sequencing profiles of an individual patient are mapped to these features, the trained models can be applied to predict whether the miRNA profile is associated with DLBCL. Thus, the patient miRNA data would function like a new independent test set for the trained model. This prediction could assist clinicians in making decisions regarding diagnosis by identifying miRNAs linked to DLBCL.
Conclusions
We developed a machine-learning system for diagnosis of DLBCL with 54 miRNAs, reaching best performance of 93.52% with training dataset and best accuracy of 86.36%, 100%, and 87.50% with three independent datasets extracted from miRNA sequence data of actual tumor samples. This proves our hypothesis that we can make a diagnosis based on miRNA sequencing data of tumor samples with known miRNA biomarkers and their descriptors of target genes.
Furthermore, we identified the enriched pathways where the miRNAs have significant presence. Some of the pathways [e.g., PI3K signaling pathway (32,33)] are already discussed in several publications, but others (e.g., FoxO signaling pathway) are not being investigated so intensively and could result in new insights in the pathogenesis of DLBCL.
Acknowledgments
Funding: None.
Footnote
Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-282/prf
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-282/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. IRB approval and informed consent are waived as there is no human subject involved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Liu Y, Barta SK. Diffuse large B-cell lymphoma: 2019 update on diagnosis, risk stratification, and treatment. Am J Hematol 2019;94:604-16. [Crossref] [PubMed]
- Harris LJ, Patel K, Martin M. Novel Therapies for Relapsed or Refractory Diffuse Large B-Cell Lymphoma. Int J Mol Sci 2020;21:8553. [Crossref] [PubMed]
- Sandhaus LM. Fine-needle aspiration cytology in the diagnosis of lymphoma. The next step. Am J Clin Pathol 2000;113:623-7. [Crossref] [PubMed]
- Paquin AR, Oyogoa E, McMurry HS, et al. The diagnosis and management of suspected lymphoma in general practice. Eur J Haematol 2023;110:3-13. [Crossref] [PubMed]
- Sempere LF, Azmi AS, Moore A. microRNA-based diagnostic and therapeutic applications in cancer medicine. Wiley Interdiscip Rev RNA 2021;12:e1662. [Crossref] [PubMed]
- Precazzini F, Detassis S, Imperatori AS, et al. Measurements Methods for the Development of MicroRNA-Based Tests for Cancer Diagnosis. Int J Mol Sci 2021;22:1176. [Crossref] [PubMed]
- He L, Thomson JM, Hemann MT, et al. A microRNA polycistron as a potential human oncogene. Nature 2005;435:828-33. [Crossref] [PubMed]
- Liu K, Du J, Ruan L. MicroRNA-21 regulates the viability and apoptosis of diffuse large B-cell lymphoma cells by upregulating B cell lymphoma-2. Exp Ther Med 2017;14:4489-96. [Crossref] [PubMed]
- Fuertes T, Ramiro AR, de Yebenes VG. miRNA-Based Therapies in B Cell Non-Hodgkin Lymphoma. Trends Immunol 2020;41:932-47. [Crossref] [PubMed]
- Getaneh Z, Asrie F, Melku M. MicroRNA profiles in B-cell non-Hodgkin lymphoma. EJIFCC 2019;30:195-214. [PubMed]
- Larrabeiti-Etxebarria A, Lopez-Santillan M, Santos-Zorrozua B, et al. Systematic Review of the Potential of MicroRNAs in Diffuse Large B Cell Lymphoma. Cancers (Basel) 2019;11:144. [Crossref] [PubMed]
- Chen Y, Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic Acids Res 2020;48:D127-31. [Crossref] [PubMed]
- Liu W, Wang X. Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data. Genome Biol 2019;20:18. [Crossref] [PubMed]
- Frank E, Hall MA, Witten IH. The WEKA Workbench, Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann, Fourth Edition, 2016. The University of Waikato; 2016.
- Sherman BT, Hao M, Qiu J, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 2022;50:W216. [Crossref] [PubMed]
- Huang da W. Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44-57. [Crossref] [PubMed]
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27-30. [Crossref] [PubMed]
- Kanehisa M. Toward understanding the origin and evolution of cellular organisms. Protein Sci 2019;28:1947-51. [Crossref] [PubMed]
- Kanehisa M, Furumichi M, Sato Y, et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2023;51:D587-92. [Crossref] [PubMed]
- Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res 2019;47:D155-62. [Crossref] [PubMed]
- Kang W, Kouznetsova VL, Tsigelny IF. miRNA in Machine-learning-based Diagnostics of Cancers. Cancer Screening and Prevention 2022;1:132-8. [Crossref]
- Macfarlane LA, Murphy PR. MicroRNA: Biogenesis, Function and Role in Cancer. Curr Genomics 2010;11:537-61. [Crossref] [PubMed]
- Lawrie CH, Chi J, Taylor S, et al. Expression of microRNAs in diffuse large B cell lymphoma is associated with immunophenotype, survival and transformation from follicular lymphoma. J Cell Mol Med 2009;13:1248-60. [Crossref] [PubMed]
- Beheshti A, Stevenson K, Vanderburg C, et al. Identification of Circulating Serum Multi-MicroRNA Signatures in Human DLBCL Models. Sci Rep 2019;9:17161. [Crossref] [PubMed]
- Sun R, Zheng Z, Wang L, et al. A novel prognostic model based on four circulating miRNA in diffuse large B-cell lymphoma: implications for the roles of MDSC and Th17 cells in lymphoma progression. Mol Oncol 2021;15:246-61. [Crossref] [PubMed]
- Luo X, Shi F, Qiu H, et al. Identification of potential key genes associated with diffuse large B-cell lymphoma based on microarray gene expression profiling. Neoplasma 2017;64:824-33. [Crossref] [PubMed]
- Lan Q, Morton LM, Armstrong B, et al. Genetic variation in caspase genes and risk of non-Hodgkin lymphoma: a pooled analysis of 3 population-based case-control studies. Blood 2009;114:264-7. [Crossref] [PubMed]
- D’Auria F, Di Pietro R. Role of CREB protein family members in human haematological malignancies, Cancer Treatment-Conventional and Innovative Approaches. IntechOpen 2013. doi:
10.5772/55368 . - Guo D, Hong L, Ji H, et al. The Mutation of BTG2 Gene Predicts a Poor Outcome in Primary Testicular Diffuse Large B-Cell Lymphoma. J Inflamm Res 2022;15:1757-69. [Crossref] [PubMed]
- Banerjee S. Aldo Keto Reductases AKR1B1 and AKR1B10 in Cancer: Molecular Mechanisms and Signaling Networks. Adv Exp Med Biol 2021;1347:65-82. [Crossref] [PubMed]
- Liu Z, Jin D, Wei X, et al. ZBTB34 is a hepatocellular carcinoma-associated protein with a monopartite nuclear localization signal. Aging (Albany NY) 2023;15:8487-500. [Crossref] [PubMed]
- Majchrzak A, Witkowska M, Smolewski P. Inhibition of the PI3K/Akt/mTOR signaling pathway in diffuse large B-cell lymphoma: current knowledge and clinical significance. Molecules 2014;19:14304-15. [Crossref] [PubMed]
- Miao Y, Medeiros LJ, Xu-Monette ZY, et al. Dysregulation of Cell Survival in Diffuse Large B Cell Lymphoma: Mechanisms and Therapeutic Targets. Front Oncol 2019;9:107. [Crossref] [PubMed]
Cite this article as: Tang S, Tsigelny IF, Kesari S, Kouznetsova VL. miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning. J Med Artif Intell 2025;8:18.