miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning

Summer Tang; Igor F. Tsigelny; Santosh Kesari; Valentina L. Kouznetsova

doi:10.21037/jmai-24-282

Original Article

miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning

Summer Tang¹, Igor F. Tsigelny^2,3,4 , Santosh Kesari⁵ , Valentina L. Kouznetsova^2,4

¹Mentor Assistance Program, University of California San Diego, La Jolla, CA, USA; ²San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA; ³Department of Neurosciences, University of California San Diego, La Jolla, CA, USA; ⁴CureScience Institute, San Diego, CA, USA; ⁵Pacific Neuroscience Institute, Santa Monica, CA, USA

Contributions: (I) Conception and design: All authors; (II) Administrative support: None; (III) Provision of study materials or patients: S Tang; (IV) Collection and assembly of data: S Tang, IF Tsigelny; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Igor F. Tsigelny, PhD. Department of Neurosciences, University of California San Diego, Gilman Dr. 9500, MS-505, La Jolla, CA 92093-0505, USA; San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA; CureScience Institute, San Diego, California, USA. Email: itsigeln@ucsd.edu.

Background: Traditional diagnostic methods like excisional biopsy, fine needle aspiration (FNA), and core needle biopsy (CNB) are often challenged by sampling errors and high false-negative rates. Our research shifts focus to microRNA (miRNA) expression profiling, leveraging the stability of miRNA molecules and advanced RNA extraction methods. Although the oncogenic potential of miRNAs in B-cell lymphoma has been studied since 2005, and various dysregulated miRNAs in diffuse large B-cell lymphoma (DLBCL) patients have been reported in the scientific literature, there has been limited research investigating these miRNAs using ML algorithms.

Methods: This study presents an innovative approach to the diagnosis of DLBCL using a machine-learning (ML) system based on miRNA analysis. We first identified 54 miRNAs associated with DLBCL, combining them with 54 random miRNAs to create a training dataset for ML classifiers. This dataset was processed using various ML classifiers through the Waikato Environment for Knowledge Analysis (WEKA) software. In addition to miRNA profiling, our study also explored the biological pathways associated with these miRNAs using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases.

Results: Our training model achieved a notable accuracy of 93.52%. The performance was further validated with three independent datasets derived from actual tumor samples, showing best accuracies from 86.36% to 100%. We identified several enriched pathways, such as the PI3K and FoxO signaling pathways, that are significantly implicated in DLBCL. These findings not only validate known associations but also reveal potential new avenues for understanding DLBCL pathogenesis.

Conclusions: Our paper demonstrates that ML-assisted miRNA analysis can serve not only as a diagnostic tool for the onset of DLBCL but also as a discovery tool to predict specific genes, pathways, and sequence motifs as targets for further investigation.

Keywords: MicroRNA (miRNA); machine-learning (ML); lymphoma; diagnostics

Received: 17 August 2024; Accepted: 27 October 2024; Published online: 27 December 2024.

doi: 10.21037/jmai-24-282

Highlight box

Key findings

• The study developed a machine-learning (ML) model using miRNA expression profiles to diagnose diffuse large B-cell lymphoma (DLBCL) with high accuracy. By analyzing 54 miRNAs associated with DLBCL and integrating them with 54 random miRNAs, the ML classifiers achieved an accuracy of 93.52% with the training dataset. Validation with three independent datasets from tumor samples demonstrated accuracies ranging from 86.36% to 100%. Furthermore, the study identified several enriched biological pathways, including PI3K and FoxO signaling, implicated in DLBCL pathogenesis.

What is known and what is new?

• Traditional diagnostic methods for DLBCL, such as excisional biopsy, often have high false-negative rates and sampling errors. miRNAs are known for their stability, and various dysregulated miRNAs in DLBCL patients have been documented. However, research has primarily focused on individual miRNAs rather than comprehensive profiling.

• This study introduces the use of ML to analyze miRNA expression profiles for diagnosing DLBCL. By creating a dataset of 54 DLBCL-associated miRNAs combined with 54 random miRNAs, we developed ML models that achieved high diagnostic accuracy.

What is the implication, and what should change now?

• ML models based on miRNA expression profiles offers a highly accurate and reliable diagnostic tool, surpassing traditional methods plagued by sampling errors and high false-negative rates. This approach should be adopted in clinical settings to improve early and accurate detection of DLBCL, leading to better patient outcomes. The identification of key pathways involved in DLBCL pathogenesis suggests new targets for therapeutic intervention.

Introduction

The optimal practice for diagnosing diffuse large B-cell lymphoma (DLBCL) involves conducting an excisional biopsy on a lymph node that is unusually enlarged and appears abnormal, as determined through clinical assessment and imaging studies (1). With enough tissue material, a diagnosis is confirmed when tissue samples reveal large, transformed B cells characterized by prominent nucleoli, a diffuse growth pattern, and a high rate of cell proliferation (2). These cells commonly display a range of B-cell-specific antigens such as CD19, CD20, CD22, CD79a, and CD45 (1). Despite excisional biopsy being the preferred method, fine needle aspiration (FNA) and core needle biopsy (CNB) are frequently employed for initial diagnoses. Both techniques can lead to sampling errors and a high rate of false negatives due to the varied nature of lymph node tissues. It is impossible to guarantee that the cells obtained through aspiration accurately reflect the entire pathological condition (3). In a study conducted by Sandhaus (3), it’s calculated that the false-negative rate of FNA in diagnosing lymphoma was as high as 16% and the rebiopsy rate would be 33.4%. Another study shows FNA has a sensitivity of only 74% (4). CNB faces many of the same challenges as FNA, despite yielding a larger tissue sample. A meta-analysis examining the effectiveness of CNB in identifying lymphoma within cervical lymph nodes reported a variability in actionable diagnoses, with rates as low as 30% (ranging from 30% to 96.3%). In a separate analysis of 457 biopsies, which included 339 excisional and 118 CNB samples, it was found that only about 56.8% of CNB samples were adequate for diagnostic purposes, in stark contrast to the 96.8% adequacy rate observed in excisional biopsies (4).

Diagnosing conditions solely through the visual inspection of tissue samples from FNA can be problematic, particularly when the samples are small or lack structural context. However, studies that focus on miRNA expression in FNA samples face fewer such issues. Notably, in-situ hybridization (ISH) techniques, which utilize locked nucleic acid (LNA)-modified DNA probes, enable precise measurement of miRNA expression at the level of single cells (5). Additionally, miRNA molecules demonstrate resilience against extreme temperatures, variations in pH, and the processes involved in formalin-fixed paraffin-embedding (FFPE) (6). Improved methods for extracting RNAs from FFPE tissues opened the possibility for conducting large retrospective studies with archival tissue blocks (5).

Since the discovery of miRNAs’ role in promoting B-cell lymphoma in 2005 (7), extensive research has deepened our knowledge of how miRNAs contribute to B-cell transformation and their viability as biomarkers for diagnosis and prognosis. For example, Liu and colleagues found a marked elevation in miR-21 expression in DLBCL tissues relative to normal tissues (8). Their findings suggest that Bcl-2 is a target of miR-21, proposing that miR-21 enhances Bcl-2 expression by directly binding to the 3'UTR region of the Bcl-2 mRNA.

Despite the increased miRNA biomarkers discovery based on miRNA expression, further miRNA profiling, including features such as target genes, may result in new biomarkers more relevant to the molecular mechanism of DLBCL pathogenesis. Thus, there is a compelling need to develop more reliable methods to identify DLBCL miRNA biomarkers. The goal of our research is to analyze miRNA sets implied in previous studies (9-11), show their distinguishing and common descriptors, and to create a machine-learning (ML) model that would be able to diagnose DLBCL patients from healthy control subjects. Furthermore, the target genes can be cross-referenced in pathway analysis to identify oncogenic pathways functioning with participation of miRNAs.

Methods

We used the following programs and databases for miRNA analysis, ML model training and testing, and pathway analysis: miRDB (12,13), Waikato Environment for Knowledge Analysis (WEKA) (14), Database for Annotation, Visualization, and Integrated Discovery (DAVID) (15,16), and Kyoto Encyclopedia of Genes and Genomes (KEGG) (17-19). The flowchart of methods is shown in Figure 1.

Figure 1 Flowchart depicting the tools used during this study. First, both miRNAs implicated for DLBCL and unrelated, random miRNAs were combined to form the miRNA training dataset. Each miRNA is equipped to include 3 types of descriptors: predicted target genes, sequence-based attributes, and associated pathways. The training dataset went through correlation-based feature selection (CfsSubsetEval) in WEKA to reduce the number of descriptors. The data were then used in developing the ML model using four classifiers in WEKA that were tested through 10-fold cross-validation. Next, miRNAs from actual DLBCL samples were extracted along with the same 3 types of attributes, genes, sequence-based attributes, and pathways. In this study, we tested 3 independent datasets in the trained classifier models to predict DLBCL involvement. Finally, through DAVID and KEGG, the pathways significantly associated with the miRNAs were identified. DLBCL, diffuse large B-cell lymphoma; ML, machine-learning; DAVID, Database for Annotation, Visualization, and Integrated Discovery; WEKA, Waikato Environment for Knowledge Analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

PubMed

PubMed is used to search for information on miRNA and DLBCL with criteria: (miRNA OR microRNA) AND ((Diffuse Large B-Cell Lymphoma) OR (DLBCL) OR (B-Cell non-Hodgkin Lymphoma)) AND ((review[Filter]) AND (2019/1/1:3000/12/12[pdat])). The process began with the selection of dysregulated miRNAs significantly presented in biopsied tissue specimens from individuals diagnosed with DLBCL, which are confirmed in at least two research papers. A control set of randomly selected miRNAs that have not been implicated in DLBCL are extracted from the downloaded file of all human miRNAs from miRBase (20).

miRDB

miRDB is a database for miRNA target prediction. By searching selected miRNAs in the miRDB, we could attribute their corresponding features: predicted target genes with a target score of 97 or above. The control set of random miRNAs was also attributed to predicted target genes from miRDB with a score of 97 or above. We selected a target score threshold of 97 to prioritize highly correlated miRNA-target gene interactions. To facilitate the analysis, we developed a table where each miRNA is represented as a row and each unique gene as a column (a descriptor). For each miRNA-gene pair in the table, we assign a ‘yes’ if the miRNA targets the respective gene, and ‘no’ otherwise.

miRDB also provide miRNA sequence information, which we used to generate sequence-based attributes, a method developed by Kang et al. (21). The sequence of a miRNA is important because it determines the complimentary bindings to the target genes of the miRNA to inhibit the target gene’s expression (22). Table 1 shows the sequence-based attributes. N_A is the amount of Adenine bases, N_C is the amount of Cytosine bases, N_G is the amount of Guanine bases, N_U is the amount of Uracil bases and N is the total amount of bases.

Table 1

Sequence-based attributes and their descriptions.

Sequence-based attribute	Description
Bases in miRNA sequence	N
Frequency of each base	N_A/N, Nc/N, N_G/N, N_U/N
Mean mass of bases	(135.1 (N_A)+111.1 (N_C)+151.1 (N_G)+112.1 (N_U))/N
Number of hydrogen bonds	2 (N_A + N_U)+3 (N_C+ N_G)
2 base motifs	Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not
3 base motifs	Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not
4 base motifs	Each motif is a separate attribute, and the miRNA is assigned “Yes” if it has the motif, “No” if not

N_A is the amount of Adenine bases, N_C is the amount of Cytosine bases, N_G is the amount of Guanine bases, N_U is the amount of Uracil bases and N is the total amount of bases.

The Database for Annotation, Visualization, and Integrated Discovery (DAVID)

DAVID uses Over-Representation Analysis (ORA) to identify pathways that are significantly enriched in a given gene list (15,16). In our case, we input the target genes from all selected miRNA biomarkers into DAVID, which uses Fisher’s Exact Test to compute a P value for each associated pathway. We selected the pathways with a P value of less than 0.05. The lower the P value, the more associated the pathway is with the target genes from each miRNA. Consequently, we attributed the miRNAs with associated pathways. We added each unique pathway as a column (descriptor) in the miRNA data table. For each miRNA-pathway pair in the table, we assigned a ‘yes’ if the miRNA is associated with the pathway, and ‘no’ otherwise.

WEKA

Developed by the University of Waikato, WEKA is primarily used for data mining through a machine-learning approach. The application consists of multiple machine-learning tools, but this study mostly utilized the Preprocess, Classify, and Auto-WEKA tools. We combined the set of miRNAs implied in DLBCL and the control set of random miRNAs to be the training set to develop the ML model. With the training set, after running CfsSubsetEval attribute selection to reduce attributes from 9,241 to 34, we tested the performance of four different ML classifiers in WEKA: Bayes network, logistic regression (LR), multi-layer perceptron (MLP), and random forest. The Bayes network classifier was found through the Auto-WEKA plug-in. Then we analyzed significantly dysregulated miRNAs profiles of DLBCL patients published by Lawrie et al. (23), Beheshti et al. (24), and Sun et al. (25). These miRNA sets became the independent datasets to test the effectiveness of our trained ML models. We attributed miRNAs with the target genes and sequence-based attributes from miRDB, and associated pathways from DAVID just as we did for the training set. Three such sets were obtained and tested against the trained ML models to verify how accurately the ML model can classify the miRNAs.

KEGG

With pathways identified using DAVID, we then use KEGG program to map the various miRNA target genes in the pathways and illustrate the number of miRNAs associated with the genes in the pathways.

Results

Selection of miRNAs dysregulated in DLBCL was obtained from studies referenced in Table S1. Several systematic reviews (9-11) also reported their roles in the pathogenesis of DLBCL. We selected only miRNAs, which had been confirmed in at least two studies. A total of 54 miRNAs are selected and listed in Table S1. This 54-miRNA set is then mixed with 54 random miRNAs (not implicated in DLBCL) to form our training dataset for training ML classifiers.

ML analysis

DLBCL miRNA training datasets were used to develop the ML models with four different classifiers in WEKA: Bayes network, LR, MLP, and random forest. Then we conducted 10-fold cross-validation to produced average accuracy rates. When predicting whether a miRNA was associated with DLBCL based on its attributes of target genes, pathways, and sequence properties, these selected WEKA classifiers yielded accuracy rates that varied from 86.11% to 93.52% comparing to benchmark value of 46.30% given by ZeroR, a random classifier (see Figure 2 for a visual chart of the classifiers and their accuracy rates).

Figure 2 A visual chart of the five different classifiers and their accuracies for the training miRNA data set. MLP, multi-layer perceptron.

The final training dataset consisted of 108 miRNAs, 54 of which are related to DLBCL and 54 randomly selected ones. After attribute selection, there are 33 final attributes, including 15 target genes (AKAP13, AKR1B10, ASB4, BTG2, CASP8AP2, CELSR1, CREBRF, KCNMA1, MMGT1, NPTX1, PPP1R11, RPS6KA5, TRIM71, ZBTB34, ZNF512B), 4 sequence-based attributes (number of nucleotide bases, frequency of G, frequency of U, mean mass), 3 three-base nucleotide motifs, where A is standing for Adenine, C for Cytosine, G for Guanine, and U for Uracil (CAU, GGA, UGG), 10 four-base motifs (AAUC, AUAU, AUCU, CGGG, GAGC, GCGG, GGAC, UACA, UGGA, UUGU), and 1 pathway attribute (pathways in cancer hsa05200).

We performed feature importance analysis to better understand the contribution of each miRNA attribute to the model’s predictions. For random forest, WEKA’s built-in attribute importance function based on Gini impurity metric was used, and the results are shown in Table 2.

Table 2

Attribute importance based on average impurity decrease (and number of nodes using that attribute)

Attribute name	Average impurity	Number of nodes
MEAN_MASS	0.38	194
PPP1R11	0.37	19
CASP8AP2	0.35	27
TRIM71	0.34	50
hsa05200	0.33	142
RATIO_G	0.33	225
CAU	0.31	72
ASB4	0.31	23
RATIO_U	0.3	231
AAUC	0.3	55
NUM_OF_BASES	0.3	176
UUGU	0.28	75
NPTX1	0.28	22
AKAP13	0.27	16
UGG	0.27	82
ZNF512B	0.25	40
BTG2	0.25	36
CGGG	0.25	32
UACA	0.24	23
GCGG	0.24	65
CREBRF	0.22	26
GGA	0.22	59
ZBTB34	0.19	21
MMGT1	0.18	19
AKR1B10	0.18	15
UGGA	0.17	33
CELSR1	0.17	7
GGAC	0.17	42
AUAU	0.17	24
GAGC	0.17	45
AUCU	0.16	41
KCNMA1	0.13	33
RPS6KA5	0.11	22

With these trained models, we tested three independent miRNA datasets. All three test sets were derived from studies (23-25) specifically focused on identifying dysregulated miRNAs in DLBCL patients, which is why they are imbalanced. We intentionally used the full set of miRNAs from the studies as they were reported, which resulted in some overlap with the training set. To address this issue, we conducted additional experiments where the overlapping miRNAs were removed from the test sets to ensure a fair evaluation. The results for both the full set and the “clean” set (with no overlap) are presented in Figures 3-5.

Figure 3 The accuracy rates of independent dataset #1. Both the full set and the clean set (without overlap with miRNAs found in the training set) are presented. MLP, multi-layer perceptron.

Figure 4 The accuracy rates of independent dataset #2. Both the full set and the clean set (without overlap with miRNAs found in the training set) are presented. MLP, multi-layer perceptron.

Figure 5 The accuracy rates of independent dataset #3. Both the full set and the clean set (without overlap with miRNAs found in the training set) are presented. MLP, multi-layer perceptron.

The rationale for including the results with the full set is based on the real-world scenario where miRNA profiles from patients may contain miRNAs that were already part of the training data. By testing with the full miRNA collection, we aim to evaluate how the model performs in a setting that reflects practical clinical applications, where some miRNAs may be known. However, the “clean” set results offer a more stringent assessment, and we present both to provide a comprehensive evaluation of classifier performance.

The independent dataset #1 of 22 miRNAs are taken from a study of 64 DLBCL patients by Lawrie et al. (23). In their study, biopsy samples from 64 DLBCL de-novo patients were analyzed using tonsillar samples from healthy individuals as a common reference. They found 22 miRNAs are differentially expressed, listed in Table 3. With the independent set #1, these models were able to predict the miRNAs associated with DLBCL with a accuracy from 81.82% to 86.36%. If we remove the 12 common miRNAs found in both the independent test set and the training set, the accuracy is from 60% to 70%. The results are depicted in Figure 3.

Table 3

Independent dataset #1 has 22 miRNAs related to DLBCL. Twelve miRNAs, 54.55%, in this dataset are overlapping with the training dataset and are marked with an asterisk (*)

miRNA test dataset #1

hsa-mir-200c

hsa-mir-638

hsa-mir-518a

hsa-mir-199a*

hsa-mir-93*

hsa-mir-22

hsa-mir-34a*

hsa-mir-362

hsa-mir-206

hsa-mir-451*

hsa-mir-636

hsa-mir-92*

hsa-mir-27b*

hsa-mir-199b

hsa-mir-27a*

hsa-mir-24*

hsa-mir-106a*

hsa-mir-20a*

hsa-mir-19b*

has-mir-99a

hsa-mir-18b*

hsa-mir-100

DLBCL, diffuse large B-cell lymphoma.

The independent dataset #2 of 9 miRNAs are taken from a study of 86 DLBCL patients by Beheshti et al. (24). In this study, the researchers analyzed serum from 86 DLBCL patients and recorded miRNAs with higher circulation levels, listed in Table 4. With independent set #2, these models were able to predict the miRNAs with a uniform accuracy of 100%. If we remove the 5 common miRNAs found in both the independent set and the training set, the accuracy is also 100%. The results are depicted in Figure 4.

Table 4

Independent dataset #2 has 9 miRNAs. Five miRNAs, 55.56%, in this dataset are overlapping with the training dataset and are marked with an asterisk (*)

miRNA test dataset #2

hsa-mir-10b

hsa-mir-155*

hsa-let-7c

hsa-let-7b

hsa-mir-130a

hsa-mir-24*

hsa-mir-27a*

hsa-mir-18a*

hsa-mir-15a*

The independent dataset #3 of 8 miRNAs are taken from a study of 20 DLBCL patients by Sun et al. (25). This study collected serum samples at diagnosis of 20 newly diagnosed DLBCL patients and analyzed the samples for miRNA array. They then showed the miRNAs with a mean fold change greater than 2.5 and a P value less than 0.05, listed in Table 5. With the independent set #3, these models were able to predict the miRNAs with accuracy from 75.00% to 87.50%. If we remove the 2 common miRNAs found in both the independent set and the training set, the accuracy is from 66.67% to 83.33%. The results are depicted in Figure 5.

Table 5

Independent dataset #3 has 8 miRNAs. Two miRNAs, 25.00%, in this dataset are overlapping with the training dataset and are marked with an asterisk (*)

miRNA test dataset #3

hsa-mir-21*

hsa-mir-130b

hsa-mir-155*

hsa-mir-7

hsa-mir-28

hsa-mir-128

hsa-mir-424

hsa-mir-454

The independent test sets contain only “Yes” (DLBCL-associated) miRNAs, and thus the model should ideally classify all instances as “Yes”. However, the model misclassified instances as “No” in independent sets #1 and #3. This indicates that despite the model’s overall strong performance, it exhibited uncertainty with certain “Yes” instances, potentially due to feature interactions or model structure issues.

We will investigate this further by analyzing feature importance to identify whether specific features are leading to these misclassifications. Additionally, we plan to refine the model through parameter tuning and potentially explore alternative classifiers to improve its ability to correctly identify all “Yes” instances in future experiments.

Pathway analysis

We used DAVID to identify enriched pathways most relevant to the target genes of the 54 training miRNAs. Specifically, we input a total of 1,100 target genes from all 54 miRNAs into DAVID for pathway analysis. We found various pathways with a P value ≤0.05 and listed them in Table 6. Notable pathways with the lowest P values and the highest significance include signaling pathways regulating pluripotency of stem cells with a P value <0.001 and 23 target genes. Additionally, the endocrine resistance pathway had 18 target genes with a P value <0.001. Accordingly, since DLBCL miRNA target genes fall on these pathways, these pathways may be associated with the cancer and should be further examined. Pathways with more than 20 genes targeted by the miRNA biomarkers are depicted in Figure 6.

Table 6

Enriched pathways identified through DAVID analysis of 1,100 miRNA target genes, with P values ≤0.05

Term	Count of target genes	−Log₁₀(P)	P value
Signaling pathways regulating pluripotency of stem cells	23	4.6021	<0.001
Endocrine resistance	18	4.3468	<0.001
FoxO signaling pathway	21	4.1805	<0.001
Pathways in cancer	53	3.8861	<0.001
Axon guidance	25	3.8539	<0.001
Proteoglycans in cancer	27	3.8539	<0.001
Neurotrophin signaling pathway	19	3.7696	<0.001
mTOR signaling pathway	22	3.5686	<0.001
Cellular senescence	22	3.5686	<0.001
PI3K-Akt signaling pathway	38	3.4318	<0.001
EGFR tyrosine kinase inhibitor resistance	14	3.2291	<0.001
Prolactin signaling pathway	13	3.1871	<0.001
MAPK signaling pathway	33	3.1427	<0.001
Focal adhesion	25	3.1427	<0.001
Protein digestion and absorption	16	3.0555	<0.001
Glutamatergic synapse	17	3.0044	<0.01
Autophagy - animal	19	2.8539	0.001
Regulation of actin cytoskeleton	26	2.7447	0.001
Human papillomavirus infection	33	2.4949	0.003
Lysine degradation	11	2.4815	0.003
Transcriptional misregulation in cancer	22	2.3768	0.004
AMPK signaling pathway	16	2.3565	0.004
Tight junction	20	2.3565	0.004
Ras signaling pathway	25	2.2596	0.005
Phospholipase D signaling pathway	18	2.2518	0.005
Renin secretion	11	2.1938	0.006
ErbB signaling pathway	12	2.0000	0.01
p53 signaling pathway	11	2.0000	0.01
Phosphatidylinositol signaling system	13	2.0000	0.01
Wnt signaling pathway	19	1.9586	0.01
Hedgehog signaling pathway	9	1.8239	0.01
cGMP-PKG signaling pathway	18	1.7447	0.01
Oocyte meiosis	15	1.6990	0.02
Endocytosis	24	1.6778	0.02
cAMP signaling pathway	22	1.6576	0.02
Inositol phosphate metabolism	10	1.6021	0.02
Apelin signaling pathway	15	1.4949	0.03
ECM-receptor interaction	11	1.4685	0.03
Relaxin signaling pathway	14	1.4202	0.03
C-type lectin receptor signaling pathway	12	1.3979	0.04
GnRH signaling pathway	11	1.3565	0.04
Thyroid hormone signaling pathway	13	1.3098	0.049
TGF-beta signaling pathway	12	1.3010	0.05

DAVID, Database for Annotation, Visualization, and Integrated Discovery; FoxO, Forkhead box O; mTOR, mammalian target of rapamycin; PI3K, phosphoinositide 3-kinase; Akt, Ak strain transforming; EGFR, epidermal growth factor receptor; MAPK, mitogen-activated protein kinase; AMPK, AMP-activated protein kinase; AMP, adenosine 3',5'-monophosphate; Ras, rat sarcoma; ErbB, erythroblastic oncogene B; cGMP, cyclic guanosine monophosphate; PKG, protein kinase G; cAMP, cyclic adenosine 3',5'-monophosphate; ECM, extracellular matrix; GnRH, gonadotropin-releasing hormone; TGF, transforming growth factor.

Figure 6 Each pathway with a P value of less than 0.05 and has more than 20 target genes. The pathways are graphed in the following fashion: the x-axis is the number of target genes of DLBCL miRNAs in the pathway while the y-axis is the negative common logarithm of the P value of each enriched pathway. The pathway (with the highest negative common log and the most target genes should ideally be investigated further as a related pathway to DLBCL. FoxO, Forkhead box O; mTOR, mammalian target of rapamycin; PI3K, phosphoinositide 3-kinase; Akt, Ak strain transforming; MAPK, mitogen-activated protein kinase; Ras, rat sarcoma; cAMP, cyclic adenosine 3',5'-monophosphate; DLBCL, diffuse large B-cell lymphoma.

Furthermore, after querying target genes of the DLBCL miRNA training data into KEGG, we were able to create a map of pathways in cancer and highlight the number of target genes in each pathway as shown in Figure 7. These results show participation of miRNAs on significant sections of the several main cancer pathways.

Figure 7 KEGG pathways in cancer visual. The target genes are marked in varying shades of color. The most targeted genes are rendered in yellow as illustrated in the color keys in the upper left corner. KEGG, Kyoto Encyclopedia of Genes and Genomes.

Discussion

The performances of the classifiers are evaluated using confusion matrices: true positive rate (TPR) (sensitivity), false positive rate (FPR), precision, recall, F-measure (harmonic mean of precision and recall), Matthews correlation coefficient (MCC), area under receiver operating characteristic (ROC) curve (AUC) (the balance between the TPR and FPR), and area under precision-recall (PR) curve (AUC-PR) (the balance between and recall). Table 7 presents a comparison of the four classifiers examined in this study: Bayes network, LR, MLP, and random forest.

Table 7

Performance comparison of the classifiers

Classifier	Accuracy	TP rate	FP rate	Precision	Recall	F-measure	MCC	AUC	AUC-PR
Bayes network	0.917	0.917	0.083	0.917	0.917	0.917	0.833	0.967	0.967
Logistic regression	0.935	0.935	0.065	0.937	0.935	0.935	0.872	0.973	0.974
Multi-Layer perceptron	0.880	0.880	0.12	0.886	0.88	0.879	0.766	0.963	0.962
Random forest	0.861	0.861	0.139	0.861	0.861	0.861	0.772	0.941	0.944

TP, true positive; FP, false positive; MCC, Matthews correlation coefficient; AUC, area under receiver operating characteristic curve; AUC-PR, area under precision-recall curve.

All classifiers exhibited an accuracy rate exceeding 85%, with LR achieving the highest level at 93.5% resulted from cross-validation experiments. LR demonstrated a low FPR of 6.5%, significantly outperforming the 16% FPR reported in Sandhaus’s study on FNA biopsies (3). Moreover, LR exhibited a sensitivity of 93.5%, notably surpassing the 74% sensitivity reported in Paquin’s FNA study (4). Compared to CNB with a diagnostic rate of 56.8% (4), our miRNA-based diagnosis also showcased superior performance. Notably, our best-performing LR model approached the reported accuracy rate of excisional biopsies from other studies (4), demonstrating the potential to deliver high accuracy without being constrained by sample size limitations.

We also investigated the 15 target gene attributes used in the model after attribute selection: AKAP13, AKR1B10, ASB4, BTG2, CASP8AP2, CELSR1, CREBRF, KCNMA1, MMGT1, NPTX1, PPP1R11, RPS6KA5, TRIM71, ZBTB34, ZNF512B. Some of the genes, such as AKAP13 (26), CASP8AP2 (27), CREBRF (28), BTG2 (29), have been implied in DLBCL. For example, the dysregulation of CASP8AP2 has been associated with DLBCL, and its dysregulation may impact apoptotic pathways, which play a crucial role in cancer development and progression (27). The other 11 genes are not implied in DLBCL but have been implied in the development of other cancers. For instance, AKR1B10 plays a role in metabolism and its elevated expression is linked to unfavorable prognoses across several cancer types (30). Meanwhile, ZBTB34 functions as a transcriptional repressor and has been implicated in hematological malignancies (31). The fact that all 15 target genes identified through attribute selection in our model are implicated in cancer pathogenesis is not coincidental. This serves as evidence that our model effectively identifies oncogenes.

The methodology outlined in this study holds promise for miRNA analysis in various other types of cancer or diseases, provided that relevant miRNA data is available. In a clinical setting, sequencing data from a patient’s biopsy or serum sample would first be processed to obtain the miRNA expression levels. These expression levels would then be transformed into the feature space used by the ML models, which includes attributes such as target genes, enriched pathways, and sequence-based properties. Once the sequencing profiles of an individual patient are mapped to these features, the trained models can be applied to predict whether the miRNA profile is associated with DLBCL. Thus, the patient miRNA data would function like a new independent test set for the trained model. This prediction could assist clinicians in making decisions regarding diagnosis by identifying miRNAs linked to DLBCL.

Conclusions

We developed a machine-learning system for diagnosis of DLBCL with 54 miRNAs, reaching best performance of 93.52% with training dataset and best accuracy of 86.36%, 100%, and 87.50% with three independent datasets extracted from miRNA sequence data of actual tumor samples. This proves our hypothesis that we can make a diagnosis based on miRNA sequencing data of tumor samples with known miRNA biomarkers and their descriptors of target genes.

Furthermore, we identified the enriched pathways where the miRNAs have significant presence. Some of the pathways [e.g., PI3K signaling pathway (32,33)] are already discussed in several publications, but others (e.g., FoxO signaling pathway) are not being investigated so intensively and could result in new insights in the pathogenesis of DLBCL.

Acknowledgments

Funding: None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-282/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-282/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. IRB approval and informed consent are waived as there is no human subject involved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Liu Y, Barta SK. Diffuse large B-cell lymphoma: 2019 update on diagnosis, risk stratification, and treatment. Am J Hematol 2019;94:604-16. [Crossref] [PubMed]
Harris LJ, Patel K, Martin M. Novel Therapies for Relapsed or Refractory Diffuse Large B-Cell Lymphoma. Int J Mol Sci 2020;21:8553. [Crossref] [PubMed]
Sandhaus LM. Fine-needle aspiration cytology in the diagnosis of lymphoma. The next step. Am J Clin Pathol 2000;113:623-7. [Crossref] [PubMed]
Paquin AR, Oyogoa E, McMurry HS, et al. The diagnosis and management of suspected lymphoma in general practice. Eur J Haematol 2023;110:3-13. [Crossref] [PubMed]
Sempere LF, Azmi AS, Moore A. microRNA-based diagnostic and therapeutic applications in cancer medicine. Wiley Interdiscip Rev RNA 2021;12:e1662. [Crossref] [PubMed]
Precazzini F, Detassis S, Imperatori AS, et al. Measurements Methods for the Development of MicroRNA-Based Tests for Cancer Diagnosis. Int J Mol Sci 2021;22:1176. [Crossref] [PubMed]
He L, Thomson JM, Hemann MT, et al. A microRNA polycistron as a potential human oncogene. Nature 2005;435:828-33. [Crossref] [PubMed]
Liu K, Du J, Ruan L. MicroRNA-21 regulates the viability and apoptosis of diffuse large B-cell lymphoma cells by upregulating B cell lymphoma-2. Exp Ther Med 2017;14:4489-96. [Crossref] [PubMed]
Fuertes T, Ramiro AR, de Yebenes VG. miRNA-Based Therapies in B Cell Non-Hodgkin Lymphoma. Trends Immunol 2020;41:932-47. [Crossref] [PubMed]
Getaneh Z, Asrie F, Melku M. MicroRNA profiles in B-cell non-Hodgkin lymphoma. EJIFCC 2019;30:195-214. [PubMed]
Larrabeiti-Etxebarria A, Lopez-Santillan M, Santos-Zorrozua B, et al. Systematic Review of the Potential of MicroRNAs in Diffuse Large B Cell Lymphoma. Cancers (Basel) 2019;11:144. [Crossref] [PubMed]
Chen Y, Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic Acids Res 2020;48:D127-31. [Crossref] [PubMed]
Liu W, Wang X. Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data. Genome Biol 2019;20:18. [Crossref] [PubMed]
Frank E, Hall MA, Witten IH. The WEKA Workbench, Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann, Fourth Edition, 2016. The University of Waikato; 2016.
Sherman BT, Hao M, Qiu J, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 2022;50:W216. [Crossref] [PubMed]
Huang da W. Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44-57. [Crossref] [PubMed]
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27-30. [Crossref] [PubMed]
Kanehisa M. Toward understanding the origin and evolution of cellular organisms. Protein Sci 2019;28:1947-51. [Crossref] [PubMed]
Kanehisa M, Furumichi M, Sato Y, et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2023;51:D587-92. [Crossref] [PubMed]
Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res 2019;47:D155-62. [Crossref] [PubMed]
Kang W, Kouznetsova VL, Tsigelny IF. miRNA in Machine-learning-based Diagnostics of Cancers. Cancer Screening and Prevention 2022;1:132-8. [Crossref]
Macfarlane LA, Murphy PR. MicroRNA: Biogenesis, Function and Role in Cancer. Curr Genomics 2010;11:537-61. [Crossref] [PubMed]
Lawrie CH, Chi J, Taylor S, et al. Expression of microRNAs in diffuse large B cell lymphoma is associated with immunophenotype, survival and transformation from follicular lymphoma. J Cell Mol Med 2009;13:1248-60. [Crossref] [PubMed]
Beheshti A, Stevenson K, Vanderburg C, et al. Identification of Circulating Serum Multi-MicroRNA Signatures in Human DLBCL Models. Sci Rep 2019;9:17161. [Crossref] [PubMed]
Sun R, Zheng Z, Wang L, et al. A novel prognostic model based on four circulating miRNA in diffuse large B-cell lymphoma: implications for the roles of MDSC and Th17 cells in lymphoma progression. Mol Oncol 2021;15:246-61. [Crossref] [PubMed]
Luo X, Shi F, Qiu H, et al. Identification of potential key genes associated with diffuse large B-cell lymphoma based on microarray gene expression profiling. Neoplasma 2017;64:824-33. [Crossref] [PubMed]
Lan Q, Morton LM, Armstrong B, et al. Genetic variation in caspase genes and risk of non-Hodgkin lymphoma: a pooled analysis of 3 population-based case-control studies. Blood 2009;114:264-7. [Crossref] [PubMed]
D’Auria F, Di Pietro R. Role of CREB protein family members in human haematological malignancies, Cancer Treatment-Conventional and Innovative Approaches. IntechOpen 2013. doi: 10.5772/55368.
Guo D, Hong L, Ji H, et al. The Mutation of BTG2 Gene Predicts a Poor Outcome in Primary Testicular Diffuse Large B-Cell Lymphoma. J Inflamm Res 2022;15:1757-69. [Crossref] [PubMed]
Banerjee S. Aldo Keto Reductases AKR1B1 and AKR1B10 in Cancer: Molecular Mechanisms and Signaling Networks. Adv Exp Med Biol 2021;1347:65-82. [Crossref] [PubMed]
Liu Z, Jin D, Wei X, et al. ZBTB34 is a hepatocellular carcinoma-associated protein with a monopartite nuclear localization signal. Aging (Albany NY) 2023;15:8487-500. [Crossref] [PubMed]
Majchrzak A, Witkowska M, Smolewski P. Inhibition of the PI3K/Akt/mTOR signaling pathway in diffuse large B-cell lymphoma: current knowledge and clinical significance. Molecules 2014;19:14304-15. [Crossref] [PubMed]
Miao Y, Medeiros LJ, Xu-Monette ZY, et al. Dysregulation of Cell Survival in Diffuse Large B Cell Lymphoma: Mechanisms and Therapeutic Targets. Front Oncol 2019;9:107. [Crossref] [PubMed]

doi: 10.21037/jmai-24-282
Cite this article as: Tang S, Tsigelny IF, Kesari S, Kouznetsova VL. miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning. J Med Artif Intell 2025;8:18.

miRNA biomarkers-based diagnosis of diffuse large B-cell lymphoma using machine learning

Highlight box

Introduction

Methods

PubMed

miRDB

Table 1

The Database for Annotation, Visualization, and Integrated Discovery (DAVID)

WEKA

KEGG

Results

ML analysis

Table 2

Table 3

Table 4

Table 5

Pathway analysis

Table 6

Discussion

Table 7

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share