AI-driven discovery of EZH2 as a core target and potential inhibitors for breast cancer
AI-driven discovery of EZH2 as a core target and potential inhibitors for breast cancer
Original Article
AI-driven discovery of EZH2 as a core target and potential inhibitors for breast cancer
Zifu Wang1, Jundi Li2, Bingxi Lei3,4
1School of Computing, Asia Pacific University of Technology & Innovation, Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia;
2Beijing Institute of Remote Sensing Equipment, China Aerospace Science and Industry Corporation Limited, Beijing, China;
3Department of Neurosurgery, Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China;
4Department of Public Health, International College, Krirk University, Bangkok, Thailand
Contributions: (I) Conception and design: Z Wang, B Lei; (II) Administrative support: B Lei; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: Z Wang, J Li; (V) Data analysis and interpretation: Z Wang, J Li; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.
Correspondence to: Bingxi Lei, PhD. Department of Neurosurgery, Sun Yat-sen Memorial Hospital, Sun Yat-sen University, No. 107 Yanjiang West Road, Yuexiu District, Guangzhou 510123, China; Department of Public Health, International College, Krirk University, 3 Ram Inthra Rd, Anusawari, Bang Khen, 10220, Bangkok, Thailand. Email: leibingxi@mail.sysu.edu.cn; leibingxi@163.com.
Background: Breast cancer (BC) is a highly heterogeneous disease whose development involves multiple molecular mechanisms. Despite progress in the early diagnosis and treatment of BC, individualised clinical treatment remains a challenge. Bioinformatics, machine learning (ML), and computational medicinal chemistry have shown significant potential in identifying early diagnostic biomarkers for BC, discovering therapeutic targets, implementing precision treatment, and assessing prognosis, they have also provided effective tools for evaluating the feasibility of drug targets. This study aimed to integrate multi-cohort transcriptomic data with large-scale ML and computational drug discovery to (I) identify robust diagnostic biomarkers and core therapeutic targets for BC and (II) prioritize potential small-molecule inhibitors against the key target through systematic in silico screening and molecular docking validation.
Methods: This study obtained data from the GEO database. After quality control and standardisation, the gene expression matrices of GSE70947 and GSE248830 were combined into a training model. Then, differentially expressed genes (DEGs) screened through cross-gene hybridisation and the most strongly correlated module genes obtained through weighted gene co-expression network (WGCNA) analysis were used to obtain cross genes. Key genes were obtained through PPI analysis, and a diagnostic model was constructed using 127 combined ML algorithms. Hub genes were finally obtained through SHapley Additive exPlanations (SHAP) interpretability analysis and validation on two external datasets (GSE30670 and GSE70905). Based on the ChEMBL database, compounds were screened using the Lipinski five-rule method, PAINS filtering, and ML activity prediction. Molecular docking was used to assess binding affinity and interaction forces.
Results: The study identified 262 DEGs, and WGCNA analysis showed that the MEturquoise module was significantly related to the phenotype of BC. The ML model screened out 11 characteristic genes, among which enhancer of zeste homolog 2 (EZH2) showed stable diagnostic efficacy in the training set (area under the curve >0.84) and the independent verification set. Functional enrichment analysis reveals that characteristic genes are significantly involved in neurogenesis regulation, microRNA regulation and other pathways. Three high-affinity EZH2 inhibitors (CHEMBL5206842, CHEMBL5188577, and CHEMBL4644363) were selected from 731 initial compounds, and their binding free energy reached −10.6 to −10.1 kcal/mol, and stable binding is achieved through a hydrogen bond network and hydrophobic interaction.
Conclusions: By combining 127 ML algorithms with multi-omics integration and computational biology methods, EZH2 was established as a core diagnostic biomarker and therapeutic target for BC, and candidate inhibitors with high binding affinity were screened.
Keywords: Breast cancer (BC); enhancer of zeste homolog 2 (EZH2); machine learning algorithms (ML algorithms); drug discovery; targeted treatment
Received: 16 July 2025; Accepted: 19 January 2026; Published online: 27 March 2026.
doi: 10.21037/jmai-2025-167
Introduction
Background
Breast cancer (BC) is a complex disease composed of a variety of disease entities with different pathological characteristics, prognosis results and metastasis (1). In order to gain an in-depth understanding of the heterogeneity of BC, global transcription spectrum analysis revealed at least five different intrinsic molecular subtypes of BC, including human epidermal growth factor receptor 2 positive (HER2+), luminal A, luminal B, normal-like and basal-like BC (BLBC). Among them, the survival results of patients with the Luminal B subtype are poor, partly due to the lack of effective treatment methods (2). Therefore, developing an effective treatment for this subtype is one of the important directions of BC research at present (3). For this reason, researchers are committed to defining the regulatory mechanisms of different transcription networks in each subtype, especially how to carry out unique regulation through genetic and epigenetic mechanisms (4).
Trimethylation (H3K27me3) of methyltransferase enhancer of zeste homolog 2 (EZH2) on histone H3, as part of polycomb inhibition complex 2 (PRC2), is an important mechanism for gene silencing (5). In BC, the abnormal expression of EZH2 has been widely observed, and the increase in its expression is associated with poor prognosis and low survival rate. Nevertheless, the research community has not reached a consensus on the relationship between its carcinogenic function and only being a by-product of cell proliferation for the specific role of EZH2 in driving the process of BC (6). The context dependence of EZH2 function is also thought to be closely related to the origin of tumour cells and/or early transformation events. Therefore, it is particularly important to evaluate the role of EZH2 in different molecular subtypes of BC, especially in intrinsic BC subtypes of different developmental origins. Although the level of H3K27me3 in luminal B BC is significantly higher than that of BLBC or HER2+, the functional correlation between its increased global histone methylation status and tumour progression is not yet clear (7).
Rationale and knowledge gap
BC is characterized not only by its unique epidemiological pattern but also by the significant heterogeneity of the disease. As one of the leading causes of malignant tumour-related deaths in women around the world, the molecular subtype of BC is increasingly refined, which greatly enhances our understanding of BC and promotes the development of precise treatment (8). The mechanism of occurrence and progression of BC, involving many factors such as tumour stem cells, tumour microenvironment and circadian rhythm, has always been the core topic of scientific research. With the progress of technology, especially the application of artificial intelligence technology in the detection and diagnosis of BC, the diagnostic accuracy of BC has been significantly improved (9).
Objective
This study aims to systematically analyze the functional role of EZH2 in different BC subtypes by integrating 127 types of machine learning (ML), multi-group data and biocomputing methods, explore its potential as a diagnostic marker and potential treatment target, and provide a new theoretical basis and strategic direction for promoting the precise treatment of BC (10). We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-167/rc).
Methods
Data sources and preliminary analysis
Differentially expressed genes (DEGs) screening and data acquisition
This study incorporated two BC datasets (GSE70947, GSE248830, GSE30670, and GSE70905) from the GEO database (11). GSE70947 and GSE248830 served as the training set, GSE30670 and GSE70905 served as the external validation set. GSE70947 contains primary BC (n=148) with normal tissue adjacent to the cancer (n=148), which provides the basis for identifying molecules related to tumorigenesis, GSE248830 contains primary foci (n=11) with matched brain metastases (n=11), which provides a unique perspective for resolving metastatic mechanisms. GSE30670 contains BC (n=8) and normal tissue (n=4) samples. GSE70905 contains BC (n=148) and normal tissue (n=148) samples, providing an independent cohort for external validation of the study results. After normalization based on the integrated expression matrix, the “ComBat” algorithm was used to correct for batch effects in the source of the dataset (12), and the batch correction was validated by principal component analysis (PCA) (13). The corrected data were analyzed for differential expression using a linear model of limma (version 3.66.0) combined with an empirical Bayesian approach with the “sva” (version 3.58.0) package, setting |logfold change (FC)| >0.585 and FDR <0.05 as the significance thresholds to screen for DEGs (14). The differential expression results were visualized by heatmaps and volcano maps.
Gene co-expression network study
To construct the WGCNA analysis of DEGs, we used the WGCNA (version 1.73.0) package of R. The objective was to identify gene modules significantly linked to BC. Quality control and sample clustering were performed on the gene expression data to exclude abnormal samples. The soft threshold function was used to determine the optimal power value, construct the neighbor-joining matrix between genes, and further calculate the topological overlap matrix (TOM) (15). Hierarchical clustering was performed based on the dissimilarity of the TOM matrix, and gene co-expression modules were identified by a dynamic shear tree algorithm (deepSplit =2, minModuleSize =50). Modules with highly correlated feature genes were merged to obtain the final, independent modules. Key modules with significant phenotypic associations were screened by calculating the correlation between module feature genes and clinical traits, and the genes that intersect between the significant coexpressed genes identified through WGCNA and DEG were analyzed for protein-protein interactions (PPI) analysis (16).
Construction and analysis of PPI networks
In this study, we constructed a protein interaction network (confidence threshold ≥400) based on STRINGdb (version 2.22.0) and systematically calculated topological metrics such as degree centrality, median centrality, proximity centrality, eigenvector centrality, and PageRank value by the igraph package (version 2.2.1) (17). A weighted integration strategy was used to obtain a comprehensive score of nodes and screen key target genes. The Louvain algorithm is further applied to identify the network functional modules and visualized by stress layout. The final established core genes will be used for subsequent ML analysis (18).
Model building and SHapley Additive exPlanations (SHAP)
In this study, an integrated ML framework was used to systematically analyze BC data based on the GSE70947 and GSE248830 datasets. 127 model combinations were constructed by 12 algorithms [including LASSO, Ridge, Stepglm, support vector machine (SVM), RF, XGBoost, etc.], and each combination used a cross-validation framework, with the prelude algorithm being responsible for feature screening and the subsequent algorithms constructing the predictive model (19). After the training data are standardized, various algorithms such as logistic regression, SVM, and gradient boosting machine are applied for variable selection and model training (20).
The optimal model identifies the key feature genes by the SHAP analysis method and calculates the feature importance by using the random forest (RF) algorithm (21). The model performance was systematically evaluated by area under the curve (AUC) heatmap and stability analysis plot, and various visualization methods, such as bar graphs, bee plots, and dependency plots, were used to deeply resolve the contribution of key genes to the model prediction and their relationship with BC prognosis. Key genes were subsequently incorporated into the enrichment function analysis.
Enrichment analysis followed by hub gene identification and validation
To systematically analyze the biological functions of key genes, this study used the R package “clusterProfiler” (version 4.18.1) to perform a comprehensive functional annotation analysis based on the key genes obtained from the SHAP algorithm (22), including gene ontology (GO) functional enrichment analysis, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis, and disease ontology (DO) analysis (23). In the model validation phase, high-quality models with AUC values greater than 0.8 in the training set were screened for subsequent validation by systematically evaluating the ROC curves of 127 algorithm combinations. These preferred models were rigorously validated in an independent external dataset, GSE70905 and GSE30670, with consistency index (C-index) and average AUC as the core evaluation metrics, and subsequently the diagnostic models with the best generalization performances were established, and accordingly the most predictive hub genes were screened (24). Based on these hub genes, compound-related analyses were further conducted to explore potential therapeutic agents.
Compound screening
Preliminary screening of candidate compounds
To identify potentially active compounds targeting hub genes, this study systematically searched the information of small molecule ligands associated with the UniProt ID corresponding to each hub gene based on the ChEMBL database (https://www.ebi.ac.uk/chembl/) (25,26). In the compound screening process, we strictly followed the drug similarity evaluation criteria and adopted a multi-stage screening strategy: the candidate compounds were initially screened by Lipinski’s rule of five (27), the specific parameters of which included molecular weight (MW ≤500 Da), the number of hydrogen-bonding donors (HBD ≤5), the number of hydrogen-bonding acceptors (HBA ≤10), and the lipid-water partition coefficient (LogP ≤5), followed by the application of a PAINS (pan-assay interference compounds) filter, which was then applied to exclude molecular structures that could generate false positive signals (28). This comprehensive screening programme ensures that the obtained compounds have both good drug-like properties and specific binding potential and lays a reliable molecular foundation for the subsequent accurate prediction of active compounds based on ML (29).
Compound optimization and protein structure screening and modeling
Based on the candidate compounds obtained from the previous ChEMBL, their pharmacokinetic profiles were further systematically evaluated in the present study. The absorption, distribution, metabolism, and excretion properties of the compounds were comprehensively examined using ADME analysis. To construct the feature matrix required for the ML model (30), we integrated the molecular fingerprint information with the activity data: the MACCS key fingerprints and Morgan fingerprints (with radius 2/3) were calculated separately, and together with the compound activity index pIC50 values, we formed the feature vectors (31,32).
In the modelling method, we used four ML algorithms, namely, RF, SVM, artificial neural network and convolutional neural network, to establish the model for predicting compound-target interactions. Based on the model prediction results, high-affinity candidate compounds with pIC50 ≥9.0 were screened and obtained for subsequent molecular docking validation (33).
Molecular docking and visualization
The AutoDock Vina-based Smina tool was used for molecular docking in this study (34). Human crystal structures corresponding to UniProt IDs of hub genes were obtained from the RCSB PDB database (https://www.rcsb.org/) (35), and the screening criteria were Homo sapiens origin, Molecular weight >100 Da, and monomer conformation with resolution ≤3.0 Å (36). The protein structure was prepared by using OpenBabel. The protein structures were preprocessed using OpenBabel to remove water molecules, heteroatoms and irrelevant ligands, and protonation and charge optimisation were completed at physiological pH 7.4 to generate the final receptor PDBQT files (37). The ligands were converted to PDBQT format from their SMILES expressions via the MMFF94s force field for 3D conformational generation, hydrogenation and charge assignment, and were also converted to PDBQT format (38). The docking binding pocket is defined by an outward expansion of 4.0 Å to the search space, using the geometrical centre of the protocrystalline ligand as its origin. A systematic sampling strategy with an exhaustive degree of 16 was used to evaluate the ligand conformation and output the nine poses with optimal binding affinity. To verify the prediction reliability, redocking calculations were performed on the top-ranked conformations, and the RMSD value below 2.0 Å was used as the conformational consistency threshold (39). Ultimately, high-affinity and validated compounds were screened for in-depth protein-ligand interaction analysis.
Data analysis and tools
The data analysis flowchart of the study is shown in Figure 1. The raw data were obtained from the GEO database, including four BC cohorts, GSE70947, GSE248830, GSE30670 and GSE70905, which were used for model training, feature screening and external independent validation, respectively. The protein 3D structure data were obtained from the RCSB PDB database, and the crystal structure of Homo sapiens with a ligand mass of ≥100 Da and a resolution of ≤3.0 Å (PDB ID: 4MIO) was selected for molecular docking simulation by combining with the UniProt ID.
Figure 1 Flowchart of the study. ADME, absorption, distribution, metabolism, and excretion; DEGs, differentially expressed genes; EZH2, enhancer of zeste homolog 2; GEO, Gene Expression Omnibus; ML, machine learning; PDB, protein data bank; PPI, protein-protein interaction; SHAP, SHapley Additive exPlanations; WGCNA, weighted gene co-expression network analysis.
All the data processing and modelling were done in Perl (version 5.34.1), R (version 4.4.1), and Python (version 3.8.15). Differential gene screening was based on limma (3.66.0), batch effect correction was performed using the ComBat method of the sva package, and co-expression network construction was implemented by WGCNA (1.73.0). STRINGdb (2.22.0) and igraph (2.2.1) were used to construct and parse protein interaction networks. The ML framework integrates 12 algorithms to identify key genes using the SHAP method, and the functional enrichment analysis is done by clusterProfiler (4.18.1). Compound screening was performed using the ChEMBL database, and molecular docking was performed by Smina (version 2020.12.10) and OpenBabel (version 0.2.2). All scripts were rigorously tested to ensure the reproducibility and scientific reliability of the analytical process. All figures (Figures 1-7) are presented in the main text, and supporting tables (Tables S1-S4) are provided as supplementary files.
Figure 2 Data preprocessing and DEGs screening. (A) Before batch calibration. (B) After batch calibration. (C) Volcano map, heat map, and expression levels of top DEGs. DEGs, differentially expressed genes; PCA, principal component analysis.
Figure 3 WGCNA analysis. (A) Analysis of the scale-free index for various soft-threshold powers (β). (B) Analysis of the mean connectivity for various soft-threshold powers. (C) WGCNA clustering tree showing co-expressed gene modules, branches represent gene clusters, and color bars identify modules defined by dynamic shearing. (D) Heatmap of correlation between module characterization genes and expression levels. (E) Module trait relationship and module features of BC. (F) Intersection gene between module gene and DEGs in WGCNA analysis. BC, breast cancer; DEGs, differentially expressed genes; WGCNA, weighted gene co-expression network analysis.
Figure 4 PPI network and identification of characteristic genes using 127 machine-learning models and SHAP analysis. (A) The top 20 pivotal genes in the PPI network. (B) 127 ML algorithm combinations evaluated via 10-fold cross-validation. (C) SHAP Analysis of the Best Model (Beeswarm Plot for Feature Contributions Across Samples, Waterfall and dependence graphs of feature contributions for a single instance). ML, machine learning; PPI, protein-protein interaction; SHAP, SHapley Additive exPlanations.
Figure 5 Enrichment analysis and validate the hub gene. (A) GO analysis. (B) KEGG analysis. (C) DO analysis. (D) Model prediction and screening of hub genes. AUC, area under the curve; BP, biological process; CC, cellular component; DO, disease ontology; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MF, molecular function; ROC, receiver operating characteristic.
Figure 6 Molecular screening and highly active EZH2 inhibitors. (A) The range of pIC50 values spreads widely within the database of compounds. (B) Compounds that pass the Ro5 are basic requirements of drug oral bioavailability. (C) Compounds that are missed from the selected criteria. (D) The occurrence of substructures in the compound dataset. (E) Lead compound structures (upper) and their flagged PAINS substructures (lower) against the EZH2 target background. (F) ML model performance for active compound prediction: ROC curves (left), loss function curves across batch sizes 16, 32, and 64 (middle), and true versus predicted pIC50 scatter plot (right). (G) The range of highly active compounds (pIC50 ≥9.0) in the dataset. (H) The structure of the target protein 4MIO that binds to the EZH2 core gene. AUC, area under the curve; EZH2, enhancer of zeste homolog 2; PAINS, pan-assay interference compounds; pIC50, negative logarithm of half maximal inhibitory concentration; ROC, receiver operating characteristic; SVM, support vector machine.
Figure 7 Molecular docking and ligand-protein interactions of candidate compounds with EZH2. (A) Plot of CHEMBL5206842 docked with EZH2. (B) Plot of CHEMBL5188577 docked with EZH2. (C) Plot of CHEMBL4644363 docked with EZH2. EZH2, enhancer of zeste homolog 2.
Ethics
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study uses only publicly available de-identified data and therefore does not require institutional ethical approval.
Results
Data sources and preliminary analysis
Screening for DEGs
In this study, two datasets (GSE70947 & GSE248830) were selected as training set. Preliminary PCA showed that there was a significant batch effect between the different datasets (Figure 2A). Batch correction was performed using the ComBat algorithm to integrate the data. The post-correction PCA results indicated that the sample aggregation status was significantly improved and the inter-batch differences were effectively eliminated (Figure 2B). Subsequently, differential expression analysis was carried out based on the linear model of the limma package combined with empirical Bayesian correction, and the screening threshold was set at a false discovery rate of <0.05 and |log2FC| >0.585. A total of 262 DEGs were identified, of which 157 were up-regulated genes and 105 were down-regulated genes (Figure 2C).
WGCNA analysis
The core of WGCNA is to identify gene modules with synergistic expression characteristics and further resolve potential associations between these modules and phenotypes. The algorithm does not only focus on the level of gene expression but also portrays the similarity of expression dynamics among genes at a systemic level. Gene co-expression networks usually exhibit scale-free properties, so the analysis process is based on the Pearson correlation coefficient, and the neighbour-joining matrix is constructed by soft-threshold weighting, based on which the TOM is computed so as to obtain a more robust gene association structure (Figure 3A-3C).
Based on the expression matrix of the training set, we constructed the co-expression network, and after forming the complete neighbour relationship and network topology, we separated the key modules based on the average connection hierarchical clustering tree and dynamic shear algorithm (Figure 3D). The analysis results indicated that the MEturquoise module had the most significant correlation with BC, which contained 420 genes (Figure 3E). After cross-validation screening the genes of this module with the set of DEGs, 175 candidate genes were obtained, which provided a reliable feature base for subsequent ML modelling (Figure 3F).
Based on the integration and SHAP screening of PPI network analysis and ML, 11 characteristic genes of BC were obtained
The network structure was systematically quantified using the igraph package, and a composite score containing five categories of topological attributes was calculated, from which an interactive gene association network was constructed to identify key nodes with central regulatory roles in the network. The 20 most representative core genes were selected based on the composite score, and their expression profiles were further combined with a ML algorithm for feature integration, which was used to identify key genes closely related to BC (Figure 4A).
A total of 127 prediction models were constructed in this study, and the training and test sets were comprehensively evaluated using a rigors cross-validation framework. Comparison of model performance showed that the integrated strategy of “LASSO + RF” obtained the highest average AUC value (0.977) in both data sets, demonstrating optimal classification discrimination (Figure 4B,4C) (available online: https://cdn.amegroups.cn/static/public/jmai-2025-167-1.xlsx). Based on this result, the combined model is considered the most robust prediction method.
After identifying the optimal classifiers, SHAP global feature interpretation analysis was performed on the model, which yielded a set of 11 characteristic genes that contributed most to BC prediction: EZH2, MKI67, RAD51, PTPN11, DNMT1, SPP1, FGF13, PIK3R1, CXCL12, KIT, and CDH1 (Figure 4D) (Table S1).
Enrichment analysis and diagnostic biomarkers
To explore the potential biological functions of the 11 characterised genes, systematic enrichment analysis of GO, KEGG and DO was performed. (I) The results showed that in the BP entries of GO, the genes were significantly associated with regulation of neurogenesis, regulation of nervous system development, regulation of neurone projection development and axonogenesis, suggesting that these genes have prominent roles in the regulation of neurogenesis and nervous system development. axonogenesis, suggesting that these genes have prominent roles in the regulation of neurogenesis and nervous system development. At the CC level, the enrichment results were concentrated in the chromosomal region, at the MF level, a significant signal related to protein tyrosine kinase binding was observed (Figure 5A). (II) KEGG pathway analysis further showed that the contributions genes (e.g., EZH2) were mainly involved in key cancer-related biological pathways such as microRNAs in cancer and lysine degradation, suggesting their multilevel functions in epigenetic regulation and tumorigenesis (Figure 5B). (III) DO enrichment analysis revealed that the disease associations of EZH2 were mainly concentrated in BC, urinary system cancer and other tumour categories (Figure 5C), which provided an evidence base for elucidating its potential driving role in multi-system tumours.
To assess the diagnostic efficacy of the 11 characterised genes, ROC curve analyses were performed in this study for the training sets and the test cohort, respectively. The results showed that the AUC values of EZH2, MKI67 and RAD51 in the training sets exceeded 0.84 (Figure 5D) (Table S2), indicating that all three had excellent ability to distinguish BC from control samples. Based on these results, EZH2, MKI67 and RAD51 were defined as core BC genes and used as potential diagnostic biomarkers. The diagnostic accuracy, relevance and expression patterns were further verified in the GSE30670 and GSE70905 external validation sets, and the results showed that despite the heterogeneity of the sample sources, the EZH2 genes maintained a stable high diagnostic efficacy with a significant up-regulation trend (P<0.001). The comprehensive analysis identified EZH2 as the final core gene (Tables S3,S4).
Compound and protein structure screening and modeling analysis
To identify potential EZH2 inhibitors, 731 candidate compounds and their corresponding IC50 values were collected in this study based on the UniProt ID (Q15910) of EZH2 in the ChEMBL database (Figure 6A). Subsequently, based on oral bioavailability and pharmacokinetic properties, the above compounds were evaluated using Lipinski’s rule of five (Ro5), of which 224 compounds were excluded because they did not meet the Ro5 criteria, and 507 candidate molecules were finally retained.
The analysis indicated that the average molecular weight of the compounds that did not pass the Ro5 screening was about 650.62 Da, while the average molecular weight of the compounds that met the Ro5 criteria was 486.78 Da. In addition, there were significant differences between the two classes of compounds in terms of the number of hydrogen-bonded acceptors (n_hba), the number of hydrogen-bonded donors (n_hbd), and the lipid-water partition coefficient (logP) (Figure 6B,6C).
To further enhance the drug-like properties of the compounds, the PAINS filtering method was employed in this study to exclude 11 molecules with potentially interfering structures and obtain 496 ideal structural candidates. Subsequently, 68 compounds containing problematic structures were further excluded by comparing 104 substructures reported in the literature that are prone to cause nonspecific binding, and 429 high-confidence candidate molecules were finally retained.
Statistical analysis of the substructures contained in the dataset (Figure 6D) showed that the substructures with high frequency of occurrence included the Michael receptor (18 times), aliphatic long chain (15 times), isolated olefin (12 times) and 2-pyridine derivatives (9 times).
The candidate compounds were categorized into active (228) and inactive (201) classes using pIC50 ≥7.2 as the threshold for activity classification. Representative lead compound structures, including CHEMBL5178022, CHEMBL4639983, and CHEMBL4635809, together with their associated PAINS substructures (two Michael-acceptor motifs and one aliphatic long-chain motif), are presented against the EZH2 target background (Figure 6E). To enhance the efficiency of active compound screening, multiple ML models were used for prediction. Further analysis shows that the RF and SVM models performed best in the identification of active compounds, both achieving an AUC value of 0.83, which is better than ANN (AUC =0.79) and CNN (AUC =0.76); the loss function curves at batch sizes of 16, 32, and 64 further confirmed the convergence of the models, and the scatter plot of the actual versus predicted pIC50 values demonstrated reliable predictive accuracy (Figure 6F).
On the basis of ML prediction, 20 highly active candidate compounds were further screened using a more stringent activity criterion (pIC50 ≥9.0) (Figure 6G). Given their potential potent inhibitory ability against EZH2, these compounds were selected for subsequent molecular docking and experimental validation. Figure 6H demonstrated the structure of the EZH2 target protein (PDB ID: 4MIO) as the basis for subsequent molecular docking.
Molecular docking and protein-ligand interaction
To better evaluate the binding potential of the candidate compounds, molecular docking of 20 candidate compounds with the EZH2 target protein (PDB: 4MIO) was performed. The results showed that compounds CHEMBL5206842, CHEMBL5188577 and CHEMBL4644363 exhibited the highest theoretical binding affinity for EZH2 with binding free energies (ΔG) of −10.6, −10.5 and −10.1 kcal·mol−1, respectively (available online: https://cdn.amegroups.cn/static/public/jmai-2025-167-2.xlsx).
The compound achieves stable binding to the EZH2 active site through multiple non-covalent interactions. The hydrophobic interactions involve VAL73, PHE75 and TRP274 residues, which form an extensive network of hydrophobic contacts. Hydrogen bonding analysis revealed that the compound acts as a hydrogen bonding acceptor, forming four key hydrogen bonds with the main chain amide nitrogen atoms: interacting with the GLY11 (2.52 Å), ALA12 (2.56 Å), LYS97 (2.47 Å) side chain and TYR282 (3.22 Å) side chain hydroxyl groups, respectively.
This ligand binds to EZH2 via a hydrogen bonding network. It acts as a hydrogen bond acceptor to form five hydrogen bonds with the main chain amide nitrogen atoms of GLY11, ALA12, MET13, GLY14, and PHE75, with bond lengths ranging from 2.01 to 3.15 Å. At the same time, the ligand acts as a hydrogen bond donor to form a sixth hydrogen bond with the TRP274 side chain (3.02 Å). The hydrophobic interaction involves seven residues, ALA12, MET13, ASP37, VAL73, GLU96, LYS97, and TRP274, which together form a tight hydrophobic binding pocket.
The binding of this compound to EZH2 is dominated by hydrophobic interactions involving eight hydrophobic contacts, including GLU96, LYS97, ALA161, TYR164 and TRP274. Among them, the TRP274 residue forms four independent hydrophobic interactions with the ligand. Hydrogen bonding analysis indicated that the ligand acts as a hydrogen bonding acceptor, forming only one hydrogen bond (3.07 Å) with the ASN157 side chain.
Discussion
In this study, we systematically identified the core diagnostic marker and therapeutic target of BC, EZH2, by integrating bioinformatics (40), ML and computational medicinal chemistry methods, and established a comprehensive computational framework from disease target discovery to therapeutic drug design (41), which provides new ideas and candidate drug molecules for the precise diagnosis and treatment of BC (40).
Key findings
In this study, 11 characteristic genes were identified by SHAP analysis, among which EZH2, MKI67 and RAD51 showed the best diagnostic efficacy. Based on the results of comprehensive analysis, EZH2 was selected as the hub gene in this study, which has sufficient scientific basis: firstly, EZH2 showed stable and high diagnostic accuracy in both the training set and the two independent validation sets, which proved its robustness and reproducibility as a diagnostic marker.
Functional enrichment analysis revealed the multiple biological functions of the characterised genes, and the GO analysis showed that the genes were significantly enriched in the nervous system-related pathways, which was consistent with the results of recent studies on tumour neurobiology, suggesting that innervation in the tumour microenvironment plays an important role in cancer progression. KEGG pathway analysis showed that the characterised genes were mainly involved in microRNA regulation and lysine degradation and other epigenetic pathways, which was consistent with the mechanism of action of EZH2 and other epigenetic regulators. DO analysis also revealed that these genes may be pan-cancer therapeutic targets with broader clinical application.
In this study, we systematically screened the small molecule inhibitor channel for EZH2 targets, and 429 high-confidence candidate molecules were obtained from 731 initial compounds obtained from the ChEMBL database by multistage screening. The reliability of the virtual screening was verified by the activity prediction model constructed based on ML (RF and SVM ≥0.83). Molecular docking studies identified CHEMBL5206842, CHEMBL5188577 & CHEMBL4644363 as the compounds with the highest binding affinity (ΔG −10.6 to −10.1 kcal/mol).
Strengths and limitations
Strengths of research methodology
The integrated analysis strategy adopted in this study has multiple advantages. First, the integration of multiple datasets and batch effect correction improved the reliability and generalisability of the analysis results and avoided the possible selection bias of a single cohort. By integrating two independent datasets, GSE70947 and GSE248830, and eliminating the batch effect using the ComBat algorithm, the data quality and reliability of the subsequent analysis were ensured (42).
Secondly, the combination of WGCNA and differential expression analysis retained the expression change information at the single-gene level and incorporated the system biology features of gene networks, which could reveal the molecular mechanism of BC more comprehensively. The 175 candidate genes identified in this study not only showed significant differences in expression levels but also exhibited a high degree of synergy in the gene co-expression network, which significantly improved the biological reliability of the candidate genes (43).
Third, systematic comparison of 127 model combinations ML and SHAP interpretability analysis not only improved the prediction accuracy but also enhanced the clinical acceptability of the models (44). The integrated LASSO + RF strategy obtained the highest average AUC value (0.977) in both training and test sets and demonstrated excellent discriminative ability, while LASSO regression achieves feature selection and dimensionality reduction through L1 regularisation, and RF reduces the risk of overfitting through integrated learning. The combination of the two forms a complementary prediction framework, providing methodological support for the practical application of precision medicine.
Fourth, the combination of bioinformatics target discovery and computational medicinal chemistry screening realises the complete research chain from disease mechanism research to drug development, which reflects the research concept of translational medicine. In this study, we successfully identified EZH2 as a key diagnostic marker and obtained three high-affinity candidate inhibitors based on a multilevel screening strategy.
Fifth, the screening system covered Lipinski’s rule of five evaluation, PAINS filtering, ML activity prediction, molecular docking validation, and protein-ligand interaction force analysis, and three lead molecules with good drug-like properties and high binding activities were gradually optimised from 731 initial compounds, which provided reliable candidates for subsequent experimental studies and drug development.
Research limitations
The retrospective analysis based on the public database lacks prospective clinical cohort verification and large-scale independent data set support, and the clinical application value of the model needs to be further confirmed, the study did not conduct a stratified analysis of different molecular subtypes, and the applicability of markers in specific subtypes needs to be verified, transcriptome data only reflects the mRNA table. To reach the level, the correlation between EZH2 protein expression and clinical pathological characteristics needs to be confirmed by immunohistochemistry, Western blot or mass spectrometry (45). There are differences between molecular docking results and actual biological activity, which need to be systematically verified in combination with molecular dynamics simulation and in vitro enzyme activity, cell experiments and animal models (46). The drug resistance mechanism of EZH2 inhibitors has not been explored, and drug resistance is a key constraint for the clinical application of targeted treatment.
Comparison with similar research
This study differs from previous EZH2-associated BC studies by integrating a diverse dataset and a more comprehensive analysis process. Earlier studies focused on specific subtypes or limited experimental cohorts, such as Hirukawa et al. (47) and Fontes-Sousa et al. (7), which were relatively restricted in their scope. In contrast, the present study incorporates multiple GEO datasets to cover a wider range of molecular heterogeneity, thereby significantly improving the representativeness and statistical efficacy of the results. Unlike previous ML studies, which mostly used a single or a small number of algorithms, this study systematically evaluated 127 model combinations and combined them with SHAP interpretability analysis to improve the reliability and transparency of predictive models. In terms of computational drug discovery, while previous work has often screened based on smaller libraries of compounds, this study comprehensively evaluated 731 candidate compounds through a multi-stage screening process combining Lipinski’s law, PAINS filtering, ML activity prediction, and optimized molecular docking validation. The integration of multi-omics data, the introduction of PPI network topology metrics, and the implementation of stringent batch effect correction make the analytical results more robust and reproducible than previous methods. This study establishes a more complete, systematic and highly translatable analytical framework than existing studies.
Explanations of findings
The biological significance of core markers
EZH2, as the core catalytic subunit of multi-comb repressor complex 2 (MCRP2), plays a key role in the development of BC through the epigenetic mechanism of gene transcription. EZH2 can silence tumor suppressor genes such as BRCA1, PTEN, CDH1, etc., promote tumour cell proliferation and invasion, and play an important role in maintaining BC stem cell properties, promoting epithelial-mesenchymal transition, and mediating endocrine therapy resistance.
The Ki-67 protein encoded by MKI67 is a classical cell proliferation marker, which has been widely used in clinical practice to assess tumour malignancy and guide the choice of treatment, and the overexpression of RAD51, as a core recombinase of the DNA homologous recombination repair pathway, correlates with the resistance to platinum drugs and the efficacy of PARP inhibitors, which provides a new therapeutic idea of the synthetic lethality strategy.
Feasibility of EZH2 as a drug target
Secondly, EZH2 plays a central driving role in the pathogenesis of BC and has clear biological functions. Thirdly, EZH2 has good druggability, and several EZH2 inhibitors have entered the clinical research stage, such as Tazemetostat, AZD9496 and palbociclib, which has been approved by the FDA for the treatment of epithelioid sarcoma and follicular lymphoma, which provides an important reference for the application of EZH2 inhibitors in the treatment of BC (48).
Screening and evaluation of candidate compounds
The analysis of the interaction patterns indicated that the three compounds achieved stable binding to the active site of EZH2 by forming multiple hydrogen bonds with hydrophobic contacts, constructing a complex hydrogen bonding network, and dominating hydrophobic interactions, respectively, which reflected the structural diversity of the ligands and the fitness characteristics of the target cavities (49).
Implications and actions needed
Future research direction
Based on the findings and existing limitations of this study, future work should focus on the following directions: systematically validate the diagnostic efficacy, prognostic value, and correlation with clinicopathological features of biomarkers such as EZH2 in large-scale, multicentre prospective cohorts, and clarify their expression patterns and clinical significance in different molecular subtypes, elucidate the functional mechanisms of EZH2 in BC occurrence, progression, and metastasis using cell lines and patient-derived xenograft models, systematically evaluate the target selectivity, pharmacokinetic properties, and in vivo efficacy of candidate compounds, focusing on optimising key drug properties such as solubility, metabolic stability, hERG inhibition risk, CYP450 metabolism predictions, and blood-brain barrier permeability (50). Explore combination strategies of EZH2 inhibitors with chemotherapy, endocrine therapy, targeted drugs, and immunotherapy, with particular attention to their synergistic effects with CDK4/6 or PARP inhibitors (51), and conduct in-depth research on EZH2 inhibitor resistance mechanisms, including target mutations, bypass signalling activation, and epigenetic remodelling, and carry out structural optimisation and new drug design for known resistance mutations such as Y641 and A677 (52).
Conclusions
In this study, EZH2 was established as a core diagnostic marker and therapeutic target for BC through differential expression analysis, WGCNA co-expression network construction, and 127 ML model screenings, and three candidate inhibitors with high binding affinity were obtained based on multi-level virtual screening, which provided the theoretical basis and the foundation of the lead compounds for the development of precision diagnosis and targeted drugs for BC.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study uses only publicly available de-identified data and therefore does not require institutional ethical approval.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
Xiong X, Zheng LW, Ding Y, et al. Breast cancer: pathogenesis and treatments. Signal Transduct Target Ther 2025;10:49. [Crossref] [PubMed]
Testa U, Castelli G, Pelosi E. Breast Cancer: A Molecularly Heterogenous Disease Needing Subtype-Specific Treatments. Med Sci (Basel) 2020;8:18. [Crossref] [PubMed]
Zhu Z, Jiang L, Ding X. Advancing Breast Cancer Heterogeneity Analysis: Insights from Genomics, Transcriptomics and Proteomics at Bulk and Single-Cell Levels. Cancers (Basel) 2023;15:4164. [Crossref] [PubMed]
Hirukawa A, Smith HW, Zuo D, et al. Targeting EZH2 reactivates a breast cancer subtype-specific anti-metastatic transcriptional program. Nat Commun 2018;9:2547. [Crossref] [PubMed]
Liu Y, Yang Q. The roles of EZH2 in cancer and its inhibitors. Med Oncol 2023;40:167. [Crossref] [PubMed]
Zimmerman SM, Lin PN, Souroullas GP. Non-canonical functions of EZH2 in cancer. Front Oncol 2023;13:1233953. [Crossref] [PubMed]
Fontes-Sousa M, Lobo J, Lobo S, et al. Digital imaging-assisted quantification of H3K27me3 immunoexpression in luminal A/B-like, HER2-negative, invasive breast cancer predicts patient survival and risk of recurrence. Mol Med 2020;26:22. [Crossref] [PubMed]
Fumagalli C, Barberis M. Breast Cancer Heterogeneity. Diagnostics (Basel) 2021;11:1555. [Crossref] [PubMed]
Ahn JS, Shin S, Yang SA, et al. Artificial Intelligence in Breast Cancer Diagnosis and Personalized Medicine. J Breast Cancer 2023;26:405-35. [Crossref] [PubMed]
Rai SN, Das S, Pan J, et al. Multigroup prediction in lung cancer patients and comparative controls using signature of volatile organic compounds in breath samples. PLoS One 2022;17:e0277431. [Crossref] [PubMed]
Barrett T, Suzek TO, Troup DB, et al. NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 2005;33:D562-6. [Crossref] [PubMed]
Zhang S, Shao J, Yu D, et al. MatchMixeR: a cross-platform normalization method for gene expression data integration. Bioinformatics 2020;36:2486-91. [Crossref] [PubMed]
Kuligowski J, Pérez-Guaita D, Lliso I, et al. Detection of batch effects in liquid chromatography-mass spectrometry metabolomic data using guided principal component analysis. Talanta 2014;130:442-8. [Crossref] [PubMed]
Zhang S, Wu Z, Xie J, et al. DNA methylation exploration for ARDS: a multi-omics and multi-microarray interrelated analysis. J Transl Med 2019;17:345. [Crossref] [PubMed]
Botía JA, Vandrovcova J, Forabosco P, et al. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst Biol 2017;11:47. [Crossref] [PubMed]
Athanasios A, Charalampos V, Vasileios T, et al. Protein-Protein Interaction (PPI) Network: Recent Advances in Drug Discovery. Curr Drug Metab 2017;18:5-10. [Crossref] [PubMed]
Bader JS, Chaudhuri A, Rothberg JM, et al. Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 2004;22:78-85. [Crossref] [PubMed]
Nave O, Elbaz M. Artificial immune system features added to breast cancer clinical data for machine learning (ML) applications. Biosystems 2021;202:104341. [Crossref] [PubMed]
Huang JC, Tsai YC, Wu PY, et al. Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method. Comput Methods Programs Biomed 2020;195:105536. [Crossref] [PubMed]
Musa AB. Comparative study on classification performance between support vector machine and logistic regression. Int J Mach Learn Cyber 2013;4:13-24.
Ponce-Bobadilla AV, Schmitt V, Maier CS, et al. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin Transl Sci 2024;17:e70056. [Crossref] [PubMed]
Piibor J. Proteome profile of bovine uterine extracellular vesicles during the oestrous cycle and in endometritis. Estonian University of Life Sciences; 2024.
Sarhan EM. Forskolin as a multi-target modulator of the oncogenic PI3K/Akt signaling pathway: An in silico integrative GO/KEGG enrichment analysis and network pharmacology. bioRxiv. 2025:2025.05. 11.653344.
Lu R, Shen J, Jiang A, et al. Towards artificial intelligence-assisted digital pathology: A systematic evaluation of multimodal generative artificial intelligence in clear cell renal cell carcinoma assessment. Interdisciplinary Medicine 2025;3:e70047.
UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res 2025;53:D609-17. [Crossref] [PubMed]
Hunter FMI, Ioannidis H, Bento AP, et al. Drug and Clinical Candidate Drug Data in ChEMBL. J Med Chem 2025;68:19800-27. [Crossref] [PubMed]
Chen X, Li H, Tian L, et al. Analysis of the Physicochemical Properties of Acaricides Based on Lipinski’s Rule of Five. J Comput Biol 2020;27:1397-406. [Crossref] [PubMed]
Bolz SN, Adasme MF, Schroeder M. Toward an Understanding of Pan-Assay Interference Compounds and Promiscuity: A Structural Perspective on Binding Modes. J Chem Inf Model 2021;61:2248-62. [Crossref] [PubMed]
Sadybekov AV, Katritch V. Computational approaches streamlining drug discovery. Nature 2023;616:673-85. [Crossref] [PubMed]
Laskar YB, Mazumder PB, Talukdar AD. Hibiscus sabdariffa anthocyanins are potential modulators of estrogen receptor alpha activity with favourable toxicology: a computational analysis using molecular docking, ADME/Tox prediction, 2D/3D QSAR and molecular dynamics simulation. J Biomol Struct Dyn 2023;41:611-33. [Crossref] [PubMed]
Campbell JW, Vogiatzis KD. Augmenting MACCS Keys with Persistent Homology Fingerprints for Protein-Ligand Binding Classification. J Chem Inf Model 2025;65:8113-26. [Crossref] [PubMed]
Ishfaq M, Halawa MI, Ahmad A, et al. Generation of Chemical Space of Compounds for Prostate Cancer Treatment: Biological Activity Prediction, Clustering, and Visualization of Chemical Space. ACS Omega 2023;8:39408-19. [Crossref] [PubMed]
Alcázar JJ, Sánchez I, Merino C, et al. A Simple Machine Learning-Based Quantitative Structure-Activity Relationship Model for Predicting pIC(50) Inhibition Values of FLT3 Tyrosine Kinase. Pharmaceuticals (Basel) 2025;18:96. [Crossref] [PubMed]
Masters L, Eagon S, Heying M. Evaluation of consensus scoring methods for AutoDock Vina, smina and idock. J Mol Graph Model 2020;96:107532. [Crossref] [PubMed]
Burley SK, Bhikadiya C, Bi C, et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res 2023;51:D488-508. [Crossref] [PubMed]
Burendei B, Kaplan JP, Orellana GM, et al. Structure-guided discovery of Otopetrin 1 inhibitors reveals druggable binding sites at the intrasubunit interface. Nat Commun 2025;16:9362. [Crossref] [PubMed]
IftikharH.Redesigning Fluoroquinolones: A Hybrid Strategy to Reduce Cardiotoxicity and Enable Neuroprotective Repurposing.ChemRxiv 2025. DOI: .
Wang J, Luo H, Qin R, et al. 3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model. Chem Sci 2025;16:637-48. [Crossref] [PubMed]
Cao H, Li X, Wan X, et al. Quantifying binding stability by clustering conformations to enhance binding prediction accuracy. Phys Chem Chem Phys 2025;27:21683-94. [Crossref] [PubMed]
Yu F, Li L, Zhang M, et al. Phosphorylation of EZH2 differs HER2-positive breast cancer invasiveness in a site-specific manner. BMC Cancer 2023;23:948. [Crossref] [PubMed]
Zhao Z, Chen X, Pang H, et al. Safety profile of EZH2 inhibitors for cancer: a systematic review and meta-analysis. PeerJ 2025;13:e18871. [Crossref] [PubMed]
Wang J, Pan C, Zhang Q. Double weighted combat data quality evaluation method based on CVF optimized FAHP. Sci Rep 2025;15:2516. [Crossref] [PubMed]
Figueroa-Martínez J, Saz-Navarro DM, López-Fernández A, et al. Computational ensemble gene co-expression networks for the analysis of cancer biomarkers. Informatics 2024;11:14.
Parisineni SRA, Pal M. Enhancing trust and interpretability of complex machine learning models using local interpretable model agnostic shap explanations. Int J Data Sci Anal 2024;18:457-66.
Ellis MJ, Lekka C, Holden KL, et al. Identification of high-performing antibodies for the reliable detection of Tau proteoforms by Western blotting and immunohistochemistry. Acta Neuropathol 2024;147:87. [Crossref] [PubMed]
Shu Y, Yue J, Li Y, et al. Development of human lactate dehydrogenase a inhibitors: high-throughput screening, molecular dynamics simulation and enzyme activity assay. J Comput Aided Mol Des 2024;38:28. [Crossref] [PubMed]
Hirukawa A, Singh S, Wang J, et al. Reduction of Global H3K27me(3) Enhances HER2/ErbB2 Targeted Therapy. Cell Rep 2019;29:249-257.e8. [Crossref] [PubMed]
Nave O, Shor Y, Bar R, et al. A new treatment for breast cancer using a combination of two drugs: AZD9496 and palbociclib. Sci Rep 2024;14:1307. [Crossref] [PubMed]
Omoboyowa DA, Aribigbola TC, Aiyeku B, et al. Deep learning and molecular dynamics reveal promising EZH2 inhibitors for epigenetic cancer targeting. Comput Biol Chem 2026;120:108784. [Crossref] [PubMed]
Xiao Z, Hirao H. Deep Learning Models for Predicting Human Cytochrome P450 Inhibition and Induction. J Chem Inf Model 2025;65:9947-61. [Crossref] [PubMed]
Brandariz J, de Llobet L, Esquefa V, et al. Harnessing Senolytics and PARP Inhibition to Expand the Antitumor Activity of CDK4/6 Inhibitors in Prostate Cancer. Mol Cancer Ther 2025;24:1959-76. [Crossref] [PubMed]
Shu F, Xiao H, Li QN, et al. Epigenetic and post-translational modifications in autophagy: biological functions and therapeutic targets. Signal Transduct Target Ther 2023;8:32. [Crossref] [PubMed]
doi: 10.21037/jmai-2025-167 Cite this article as: Wang Z, Li J, Lei B. AI-driven discovery of EZH2 as a core target and potential inhibitors for breast cancer. J Med Artif Intell 2026;9:32.