Evaluating large language models for title/abstract screening: a systematic review and meta-analysis & development of ne...
Evaluating large language models for title/abstract screening: a systematic review and meta-analysis & development of new tool
Original Article
Evaluating large language models for title/abstract screening: a systematic review and meta-analysis & development of new tool
Jin Kyu Kim1,2, Mandy Rickard2, Pankaj Dangle1, Nikhil Batra1, Michael E. Chua2, Adree Khondker2, Korand M. Szymanski1, Rosalia Misseri1, Armando J. Lorenzo2
1Department of Pediatric Urology, Riley Hospital for Children, Indiana University Health, Indianapolis, Indiana, USA;
2Division of Urology, Department of Surgery, The Hospital for Sick Children, Toronto, Canada
Contributions: (I) Conception and design: JK Kim, M Rickard, N Batra, A Khondker, AJ Lorenzo; (II) Administrative support: ME Chua, R Misseri, AJ Lorenzo; (III) Provision of study materials or patients: JK Kim, N Batra, A Khondker; (IV) Collection and assembly of data: JK Kim, N Batra, A Khondker, KM Szymanski; (V) Data analysis and interpretation: JK Kim, ME Chua, P Dangle, KM Szymanski; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.
Correspondence to: Jin Kyu Kim, MD, FRCSC. Department of Urology, Riley Hospital for Children, 702 Barnhill Drive, Indianapolis, IN 46202, USA. Email: jjk.kim@mail.utoronto.ca.
Background: Screening titles and abstracts for systematic reviews (SRs) is a labor-intensive and error-prone process. With the advent of large language models (LLMs), there is potential to automate and improve the efficiency of this process. This study aims to develop an LLM-based screening tool and compare its performance with existing LLM models in terms of sensitivity, specificity, and precision.
Methods: A SR was conducted in August 2024, searching the MEDLINE, Embase, and Scopus databases for studies reporting performance metrics (sensitivity, specificity, precision) of LLM models for automatic title/abstract screening. Studies were included if they reported sufficient data to create a confusion matrix, while those requiring human intervention were excluded. Data extraction followed PRISMA guidelines, and performance metrics were pooled and summarized using forest plots and summary receiver operating characteristic (SROC) curves. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool. The authors created a screening tool utilizing GPT-4o-mini model and tested their utility using existing SRs.
Results: The SR identified 14 LLM-based models, with varying performance metrics. The SROC analysis indicated overall excellent performance with sensitivity and specificity approaching 90%, with overall low risk of bias. In comparison, our developed model achieved 100% sensitivity but lower specificity (81%) and precision (14%). It processed entries at a speed of 1.7 seconds per entry with minimal cost.
Conclusions: LLM-based models offer a promising, cost-effective alternative to manual SR screening. Although these models achieve high sensitivity, the trade-off in specificity requires careful consideration. The integration of LLMs in SR processes could enhance efficiency while maintaining reliability in academic research.
Keywords: Large language model (LLM); systematic review (SR); abstract screening; title screening; automation
Received: 04 November 2024; Accepted: 24 March 2025; Published online: 24 April 2025.
doi: 10.21037/jmai-24-408
Introduction
Traditionally, the screening process for systematic reviews (SRs) involves manually reviewing hundreds to thousands of citations to determine their relevance to the research question, a task that can take weeks or even months to complete (1). Previous attempts to automate SR screening utilized traditional machine learning approaches, including support vector machines and random forests, often combined with bag-of-words or Term Frequency-Inverse Document Frequency (TF-IDF) text representations (2-5). While these methods have shown some success, they struggled with the nuanced and context-dependent nature of academic abstracts.
Large language models (LLM), with their advanced semantic understanding and contextual awareness, offer a new paradigm for such tasks. These models, trained on vast corpora of text data, have the potential to revolutionize the SR process by automating the initial screening of titles and abstracts. There has been particular interest in using LLM for title and abstract screening from 2023 onwards, when ChatGPT-3.5 from OpenAI was made public (6). Performance of LLM has been improving rapidly and there are several LLM-based models that have been reported to automate title and abstract screening of articles with newer models of LLM, including GPT-4. Some authors have reported high sensitivities greater than 90% using LLM-based tools utilizing GPT-3.5-turbo, GPT-4, and open-source models such as Llama-2 (7-9). Nonetheless, while there are many reported successes with title and abstract screening using LLMs, there is no clear guideline on whether it is acceptable to use LLMs for SRs. Moreover, there are limited open-source tools available for use, which may increase barriers to timely literature review. Herein, we aim to: (I) understand the current performance of LLM-based title and abstract screening automating tools reported in literature to evaluate the reliability of current tools; (II) to develop our own LLM-based tool to compare its performance against other LLM-based tools using a recent cost-effective model in an open-source manner. We present this article in accordance with the PRISMA reporting checklist (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-408/rc).
Methods
SR and meta-analysis
In July 2024, we performed an SR and searched MEDLINE, Embase, and Scopus databases (search strategy available from Appendix 1) (10). All identified records were screened by two independent reviewers. All studies reporting information on the performance (precision/sensitivity/specificity) of their LLM-model for automatically screening title/abstract were included for assessment by a single reviewer and verified by a second reviewer. Studies that required any human intervention for the screening process were excluded. Studies that were published on the same dataset (e.g., pre-print of a peer-reviewed article) were excluded. Study characteristics were collected and summarized for included studies, including LLM model used, number of title/abstracts screened, F1 score, precision, recall/sensitivity, and specificity. True positives were defined as when the automated tool chose to include an article that was chosen to be included by human authors and true negatives were defined as when the tool chose to exclude an article that was chosen to be excluded by human authors. For studies reporting enough data to formulate a confusion matrix of true/false positives/negatives, the model performances were pooled for meta-analysis using forest plots and summary receiver operating characteristics (SROC) curve. Risk of bias for each study included in the SR was performed using QUADAS-2 tool (11).
Model development
We also developed our own abstract screening tool as a Flask app (12), which utilizes application programming interface (API) of ‘GPT-4o-mini’ model (OpenAI; code/app design available from: https://github.com/kimjk4/AutoAI/tree/main). GPT-4o-mini was chosen for our purpose as it is the latest cost-effective model. Given that there may be hundreds to thousands of titles/abstracts that may be screened, using a smaller and faster model was deemed appropriate. The model was created using Python and allows uploading of RIS files to the Flask app for title/abstract extraction. The user provides the inclusion criteria for a study on the app, which should include population, intervention (if applicable), comparison (if applicable), and outcomes of the study [e.g., “Studies comparing (intervention) to (control) in (population) and reporting (outcomes)”]. This input is then used to zero-shot prompt the model using our prompt (Appendix 2). Following this, the title/abstract is screened by GPT-4o model and a decision for inclusion/exclusion is made as ‘include’, ‘unclear’, or ‘exclude’, along with a summary for why that decision was made. These results are downloadable by the user as a comma-separated values (.csv) file (Figures S1,S2). We tested our model on the title/abstracts from two prior SRs performed by the corresponding author, using the original Research Information Systems (RIS) files from those studies. The first SR evaluated the safety and efficacy of ß3-adrenoceptor agonists in the treatment of bladder dysfunction in children (13), and the second SR compared omental procedures to no intervention during peritoneal dialysis catheter insertions (14).
Results
SR and meta-analysis
From our SR, a total of 15 LLM-based model performances were identified, including our own (Figure 1) (7-9,15-25). Some models had excellent performance with sensitivities up to 100%. However, achieving a high sensitivity often required a trade-off for precision and specificity. The performance of each model is available in Table 1 and Figure 2.
Figure 1 PRISMA flow diagram of the literature search.
Table 1
Summary of performance measures for identified models
*, if multiple independent evaluations were performed, they were reported separately.
Figure 2 Performance of abstract screening summarized in sensitivity/specificity. (A) SROC curve of each large language model-based title/abstract screening tools; (B) forest plot of sensitivity/specificity. CI, confidence interval; FN, false negative; FP, false positive; SROC, summary receiver operating characteristic; TN, true negative; TP, true positive.
The SROC plot of studies with available confusion matrix values show an excellent performance overall, with area under receiver operating characteristic (AUROC) of 0.922 [sensitivity 0.812 (95% CI: 0.617–0.920), false positive rate 0.110 (95% CI: 0.052–0.220), Figure 2A]; summary of sensitivity/specificity available from Figure 2B). Heterogeneity (I2), assessed by pooling sample size unadjusted and adjusted approaches, was determined as 91.6–99.7% for unadjusted and 2.5–3.5% for adjusted. Risk of bias assessment using QUADAS-2 tool showed generally low risk of bias and concerns for applicability (Table 2, Figure 3).
QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies 2.
Figure 3 QUADAS-2 tool risk of bias assessment summary. QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies 2.
Developed model performance
Our model had a performance with 100% sensitivity/recall to identify studies that were included in our selected SRs among 733 individual articles that it had screened. However, its specificity was lower at 81%, and its precision was lower at 14%. It took an average of 1.7 seconds to process a single entry at an average cost of 0.00008 USD per entry.
Discussion
For a comprehensive SR, it would be more important to prioritize sensitivity over specificity to ensure no relevant articles are missed. Assuming models can achieve close to 100% sensitivity (indicating 0% false negatives), human reviewers would only need to focus on differentiating the true positives from false positives. Other authors such as Wang et al. have also prioritized recall/sensitivity to achieve excellent sensitivity at the cost of specificity (7). Using our GPT-4o-mini-based model, we identified all relevant 15 articles (100% sensitivity), and screened out 75% of irrelevant articles were true negatives, potentially reducing the workload by 4-fold. Other studies in the past have reported reduced person-hours spent by a factor of 6–10 or an overall reduction of 10 human hours (21,26). As other models have also achieved close to 100% in sensitivity in screening titles/abstracts, there may be an evolving role in utilizing these LLM-based models to screen irrelevant articles and improve efficiency for SRs.
Previously, using LLM-based title/abstract reviews was both expensive and time-consuming, and possibly lacked in performance, preventing their routine use. However, with newer models out there and the competition among parties that have proprietary models and open-access models for public use, the prices for performing automated SRs for both experimental and practical purposes have significantly decreased. Guo et al. have reported a cost of around 25 USD for screening of 14,771 titles and 354 abstracts with GPT-4 API (15). With GPT-4o-mini, a model designed to handle lighter tasks, it would take our model 8 cents to screen through 1000 entries in less than 30 minutes. Compared to the associated human-hours and costs that SR title/abstract screening would incur, LLM-based models may be a cost-effective alternative to traditional screening methods.
There are limitations to this study. The pooled summary of performance does not reflect all models that are reported in literature due to a lack of information to discern the confusion matrix. However, the models that were not included such as that of Cai et al. and Wang et al. had close to 100% sensitivity and may potentially have improved the outlook on LLM-based models (7,8). Specific to our model, while it had excellent sensitivity, the specificity was lower. We observe a similar pattern in studies that reported high sensitivity, including Issaiy et al. and Wang et al.’s models. In an ideal setting, we would hope a model would have 100% sensitivity and specificity. Yet, achieving 100% sensitivity reassures users that a research item that should be included in an SR will not be missed, making an LLM-based tool reliable. However, these tools are not perfect but continue to improve with newer iterations, such as GPT-4-o1, which uses chain of thought reasoning to process queries in a stepwise manner (27). Currently, these models are not available to the general-public for as API and are costly to use, which make it less fitting for SR screening purposes. Nonetheless, soon, these models will be more widely available and may improve the accuracy and efficiency of the automated screening process and overcoming the trade-off of sensitivity and specificity. Moreover, the ‘gold standard’ for screening was analogous human screening; nevertheless, despite the layers to protect against errors, there may be human-errors even in SR processes, which may affect the perceived performance of LLM-based models. As LLM is a relatively new tool, there lack real-world studies that have performed abstract screening using LLM-models alone. However, we hope to provide an overview of the performance of current LLM-based tools compared to human-comparators, who serve as the ‘gold standard’ of the SR process.
While the horizon is optimistic that LLM-based abstract screening will become an acceptable and standard practice in the future, any users of LLM-based models should be cognizant of the implications behind using these models, as SRs represent the highest level of evidence we have for a specific topic in academia. We prompt users to adhere to the proposed guidelines in using LLM in academia, including utilization of peer-reviewed, externally validated tools and transparent reporting of LLM use (28).
Conclusions
This study highlights the potential of LLMs to automate the screening process for SRs. Some models, including our developed model, achieved near-perfect sensitivity. Despite reduced specificity leading to false positives, LLMs may significantly reduce workload and processing time, making them cost-effective alternatives to manual methods. With careful implementation, these models could transform SRs by balancing efficiency with reliability in academic research.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
Clark J, Glasziou P, Del Mar C, et al. A full systematic review was completed in 2 weeks using automation tools: a case study. J Clin Epidemiol 2020;121:81-90. [Crossref] [PubMed]
van Dinter R, Tekinerdogan B, Catal C. Automation of Systematic Literature Reviews: A Systematic Literature Review. Information and Software Technology. 2021;136:1-16. [Crossref]
Feng L, Chiam YK, Lo SK. Text-Mining Techniques and Tools for Systematic Literature Reviews: A Systematic Literature Review. In: Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC); 2017:41-50.
Ros R, Bjarnason E, Runeson P. A Machine Learning Approach for Semi-Automated Search and Selection in Literature Studies. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering; 2017:118-127.
Cohen AM, Hersh WR, Peterson K, et al. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 2006;13:206-19. [Crossref] [PubMed]
Ofori-Boateng R, Aceves-Martins M, Wiratunga N, et al. Towards the automation of systematic reviews using natural language processing, machine learning, and deep learning: a comprehensive review. Artif Intell Rev 2024;57:200. [Crossref]
Wang S, Scells H, Zhuang S, et al. Zero-Shot Generative Large Language Models for Systematic Review Screening Automation. In: Goharian N, et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. doi: 10.1007/978-3-031-56027-9_25.10.1007/978-3-031-56027-9_25
CaiXGengYDuYUtilizing ChatGPT to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.medRxiv. 2023. doi:.10.1101/2023.09.06.23295072
Issaiy M, Ghanaati H, Kolahi S, et al. Methodological insights into ChatGPT's screening performance in systematic reviews. BMC Med Res Methodol 2024;24:78. [Crossref] [PubMed]
Kim JK, Rickard M, Dangle P, et al. “Evaluating Large Language Models for Title/abstract Screening: A Systematic Review and Meta-analysis.” OSF, 18 Dec. 2024. Web.
Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529-36. [Crossref] [PubMed]
Kim JK, De Jesus MJ, Lee MJ, et al. β3-Adrenoceptor Agonist for the Treatment of Bladder Dysfunction in Children: A Systematic Review and Meta-Analysis. J Urol 2022;207:524-33. [Crossref] [PubMed]
Kim JK, Lolas M, Keefe DT, et al. Omental Procedures During Peritoneal Dialysis Insertion: A Systematic Review and Meta-Analysis. World J Surg 2022;46:1183-95. [Crossref] [PubMed]
Guo E, Gupta M, Deng J, et al. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J Med Internet Res 2024;26:e48996. [Crossref] [PubMed]
Lin Y, Li J, Xiao H, et al. Automatic literature screening using the PAJO deep-learning model for clinical practice guidelines. BMC Med Inform Decis Mak 2023;23:247. [Crossref] [PubMed]
Khraisha Q, Put S, Kappenberg J, et al. Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods 2024;15:616-26. [Crossref] [PubMed]
Attri S, Kaur R, Singh B, et al. Transforming systematic literature reviews: unleashing the potential of GPT-4, a cutting-edge large language model, to elevate research synthesis. Presented at: ISPOR 2024; 2024 May 5-8; Atlanta, GA.
Du J, Soysal E, Wang D, et al. Machine learning models for abstract screening task - A systematic literature review application for health economics and outcome research. BMC Med Res Methodol 2024;24:108. [Crossref] [PubMed]
Datta S, Lee K, Paek H, et al. Optimizing systematic literature reviews in endometrial cancer: leveraging AI for real-time article screening and data extraction in clinical trials. Presented at: ISPOR 2024; 2024 May 5-8; Atlanta, GA.
Royer J, Wu EQ, Ayyagari R, et al. Prospects for automation of systematic literature reviews (SLRs) with artificial intelligence and natural language processing. Presented at: ISPOR Europe 2023; 2023 Nov 12-15; Copenhagen, Denmark.
Kaur R, Rai P, Attri S, et al. Revolutionizing systematic literature reviews: harnessing the power of large language model (GPT-4) for enhanced research synthesis. Presented at: ISPOR 2024; 2024 May 5-8; Atlanta, GA.
Rai P, Kaur R, Pandey S, et al. Advancing systematic literature reviews: the integration of AI-powered NLP models in data collection processes. Presented at: ISPOR 2024; 2024 May 5-8; Atlanta, GA.
Huotala A, Kuutila M, Ralph P, et al. The promise and challenges of using LLMs to accelerate the screening process of systematic reviews. Presented at: EASE 2024: 2024 June 18-21; Salerno, Italy.
Dennstädt F, Zink J, Putora PM, et al. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev 2024;13:158. [Crossref] [PubMed]
Herbst P, Baars H. Accelerating literature screening for systematic literature reviews with Large Language Models – development, application, and first evaluation of a solution. Presented at: Learning, Knowledge, Data, Analysis 2023; October 09–11, 2023; Marburg, Germany. CEUR Workshop Proceedings (CEUR-WS.org); 2023.
Kim JK, Chua M, Rickard M, et al. ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine. J Pediatr Urol 2023;19:598-604. [Crossref] [PubMed]
doi: 10.21037/jmai-24-408 Cite this article as: Kim JK, Rickard M, Dangle P, Batra N, Chua ME, Khondker A, Szymanski KM, Misseri R, Lorenzo AJ. Evaluating large language models for title/abstract screening: a systematic review and meta-analysis & development of new tool. J Med Artif Intell 2025;8:34.