Toward computational assessment of scientific justification across biomedical research lifecycles
Original Article

Toward computational assessment of scientific justification across biomedical research lifecycles

Sankalpa Ghose1, Insoo Hyun1,2,3

1Centre for Biomedical Ethics, National University of Singapore, Singapore, Singapore; 2Center for Bioethics, Harvard Medical School, Boston, MA, USA; 3The Hastings Center for Bioethics, Garrison, NY, USA

Contributions: (I) Conception and design: Both authors; (II) Administrative support: Both authors; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: S Ghose; (V) Data analysis and interpretation: Both authors; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: Sankalpa Ghose, PhD candidate. Centre for Biomedical Ethics, National University of Singapore, Blk MD 11, 10 Medical Dr, #02-03 Singapore, Singapore. Email: sankalpa@u.nus.edu; sg@alethic.ai.

Background: Scientific justification is the foundation of institutional science, ensuring that research and experimentation are pursued on the basis of knowledge contribution, methodological rigor, ethical responsibility, and public benefit. In biomedicine, research investigators, academic departments, funders, and oversight bodies including Institutional Review Boards (IRBs), Institutional Animal Care and Use Committees (IACUCs), Scientific Review Boards (SRBs), and Stem Cell Research Oversight (SCRO) committees each take part in this process—yet evidence shows wide variability in how justification standards are applied across biomedical research lifecycles, with significant consequences for the practice of the scientific method and public commitments to it. Accordingly, this study aims to develop a more precise and practically applicable account of scientific justification.

Methods: This study explores whether computational tools can assist in structuring and evaluating aspects of scientific justification in biomedical research. We propose a categorized and quantified rubric for evaluating justification across research lifecycles, broken down into seven stages of experimental and institutional operation, each assessed against seven core components of justificatory relevance. We develop a data architecture and processing pipeline for computing this in a parameterized schema with prompt-based workflows.

Results: This computable framework is implemented in SciJust.AI, an artificial intelligence (AI)-enabled prototype that uses large language models (LLMs) to score research-related inputs, generate structured feedback, and visualize justification profiles. Constructed to support rather than replace human judgment, SciJust.AI advances a model in which investigators and reviewers can be equipped with an experimental assessment tool that could be used at key stages, or on an ad hoc basis, throughout the design, execution, and evaluation of biomedical research—with the aim of promoting greater clarity, consistency, and comparability from protocol to practice. Both manual and AI-assisted evaluation are made publicly available for early-stage testing, and we present illustrative example evaluations with the goal of generating discussion and gathering feedback for future improvements and validation efforts.

Conclusions: By reframing scientific justification as a measurable, improvable dimension of the scientific method, this approach puts forward a practical path for better aligning experimental rigor with confidence in results.

Keywords: Scientific justification; biomedical research lifecycle; artificial intelligence (AI); AI for Science


Received: 18 November 2025; Accepted: 12 May 2026; Published online: 22 June 2026.

doi: 10.21037/jmai-2025-1-246


Highlight box

Key findings

• Propose a multi-dimensional rubric for structured evaluation of scientific justification across biomedical research lifecycles.

• Develop an end-to-end computational architecture for executing the rubric and implement it in SciJust.AI, a prototype we make available for testing and feedback. This allows for both manual and artificial intelligence (AI)-aided use, utilizing large language models (LLMs) to evaluate research inputs to produce structured justification evaluation outputs.

• This includes 67 unique criteria scores, across seven levels and seven components of scientific justification along the biomedical research lifecycle, as well as generated explanations for the score of each. These are presented in an interactive interface that displays each score organized by lifecycle stage, along with dynamically updated visual plots and score tables. These materials are additionally available for user download as structured report.

• We include a set of illustrative evaluations in aim of generating discussion and gathering feedback for future work and validation trials.

What is known and what is new?

• Biomedical research lifecycles require scientific justification, yet empirical studies show substantial variability and underspecification in how justification standards are tracked, assessed, and applied across stages of the biomedical research lifecycle, including experimental design, institutional review, funding, and public and policy communication.

• This study introduces a structured rubric and computational prototype (SciJust.AI) that propose a multi-dimensional rubric for how scientific justification criteria across research lifecycle stages can be modeled and analyzed, conceptually, as well as practically, by using interactive computational tools incorporating LLMs.

What is the implication, and what should change now?

• Research institutions and investigators should explore the use of scientific justification rubrics and computational assessment aids across the biomedical research lifecycle. Scientific justification should be treated as a measurable and improvable dimension of the scientific method. Responsible use of AI could support AI for Science practices across the scientific enterprise to help make this possible, amplifying human deliberation and enabling meaningful improvements in process and outcomes.


Introduction

Background

Scientific justification is what gives legitimacy and authority to institutional science, especially in biomedicine

Science, as a practice for discovering knowledge of the way the world works, is fundamentally premised on what is commonly known as the scientific method. This principled approach, implemented by individuals and instituted by groups, consists of practicing commitment to “systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses” (1).

That said, the procedural approach of how experiments are themselves formulated and implemented is not static—it is dynamic and ongoing. The development of new instruments and techniques for conducting experiments, and the advancement of new approaches for analyzing, reporting, reviewing, and publishing observed results, is how the scientific method is continually updated. Indeed, a central commitment of the scientific method is to improve the scientific method—and it is in this sense that we say that an experiment and its observed results are “based on presently valid scientific methods” (1).

Despite this importance, less attention has been traditionally given to how to improve how experiments are themselves formulated and implemented by organizing bodies of institutional science, although this has been changing in recent years in response to concerns about reproducibility, validity, and innovation; increased prominence of meta-scientific topics and methods (2-4); and renewed discussion around the role, aims, and productivity of major scientific programs (5) and institutions (6). This is important as these elements undergird the improvement of the scientific enterprise as an organized institutional practice and directly correspond to what professional research goals the public commits to and with what degree of credence.

Indeed, ever since the formation of the first Scientific Societies, systematic processes and oversight bodies for experimental design, review, and reporting have helped guide the integrity and aims of institutional science—from protocol to publication, and from funding to oversight. The practical business of fact finding is grounded in operationalized judgments about which investigations are worth pursuing, and how.

This is especially true in biomedicine, where over time an infrastructure of review and approval has developed to legitimate, govern, and evaluate scientific research across biomedical institutions worldwide. Institutional Review Boards (IRBs), Institutional Animal Care and Use Committees (IACUCs), Scientific Review Boards (SRBs), and Stem Cell Research Oversight (SCRO) committees—one or more of their determinations are necessary for most experiments to proceed (7,8). Determinations of scientific justification are also foundational to initial grant reviews and allocations at private foundations, state, and federal funding agencies; to industry investment along the clinical translation pipeline; and to the approvals processes of the Food and Drug Administration (FDA) for ultimately bringing therapies to the public for patient use. Moreover, more informal processes of investigative merit and lifecycle justification occur within academic departments and amongst researchers themselves in deciding how best to design, conduct, and present their proposed, ongoing, and completed work across their own priorities and those of their stakeholders.

What is the purpose of all this consideration, communication, and demonstration? Together it constitutes the system for institutional determinations of scientific justification—of whether an experimental protocol is worth conducting in commitment to the scientific method and the public results of knowledge production hoped to be achieved by it. Different stakeholders examine different dimensions of such assessment—including methodological rigor, ethical permissibility, feasibility, regulatory compliance, resource utilization, and broader scientific or societal value—and cumulatively these processes shape research designs and institutional determinations. Moreover, in some contexts of authorization and approval, especially where committees must evaluate whether the risks of a study are justified by its anticipated benefits, these deliberations often depend on how the scientific rationale and expected knowledge value of a proposed study are interpreted within the reviewing institution.

Overall, then, it is essential to the integrity and success of biomedical research to ensure what the National Institutes of Health (NIH) calls “the strict application of the scientific method to ensure robust and unbiased experimental design, methodology, analysis, interpretation, and reporting of results” (9), and what the International Society for Stem Cell Research (ISSCR) identifies as the need for “adequate and appropriate scientific justification [that] is required for performing research using specified materials” (10).

Indeed, this is the very foundation of science policy. The National Academies of Sciences, Engineering, and Medicine have addressed these issues across several landmark reports underscoring the importance of systematic, institutionally supported processes for maintaining research integrity and advancing the scientific method (11,12).

Rationale and knowledge gap

The scientific method, as institutionally organized, should itself practice the scientific method

Despite this mandate, numerous studies and meta-reviews reveal significant inconsistencies in determinations of scientific justification (13-16). In particular, the deliberations and decisions of institutional review bodies very often vary in their application of standards of scientific justification, not because of a protocol itself but because of different interpretations of standards across institutions and contexts. Left unaddressed, this risks affecting public trust in institutional science, undermining the validity and reproducibility of scientific findings, and holding back efforts to systematically assess and improve the productivity, safety, and accountability of public investments in the goals of biomedicine (17,18).

Indeed, this is acknowledged as a serious issue—for example, the U.S. Government Accountability Office’s 2023 report, “Institutional Review Boards: Actions Needed to Improve Federal Oversight and Examine Effectiveness”, underscores the urgency of addressing the variability and inconsistency of these bodies and their processes (19). More explosively, political debates surrounding governance, expertise, and authority—particularly in public health institutions and biomedical research authorities—have brought processes of scientific justification to center stage.

Institutional review processes are responsible for assessing more than informed consent language or baseline ethical compliance. Just as important, they evaluate whether research meets standards of scientific merit and of public interest; investigator qualifications and research environment adequacy; and methodological rigor and necessity within the field.

These are not abstract aspirations but real-life challenges. It is common for IRBs—and even IACUCs and SCROs—to have to assess the scientific merit and justification of research protocols (20), especially in cases where the proposed research is industry sponsored or department sponsored. In these situations, there is no requirement of an external committee, such as an outside funding agency, to perform independent scientific review. And even when research is externally funded, each proposal must be determined by an institutional committee to have a favorable risk/benefit ratio to be approved. Ultimately, no matter how the research is funded, the upshot is that review committees cannot assess risk/benefit without considering the scientific justification of the study in question (21)—which is a formidable task. According to the Belmont Report, risk-benefit analysis should strive to be accurate, support transparent and nonarbitrary ethical decisions, and that “the nature, probability and magnitude of risk should be distinguished with as much clarity as possible”.

How can high-standard clarity and rigor be better promoted? Inconsistencies in how scientific justification is institutionally adjudicated raises concerns about the integrity of institutional review processes. A fundamental question emerges: is such variability itself experimentally and ethically justified?

It could be argued that different bodies may reasonably reach different conclusions about the permissibility of research protocols across different dimensions of relevance. Even granting the need for such flexibility, the current variability in practices of scientific justification is often characterized by:

  • Underspecification—ambiguity in justification criteria, leading to inconsistent evaluations.
  • Incommensurability—challenges in comparing justification factors across different research domains and methodologies in which there is no common unit of measurement (22).

As a result, biomedical research lifecycles can be said to lack a shared analytic framework and common measures for the wide range of factors involved in evaluating scientific justification across key stages and over the entirety of its institutional lifecycle. This affects all manner of stakeholders, from research investigators developing protocols to the public and policy makers to oversight bodies, peer reviewers, and funding agencies. Most frustratingly, the absence of robust justification standards undermines the procedural accessibility and auditability of the scientific method—and therefore of Science—as an offered worldview and vehicle of public reason and action, itself.

In a practical sense, therefore, we argue that there is nothing more fundamental for future scientific progress than improving institutional structures of scientific justification. While social and political contexts will change, institutional science’s commitments to representing the public interest will remain central basis of its public support and public funding. That these are being called into question in the present political moment is alarming from multiple angles, but it is also an opportunity to act constructively to strengthen the scientific method as institutionally practiced (23).

For those who believe institutional science is an essential public good, we propose improving its processes of justification with greater:

  • Specification—establishing explicit evaluation criteria.
  • Formalization—defining justification components on a measurable scale.
  • Computation—implementing digital tools, including artificial intelligence (AI), to assist in evaluation.

In this sense we are trying to better conceptualize the theory and practice of determining biomedical research as scientifically justified—insofar as this might be best measured along its key processes and assessed over its institutional entirety.

In the most foundational sense, including that of philosophy and philosophy of science, justification means having an acceptable reason for doing something, e.g., believing something to be true, delivering judgement, proceeding with a committed aim, and so forth (24). Whether or not some action is justified depends crucially on (I) what counts as a reason and (II) what makes that reason an acceptable one.

Therefore, in analyzing the notion of scientific justification—especially toward a usable framework for biomedical research lifecycle processes, such as experimental design, protocol assessment, funding allocation, and public communication—it is important to spell out the reasons for conducting a proposed scientific activity and to identify the different dimensions of justification that such reasons ought to be measured on to meet institutional and investigative standards of process-based acceptability and outcome-based performance.

Objective

Toward a framework for institutional scientific justification

Accordingly, we propose a framework for tracking and evaluating the different senses of justification in institutional science. This covers the spectrum of stakeholders and processes—from the experimenters themselves, to scientific peers and overseeing committees, to universities and institutions, and ultimately to society and the public—each involving distinct reasons and criteria that can be assessed to support the processes of institutional review.


Methods

Scientific justification framework

To operationalize the concept of scientific justification across institutional science, we develop a structured framework that maps different dimensions across the research lifecycle.

We identify seven levels of the experimental protocol process:

  • Initial Research Conceptualization (IRU);
  • Ethical and Regulatory Approval (ERA);
  • Funding and Resource Allocation (FRA);
  • Preclinical and Clinical Trials (PCA);
  • Data Collection and Analysis (DCA);
  • Peer Review and Publication (PRP);
  • Public Communication and Policy Impact (PCPI).

Each level can be evaluated across seven justification components:

  • Scientific Validity and Rigor (SVR);
  • Feasibility and Resource Utilization (FRU);
  • Innovation and Impact Potential (IIP);
  • Ethical Integrity and Patient Safety (EIPS);
  • Alignment with Public Health and Policy Goals (PHPG);
  • Transparency, Accountability, and Compliance (TAC);
  • Social Responsibility and Public Trust (SRPT).

The seven levels identified correspond to major stages through which biomedical research typically progresses—from initial conceptualization and ethical approval to funding, experimentation, publication, and eventual public or policy impact.

The seven justification components—including scientific rigor, feasibility, ethical integrity, regulatory compliance, public health alignment, transparency, and social responsibility—correspond to recurring evaluation criteria that we identified as arising across the biomedical research lifecycle among research investigators, department evaluators, SRBs, funders, institutional review boards, and related bodies. These seven components are also expounded upon in professional guidelines aimed at disparate research communities, such as guidance resources for IRBs and IACUCs, and professional society standards for stem cell and genome editing scientists and relevant stakeholders (25-27).

The result is a scientific justification rubric in which each level’s justification components are composed of criteria that can be evaluated and scored.

For illustrative purposes, we present a preliminary version of our scientific justification rubric below (Figure 1)—with relative weights for each component at each level (noting that actual use would require institutional deliberation and testing to determine applicability and accuracy).

Figure 1 Scientific justification rubric. Only the high-level component scores are shown here to illustrate the varying numerical distribution of component score maxima per level (sub-criteria per component is detailed further below). We present this as preliminary version representing possible rather than validated importance of different components per level. Actual component importance in institutional use would require extensive stakeholder deliberation and iteration.

Each level sums to 100, enabling evaluation across components to a maximum score of 100 per level, and a maximum total score of 700. The weighting of components within each level is grounded by taking into account the priority tasks of that stage of the biomedical research lifecycle. For each level, the component with the greatest score maxima corresponds to the function that stakeholders of that phase are principally motivated or mandated to advance. For example, at the Ethical and Regulatory Approval level, the component of “Ethical Integrity and Patient Safety” carries the highest weight (33/100), followed by the component of “Transparency, Accountability, and Compliance” (22/100), reflecting how institutional review gives particular focus to these in comparative priority to other component considerations like “Feasibility and Resource Utilization” (5/100). Whereas, at the level of Funding and Resource Allocation, it is exactly the component of “Feasibility and Resource Utilization” that dominates (40/100), reflecting that stakeholders at this stage are primarily concerned with whether the research plan is executable and resources well-deployed. In this sense, the weights are not assigned arbitrarily but encode a structured baseline of how stakeholders distribute evaluative emphasis across the research lifecycle. We present the numerical score maxima as preliminary, representing possible rather than validated importance of different components per level—which is to say, as a theoretically motivated starting point for demonstrating the conceptual and operational framing of our proposed method. Identifying the accurate and appropriate weights for institutional use would require further iteration. For example, in our preliminary rubric, at the level of Initial Research Conceptualization, the components of “Scientific Validity and Rigor” and “Innovation and Impact Potential” are prioritized equally (24/100 each), reflecting our theoretical assessment that at this stage the two most important considerations are whether the proposed research is scientifically sound and novel. Whether this should be equally so might be contested or may depend on the institutional context—and as such, validation and actual institutional use would require empirical refinement, for example through expert consultation, analysis of assessment and review precedents, and alignment with specific institutional policies and priorities.

The overall point of the preliminary rubric and its distributions is to put forward a level-component-criteria framework that allows for greater tracking and accounting of the multidimensional aspects of scientific justification across biomedical research lifecycles. It is important to note that this granularity is the primary locus of assessment. That said, in order to assist in interpretation of scored outputs, we additionally ascribe high-level evaluative judgements to the aggregate scores, as follow: 0–20% (0–140/700) = terrible; 21–40% (141–280/700) = poor; 41–60% (281–420/700) = fair; 61–80% (421–560/700) = good; 81–100% (561–700/700) = excellent. These are intended as descriptive orientation aids and should be interpreted in the context of the full criterion-level breakdown rather than as standalone verdicts of overdetermined net comparison across what very well may be vastly different contexts and studies. Nonetheless, we aver that part of measurement is enabling ease of comparative assessment, and so such normalized normativity is a design principle we advance.

That said, the reality of the incommensurability of values demands that justification not become an overdetermined function of flattened trade-offs between different levels, components, or criteria. As such, our approach is about maximizing each on its own terms within its own level, component, and criteria in full maintenance of the context of all other factors—and only thereby tracking the legitimacy of the whole. As with other multidimensional scoring and judgement systems, the idea is to do better in every sense, and across all multidimensional relations therein, with as much specification as possible to guide practical analysis and utilization.

An analogy here can be made to scoring rubrics in figure skating or gymnastics—which include components like Technical merit and Aesthetic merit, each with defined sub-criteria. Each of the criteria within each of the components is given principled weight of importance (maximum criterion score), which aggregates to the component score and then to the total combined component scores. In such contexts, the relative weights assigned to different components are not intended to represent trade-offs between incommensurable values, but rather to illustrate a structured heuristic for modeling how different aspects of importance are typically emphasized in accordance with the priorities of each stage. This produces a multi-dimensional accounting and total score for the performance—of the skating or gymnastics routine, or by the same logic, for the biomedical research lifecycle being evaluated.

While it might be said that this numerical distribution is an inherent balancing act, that is a quality of all informed judgement, and our approach is not put forward to reduce consideration to pure number, let alone a single aggregate one, but to enrich qualitative consideration with a multi-dimensional assay of numerical adaptability that helps track standard levels, components, and criteria of scientific justification across the biomedical research lifecycle. The simple fact is that such complex judgement is a daily reality; we believe it is possible that a shared qualitative and quantitative framework for managing such complexity could be of value and use, including for better judgements and ultimately better results.

Provided that framing of intent, we further detail our scientific justification rubric, providing criteria and sub-score maximum for each, in each component, across each level (Figure 2).

Figure 2 Criteria-expanded version of scientific justification rubric (4-14,16-19,21,22). This shows the criteria per component, and the maximum sub-scores for each. This is a preliminary model of a fully-specified scientific justification rubric, and the one utilized for demonstration and execution purposes across this study. Scores are normalized to total 100 per level, enabling evaluation across components to a maximum justification score of 100 per level.

Overall, the number of justificatory elements makes a structured approach to evaluation both necessary and productive. As in other performance-driven domains, where systematic measurement enables improvement, evaluative clarity is the essential target.

For this reason, our justification rubric operates on a simple yet powerful architecture: Levels × Components → Criteria + Scores.

We present this as a preliminary model in hopes of illustrating our proposed approach and inviting consideration, testing, and feedback of how it might be more improved or made more accurate.

Computational operationalization

To operationalize the rubric computationally, we designed a data architecture and processing pipeline that maps research inputs to a parameterized schema using large language models (LLMs). The inputs—provided as entered text, uploaded documents, Uniform Resource Locators (URLs), or generated examples—are evaluated through a structured prompt workflow that instructs the model to assess each criterion and return criterion-specific explanations and integer scores.

Concretely, this is executed as follows: for each evaluation run, the prototype provides a selected LLM model with the research inputs and instructs it to follow structured prompts that specify all criteria in fixed order, each with its explicit score maximum, and overall providing the entirety of the rubric’s Level, Component, Criteria, and Score assessment parameters. The model is then instructed to return a strictly formatted JavaScript Object Notation (JSON) array of 67 strings, each embedding an integer score bounded by that criterion’s maximum and a concise justification. These criterion-level scores are then summed within each component to produce component scores, and component scores are summed within each level to produce level scores out of the maximum of 100 per level. Level scores are finally aggregated to a total justification score out of the maximum 700.

In effect, the LLM is provided with the concepts and scoring framework for the entire rubric and each of its elements, and is instructed to assess and score each criterion of each component within each level in consideration of the research inputs. Once these scores (and their accompanying justification descriptions and improvement suggestions) are generated, the rubric is automatically filled, allowing all individual and cumulative scores to be displayed and recorded as the evaluation data of that run.

LLMs were selected for this pipeline for several reasons. LLMs are capable of interpreting natural language research descriptions and producing structured assessments—as shown to at least some degree of viability in domains including biomedical research evaluation (28) and rubric-based scoring of complex written outputs (29,30). Rubric-anchored LLM evaluation offers the potential to achieve productive alignment with expert human raters and to provide scalability advantages over entirely manual review processes.

Importantly, LLM responses in evaluative contexts should not be assumed to be generic or unsubstantive. Inherent to their construction is pretraining on vast scientific and biomedical corpora, through which concepts and relationships from an enormous body of prior work and relevant literature are encoded in a highly compressed, parametric form—such that the model’s judgments about research design, ethical adequacy, and methodological rigor, as well as relevant scientific knowledge, are structured predictions informed by the distillation of patterns as invoked by context-specific inputs and prompt-based elicitation. The point is not that LLMs automate consideration or judgements; rather it is that they might be put into constructive workflow for aiding and improving our evaluative capabilities of scientific justification across biomedical research lifecycles.

That said, to ensure machine-readable outputs, the model is required to produce results in a structured JSON format aligned to the rubric (Levels × Components → Criteria + Scores). This parameterized schema enforces a fixed ordering of rubric elements and allows outputs to be parsed deterministically into criterion-level explanations and scores, which are then aggregated into component-level and overall justification measures (see Appendix 1 for additional technical detail on the computational architecture, including model settings, reproducibility controls, prompt structure, and versioning). The use of a structured schema also supports inspectability, auditing, and visualization of justification profiles, although output variability is inherent in LLMs (see Appendix 2) and questions of repeat-run reproducibility and context-sensitive accuracy will be essential to address in future work.

Overall, this computational architecture is intended not to automate institutional scientific processes or decision-making but to structure the specification and evaluation of justificatory reasoning across the biomedical research lifecycle.

The study involved the construction of a computational prototype which can be used for manual or AI-aided assessment of research inputs, which could include human subject and animal subject experiment protocols or other study-related documents; however, in this study, no materials inputted into this system required ethical approval or involved patient consent or permissions; further, at this stage there is no prospective or retrospective experiment that has been developed as part of this study that would require such review or approval; as the system could in the future have use across the biomedical research lifecycle, the authors note that it will be important to ensure any and all necessary ethical requirements in case of such utilization are adhered to by those responsible.


Results

SciJust.AI tool

To implement this computational architecture, we developed SciJust.AI—a digital tool that leverages AI to support structured evaluation of scientific justification in biomedical research lifecycles.

The prototyped tool, which is publicly available at www.scijust.ai, provides users with a direct interface (Figure 3) for interacting with the institutional scientific justification rubric and exploring its potential use for different evaluative purposes. It is important to note that this is a first version of the tool, presented to initiate an iterative process of user testing, feedback, and improvement.

Figure 3 SciJust.AI interface: (I) in the left sidebar, the user can input experimental and research lifecycle inputs (in the form of uploaded document files, URLs, or entered text). These are then evaluated and the evaluation text is parsed into a standardized JSON object representing justificatory criteria across seven levels of the research process and seven components, derived from the SciJust rubric. These are outputted into the main interface in expandable sections with per-criterion explanation. (II) The outputs also dynamically update a justification plot, scores table, and report. All of these materials can be downloaded in a timestamped format. JSON, JavaScript Object Notation; URL, Uniform Resource Locator.

At its most basic level, users can describe and manually self-evaluate an experimental protocol in accordance with the justification rubric. They can enter the Research Title, Objectives, Methods, and Outcomes—and then go through the rubric level by level and criterion by criterion. As they score each justification criterion, the interface dynamically updates an in-app visual chart, allowing users to see their evaluation results across the rubric. This functionality reflects the core aim of the framework: enabling structured evaluation of the diverse dimensions of scientific justification across the biomedical research lifecycle.

Building upon this, a user can upload an experimental protocol or other research-related document into the tool—and SciJust.AI will automatically go through the entire justification rubric, scoring input materials according to each criterion, providing a justification for each score, and visualizing the entirety in a justification plot in rapid execution, generally less than thirty seconds.

Users can also provide relevant text or files via a URL, or submit a prompt from which to generate an example protocol—and then the SciJust.AI system will conduct a rubric-assessed evaluation of it. The resulting analysis can be downloaded as a structured report along with the generated justification plot that visualizes the entire scored rubric.

The overall result is an interactive tool that incorporates both manual and AI-assisted automated scoring to provide assessments of scientific rigor, feasibility, innovation, ethics, public health alignment, transparency, and social responsibility across the stages of the biomedical research lifecycle.

SciJust.AI evaluations

In this section, we present illustrative demonstrations of SciJust.AI evaluations. These examples are intended to show how the rubric and prototype operate when assessing different types of research descriptions and experimental contexts.

As the goal of this study is to introduce the proposed framework and prototype tool, these cases are not intended as definitive evaluations of the original protocols themselves nor as validation of the tool in institutional review, assessment, or decision-making contexts. Importantly, we are not suggesting that the tool would replace or automate processes integral to institutional science; rather, we hope to foster productive discussion of how the core aims of such processes might be developed to better standards of assessment, as well as how emerging computational and AI-based tools might best assist in that.

Moreover, and this speaks to a particularly important point—the prototype we have developed is not presented as always providing an accurate evaluation in all aspects and in all cases. In fact, it is explicitly the case that there could be innumerable obvious or subtle failures of evaluation, and that the user must be aware of this possibility of the system not getting something fully right or indeed of getting it fully wrong. Assessing these inaccuracies, misalignments, or fundamental misconceptions will require testing, feedback, iteration, calibration, and expert-validation. While aspects of this have been conducted to bring the prototype to this stage, a formal benchmarking process incorporating in-app user-assessed accuracy and failure reporting, along with ongoing fine-tuning in alignment with domain-specific users, expert panels, process-oriented trials, and institutional deployments would all be part of constructing increased accuracy across contexts. This entails a product roadmap of ongoing design and development cycles and deployed improvements. We hope to proceed in such directions in future work.

That said, at this present stage we put forward what is explicitly identified as a prototype for testing and feedback, not a product for guaranteed performance. It is not to say this is a finished product which has solved scientific justification. The examples included in this study therefore serve to illustrate how the rubric structures analysis of scientific justification across diverse scenarios and how the resulting evaluations are presently interpreted across different justification components and research lifecycle stages. We present an end-to-end implementation of our scientific justification rubric executing to evaluation, through an AI-aided computational architecture, in a user-facing prototype that we are publicly releasing for open testing. While evaluations may generate speculation related to different possibilities, including possibilities of valuable deployment, such reliance will require further advances and work.

Accordingly, the illustrative examples are generated by inputting into SciJust.AI a range of publicly available historical studies, published research descriptions, and prompted protocol summaries. These were selected to span notable and controversial examples across different biomedical contexts to show how the rubric executes and to spark consideration of how it might be applied across different stages of the biomedical research lifecycle. In the initial phase it was not possible to acquire actual research protocols due to institutional requirements and protections. In future work, we hope to partner with institutional bodies to access such materials and to evaluate them across different contexts and uses.

Our illustrative set includes:

  • Poliovirus vaccine field trial (uploaded BMJ review paper) (31);
  • Framingham heart study (provided Wikipedia URL) (32);
  • Inner speech in motor cortex and implications for speech neuroprostheses (provided PubMed URL) (33);
  • Public perceptions of free-roaming dogs and cats in India and the United States (uploaded JAAWS research paper) (34);
  • Evaluating the efficacy of a new antihypertensive drug in adults (provided prompt: “draft protocol that is pretty good but could use some improvement”)
  • Cytokine storm in a phase 1 trial of the anti-CD28 monoclonal antibody TGN1412 (uploaded NEJM paper) (35);
  • Rofecoxib (Vioxx) Clinical trials and post-marketing surveillance (provided prompt: “Vioxx”);
  • Birth of twins after genome editing for HIV resistance (uploaded MIT Tech Rev article) (36);
  • Nazi medical experiments (provided prompt: “Nazi medical experiments”).

Across these cases—which include landmark successes (poliovirus and Framingham heart study), a widely recognized failure of design and monitoring (TGN1412), a case of serious ethical violation (CRISPR-edited babies), cases from diverse research types (animal welfare, speech neuroprosthetics), as well as prompts involving recalled pharmaceuticals, historically criminal medical experiments, and loosely entered example protocol generation—the SciJust.AI tool executes to produce structured evaluations that translate varying research descriptions into scored justification criteria in accordance with the structure of our rubric.

The resulting outputs include:

  • Generated scores and textual explanations for every criterion—accessible by expanding the relevant level and component section within the interface.
  • A summary of the main components of the study.
  • A summary of the key takeaways of the evaluation—including aggregate score, points of relevance, and potential areas of improvement.
  • Justification plot providing visualized overview and per-level and per-component identification of scores that is color-coded with labeled key.

In addition to the interactive interface, the results are made accessible for download, with users able to download the visualized plot, a cumulative spreadsheet of all criteria and scores, and a comprehensive text document that is structured as a report—and includes title of the research evaluated, source of the input material, timestamp of when the evaluation was conducted, aggregate score, key takeaway and improvement points, research summary, and a complete listing of all 67 criteria organized by level and by component and including the evaluated score and the generated explanation for said score.

Together these illustrative evaluation materials aim to demonstrate how the framework and prototype already produce structured output in web-based interactive form and in downloadable report and data files formats. These may have the potential to support assessment processes for different stakeholders across the biomedical research lifecycle, especially if tested and calibrated in validated deployments ahead.

Below, Figures 4,5 and Table 1 are presented as single-run evaluation for the 1954 poliovirus vaccine field trials, and Figure 6 and Table 2 are presented as single-run evaluation for the 2018 genome-edited babies case.

Figure 4 SciJust.AI evaluation for 1954 poliovirus vaccine field trials research review. In the left sidebar is an automatically generated summary of key components (research objective, research methods, expected outcomes) from the uploaded paper. On the right is an expanded view of the initial research conceptualization section, showing two of the ten criteria in this section, and the dynamically updated evaluated score and explanation.
Figure 5 SciJust.AI evaluation for 1954 poliovirus vaccine field trials research review. At top is the dynamically updated justification plot, with overall score composed of per-criteria component scores at each research level. Underneath that is a generated summary of the overall evaluation with key points of evaluation and potential improvement.

Table 1

Evaluation scores for all criteria, 1954 poliovirus vaccine field trials research review

Section Criterion Option Score Max score
Initial research conceptualization Scientific validity and rigor Clearly defined research question 10 10
Scientific validity and rigor Strong theoretical framework 7 7
Scientific validity and rigor Literature review conducted 7 7
Feasibility and resource utilization Feasibility of the research plan 7 7
Feasibility and resource utilization Availability of resources 4 4
Innovation and impact potential Novelty of research question 13 13
Innovation and impact potential Potential for significant impact 11 11
Ethical integrity and patient safety Consideration of ethical implications 7 7
Alignment with public health and policy goals Alignment with public health goals 13 13
Transparency, accountability, and compliance Transparent research objectives 8 8
Social responsibility and public trust Research addresses societal needs 13 13
Ethical and regulatory approval Scientific validity and rigor Ethical approval sought 5 5
Scientific validity and rigor Adherence to scientific rigor 5 5
Feasibility and resource utilization Ethical review of resource utilization 5 5
Innovation and impact potential Ethical review of innovative methods 4 4
Ethical integrity and patient safety Thorough ethical review conducted 17 17
Ethical integrity and patient safety Patient safety considered 16 16
Alignment with public health and policy goals Compliance with public health regulations 12 12
Transparency, accountability, and compliance Compliance with ethical standards 14 14
Transparency, accountability, and compliance Clear accountability mechanisms 8 8
Social responsibility and public trust Ethical review includes social responsibility 14 14
Funding and resource allocation Scientific validity and rigor Resources allocated based on scientific need 6 6
Feasibility and resource utilization Efficient use of allocated funds 22 22
Feasibility and resource utilization Minimization of resource waste 18 18
Innovation and impact potential Funding allocated to innovative aspects 8 8
Ethical integrity and patient safety Ethical use of funds 6 6
Alignment with public health and policy goals Funding aligns with public health priorities 14 14
Transparency, accountability, and compliance Transparent allocation of resources 10 10
Social responsibility and public trust Socially responsible use of funds 16 16
Preclinical and clinical trials Scientific validity and rigor Trials designed with scientific rigor 6 6
Scientific validity and rigor Clear criteria for trial success 6 6
Feasibility and resource utilization Feasibility of trial implementation 9 9
Feasibility and resource utilization Resource management during trials 8 8
Innovation and impact potential Innovative trial designs 8 8
Innovation and impact potential Potential for high-impact results 7 7
Ethical integrity and patient safety Ethical conduct during trials 9 13
Ethical integrity and patient safety Safety protocols in place 11 13
Alignment with public health and policy goals Trials address public health needs 9 11
Transparency, accountability, and compliance Accountability in trial conduct 6 8
Social responsibility and public trust Socially responsible trial conduct 10 11
Data collection and analysis Scientific validity and rigor Data collection methods are scientifically sound 7 7
Scientific validity and rigor Statistical methods are appropriate 6 6
Feasibility and resource utilization Feasibility of data collection methods 9 10
Innovation and impact potential Use of innovative data collection methods 6 7
Innovation and impact potential Novel data analysis techniques 6 6
Ethical integrity and patient safety Ethical data collection methods 9 10
Ethical integrity and patient safety Protection of patient data 11 13
Alignment with public health and policy goals Data collection supports public health 10 13
Transparency, accountability, and compliance Transparent data collection methods 9 9
Transparency, accountability, and compliance Compliance with data standards 11 11
Social responsibility and public trust Data collection considers social impact 6 8
Peer review and publication Scientific validity and rigor Peer-reviewed publication targeted 13 13
Scientific validity and rigor High-impact journal targeted 11 11
Feasibility and resource utilization Resources available for publication process 6 6
Innovation and impact potential Targeting innovative research outlets 6 9
Ethical integrity and patient safety Ethical considerations in publication 10 10
Alignment with public health and policy goals Publication relevant to public health 19 19
Transparency, accountability, and compliance Transparent publication process 11 11
Social responsibility and public trust Publication considers public trust 6 21
Public communication and policy impact Scientific validity and rigor Scientific communication to public is clear 6 6
Feasibility and resource utilization Feasibility of public communication plan 6 6
Innovation and impact potential Potential for policy impact 8 12
Innovation and impact potential Innovative public communication strategies 11 11
Ethical integrity and patient safety Ethical public communication 11 13
Alignment with public health and policy goals Policy impact is a focus 19 19
Transparency, accountability, and compliance Transparent public communication 14 14
Social responsibility and public trust Public communication fosters trust 6 19
Figure 6 SciJust.AI evaluation for 2018 birth of twins after genome editing for HIV resistance research. At top is the dynamically updated evaluation plot, with overall score composed of per-criteria component scores at each research level. Underneath that is a generated summary of the overall evaluation with key points of evaluation and potential improvement. HIV, human immunodeficiency virus.

Table 2

Evaluation scores for all criteria for 2018 birth of twins after genome editing for HIV resistance research

Section Criterion Option Score Max score
Initial research conceptualization Scientific validity and rigor Clearly defined research question 3 10
Scientific validity and rigor Strong theoretical framework 2 7
Scientific validity and rigor Literature review conducted 2 7
Feasibility and resource utilization Feasibility of the research plan 2 7
Feasibility and resource utilization Availability of resources 3 4
Innovation and impact potential Novelty of research question 5 13
Innovation and impact potential Potential for significant impact 4 11
Ethical integrity and patient safety Consideration of ethical implications 2 7
Alignment with public health and policy goals Alignment with public health goals 2 13
Transparency, accountability, and compliance Transparent research objectives 3 8
Social responsibility and public trust Research addresses societal needs 2 13
Ethical and regulatory approval Scientific validity and rigor Ethical approval sought 1 5
Scientific validity and rigor Adherence to scientific rigor 2 5
Feasibility and resource utilization Ethical review of resource utilization 1 5
Innovation and impact potential Ethical review of innovative methods 1 4
Ethical integrity and patient safety Thorough ethical review conducted 1 17
Ethical integrity and patient safety Patient safety considered 1 16
Alignment with public health and policy goals Compliance with public health regulations 1 12
Transparency, accountability, and compliance Compliance with ethical standards 1 14
Transparency, accountability, and compliance Clear accountability mechanisms 1 8
Social responsibility and public trust Ethical review includes social responsibility 1 14
Funding and resource allocation Scientific validity and rigor Resources allocated based on scientific need 2 6
Feasibility and resource utilization Efficient use of allocated funds 2 22
Feasibility and resource utilization Minimization of resource waste 2 18
Innovation and impact potential Funding allocated to innovative aspects 3 8
Ethical integrity and patient safety Ethical use of funds 1 6
Alignment with public health and policy goals Funding aligns with public health priorities 1 14
Transparency, accountability, and compliance Transparent allocation of resources 1 10
Social responsibility and public trust Socially responsible use of funds 1 16
Preclinical and clinical trials Scientific validity and rigor Trials designed with scientific rigor 1 6
Scientific validity and rigor Clear criteria for trial success 1 6
Feasibility and resource utilization Feasibility of trial implementation 2 9
Feasibility and resource utilization Resource management during trials 2 8
Innovation and impact potential Innovative trial designs 2 8
Innovation and impact potential Potential for high-impact results 2 7
Ethical integrity and patient safety Ethical conduct during trials 1 13
Preclinical and clinical trials Ethical integrity and patient safety Safety protocols in place 1 13
Alignment with public health and policy goals Trials address public health needs 1 11
Transparency, accountability, and compliance Accountability in trial conduct 1 8
Social responsibility and public trust Socially responsible trial conduct 1 11
Data collection and analysis Scientific validity and rigor Data collection methods are scientifically sound 2 7
Scientific validity and rigor Statistical methods are appropriate 2 6
Feasibility and resource utilization Feasibility of data collection methods 2 10
Innovation and impact potential Use of innovative data collection methods 2 7
Innovation and impact potential Novel data analysis techniques 2 6
Ethical integrity and patient safety Ethical data collection methods 1 10
Ethical integrity and patient safety Protection of patient data 1 13
Alignment with public health and policy goals Data collection supports public health 1 13
Transparency, accountability, and compliance Transparent data collection methods 1 9
Transparency, accountability, and compliance Compliance with data standards 1 11
Social responsibility and public trust Data collection considers social impact 1 8
Peer review and publication Scientific validity and rigor Peer-reviewed publication targeted 1 13
Scientific validity and rigor High-impact journal targeted 1 11
Feasibility and resource utilization Resources available for publication process 1 6
Innovation and impact potential Targeting innovative research outlets 1 9
Ethical integrity and patient safety Ethical considerations in publication 1 10
Alignment with public health and policy goals Publication relevant to public health 1 19
Transparency, accountability, and compliance Transparent publication process 1 11
Social responsibility and public trust Publication considers public trust 1 21
Public communication and policy impact Scientific validity and rigor Scientific communication to public is clear 1 6
Feasibility and resource utilization Feasibility of public communication plan 1 6
Innovation and impact potential Potential for policy impact 1 12
Innovation and impact potential Innovative public communication strategies 1 11
Ethical integrity and patient safety Ethical public communication 1 13
Alignment with public health and policy goals Policy impact is a focus 1 19
Transparency, accountability, and compliance Transparent public communication 1 14
Social responsibility and public trust Public communication fosters trust 1 19

HIV, human immunodeficiency virus.

(For full evaluation materials, including plots, tables, and reports for all nine illustrative examples—please see Appendix 3 and supplementary material available at: https://cdn.amegroups.cn/static/public/jmai-2025-1-246-2.zip).


Discussion

Key findings

The intersection of biomedical knowledge and AI is a central storyline of our times. While there is immense interest in using AI in the life sciences and healthcare, it is only recently that this has begun to extend into the processes of institutional science itself (37-41). These possibilities may be practically relevant in institutional review environments and procedural settings where committees must synthesize methodological, ethical, and risk-benefit considerations under conditions of limited time, variable expertise, or inconsistent evaluative structure, and they should also be explored for use across biomedical research lifecycles. Our method for computational assessment of scientific justification across all levels of operation and impact, from experimental design to protocol assessment to funding reviews to public and policy communication, puts forward a proposed framework for AI-assisted processes and outcomes across institutional science. We believe this is essential if the full promise of AI for Science is to be realized (42,43).

Strengths and limitations

Importantly, the quantification of justification does not aim to replace the human deliberative process but to augment it with a clear, structured reference that can be established and managed by institutions themselves. While we present one version of the rubric above, the specific scoring breakdowns and the criteria that comprise different components and level of justification relevant to scientific protocols can be updated and adapted. This does not mean that today’s variability would necessarily be reproduced. Rather, our point is that by providing a structured justification rubric, a preliminary model of levels, components, criteria, and scores, and an easy-to-test tool, we hope to provide a useful way for experimenters, evaluators, and institutions to work toward common scientific justification standards in practice. To that end, we are actively engaging partners to trial, validate, and refine the methods introduced here.

There are, however, important limitations, and areas that require further investigation and validation. AI-assisted scoring depends on both the precision of the rubric and the model’s capacity to accurately interpret and evaluate research inputs. The level of granularity provided by the rubric-based definitions of scoring criteria, including as prompt-instructed into the computational processing of the prototype, is fundamental. While LLMs have powerful capabilities of high-level conceptual alignment, a more detailed description of criteria in terms of assessment of scores from minimum to maximum could make the scoring process more transparent, objective, and accurate. At the same time, overly rigid evaluative criteria might risk stifling methodological innovation, just as overly broad criteria may leave inconsistency unresolved. Striking the right balance between conceptual guidance and criteria description will be important to investigate in future work, especially for identifying that which is most productive for human deliberation with AI-aided use. It should also be said that AI outputs may reflect embedded biases in training data or rubric design, underscoring the need for human oversight. As already noted, run-to-run variation may occur in scoring; and so repeated runs should be conducted and comparatively tracked to identify broad evaluation patterns indicative of consistency, as well as acceptable to unacceptable variation. Implementing multi-run comparability should be a top priority in future work developing the prototype toward product.

Furthermore, while SciJust.AI is not an automated solution for resolving the incommensurability of values across different justification factors, there is a risk that its presentation of 0–100 normalized level scores, 0–700 aggregate scores, and interpretive labels for aggregate scores (“terrible” to “excellent”) could be treated as shorthand for such reductive thinking or even decision-making. As such, deliberators using SciJust must make transparent which overarching values and which complex of multidimensional criteria they are appealing to when comparing different evaluations (44). Otherwise, there could be issues of evaluative overdetermination, if, for example, different protocols with overlapping scores from different experimental and procedural components were not properly distinguished beyond their aggregate scores. For these reasons, institutional scientific justification rubrics and AI-assisted institutional review should be treated as a procedural aid for tracking the complexities of such considerations, not a reductive substitute for automating deciding authority. That is to say: like a magnifying lens, not a rubber stamp.

Additional limitations arise from the current stage of prototype development and deployment.

The framework and prototype depend on the quality and completeness of the research descriptions provided as inputs. Incomplete or selectively reported information may influence the resulting evaluation profiles, including in ways not exposed to users. For example, URL-based inputs retrieve and extract content from the provided link, but not all websites function equally well with this feature as currently implemented, which may limit evaluation depth, cause misrecognition (if incorrect content is retrieved), or block it entirely (in which case an error message is provided). This could be addressed by providing more extensive input guidance or content retrieval notifications to users within the prototype interface.

Further, because the implementation relies on LLMs, outputs will exhibit variability depending on prompt formulation, model configuration, and evolving system capabilities. This is inherent in their use, and so future work should involve enabling multi-run execution with comparative analysis for identifying patterns of stability, variation, and other loci of interpretive interest.

The prototype does not currently incorporate automated retrieval, assessment, or citation of literature relevant to the study or research input materials. Doing so as an additional part of the evaluative workflow should be investigated and could further strengthen justification assessments in future versions, as well as provide increased confidence to users. Additionally, domain-specific ethical, regulatory, funding, or process-related frameworks, for example the precise logic of human and animal subject protections, are not explicitly instructed or specified in the prototype. Future versions might implement this and also enable citation-related justifications.

This speaks to a particularly important point—the prototype as presented is not presented as getting the right answer in all cases. In fact, it is explicitly the case that there could be innumerable obvious or subtle failures of evaluation. This may help reveal the current performance of LLM models on questions of scientific justification—an important task of itself—as well as inform approaches by which both underlying models and the AI-enabled decision-aid applications could be improved ahead for more conceptual accuracy and practical utility ahead.

Lastly, the current user interface aims to present a large amount of input and output data in a legible display and navigable format, but there are limitations of both design and technical execution in the current prototype that might be improved upon for more easily understandable and scrutable interaction and utilization.

With all this noted, it is clear that iterative development, stakeholder feedback, and process-integrated calibration, including through validation trials with institutional partners, will be required to ensure applicability, adaptability, and accuracy across the research lifecycle and domain-specific contexts. Our work presents an initial end-to-end framework and prototype that provides a plausible foundation for such efforts and improvements.

Comparison with similar research

AI is being increasingly applied across the scientific enterprise, with significant developments in areas such as structural biology and drug discovery (45,46). More recently, applications to research ethics review and protocol evaluation have been put forward (37-41). These efforts have focused primarily on potential use or procedural review tasks at particular stages of the research process (47). We believe SciJust.AI and such approaches could ultimately be complementary, or even integrated; but part of the reason for a parallel and more expansive approach in this research is that we believe that scientific justification can (I) be better delineated in its conceptual dimensions; (II) that this could be applied across the biomedical research lifecycle to a degree greater than currently the case in which different senses of justification are often procedurally siloed; (III) that LLMs and AI-aided evaluation might be implemented in a prototyped way that could help start a process of open testing and feedback to help engender discussion of what works and what does not—in aim of identifying iterative improvements and motivating further development, deployment, and validation efforts, including with institutional partners.

To the best of our knowledge, prior work has yet to propose a comprehensive, multi-dimensional, evaluative framework for institutional scientific justification that is computationally operationalized. Our study presents a novel, preliminary approach (model, rubric, and tool) toward this goal, across the full biomedical research lifecycle, with initial evaluation results demonstrating its feasibility and highlighting areas for future refinement.

Explanations of findings

SciJust.AI is at an early stage of development, but even in its preliminary conceptual model and prototyped application, it aims to demonstrate the potential viability of a decision-support aid for investigators, reviewers, funders and institutions across the biomedical research lifecycle. For experimenters, the goal is to help improve experimental design by fostering structured reflection on why a study should be conducted and what would justify it. In doing so, the tool aims to support the development of stronger, more rigorous experimental designs from the outset and over the duration of projects—helping scientists model and develop the most defensible and well-justified version of their biomedical research. Applicants could then iteratively assess their proposals against the rubric, identify weaknesses, and strengthen their justification before submission. Reviewers, funders, institutions, and policymakers could use the same framework to align their evaluations, promoting shared standards of what counts as a well-justified scientific study. In this way, AI-assisted review processes could reduce bureaucratic burden while helping raise the quality, not just the compliance, of experimental science.

Example use-cases include evaluating justification strength at the proposal stage, before and after review, and during and after experimental execution; identifying weak, unclear, or underspecified elements throughout; assisting committee and stakeholder deliberation; benchmarking justification over research project milestones; and training students and researchers in high quality experimental design.

Over time, longitudinal use of this rubric has the potential to reveal systemic patterns in justification practices, providing feedback not only to researchers and reviewers, but also to funders, policy makers, and institutional stakeholders seeking to improve the quality and impact of scientific research. Ultimately, the goal of well-instituted scientific justification is to drive greater consideration of how to design and execute good experiments with the most beneficial results.

Implications and actions needed

Next steps include calibrating and testing the scientific justification rubric, identifying and implementing stakeholder feedback, piloting versions of the SciJust.AI tool with institutional partners, and publishing validation studies examining the reliability and performance of AI-assisted scoring across different operational contexts. Planned technical expansions for SciJust.AI include expanded user and team capabilities, workspace integration with protocol submission systems, and analytic dashboards. This product-led approach aims to position scientific justification as a continuous and measurable dimension of the scientific method itself, and to work to develop an instrument for helping people improve it.


Conclusions

Scientific justification is the foundation upon which the authority and legitimacy of institutional science rests. In biomedicine, where decisions directly affect lives, the consistency and transparency of justification processes across the entirety of research lifecycles, from initial concept to ultimate impact, shapes both research quality and public trust.

We put forward a preliminary evaluative framework for science to systematically measure and compare the justification of experimental protocols across domains, timeframes, and institutions. By providing a shared and adaptable means to track justification across experimental protocols, such a framework could help improve investigative aims, enrich and inform institutional review, and enable more considered and efficient decision-making for stakeholders, including researchers, oversight committees, funding agencies, and the public.

We argue for the systematic specification, formalization, and computation of scientific justification. The scientific justification rubric and SciJust.AI tool presented here offer a path toward making justification more measurable, comparable, and applicable at scale. Embedding such tools in institutional workflows could strengthen decision-making and support productive alignments of professional and public reasoning.

A central commitment of the scientific method is to improve itself by operationalizing better investigations that more reliably produce accurate results. Scientific justification, in all its components and criteria, and across all levels of research, is not separate or detached from this objective of experimental science—it is its core principle.

The true aim of scientific justification is as the guidance mechanism for targeting scientific knowledge. It is where validity derives veridicality from. It is in this sense of philosophical, methodical, and practical commitment to the principles of the scientific method, that scientific justification should be developed as a domain of scientific inquiry and application in its own right.

If developed and implemented productively, the approach presented here could contribute to improving the degree to which research carried out in the name of science is justified in practice.


Acknowledgments

Artificial intelligence tools were used for assistance in identifying references and in identifying areas for manuscript revision based on reviewer comments.


Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-1-246/prf

Funding: None.

Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-1-246/coif). I.H. serves as an unpaid editorial board member of Journal of Medical Artificial Intelligence from June 2025 to May 2027. I.H. was invited to speak about the ethics of AI for pharmaceutical drug development and the use of AI chatbots for an internal company meeting at Bayer International. This was an online meeting and received a modest honorarium for his time. The other author has no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study involved the construction of a computational prototype which can be used for manual or AI-aided assessment of research inputs, which could include human subject and animal subject experiment protocols or other study-related documents; however, in this study, no materials inputted into this system required ethical approval or involved patient consent or permissions; further, at this stage there is no prospective or retrospective experiment that has been developed as part of this study that would require such review or approval; as the system could in the future have use across the biomedical research lifecycle, the authors note that it will be important to ensure any and all necessary ethical requirements in case of such utilization are adhered to by those responsible.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

, a recent study (21) that was conducted by a team including one of the authors looked at IRB chairs’ attitudes about perceived difficulty and preparedness of assessing risk-benefit analyses of early phase and first-in-human neurology trials. A key finding was that one-third of IRB chairs did not believe their IRB was “very prepared” to conduct risk-benefit analysis for early phase trials. Furthermore, more than two-thirds of IRB chairs reported that it would be mostly or very helpful to have additional resources when conducting risk-benefit analysis for early phase trials in neurology. For example, 77% felt that having a standardized process for conducting risk-benefit analysis would be mostly or very helpful. The same percentage felt similarly regarding having a computer program that summarizes the published literature on the benefits and potential risks of the drugs or biologics being studied as well as for specific guidance from the Office for Human Research Protections (OHRP) (77%). We take this as indicative of the desirability of computer-aided assessment of such relevant dimensions of scientific justification along the biomedical research lifecycle, including at procedural stages of review and approval.


References

  1. "Scientific Method". New Oxford American Dictionary. 3rd ed. Oxford University Press; 2010.
  2. Metascience can improve science - but it must be useful to society, too. Nature 2025;643:304.
  3. Ioannidis JP. Why most published research findings are false. PLoS Med 2005;2:e124. [Crossref] [PubMed]
  4. Munafò MR, Nosek BA, Bishop DVM, et al. A manifesto for reproducible science. Nat Hum Behav 2017;1:0021.
  5. Anonymous [Senior NIH-funded investigator]; Buck S, ed. A top scientist’s ideas as to NIH. The Good Science Project. November 22, 2025. Available online: https://goodscience.substack.com/p/a-top-scientists-ideas-as-to-nih
  6. Mahajan A. Things I learned talking to the new breed of scientific institution. Owl Posting [Good Science Project]. August 26, 2024. Available online: https://www.owlposting.com/p/things-i-learned-talking-to-the-new
  7. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. Washington, DC: U.S. Department of Health, Education, and Welfare; 1979.
  8. U.S. Department of Health and Human Services. Protection of Human Subjects. 45 CFR 46. 2018 revision. Available online: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-A/part-46
  9. National Institutes of Health. Guidance: Rigor and Reproducibility in Grant Applications. 2024. Available online: https://grants.nih.gov/policy-and-compliance/policy-topics/reproducibility/guidance
  10. International Society for Stem Cell Research. Laboratory-based human embryonic stem cell research, embryo research, and related research activities. In: 2025 Guidelines for Stem Cell Research and Clinical Translation. ISSCR; 2025.
  11. National Academies of Sciences, Engineering, and Medicine. Fostering Integrity in Research. Washington, DC: National Academies Press; 2017.
  12. National Academies of Sciences, Engineering, and Medicine. Reproducibility and Replicability in Science. Washington, DC: National Academies Press; 2019.
  13. Abbott L, Grady C. A systematic review of the empirical literature evaluating IRBs: what we know and what we still need to learn. J Empir Res Hum Res Ethics 2011;6:3-19. [Crossref] [PubMed]
  14. Lynch HF, Abdirisak M, Bogia M, et al. Evaluating the Quality of Research Ethics Review and Oversight: A Systematic Analysis of Quality Assessment Instruments. AJOB Empir Bioeth 2020;11:208-22. [Crossref] [PubMed]
  15. Scherzinger G, Bobbert M. Evaluation of Research Ethics Committees: Criteria for the Ethical Quality of the Review Process. Account Res 2017;24:152-76. [Crossref] [PubMed]
  16. Nicholls SG, Hayes TP, Brehaut JC, et al. A Scoping Review of Empirical Research Relating to Quality and Effectiveness of Research Ethics Review. PLoS One 2015;10:e0133639. [Crossref] [PubMed]
  17. Lynch HF, Eriksen W, Clapp JT. "We measure what we can measure": Struggles in defining and evaluating institutional review board quality. Soc Sci Med 2022;292:114614. [Crossref] [PubMed]
  18. Silberman G, Kahn KL. Burdens on research imposed by institutional review boards: the state of the evidence and its implications for regulatory reform. Milbank Q 2011;89:599-627. [Crossref] [PubMed]
  19. U.S. Government Accountability Office. Institutional Review Boards: Actions Needed to Improve Federal Oversight and Examine Effectiveness. GAO-23-104721. Washington, DC: GAO; 2023.
  20. Jonlin EC, Fujita M, Isasi R, et al. What does "appropriate scientific justification" mean for the review of human pluripotent stem cell, embryo, and related research? Stem Cell Reports 2025;20:102479. [Crossref] [PubMed]
  21. Baugh CM, Bolcic-Jankovic D, Fedyk M, et al. Challenges of Conducting Risk-Benefit Analysis of Early Phase Clinical Trials: Results of a National Survey of IRB Chairs. Ethics Hum Res 2025;47:2-12. [Crossref] [PubMed]
  22. Chang R, ed. Incommensurability, Incomparability, and Practical Reason. Cambridge, MA: Harvard University Press; 1997.
  23. Oreskes N. Why Trust Science? Princeton, NJ: Princeton University Press; 2019.
  24. Daniels N. Justice and Justification: Reflective Equilibrium in Theory and Practice. Cambridge University Press; 1996.
  25. National Academies of Sciences, Engineering, and Medicine. Human Genome Editing: Science, Ethics, and Governance. Washington, DC: National Academies Press; 2017.
  26. Coleman CH, Menikoff JA, Goldner JA, et al. The Ethics and Regulation of Research with Human Subjects. 2nd ed. Durham, NC: Carolina Academic Press; 2015.
  27. Applied Research Ethics National Association (ARENA); Office of Laboratory Animal Welfare (OLAW), NIH. Institutional Animal Care and Use Committee Guidebook. 2nd ed. Bethesda, MD: National Institutes of Health; 2002. Available online: https://grants.nih.gov/grants/olaw/guidebook.pdf
  28. PitreTJassalTTalukdarJRChatGPT for assessing risk of bias of randomized trials using the RoB 2.0 tool: a methods study.medRxiv. 2023. doi: .
  29. Chiang CH, Lee H. Can large language models be an alternative to human evaluations? In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. ACL; 2023:15607-31.
  30. Kim S, Shin J, Cho Y, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In: Proceedings of the Twelfth International Conference on Learning Representations. ICLR; 2024.
  31. Meldrum M. "A calculated risk": the Salk polio vaccine field trials of 1954. BMJ 1998;317:1233-6. [Crossref] [PubMed]
  32. Wikipedia contributors. Framingham Heart Study. Wikipedia. Available online: https://en.wikipedia.org/wiki/Framingham_Heart_Study
  33. Suntharalingam G, Perry MR, Ward S, et al. Cytokine storm in a phase 1 trial of the anti-CD28 monoclonal antibody TGN1412. N Engl J Med 2006;355:1018-28. [Crossref] [PubMed]
  34. Sensharma R, Reinhard CL, Powell L, et al. Public perceptions of free-roaming dogs and cats in India and the United States. J Appl Anim Welf Sci 2025;28:563-77. [Crossref] [PubMed]
  35. Kunz EM, Abramovich Krasa B, Kamdar F, et al. Inner speech in motor cortex and implications for speech neuroprostheses. Cell 2025;188:4658-4673.e17. [Crossref] [PubMed]
  36. Regalado A. China's CRISPR babies: Read exclusive excerpts from the unseen original research. MIT Technology Review. December 3, 2019. Available online: https://www.technologyreview.com/2019/12/03/131752/chinas-crispr-babies-read-exclusive-excerpts-he-jiankui/
  37. Ho KYS, Sam V, Shah J, et al. AI/ML applications for scientific review assistance. U.S. Food and Drug Administration, Center for Scientific Review; 2025.
  38. Ho KYS, Sam V, Shah J, et al. AI/ML applications for scientific review assistance. [research poster]. Leidos and National Institutes of Health, Center for Scientific Review.
  39. U.S. Food and Drug Administration. FDA launches agency-wide AI tool to optimize performance for the American people. FDA Press Announcements; June 2, 2025.
  40. Porsdam Mann S, Seah JJ, Latham S, et al. Chat-IRB? How application-specific language models can enhance research ethics review. J Med Ethics 2025;jme-2025-110845.
  41. Checco A, Bracciale L, Loreti P, et al. AI-assisted peer review. Humanit Soc Sci Commun 2021;8:25.
  42. Wang H, Fu T, Du Y, et al. Scientific discovery in the age of artificial intelligence. Nature 2023;620:47-60. [Crossref] [PubMed]
  43. Gao S, Fang A, Huang Y, et al. Empowering biomedical discovery with AI agents. Cell 2024;187:6125-51. [Crossref] [PubMed]
  44. Hyun I. Bioethics and the Future of Stem Cell Research. Cambridge University Press; 2013.
  45. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583-9. [Crossref] [PubMed]
  46. Atanasov D, Zanichelli N, Denain JS. Expanding our analysis of biological AI models. Epoch AI Technical Report. February 20, 2026. Available online: https://epoch.ai/blog/expanding-our-analysis-of-biological-ai-models
  47. Zhao C. Ethicists flirt with AI for reviewing human research. Science 2025;389:1281-2. [Crossref] [PubMed]
doi: 10.21037/jmai-2025-1-246
Cite this article as: Ghose S, Hyun I. Toward computational assessment of scientific justification across biomedical research lifecycles. J Med Artif Intell 2026;9:52.

Download Citation