Artificial Intelligence in Toxicological Risk Assessment
Toxicology
Jun 4, 2026 | Published by Dean Hatt
Toxicology
Why expert scientific judgement remains irreplaceable in the age of AI
There is a version of this article that opens by celebrating artificial intelligence. The tools are genuinely impressive. Large language models can process thousands of scientific publications in the time it takes a human analyst to read a handful. They can extract structured data, trends across heterogeneous datasets, and produce well-formatted summaries at a speed that was unimaginable five years ago. For the parts of toxicology risk assessment that are fundamentally about handling large volumes of information, AI is a real and useful tool.
This is not, however, what this article is about. This article is about what AI cannot do, and about the commercial and regulatory risks that arise when enthusiasm surrounding AI has outpaced the evidence. There is a growing tendency to treat AI-generated outputs as a substitute for expert scientific judgement. They are not.
What is AI?
It is a Large Language Model (LLM), which is a statistical system trained on enormous volumes of text. This architecture explains both where LLMs excel and where they fail. Tasks involving processing and restructuring large volumes of text, literature screening, data extraction, evidence structuring, gap identification, test prioritisation, dossier formatting, map directly onto what these systems are built to do. That is a genuine contribution to a workflow that has historically consumed enormous amounts of specialist time.
Tasks requiring formal logical reasoning, causal inference, mechanistic understanding, or the weighting of scientific evidence against established knowledge are a fundamentally different matter. An LLM has no internal model of biological mechanisms. It has no understanding of dose-response relationships, no concept of toxicological plausibility, and no reliable capacity to recognise when the conclusion it is generating is scientifically incoherent. It has learned that certain words tend to appear together in scientific contexts and produces text reflecting those associations. That is not understanding. In risk assessment, the difference matters enormously.
There is a further problem that is less often discussed: reproducibility. Ask the same question twice and you may get meaningfully different answers. Rephrase the question and the output can shift substantially even when the underlying scientific question has not changed. A well-constructed prompt can elicit a more complete response while a differently framed but equally valid question may not. In a regulatory context where reproducibility is a requirement rather than a preference, this is a structural weakness.
Carcinogenicity, reproductive toxicity, developmental neurotoxicity, and immunotoxicity are each associated
with multiple mechanistic variants involving different molecular initiating events, different key event sequences, and different adverse outcome pathways converging on superficially similar endpoints by entirely different routes. An AI system navigating this landscape without expert direction cannot reliably distinguish between pathways, identify which variant is relevant for a given chemical and exposure scenario, or recognise when evidence from one pathway cannot be extrapolated to another.
The Steps AI Cannot Take
Toxicological risk assessment follows a well-established framework of hazard identification, dose-response assessment, risk characterisation, and risk management. Each step involves interpretation that is contextual, mechanism-dependent, and species-specific. These are precisely the areas where current AI systems are least reliable, and where getting it wrong has the most serious consequences.
Hazard identification requires a toxicologist or risk assessor to make judgements about whether available data, which is often incomplete, conflicting, or generated under conditions not directly relevant to the exposure of concern, supports a conclusion about potential harm. A positive carcinogen classification or data suggesting reproductive toxicity or mutagenic potential requires expert evaluation in the context of the product, the route of exposure, likely metabolism, and the weight of wider toxicity findings. Often data is lacking and the assessor must use computational tools such as QSAR modelling to predict outcomes from structure. Even then, not all predictions are clean, and an expert opinion is needed. An AI system presented with the same literature will not make a scientifically defensible judgement.
Dose-response assessment goes well beyond pattern recognition. The EFSA re-evaluation of bisphenol A in 2023 reduced the tolerable daily intake by a factor of 20,000, from 4 ug/kg/day to 0.2 ng/kg/day. This decision required years of expert deliberation, mechanistic interpretation, and structured uncertainty analysis across multiple endpoints. It is not a judgement an AI system could reliably or defensibly reproduce.
Regulatory transparency presents a further structural problem. Decisions issued by agencies such as ECHA carry legal weight and must withstand juridical scrutiny. AI systems whose black-box reasoning makes it difficult or impossible to determine how an output was produced are fundamentally incompatible with that requirement. A risk assessment conclusion without explanation, audit or repeatability cannot be defended in a regulatory or legal context.
Where Industry Is Going
Industry is not standing still. Toxicological safety assessment is in genuine transition toward hypothesis-driven, weight-of-evidence frameworks that integrate multiple data streams and ask more targeted, mechanistically relevant questions rather than simply repeating outdated and previously mandated studies. AI, NAMs, in silico models, and bioactivity data are all playing pivotal roles in that shift.
Progress in specific areas is real. Repeat dose toxicity classification using in silico and bioactivity data is advancing. The binary carcinogen or non-carcinogen model is under active scientific scrutiny. NAMs for endocrine disruption assessment, particularly within the oestrogenic, androgenic and thyroid (EAT) modalities, are showing genuine development. Newer areas such as metabolic disruption remain at early stages.
The consensus across the regulatory toxicology community is consistent. AI integration into toxicology workflows is legitimate and growing. Its roles in literature review, data extraction, evidence structuring, gap identification, and test prioritisation are recognised. Its inability to replace domain expertise or scientific judgement is equally recognised. The infrastructure surrounding the tool, traceable inputs and outputs, robust audit trails, and active expert oversight, matter as much as the tool itself. When used responsibly within that infrastructure, AI can be a powerful enabler. When used as a shortcut around it, the output is worthless and potentially dangerous.
The next significant challenge is ensuring regulatory frameworks adapt and keep pace with scientific progress without compromising the public health protections those frameworks exist to provide. Science is moving faster than regulation. That gap needs careful management, and it will not be managed by technology alone.
What should we do?
For companies navigating this landscape, a few principles stand out:
-
Do not confuse AI capability with regulatory acceptability. A tool that processes literature rapidly, or a model that generates plausible-looking outputs, is not automatically fit for regulatory submission. The regulatory standing of any method needs verification and validation before resources are committed.
-
Maintain qualified scientific oversight as a fixed cost rather than a variable one. AI may reduce the time specialists spend on certain tasks. It does not reduce the need for them. The scientist who reviews, interprets, and takes responsibility for AI-assisted outputs is not an optional add-on. They are the reason those outputs mean anything.
-
Engage regulatory agencies early. If NAMs or AI-assisted data are intended for a submission, the relevant competent authority should be involved at the study design stage. Early dialogue reduces the risk of costly rejection.
-
Apply endpoint-specific logic. No single NAM or AI tool performs reliably across all toxicological endpoints. What is validated and accepted for skin sensitisation is not transferable to developmental neurotoxicity. The question is not whether AI or NAMs are useful in general. The question is which tool, for which endpoint, under what conditions of expert oversight and documented audit trail.
-
Select your AI system carefully for the task you want it to perform. The common breakdowns are based on capability, function and underlying approach. Some are generalised, some are designed for a specific task, some offer machine learning and who knows, some may become self-aware. It is unlikely a single system you subscribe to will perform all of your AI needs.
The Bottom Line
Chemical risk assessment depends on qualified scientific expertise. AI does not change that. What AI changes is the efficiency of certain supporting tasks, and that is genuinely valuable. But efficiency in data extraction or literature screening is not the same as the scientific expertise that gives those tasks meaning. Invest in AI where it helps and where the regulatory pathway is clear.
An AI tool that produces a fluent, structured summary of toxicological literature has not assessed the hazard. It has summarised text. The hazard assessment begins when a qualified scientist reads that summary, evaluates its accuracy, applies mechanistic judgement, and takes responsibility for what it means. Removing that step does not accelerate the process. It breaks it.
The field is evolving quickly, with a strong emphasis on scientific judgement, transparency, and regulatory confidence. That emphasis is well placed. AI, used properly, can help deliver on all three. Used improperly, it undermines all three. The difference lies not in the technology but in the rigour of the human framework around it.