Can AI help in Validation? Cutting Through the Hype

Ai in writing URS, FS and Traceability Matrix generation

As interest in AI continues to grow, some vendors are promoting AI as tools capable of generating qualification deliverables such as User Requirement Specifications (URS), Functional Specifications (FS), and traceability matrices, with minimal human input. This article provides a grounded, practical look at what AI can and cannot do in the context of URS and FS creation and automating traceability matrix. It explains how LLM models work; what URS and FS documents are, and how they fit in the qualification process; it also looks at tests we ran with custom LLMs to automate reconciliation as in automating generation of a traceability matrix, and the results obtained; and suggests where AI may offer limited support – while highlighting why human ownership and input is still required for critical tasks like URS and FS development.


Introduction: Opportunity vs. Overpromise

The rise of AI tools like ChatGPT has sparked broad interest across regulated industries. In pharmaceutical, biotech, medical device, and cosmetics sectors, companies are exploring how these tools might reduce documentation workload, increase consistency, and support compliance efforts.

However, alongside this enthusiasm, some vendors have begun promoting the idea that large language models (LLMs) can generate qualification documents – such as User Requirements Specifications (URS), Functional Specifications (FS), or even test scripts – with a simple prompt.

While these claims are designed to attract attention, they risk oversimplifying what these qualification documents really are, how they are developed – and where, if at all, AI actually fits into the process.

This article explains the role and requirements of key qualification documents, explores how LLMs like ChatGPT work, and examines where AI tools can – and cannot – be appropriately applied to support the qualification process.

What a URS Actually Does – and Why It Matters

A URS (User Requirements Specification) is a critical foundation for any qualified entity. It defines what the user needs the system, equipment, utility, or software to do – in terms that are clear, measurable, and auditable.

A robust URS is not a generic checklist. It must be developed by knowledgeable stakeholders who understand:

  • The intended use of the system,
  • Key process and operational parameters (e.g., throughput, control logic, data handling),
  • The regulatory framework and internal quality expectations,
  • Integration points with other systems (e.g., MES, SCADA, LIMS),
  • Environmental, safety, and compliance constraints.

A well-defined URS supports:

  • Effective vendor selection and procurement,
  • A clear basis for design and testing,
  • Alignment with qualification protocols and traceability,
  • Compliance with GxP expectations for “fitness for intended use.”

When a URS is vague or incomplete, the consequences can be significant: unsuitable equipment, misaligned vendor deliverables, inefficient qualification efforts, or gaps that lead to inspection findings and remediation. Poor input at the URS stage can therefore compromise the entire lifecycle of the system – including its qualification.

How LLMs Like ChatGPT Actually Work

How LLMs function is often misunderstood. Tools like ChatGPT do not think, reason, or understand. In fact, they generate text by predicting the most statistically likely next word in a sentence, based on patterns learned from massive volumes of publicly available internet text.

In practice, this means:

  • They do not understand context in the way a human expert does,
  • They do not verify accuracy,
  • And they do not apply regulatory logic or domain-specific judgment.

Critically, high-quality URSs, FSs, and qualification scripts are typically internal and proprietary – not available in the public domain. As a result, these documents are not well represented in the training data used to build general-purpose language models.

This creates a risk of ‘hallucination’: that is, AI can sometimes confidently generate text that sounds correct but is factually inaccurate or entirely fabricated. In regulated environments, this presents an obvious risk – especially when such content is used to support compliance activities.

Can LLMs Write a URS? What’s the Real Limitation?

When asked to generate a URS, an LLM can produce something that looks well-formatted – but that document is likely to be too generic or vague to be useful in practice.

For example, asking ChatGPT to generate a URS for a “sifter” will not trigger the LLM to ask clarifying questions on aspects such as:

  • Process requirements: What is the role of the sifter in the process? For example, it could be particle size separation, de-lumping, or scalping. Or which screening motion – centrifugal, vibratory, or gyratory – is best suited for your materials? What are the in-process material characteristics (e.g., moisture content, particle size distribution)? How will the sifter integrate with upstream or downstream equipment, like granulators or during presses – by using gravity-fed or vacuum transfer systems? Is the sifter intended for batch processing or continuous operation?
  • Performance requirements: What throughput is needed? Are there maximum allowable noise or vibration levels to comply with workplace safety standards? What is the acceptable level of material loss or retention during the process?
  • Design and construction requirements: What are the material of construction (MOC) requirements? What specific surface finish (e.g., Ra < 0.8 µm), required? What are the requirements for welding and polish?  What utilities are available, like compressed air pressure, electrical supply, or vacuum systems, to support its operation? Does it need tool-less disassembly for cleaning? What are the cleaning needs (e.g., CIP, SIP, manual)?
  • Controls and automation: Will the sifter be integrated with a Supervisory Control and Data Acquisition (SCADA) system or Programmable Logic Controller (PLC)? Are alarms, interlocks, or data logging required? Should it support electronic batch records or interface with Manufacturing Execution System (MES)? Is Process Analytical Technology (PAT)-driven monitoring required?
  • Environmental or containment needs: Is operator protection (e.g., for potent APIs) required? What classification zone applies?  For example, if OEB 5 containment is required, pneumatic clamping mechanisms have to be put in place.

These are essential details – and they define far more than the basic equipment type. They shape how the system will be designed, selected, installed, qualified, and operated. Without them, the URS cannot serve its purpose as a regulatory and procurement document.

Even if LLMs are fine-tuned on customer data, key questions remain:

  • Who validates the quality and compliance of the source material?
  • Is the model version-controlled?
  • Is the data private – or potentially exposed across users or systems?
  • Will the AI output reflect your company’s unique process – or generalize from unrelated inputs?

And if subject matter experts must still correct or rebuild the draft, it’s worth asking: is the AI truly reducing effort, or simply introducing another layer of review?

Functional Specifications: A Vendor’s Responsibility

Once the URS is approved, the FS (Functional Specification) is typically authored by the vendor, service provider, or internal engineering team. The FS outlines how the solution will meet the defined user requirements – through system design, control logic, automation behavior, configuration, and integration.

Authoring an FS requires:

  • Proprietary system knowledge,
  • Awareness of design constraints,
  • Understanding of operational and regulatory requirements,
  • Collaboration across quality, engineering, IT, and qualification functions.

The FS is the vendor’s responsibility – it represents their response to the URS. And this raises an important point: why would a vendor delegate that responsibility to an AI tool that doesn’t understand their product? Even with access to a URS, an LLM cannot independently determine how a system meets those requirements without being explicitly told.

The same applies to qualification test scripts. These are step-by-step instructions used to verify whether the system meets its documented requirements. They must be traceable, reproducible, and inspection-ready – and they require domain knowledge to be meaningful. LLMs cannot reliably generate test descriptions because they don’t understand how the system functions, what needs to be tested, or how to design a test that produces meaningful, validated results.

If generating test scripts were as simple as prompting an AI, most qualification tools would be fully automated by now. But they are not – because test design remains a technical, judgment-based process.

Can LLMs Automate the Process of Filling a Traceability Matrix?

Some vendors have been marketing their ability to automate traceability, namely by using LLMs to identify which test script aligns with a given URS requirement. But is it really possible to perform this task truly reliably using AI?

One approach would be to use a type of LLM known as a sentence transformer model, a class of AI designed to compare entire sentences rather than individual words. These models can capture semantic relationships between phrases – for example, comparing the intent of a URS requirement with the description in a test script (be it an IQ, OQ, or PQ) – and then rank the closest matches accordingly, to select which script number matches each URS requirement.

Transformer models are trained on billions of data points and can encode the meaning of phrases or entire sentences, not just individual words. However, their reported accuracy leaves some questions.

Here are the published accuracy figures for some commonly used transformer models:

ModelReported Accuracy
all-distilroberta-v1~85%
all-MiniLM-L12-v2~80%
paraphrase-MiniLM-L6-v2~70%

To assess performance in a qualification context, we tested two of these models – all-distilroberta-v1 and all-MiniLM-L12-v2 – by asking them to match 50 URS statements against 50 corresponding functional requirements.

Despite both documents being clearly written, the LLMs mismatched 11 of the 50 statements to the wrong requirements – more than 20%. Even when combining different transformer models to improve accuracy, we never obtained an accuracy of 90%. Of note, this occurred even when both input sets were well structured. In real-world scenarios, where URSs and scripts often vary in phrasing, specificity, or formatting, this represents a serious challenge to the idea that LLMs can automate traceability.

For matching URS requirements to test scripts the accuracy shortfall would become even more pronounced, as scripts are written in less descriptive and more procedural language than FSs – simply describing steps to be executed, and not their intent. Sentence transformer models are not yet capable of reliably inferring test intent from such procedural language. The point then is that if human experts still need to manually verify each AI-suggested match, what, if any, are the efficiency gains?

It’s also worth noting that these models are more specialized than general-purpose LLMs like ChatGPT or Mistral. Therefore if transformer-based models – trained specifically for sentence matching – struggle with this task, non-specialised models will likely perform worse.

Where LLMs Might Help

Despite these limitations, LLMs can still support certain tasks – particularly when used under SME supervision in low-risk contexts. Potential use cases include:

  • Drafting boilerplate content (e.g., introductory or regulatory reference sections),
  • Improving clarity, grammar, or formatting,
  • Suggesting document structures or outlines,

These kinds of low-risk, repeatable, and time-consuming activities are suitable areas for automation, with AI acting as a support tool. However, even in these cases, outputs need to be reviewed and approved by qualified personnel to ensure regulatory accuracy and contextual appropriateness.

Final Thoughts

The importance of accuracy in qualification, and the limitations of LLMs mean human input is still essential. While LLMs can support low-risk, repetitive tasks like drafting boilerplate content or improving formatting, their role in generating qualification documents remains limited.

By definition, a URS must reflect what human users need from that asset/service to be purchased – something no AI can independently determine. These documents require expert judgment, deep process understanding, and context-specific decisions that LLMs simply cannot replicate, at this level of maturity.

Similarly, the use of AI for automating traceability – such as matching URS statements to test steps – still falls short. Even advanced sentence transformer models show accuracy rates that are too low for compliance-critical tasks. In regulated environments, a 20% error rate isn’t just inefficient – it’s dangerous.

Errors in qualification can have serious consequences: from costly equipment failures to regulatory violations or, worse, risks to patient safety. That’s why experienced professionals must remain in control of the process.

In short, LLMs may enhance productivity in certain areas, but when it comes to URS, FS, and traceability matrix, human expertise, oversight, and responsibility are irreplaceable.