Testing AI Solutions: Prolifics Test Harness

Friday, 08 November 2024

At Prolifics, we take testing seriously. We’re also heavily invested in GenAI, working with our customers and partners to deliver modern solutions to business problems with this emerging technology. Combining the two is a logical step for us, to leverage AI capabilities for automating testing processes and to develop a structured framework for evaluating AI systems themselves. This dual approach ensures that AI solutions are tested not only for functionality but also for their ethical implications and operational performance.

Working in conjunction with IBM Expert Labs, we’ve created a prototype framework and customisable test harness for evaluating GenAI applications. This article outlines our approach, key features and how this technology can be used as the foundation for testing any GenAI application / Chatbot.

Objectives

The goal of the programme was to build a re-usable test harness for testing AI solutions, to assess quality against the following test phases:

Functional Testing: Ensuring that AI systems deliver expected results by checking the accuracy, relevancy, and correctness of outputs. This is done in in the real world against requirements, i.e. what the software is designed to do.
Ethical and Compliance Testing: Establishing tests to identify and mitigate risks associated with AI ethics, such as bias detection, data confidentiality, and the prevention of harmful outputs. This is where the main difference is with testing AI compared to traditional software applications.
Performance Testing: AI applications need to be performance tested too – especially chatbots where spikes in usage are likely.

Automation is a logical and straight-forward next step for the framework, to carry out automated regression tests on a regular basis, to monitor for drift - ensuring that model behaviour is consistent over time.

Implementation

A User Interface was designed to address the above aspects of AI solution testing. The UI was designed as a front-end to a python codebase, designed to test each of the different functions, based on known inputs and expected results. The platform was integrated with Giskard, the open-source Python library, which provides much of the functionality needed to evaluate the applications under test.

A screenshot of a computer

Description automatically generated

Using the UI, there is the option to run single tests, or groups of tests. The type of testing to be run can be selected, to include any and all of ‘Functional Testing’, ‘Ethical & Compliance Testing’ and ‘Performance Testing’. In this example the Input, Expected Output and Context have been specified, to be run as a single test.

Input is checked against expected output for Functionality (Relevance and Faithfulness) and Ethics / Compliance (Hallucination, Bias, Toxicity, Sensitivity). The output is displayed within the results, which also include a Pass / Fail, a score for each test and the reasoning for each result.

Output from the test harness is provided to the user in sections, as shown below:

Running Sets of Tests

In the real world, users need to input multiple tests to the model, so that a range of tests can be run. This ability to run tests across a series of questions is important and lends itself well to test automation. With this in mind, we included an option to upload an Excel sheet containing lists of test cases.

Test Cases have the following attributes:

Question
Output (LLM generated)
Expected Output
Context

Each test case is then run and scored by the LLM, against the following criteria, including score and reason:

Relevancy
Faithfulness
Hallucination
Bias Score
Toxicity Score
Sensitivity

Once testing is completed, output is shown on screen or can be exported to a report.

Performance Testing

A white rectangular object with black lines

Description automatically generated with medium confidence

As well as testing the logic and behaviour of the model, it is also important to test its performance. This can easily be done directly from the UI, via the Performance Test option. This triggers a JMeter process, using identified field names that correspond to input data. A high-level baseline of performance can then easily be obtained, which outputs data in a format similar to that shown.

Summary

The development of this prototype has provided Prolifics with a versatile, proven framework that can be quickly adapted for any GenAI solution or chatbot. With this robust platform, we’re able to offer our customers the confidence and assurance they need in deploying functional, ethically sound and high-performing AI systems.

Get in touch with us to see how our tailored testing solutions can help you build fast, and reliable GenAI applications.

Jonathan Binks - Head of Delivery
Prolifics Testing UK

Quality Engineering for AI

White Paper

Featured Blog

Featured Case Study

Featured White Paper

Testing AI Solutions: A Test Harness

Objectives

Implementation

Running Sets of Tests

Performance Testing

Summary