Evaluating Artificial Intelligence (AI) Systems

Kim Deloria
October 1, 2018

Artificial Intelligence or AI has long fascinated us. Once only in the realm of science fiction and movies, recent advances in computing power and data storage have allowed this alluring concept to become a reality.

Some examples of today’s AI application are chatbots, Google’s predictive searching, Apple’s Natural Language Processing (“NLP”) of queries through Siri, smart email categorization, and route-based pricing in ride sharing apps like Uber or Lyft, among others.

Our current AI capability, despite achieving a revolutionary effect in modern life, is far from being a perfected or apex technology and has its share of controversy. There are valid concerns regarding AI. These concerns range from the problem with ‘context’, the eventual displacement of human workers in some industries, the capability of AI systems to make ‘moral decisions’, the potential application of AI in lethal weapons systems, and perhaps, most controversial of all, the possibility of a runaway AI system taking over the world. It may be that some of these controversies are more myth than fact, but with the pace that technology is progressing it is only a matter of time before a manifestation of our unease could become a reality.

When Artificial Intelligence goes awry

Recent events show that creating ambitious AI systems–even with the best of intentions–can sometimes lead to awry results.

When Microsoft created a hip millennial bot named Tay who can interact through social media, the software giant’s goal was to do research on conversational understanding. Tay would continually learn from the conversations that she had with the public on Twitter. Most of these conversations, unfortunately, were from trolls and troublemakers and it was not long before she displayed disturbing racist behavior. Microsoft eventually shut her down for repairs.

Meanwhile, Facebook contributed to a minor scare among conspiracy theorists when an AI system that it was designing, chatbots designed to talk to humans, created their own indecipherable language. The researchers chose not to let the bots create their own language since the whole point was machine to human communication.

In another high-publicity case, a self-driving shuttle bus had an accident on its first day of operation when it crashed into a van. While it was the driver of the van who was at fault for crashing into the shuttle, the fact that the shuttle’s ‘decision’ was to stop shows that the AI system had difficulty processing situations that were not part of its original parameters.

All the above instances indicate that although AI may hold a lot of promise, there are still some rough spots that need to get smoothened out. This poses an intriguing question for software testing: how does one evaluate the design, engineering, and specifically, the testing process, of artificially intelligent systems?

Testing AI versus Testing Traditional Systems

Requirements bind a non-AI software system because these form the basis of its creation. There is a clear set of documentation that defines and sets boundaries for what it can do. These set of requirements is also what allows the system to be evaluated.

An AI system has these same set of documentation or requirements; however, it has the potential to display or exhibit behavior not captured in the documentation. Unlike the testing of traditional systems that rely heavily on documented requirements, AI systems can evolve beyond their original programming–even surpassing reasonable extrapolations–making it evident that the testing of AI systems need to go beyond the traditional methodologies of testing.

Testing of systems, whether AI or non-AI, will always be an essential activity. Both types  of systems will need to undergo the requisite testing. The main difference between the two is that in testing AI systems, their behavior is much harder to predict, whereas traditional testing mandates that the ‘expected result’ is defined or known. Furthermore, the inherent complexity of creating AI systems means that a lot of other testing prerequisites are needed before testing can be done.

A non-exhaustive list of some considerations that need to be addressed when testing AI systems include:

  • What are the inputs that will be used to train and test the system?
  • What is the potential of the input data to be biased?
  • How will the system handle exceptions?
  • How good is the training and testing data quality?
  • Does the project have enough data?
  • Are there any ethical or legal implications?

The table below underscores the differences between testing AI systems and testing traditional IT systems.

Activity Testing Traditional systems Testing AI systems
Create test cases Determine the acceptance criteria based on the requirements Essentially the same as testing traditional systems but with additional considerations like the sufficiency and quality of input and training data
Test Execution Perform test execution based on test scripts or test charters Specialized form of testing is needed such as A/B testing, metamorphic testing, and additional non-functional testing
Test Reporting Reporting of test metrics and recommendations Reporting of test metrics and AI behavior
Defect Management Defect triages and code fixes Defect triages, code fixes, data changes, data optimization

It can be observed that when testing AI systems, it is essential that there must be a good number of quality data sets for testing to be successful. Testing AI systems also means that additional types of testing need to be performed to fully test the AI system.

New QA Skills Needed

It becomes quickly apparent that additional skills are needed when it comes to testing AI systems.

The traditional skills of a QA resource are analyzing user requirements, creating test scenarios, test execution and defect analysis, and reporting. These skills are still essential when it comes to AI testing, but are no longer sufficient. Even a Quality Engineer who has additional technical skills, such as being able to write and understand computer code, automation, and networking, is not enough.

Testing AI systems requires a deep understanding of data manipulation and quality, a working knowledge of the varied machine learning techniques, and a good foundation in mathematics and statistics. The AI testing resource should therefore be a technologist who has a solid grasp of the theory behind AI.

Since AI systems are designed to emulate a form of human behavior or capability, there can even be situations where certain non-functional skills are needed. In situations where NLP is involved, for example, language majors need to work with technologists to properly capture the essence of the data needed that makes the AI work. This avoids the Microsoft debacle mentioned in an earlier section. On the other hand, in systems where the AI is expected to culturally interact with groups of people, the input of psychologists and anthropologists would be essential in determining the acceptance criteria of the AI system.


The possibilities and impact  of AI means that more AI systems will be created in the future. In order for these AI systems to gain widespread acceptance and unlock their true potential, it is important that they be constructed correctly and thoroughly evaluated.

Testing is an important activity in the evaluation process. Due to the complex nature of AI systems, the traditional methods of testing as applied to AI are insufficient. Testing therefore needs to be supplemented by newer types of activities specifically geared towards AI. These newer set of activities further imply that resources doing the testing have other skill sets that enable them to perform these new activities.

So in a manner of speaking, we do need to write our own science fiction to harness the power and potential of AI.