This series is a reprint of an article by Scott Sehlhorst, written for developer.* in March 2006. A recent article on dailytech about “new” methods for software testing points to some very interesting research by the NIST (the National Institute of Standards and Technology) Information Technology Lab. We’ve split the original article into three articles to be more consistent in length with other Tyner Blain articles.

This is part 2 of a 3 part series.

## Article Overview

- Part 1 of this article explores the problem of getting good testing coverage of complex software.
- Part 2 of this article discusses solution approaches (including those identified in the NIST research).
- Part 3 of this article explores an approach to improving on the “best” solution identified in part 2.

## Solving the problem

There are a lot of applications that have millions or billons of combinations of inputs. They have automated testing. They have solutions to this problem. We just finished discussing how impractical it is to test exhaustively, so how do companies test their complex software?

## Random sampling

Early on in the software testing world, someone realized that by randomly checking different combinations of inputs, they would eventually find the bugs. Imagine software that has one million possible combinations of inputs (half as complex as our previous example). Each random sample would give us 0.000001% coverage of all possible user sessions. If we run 1000 tests, we would still only have 0.001% coverage of the application.

Thankfully, statistics can help us make statements about our quality levels. But we can’t use “coverage” as our key measurement of quality. We have to think about things a little bit differently. What we want to do is express a level of confidence about a level of quality. We need to determine the sample size, or number of tests, that we need to run to make a statistical statement about the quality of the application.

First we define a quality goal: we want to assure that our software is 99% bug free. That means that up to 1% of the user sessions would exhibit a bug. To be 100% confident that this statement is true, we would need to test at least 99% of the possible user sessions, or over 990,000 tests.

By adding a level of confidence to our analysis, we can use sampling (selecting a subset of the whole, and extrapolating those results as being characteristic of the whole) to describe the quality of our software. We will leverage the mathematical work that has been developed to determine how to run polls.

We define our goal to be that we have 99% confidence that the software is 99% bug free. The 99% *level of *confidence means that if we ran our sample repeatedly, 99% of the time, the results would be within the margin of error. Since our goal is 99% bug free code, we will test for 100% passing of tests, with a 1% margin of error.

How many samples do we need, if there are one million combinations, to identify the level of quality with a 99% confidence, and a 1% margin of error? The math for this is readily available, and calculators for determining sample size are online and free. Using this polling approach, we find that the number of samples we require to determine the quality level with a 1% error and 99% confidence is 16,369.

If we test 16,369 user sessions and find 100% success, we have established a 99% confidence that our quality is at least at a 99% level. We only have 99% quality, because we have found 100% quality in our tests, with a 1% margin of error.

This approach scales for very large numbers of combinations. Consider the following table, where our goal is to establish 99% confidence in a 99% quality level. Each row in the following table represents an increasingly complex software application. Complexity is defined as the number of unique combinations of possible inputs).

We can see that the very few additional tests have to be run to achieve the same level of quality for increasingly complex software. When we have a modest quality goal, such as 99/99 (99% confidence in 99% quality), this approach is very effective.

Where this approach doesn’t scale well is with increasing levels of quality. Consider the quest for “five nines” (99.999% bug free code). With each increase in the desired level of quality, the number of tests we have to run grows. It quickly becomes an almost exhaustive test suite.

Each row in the following table represents an increasingly stringent quality requirement, with the complexity of the software staying constant at one million possible input combinations.

The random sampling approach does not provide a benefit over exhaustive testing when our quality goals are high.

## Pairwise testing of input variables

Studies have shown that bugs in software tend to be the results of the combination of variables, not individual variables.

Consider a very simple laptop configuration page, having three selectable controls: CPU, Memory, and Storage. Each control has three possible values as shown in the table below.

We successfully pass tests of each of the different values available in the CPU control. However, we discover that the software test fails if the user selects a CPU value of “Consumer” *and* selects a Storage value of “Huge.” This highlights an unknown dependency between the CPI and Storage controls.

Pair-wise testing is designed to get coverage of every possible combination of two variables, without testing every possible combination of all the variables. For this example, there are 27 unique combinations of all of the selections. The following table shows the first 9 combinations. An additional 9 combinations are needed for each of the other CPU selections.

Exhaustive pair-wise testing will make sure that every unique combination of any two variables will be covered. The next table shows the combinations for this example.

With just 9 tests, we are able to exhaustively cover every unique pair of CPU and Memory, CPU and Storage, and Memory and Storage. Pair-wise testing allows us to get full coverage of the combinations of every two variables, with a minimal number of tests.

Pair-wise testing not only gives us full coverage of every pair of values, it also gives us (redundant) coverage of every single value for each control.

If we look back at our previous examples of laptop-configuration, we can calculate the numbers of tests required to get full pair-wise coverage. For the entry level laptop configurator, there are 32,256 possible unique combinations of inputs. We can test every unique combination of two variables with 31 tests. For the high-end laptop configurator, there are 2,322,432 unique combinations of inputs. We can test every unique combination of two variables with 36 tests.

## N-wise testing

The concept of pair-wise testing can be extended to N-wise testing – looking at every combination of N possible inputs.

The following table shows how many tests are required to get full coverage of each N-wise combination of inputs for both the low-end and high-end laptops configurators.

This is a much more manageable situation. Using N=3, we get 179 tests versus 2.3 million tests for exhaustive coverage! Existing studies have consistently shown that N=3 creates on the order of 90% code coverage with test suites, although the number will vary from application to application. We will use N=3, based on practical experience that N=4 tests rarely uncover bugs that were missed with N=3.

This approach only works when users are forced to enter values in a proscribed sequence, and in cases where the sequence of entry is irrelevant. This set of tests won’t give us representative coverage of what the users will do when they are allowed to make selections in arbitrary but relevant order.

## Order relevance and statistical testing

There’s been an assumption implicit in all of our calculations so far – that the order of selection in the controls is irrelevant. The available N-wise test calculation tools do not incorporate order of selection in their permutations – explicitly, they assume a fixed order of operations. When we test an API we have control over the order of processing – there are a fixed number of arguments, in a fixed order. N-wise testing People, however, do not always interact with the controls in a fixed order. And web service architectures may not be able to depend upon a predetermined sequence of events.

With 5 controls in an interface, we have 5! (factorial – see definition) or 120 possible sequences in which selections can be made by a person. Although the user interface may incorporate dynamic filtering that prevents some subsets of out-of-sequence selection, N-wise testing is blackbox testing, and will not have access to that information.

For an interface with M possible controls, each script created by an N-wise test generator will have to be tested in M! sequences to get exhaustive coverage. If the controls are split across multiple screens, then we can reduce the number of sequences. For example, if there are 5 controls on the first screen, and five controls on the next screen, instead of considering 10! (3.6 million sequences), we can consider all first-screen-sequences in combination with all second-screen sequences (5! * 5! = 120 * 120 = 14,400 script sequences).

In our example laptop configurators, there are 11 and 13 controls (all on the same page) for the low-end and high-end laptops respectively. This would imply 11! and 13! possible sequences (40 million and 6 billion).

We do not need to do exhaustive coverage of the sequencing permutations. An N-wise test is specifically analyzing the interdependence of any combination of N controls. As a lower-bound, we would only need N! sequences for each generated script. So our 179-script suite for the high-end laptop (with N=3) would need 3! (6) * 179 = 1,074 scripts to cover the product.

Here’s the table for the lower-bound of scripts required to account for different values of N for both laptop-configurators.

This is a lower bound, because it assumes a perfect efficiency in combining unique sequences of each group of N controls. Existing N-wise testing tools do not (to the author’s knowledge) take order of operations into account. For N=2, this is trivial – just duplicate the set of tests, in the exact reverse order.

We can take order of operations into account by treating the sequence as an additional input. We use the mathematical formula “X choose Y” which tells us the number of different combinations of Y values from a set of X values. The formula for calculating “X choose Y” is X!/(Y!*(X-Y)!) where X is the number of inputs and Y is the dimension of the desired N-wise test.

Here’s the table of the number of combinations for each N, for both the low-end and high-end laptop configuration screens we’ve been discussing.

Here are the values, generally, for varying numbers of inputs.

We would then calculate the N-wise testing using a value of N+1 as an input to the test-generation tool, and include the number of unique sequences as if it were a control input.

Unfortunately, we don’t have a solver capable of handling single dimensions larger than 52. This limits our ability to create a test suite for N=3 to a maximum of 7 controls.

To show the impact of sequencing on the test suite, consider an interface with 7 controls, each having 5 possible values. N=3 would require 236 tests if order is irrelevant. We then include sequence of selection as a parameter (by adding an 8^{th} control with 35 possible values, and testing for N=4), In this case, N=3 (with sequencing) requires 8,442 scripts. Our theoretical lower bound would be 236 * 35 = 8260.

## Article Overview

- Part 1 of this article explores the problem of getting good testing coverage of complex software.
- Part 2 of this article discusses solution approaches (including those identified in the NIST research).
- Part 3 of this article explores an approach to improving on the “best” solution identified in part 2.