文章基本信息

标题：Validation and further application of software reliability growth models - HP's Lake Stevens Instrument Div - technical
作者：Gregory A. Kruger
期刊名称：Hewlett-Packard Journal
印刷版ISSN：0018-1153
出版年度：1989
卷号：April 1989
出版社：Hewlett-Packard Co.

Validation and further application of software reliability growth models - HP's Lake Stevens Instrument Div - technical

Gregory A. Kruger

Validation and Further Application of Software Reliability Growth Models

ATHP'S LAKE STEVENS INSTRUMENT DIVISION, a software reliability growth model has demonstrated its applicability to projects ranging in size from 6 KNCSS to 150 KNCSS (thousand lines of noncomment source statements), and in function from instrument firmware to application software. Reliability modeling curves have been used to estimate the duration of system integration testing, to contribute to the release-to-sales decision, and to estimate field reliability. Leveraging from the basic model, project managers are beginning to plan staffing adjustments as the QA effort moves through the defect-fixing-limited phase and into the defect-finding-limited phase.

Basic Model

In the fall of 1986, a software reliability growth model's good fit to historical data on a previous firmware product led to the development of a set of release criteria, with defects per system test hour (QA hour) as the principal quality measure. The model and release criteria were then applied in real time to a new application product. The modeling effort aided in predicting when the product was ready for release to customer shipments and provided estimates for the number of defects that might be found in the field.

The basic exponential model is based upon the theory that the software defect detection and removal effort will follow a nonhomogeneous Poisson process. In this process the defect arrival rate is assumed to decrease with every hour of testing (or at least with every correction). The model has two components.

The cumulative number of defects found by time t is given by m(t) = a(1-e).sup.(k/a)t., and the instantaneous new defect-finding rate at time t is given by l(t) = ke.sup.-(k/a)t..

Fitting the model requires the estimation of parameters k, the initial defect discovery rate, and a, the total number of defects. The data required is obtained by recording on a daily or weekly basis the time spent executing the software and the resulting number of defects discovered. The model parameters may be estimated by the least squares, nonlinear least squares, or maximum likelihood method. In most cases, the maximum likelihood method is preferred.

Considering typical software development and system testing practices, the assumptions necessary for the applicability of Poisson theory would seem to negate the use of the model. Key assumptions of the model and the correspondingly realities are:

* Assumption: All functionality is completed before the start of system testing.

Reality: Many products enter system testing without all the features in place.

* Assumption: Testing can be considered to be repeated random samples from the entire input domain.

Reality: There is some random testing, but typically testers are more structured and systematic in the selection of test cases.

* Assumption: Defects found are removed with certainty and no new defects are introduced (a perfect repair).

Reality: A defect repair may introduce new defects.

* Assumption: The times between failures are independent.

Reality: When a defect is found in a particular area of the software, because of the suspicion that there may be more defects in the same area, the area is probed for more defects. This process usually finds more defects, which is good, but makes the arrival rate of defects dependent on when the last one was found.

As has been said, with such a set of assumptions, it would seem unlikely that this model would fit real-world data. However, some aspects of the testing process at Lake Stevens approximate these conditions. First, our life cycle calls for all functionality to be completed by the time we start formal system integration testing. Typical projects have 95% or more of their functionality complete by this time. Second, the entire set of functionality is subdivided and assigned to different individuals of the testing team. Therefore, while the testing process cannot be considered to be repeated random samples from the input domain, it is at least sampling from the entire functionality set as time progresses. This is in contrast to a testing process wherein some subset of the functionality is vigorously tested to the exclusion of all others before moving on to another subset and so on. Regarding the third assumption, strict revision control procedures at least maintain some control over the rate of defect introduction. Finally, nothing about the Lake Stevens development process justifies the assumption that the times between failures are independent. After finding a serious defect in a portion of the product, testing effort often intensifies in that area, thus shortening the next time to failure.

The model's success in describing the projects at LSID demonstrates some degree of robustness to these assumptions. Our past and continued application of software reliability theory is not based on a fundamental belief in the validity of the assumptions, but in the empirical validation of the model. ThereforE, we have continued to use software reliability growth models with the following objectives in mind:

* To standardize the application of the model to all software products produced at LSID

* To put in place a set of tools to capture and manage the data and obtain the best fit curves

* To use the defect-finding rate and the estimated defect density to define the release goal

* To predict the duration of the QA phase before its start

* To understand the relationship between model estimates and field results.

Standardized Application

To date, software reliability growth modeling has been conducted on eleven projects that have since been released for customer shipment. Two demonstrated excellent fit to the model, two very good fit, four showed a fair conformance to the model, and three showed a poor fit. Fig. 1 shows the curves for one of the projects on which the model gave an excellent fit. Contrast these results to the model's performance on the project shown in Fig. 2. Note that time in this case is measured in calendar days rather than test hours. Here the cumulative defects begin to taper off only to start up again. These results reflect inconsistent testing effort, which is not picked up by simply measuring calendar days of testing effort. The curves in Fig. 2 were obtained by independently fitting the basic model before and after the change in testing effort. There two best-fit models were then tied together to form the piecewise curves shown.

Tools

The defect tracking system (DTS), an internal defect tracking tool, is used by all project teams to log defects found during system testing. In software reliability modeling it is important to record all time spent exercising the software under test regardless of whether a defect is discovered. DTS has proven to be unsatisfactory for capturing QA hours that do not produce a software defect. Therefore, project teams separately log test hours at the end of each day.

The DTS data is loaded into an Informix data base so that it can be sorted and retrieved as desired. On projects using DTS for tracking QA time as well as defect statistics, Informix reports generate files with weekly (or daily) QA hour and defect total data pairs. On projects tracking QA time separately, the weekly (or daily) defect totals are retrieved from the Informix data base and matched with the appropriate QA hours. In either case, the file of cumulative QA hours and cumulative defects found is submitted to a program that obtains the best-fit model parameters by the method of maximum likelihood. At the present time, plots for distribution are generated using Lotus 1-2-3. Future plans call for using S, a statistical package that runs in the HP-UX environment, to generate the graphics, thereby conducting the data manipulation, analysis, and plotting all on one system.

Release Goal

The software modeling process provides two related metrics that help support a release-to-customer-shipments decision: the defect-finding rate and the estimated number of unfound defects. A specific goal for one of these two metrics must be established if the model is to be used for predicting the conclusion of system testing.

The defect-finding rate is a statistic you can touch and feel. It can be validated empirically--for example, 100 hours of test revealed four defects. On the other hand, one can never really measure the number of defects remaining. This metric can only be estimated. Although the two measures are related, it is not true that two projects releasing at the same defect-finding rate goal will have the same number of defects estimated to be remaining. Couple this fact with the recognition that the size of the product has no bearing on the model fit and the resulting estimated number of residual defects and it is clear that two projects releasing at the same find rate could have quite different estimated residual defect densities. Because of its observability, the defect-finding rate has been used as the principal release goal on all projects to date except one. However, both the failure rate and the estimated residual defect density are monitored and used in aiding the release decision.

The Project E Experience

The one project to date using a goal of ending system test with a certain residual defect density will serve as a good illustration of the contributions and limitations of software reliability growth models. Project E is an application software product of 156 KNCSS. This project represents a new release of a previously developed product and is roughly two-thirds reused or leveraged code. The stated goal at the start of system integration testing was to achieve an estimated residual defect density of 0.37 defects per KNCSS, goal derived from the performance of the first release of this product. Such a goal means that the best-fit model should be estimating 58 residual defects.

A team of engineers was assembled to conduct testing while the project team fixed defects. The data was plotted at roughly 30-hour testing intervals and the model refit each week. The most recent curve was used to estimate the QA hours required to achieve the objective and these estimates were plotted weekly with statistical confidence limits as shown in Fig. 3. In mid-April, the decision was made to release the project for customer shipments and to continue further testing and refinements for a final release in June. The team had all but reached the goal and the data had tracked the model very well. At this point, the engineers on the testing team disbanded and returned to their original project assignments. The design team then took on the task of conducting both continued testing and defect resolution. With only the designers looking, the defect discovery rate jumped up rather than continuing to follow the curve as can be seen in Fig. 4. The designers were testing specific areas of the code (directed testing), so an hour of testing now was not equivalent in intensity to an hour of testing with the previous test team. The testing process was not meeting the assumption that testing can be considered to be repeated random samples from the entire user input domain.

What is clear from this project is that the failure rate data and curves are modeling more than the software product alone. They are modeling the entire process of testing. The estimates of failure rates and residual defect densities are estimates only as good as the testing process itself. The degree to which these statistics match field results will depend upon the degree to which the testing matches the customer's use profile. The identification of the customer's use profile and the incorporation of that information into the testing strategy is a topic for further investigation.

Before QA Begins

Naturally we would like to estimate the duration of the QA phase before it begins. But fitting a model to do estimation must wait for testing to begin and for enough data to be collected before an effective statistical analysis can be conducted. However, it is possible to use results from past projects to estimate the two model parameters a and k.

In preparation for testing a recent software product, Project F, we reviewed the total number of defects discovered during system integration testing on past projects. Defect densities appeared to fall between 12 and 20 defects per KNCSS. Project F had 28.5 KNCSS, so the likely range for the first model parameter, a, was calculated to be 342 to 570 defects. Again looking at past projects, the initial defect discovery rate averaged around one defect per hour, so the other model parameter, k, could be set to one. Given a goal for the failure rate of 0.08 defects per hour, an expected range of 864 to 1440 QA hours was calculated.

Management ultimately needs an estimated date of completion so the expected

QA hours required for system testing must be converted to calendar time. To accomplish this we again reviewed the data on past projects and discovered an amazing consistency of four QA hours per day per person doing full-time testing, and an average of 2.3 defects fixed per day per person doing full-time fixing. Given the number of team members capable of fixing, the number capable of finding and those qualified to do both, the required QA hours for testing could now be converted to calendar time. Fig. 5 shows the final QA projections for Project F and the staffing levels used to convert the QA hours into calendar time. Note that the staffing levels given correspond to the midrange assumption of 16 defects per KNCSS.

Recognize that as testing proceeds, testing and fixing resources will have to be shifted. Early in the process, the project is fixing-constrained because a few testers can and enough defects to keep and available fixers busy. Over time this changes, until late in testing, the project is finding-constrained since it takes many resources looking for defects to keep only a few fixers working. Also, the finders cannot be allowed to outstrip the fixers, creating a large backlog of unresolved defects. Such a situation only causes frustration for the finders because of testing roadblocks created by defects already found.

Our experience to date with using the model to estimate the duration of the QA phase before its start demonstrates the difficulty in estimating the two required model parameters without actual system test data. Project F concluded system integration testing with a total QA effort that was 225% of the original effort. Over twice the expected number of defects were found and resolved. Not all of this error in estimation can be blamed on the failure of the model. In hindsight, we completely disregarded the fact that this was not a stand-alone project. Many of the problems encountered were because Project F had to be integrated with another 130-KNCSS product.

These results indicate that adjustments are necessary in the way values for the model parameters are derived. For instance, currently the values for the parameters are averaged from a heterogeneous population of previous software projects. Fixing this problem means that if we want to estimate the QA duration for project X then we must base the model parameters on projects similar to project X. Project characteristics such as complexity, team experience, and the development language must be formally factored into any early estimates of QA duration.

Field Failure Results

Although the modeling process is helping to monitor progress through system testing and is aiding in the release decision, based upon limited defect data from our field defect tracking system it appears that the curves may be overestimating the number of defects customers will discover by as much as forty times. However, it is likely that only a portion of the actual field failures find their way into the tracking system.

Our experience testing one firmware project that was an enhanced version of an old instrument puts an interesting perspective on estimated residual defect densities. This particular product had been shipped for ten years at an average volume of over 200 units per month. Since a market opportunity existed for an updated version of that product, both hardware and firmware enhancements were incorporated into a new version. During system test, an obscure defect in a math routine was discovered that had not only existed in the original product since introduction but in several other products shipped over the last ten years. To the best of our knowledge, no customer or HP personnel had previously found that failure. Its existence was further argument that the information coming back from the field is not giving a true perception of residual defect densities. Not only do customer observed failures go unreported, but it is highly likely that some failures will never be encountered during operation. It was reassuring to note that LSID's current testing process was uncovering defects so obscure as to be unobservable in ten years of field use.

Conclusion

With data collected during system integration testing, we have been able to use a software reliability model to estimate total testing effort and aid in assessing a project's readiness for release to customer shipments. Although the model appears to be somewhat robust to its underlying assumptions, future success will depend upon the integration of customer representative testing techniques into our existing testing process. In addition, there remains the challenge of using the model to estimate test duration before system integration begins. This will require a thorough analysis of data on past projects and key information on the current project to derive better early estimates of the model's parameters. Our ultimate objective remains to achieve validation of the modeling process through accurate field failure data. All of these areas will continue to be investigated because they are important in determining project schedules and estimating product quality.