文章基本信息

标题：Project management using software reliability growth models - Goel-Okumoto software reliability growth model - technical
作者：Gregory A. Kruger
期刊名称：Hewlett-Packard Journal
印刷版ISSN：0018-1153
出版年度：1988
卷号：June 1988
出版社：Hewlett-Packard Co.

Project management using software reliability growth models - Goel-Okumoto software reliability growth model - technical

Gregory A. Kruger

Project Management Using Software Reliability Growth Models

SIGNIFICANT IMPROVEMENT in software reliability calls for innovative methods for developing software, determining its readiness for release, and predicting field performance. This paper focuses on three supporting strategies for improving software quality. First, there is a need for a metric or a set of metrics to help make the decision of when to release the product for customer shipments. Second, accurately estimating the duration of system testing, while not directly contributing to reliability, makes for a smoother introducting of the product to the marketplace.

Finally, achieving significant improvement is easier given the ability to predict field failure rates, or perhaps more realistically, to compare successive software products upon release. Although this third strategy will be discussed in this paper, the emphasis will be on choosing the right software reliability metric and confidently managing the testing effort with the aid of software reliability growth models. For a more thorough discussion of estimating software field failure rates, see Drake and Wolting. sup.1

Defect per Hour as the Quality Metric

Two principal reliability measures are:

* Defect density expressed as defects per thousand lines of noncomment source statements (KNCSS)

* Number of new defects found per hour of system testing. The first measure will show progress on successive software projects, while the second measure is more helpful for real-time project management.

The user is unlikely to perceive a software product in terms of lines of code, since this is an invisible dimension of the product. The customer perceives quality in terms of how often a problem is encountered. This is analogous to a hardware product. We speak of the mean time between failures (MTBF) for repairable instruments, and in fact, failure rate and its reciprocal, MTBF, are appropriate measures of software quality. It is likely that, however different the sizes in lines of code, if two software products have the same failure rate during operation, they will be perceived as having the same level of reliability by the customer.

Although ascertaining the customer-observed failure rate is difficult at best, measuring the failure rate during system testing is straightforward. For each successive time period, a week for example, record both the engineering hours invested in testing and the number of new defects found. Dividing the number of defects by the hours spent testing gives an estimate of the "instantaneous" new-defect-finding rate. As defects are found and removed from the software, this rate declines. The decreasing new-defect-finding rate and the corresponding increasing mean time between finding new defects provide the criteria by which progress and the ultimate conclusion of system testing can be judged.

Reliability Growth Modeling

There is considerable statistical literature on modeling the reliability growth process of finding and fixing defects in a software product. Available models can be classified into four broad categories: sup.2.

* Input-domain-based models

* Fault-seeding models

* Time-between-failures models

* Fault-count models.

Input-domain-based models can be characterized as a basic sampling theory approach. One defines the input domain of a software system to be the set of all possible input data the software will encounter during operational use. The reliability can then be estimated by taking a representative sample fromt he input domain and looking at the resultant failure rate when the sample data is input to the system for execution. Although the input domain sampling procedure will not be sticly followed here, the concepts embodied in this technique turn out to be helpful in designing the testing process and understanding the limitations of making estimates of field performance based on defect-finding rates observed during system testing.

The approach in fault-seeding models is to seed a known number of defects into a software product. The software is then tested using a process that presumably has equal probability of finding a seeded or an indigenous defect. The numbers of indigenous and seeded faults found are used to estimate the system reliability. This concept has several implementation problems and draws considerable skepticism from project managers and designers alike.

The preferred data in time-between-failures models is the observed running times between successive failures. Typically it is assumed that the time between two consecutive failures followes a distribution whose parameters are dependent upon the number of defects remaining in the software during that time interval. These distribution parameters can be estimated from the observed time-between-failures data. However, the actual runing time between successive failures may be a difficult measure to extract from the software testing process.

In fault-count models, the variable of interest is the number of defects observed per specified time interval. The basic idea is that the number of failures observed per time interval can be modeled according to a Poisson process. The Poisson distribution is widely used to model the number of occurrences of some event in a time or space interval. As previously noted, the defect-finding rate staistic is relatively easy to capture. For this reason, the focus of this paper will be on the category of fault-count models.

John Musa's execution-time model, discussed by Drake and Wolting, sup.1. is an example of a fault-count model. Although their derivation differs from Musa's, Goel and Okumoto sup.3 propose essentially the same model. Because of its simplicity and intuitiveness and this author's preference for the derivation of the model from the underlying assumption of a Poisson process, the goel-Okumoto model will be used here.

The Goel-Okomot Model

The model assumes that between software changes, the number of defects observed during each hour of test will follow the Poisson distribution with a constant average, [lambda]. Note that this defect rate includes the occurrence of both new and repeat defects. If the observed defects are fixed immediately, the Poisson parameter [lambda] will be decreasing every hour of testing (or at least decreasing with every code correction). In this case, we have a nonhomogeneous Poison process and can expect the defect-finding rate to decrease exponentially until software changes cease. Conversely, as the defect rate declines, the cumulative number of defects found will increase and asymptotically approach a constant. Since there is a lag in the correction of defects, the ideal state of instantaneously fixing defects when found can be approximated by counting only the new defects found each hour.

The functional form of the model is as followes. The cumulative number of defects found by time t is given by m(t) = [alpha](1 -- e.sup.-[beta]t.) and the instantaneous new-defect-finding rate at time t is given by l(t) = m'(t) = [alpha][beta]e.sup.-[Beta]T.

In this model, [alpha] is the expectd total number of defects in the software, [Beta] is the initial defect-finding rate divided by the expected tota number of defects in the softwae, t is the cumulative hours of system testing, and m'(t) denotes the derivative of m(t).

Figure 1 shows the typical shape of l(t) and m(t).

It should be noted that the parameter [alpha] is actually an estimate of the total number of defects the system testing process is capable of detecting. The input domain concept is useful for understanding this point. If there is some stratum of the input domain that the system testing process is not covering, the parameter [alpha] is actually an estimate of the total number of defects minus those gerated by data from that specific stratum.

There are three key assumptions underlying this model:

* All new coding is completed before the start of system testing.

* Defects are removed with certainly and without introducing new defects.

* Testing methods and effort are homogeneous, that is, the last hour of testing is a insterest as the first.

The first assumption ensures that the defect-finding rate follows a monotomic pattern. While the second assumption is difficult to guarantee, instituting strict security measures will minimize the rate of defect introduction. The phasing of distinct test activities, the introduction of new testing tooks, the addition of new team members, and software testing burnout make meeting the third assumption difficult, as well. However, good planning of the testing process and averaging the defect-finding rate over a calendar week will tend to smooth out difference in testing effort.

Fitting the Model on a Project

Consider the application of the Goel-Okumoto model to a 90-KNCSS firmware product. This product had completed system testing eighteen months earlier, but records were available from which a data set could be constructed to test the feasibility of the model. For each of 43 weeks of system testing, the cumulative test hours, the instantaneous new-defect-finding rate, and the cumulative number of defects found were calculated. However, because the number of hnours investd in system test during the first several weeks of testing was highly variable, the first 19 weeks of data was condensed into five data points, each representing approximately 200 hours of testing. This left 29 sets of data values, each representing at least 200 consecutive hours of testing. The nonlinear least squares fit of the the l(t) model to this data appears in Fig. 2. The obvious departure of the actual defect-finding rate data from the model at about 9000 cumulative hours appears to be explained by the addition of several experienced designers to the testing team. This is a clear violation of the assumption that the testing effort remains homogeneous. Using the parameter estimates obtained from fitting l(t), the m(t) model and cumulative data are plotted in Fig. 3. Parameter estimates and standard errors from fitting l(t) are: Estimates: [alpha] = 2307 [beta] = 0.000384 Standard Errors: [sigma].sub.[alpha] = 362 [sigma].sub.[beta] = 0.000102

There are some statistical advantages to fitting l(t) to the defect-finding rate data and using the resulting parameter estimates to plot m(t). By the nature of cumulative data, the residuals from fitting m(t) are serially correlated. However, the residuals from fitting l(t) to the instantaneous defect-finding rates ar well-behaved, more closely conforming to the standard assumption of normally distributed erros with mean zero. On the other hand, l(t) can only be fit it there are enough hours of testing and defects found to generate reasonable estimates of the instantaneous defect-finding rate. If a project will be undergoing a small number of test hours so that blocks of time (such as a wekk) cannot be aggregated to give an estimate of the defect-finding rate, then it is difficult to obtain parameter estimates by the method of least squares. In this case, maximum likelihood estimate should be used to fit the m(t) function to the cumulative data. This should not be interpreted to mean that maximum likelihood estimation is an inferior technique to be employed when the least squares method is not feasible. On the contrary, it can be demonstrated that maximum likelihood estimates are among the best. The intent here is to show what is possible with simple least squares procedures.

Using the Model in Real Time

The project discussed above, project A, demonstrates that real software testing data does follow the nonhomogeneous Poisson process model. Another project, project B, an application software product consisting of 100 KNCSS, demonstrtes the potential for using the defect-finding rate metric and the Goel-Okumoto model in meanaging the system testing process.

Setting the Release Criteria. First, a defect-finding rate objective must be incorporated into the existing set of software release requirements. The total release checklist would then include such things as all functionality complete, all performance specifications met, no known serious defects, and the defect-finding rate criterion. The question remaining is, "What would be an appropriate value for this defect rate?" Project A concluded system testing with a rate of 0.03 defects per hour. As it turned out, the firmware proved to have extremely high reliability in the field. However, it is not clear that a 0.03 defect-finding rate is appropriate for all projects--Particularly on software rather than firmware projects. In addition, arguments can be made that the relative severity of each defect found should be used to express the release criteria in terms of a weighted defect-finding rate. Ultimately, based upon work at other HP divisions and the belief that the defect weighting would be quite severe, a release requirement of 0.04 weighted defects per test hour was established. A team of four project engineers reviewed all defects found and scored each on a 1-9 scale with 1 being a cosmetic flaw and 9 being most serious. These scores were then normalized by dividing by nine. The team recognized that the software reliability modeling process would be dependent upon their ability to weight the defects consistently throughout the duration of the project.

Applying the Model. Once the release requirement is set, the Goel-Okumoto model becomes useful as a real-time management tool. Having fit the model to the most recent data, an estimate for the total test hours required to release is obtained by calculating where the curve crosses the release requirement. The management team can then review weekly plots of the defect-finding rate and estimates of the total test hours to release. These plots enable the team to predict the project's conclusion more accurately and to judge progress along the way. In addition, the engineering staff conducting system testing may find the data to be morale building. Without some measure of progress, system testing is often perceived to be a never-ending process.

Project B ultimately reached the release goal after 5760 hours of testing over 22 calendar weeks. Figs. 4 and 5 show the final plots of the l(t) and m(t) functions fit to Project B system testing data using nonlinear least squares. The parameter estimates and standard errors from fitting l(t) are as follows: Estimates: [alpha] = 1239 [beta] = 0.000484 Standard Errors: [sigma].sub.[alpha] = 116 [sigma].sub.[beta] = 0.000069

One can simplify the analysis by first taking the natural log of the defect-finding rate data. A quick check of the l(t) function vrifies that taking the log transformation makes the model linear. Fig. 6 shows the regression line fit to the logged data. The benefit of operating with a linear model is that standard techniques can be used to set confidence bounds on the estimates of total system test hours needed to reach the release requirement. As can be seen from Fig. 7, the estimates of total test hours proved to be both consistent and accurate. The project concluded system testing within 300 hours of the estimate made eleven weeks previously. The accuracy of these estimates far exceeded our expectations.

Estimating Field Failures

Two different but related estimates can be made concerning the number of defects the customer will observe in the first year after release. The first is the expected number of times the user will experience a defect in the system including the recurrence of a previously observed defect. The second is the expected number of unique defects encountered. New And Repeat Defect Rate. Including repeat defects in estimates of a software system's failure rate can be considered as taking the perspective of the customer. The customer perceives quality in terms of how often a defect is encountered. The user may not be able to tell if two or more defects are actually caused by the same fault in the code. Estimating the combined new and repeat defect rate may be possible using the l(t) model fit during system testing. Recall that the defect-finding rate declines because defects are removed from the software as they are found. Once QA ends, defects are no longer being fixed. Therefore, defects should now be observed according to a Poisson process with a fixed average number of defects per hour of use. The best estimate available for this is the defect-finding rate observed at the conclusion of system testing--0.04 weighted defects per hour for Project B. The question is, how good an estimate is this for what the customer will experience?

The answer is dependent upon the difference between a customer's and an HP engineer's defect-finding efficiency. It is likely that the test engineer, rather than the customer, is the more efficient at finding defects. If an acceleration factor, K, could be identified expressing the ratio of customer hours to one hour of system testing, the customer's defect-finding rate could be estimated as 1/K times the defect rate at the conclusion of system testing. The simplifying assumption here is that the customer's defect-finding efficiency is constant. Evidence to date suggests that a value of K = 50 might be a reasonable estimate of the acceleration factor. The estimated combined new and repeat defect-finding rate would then be 0.0008 weighted defects per hour.

New-Defect-Finding Rate. Considering the number of unique defects in the software is taking the perspective of the company; we want to know if code revisions are likely to be necessary. When considering repeat failures, we stopped on the exponential failure rate curve and extended a fixed horizontal line at the defect-finding rate observed at the end of system testing. To estimate unique failures, we continue down the l(t) curve for the first year after release to shipment. Conversely, the cumulative number of defects found should continue up the m(t) curve.

It is a trivial matter to estimate the number of unique defects the system testing team would find in an additional t hours of testing. This is simply: m(T + t) - (total found so far) where T is the total number of test hours accumulated to date. Inf act, the total number of defects remaining in the software can be estimated by [alpha] -- (total found so far).

For the Project B software, 1239 -- 1072 = 167 weighted defects.

An estimate of how many of these remaining defects will be found in the field can be accomplished by defining a new m(t) function for the customer. The [alpha] parameter can be set equal to the estimated number of defects remaining. The [beta] parameter can be obtained if we are once again willing to use an acceleration factor. For the Project B software, the cumulative number of unique defects found after t hours of customer use is given by: m(t) = 167(1 - e.sup.-0.000484t/50.).

For example, in 2000 hours of customer use we could expect 3.2 unique weighted defects to be found.

Field Failure Rates in Perspective. Estimates of customer defect rates and defect totals should be taken with a healthy degree of skepticism. It is a risky proposition to propose acceleration rates and attempt to estimate customer observed defect rates. Experience to date clearly shows that our customers are not finding as many defects as estimated by the model. However, these two projects have not provided sufficient insight for us to estimate confidently the relationship between model projections and customer experience. After more products are released according to this process, comparisons will be drawn between the defect rate observed in-house and the subsequent rate in the field. An empirical estimate of the acceleration factor may be obtained in this way. To date, the results from these two projects are as follows.

Recall that on Project A the analysis was conducted on unweighted data--all defects were counted equally regardless of severity. From the model, the number of unweighted defects estimated to be remaining in Project A is 2307 - 2167 = 140. After 18 months in the field, there have been three customer reported defects. All of these have been resolved to the satisfaction of our customers. Three defects found out of an estimated 140 remaining suggests a ratio of estimated to actual defects of 47 to 1.

As has been stated, the estimated number of weighted defects remaining on Project B is 167. After 12 months in the field, there have been four defects identified, all of which have been resolved in a subsequent release. After applying the same weighting criteria that was used during system testing, these four defects represent 2.4 weighted defects. Since this product has not been in the field as long as Project A, we could multiply by 1.5 to give 3.6 weighted defects after 18 months. The ratio of estimated to actual defects would then be 46 to 1--amazingly close to the results from Project A but still conjecture at this point.

Conclusions

The defect-finding rate metric and the Goel-Okumoto model proved, on one project, to be a real contribution to managing the system testing phase of the software development process. Judging from the fit to historical data, the model would have been helpful on an earlier project as well. Through weekly plots of the data and the model, the management team received feedback on testing progress and information to aid resource planning. Ultimately, there was a clearer decision as to whether to release the product for customer shipments.

In cases with limited test hours, the straightforward application of the Goel-Okumoto model would be to fit the m(t) model tot he unweighted cumulative defect data using the method of maximum likelihood. If there are enough test hours to allow reasonable estimates of the instantaneous defect-finding rates, fitting l(t) using least squares techniques is possible, although not superior to the maximum likelihood procedure. Based upon the results from these two projects, it appears that the modeling process works for both unweighted and weighted defect data. Extrapolations from the model to field conditions have proved to overestimate the number of defects customers will encounter. Further study is required to estimate confidently the relationship between model estimates and customer observed failure rates. However, we are continuing to tighten the failure rate release criterion as new products enter final system test.

An alternative approach to making the release-to-production decision would be to plot the mean time between finding new defects, 1/l(t). This metric could then be used to make an economic decision on when continued testing would be too costly.

Acknowledgments

I would like to thank Doug Howell, Karen Kafadar, and Tim Read for their suggestions for making the statistical portions of this paper clear. I am also indebted to Peter Piet for the long telephone conversions in which we worked through the technical details of estimating field failure rates.