首页    期刊浏览 2025年05月01日 星期四
登录注册

文章基本信息

  • 标题:Metrics, targets and performance.
  • 作者:Stevens, Philip ; Stokes, Lucy ; O'Mahony, Mary
  • 期刊名称:National Institute Economic Review
  • 印刷版ISSN:0027-9501
  • 出版年度:2006
  • 期号:July
  • 语种:English
  • 出版社:National Institute of Economic and Social Research
  • 摘要:Keywords: Public sector performance; Productivity; Acute Trusts JEL classifications: HII; HI4; 112; 118
  • 关键词:Industrial productivity;Public sector

Metrics, targets and performance.


Stevens, Philip ; Stokes, Lucy ; O'Mahony, Mary 等


The setting and use of targets in the public sector has generated a growing amount of interest in the UK. This has occurred at a time when more analysts and policymakers are grasping the nettle of measuring performance in and of the public sector. We outline a typology of performance indicators and a set of desiderata. We compare the outcome of a performance management system--star ratings for acute hospital trusts in England--with a productivity measure analogous to those used in the analysis of the private sector. We find that the two are almost entirely unrelated. Although this may be the case for entirely proper reasons, it does raise questions as to the appropriateness of such indicators of performance, particularly over the long term.

Keywords: Public sector performance; Productivity; Acute Trusts JEL classifications: HII; HI4; 112; 118

I. Introduction

The setting and use of targets in recent years has generated a growing amount of interest (OXREP, 2003; JRSS, 2004), especially in high profile government services such as health and education. Public Service Agreement (PSA) targets were first established in 1998 as part of the Comprehensive Spending Review. The framework was based on four principles: outcome-focused goals; devolution of responsibility to public service providers themselves; arrangements for audit and inspection to improve accountability; and transparency (Hill, 2003). A number of changes have taken place since their introduction, including simplification and reducing the number of targets set and a greater emphasis on outcome-based measures.

The aim of such targets is to improve the quality, performance and accountability of public services. However, in the beginning, many targets were set without a clear idea of how to measure them. More recently, therefore, in the 2004 Spending Review, it was stressed that targets must be measurable.

Reliable measures of performance are required to examine whether targets have been met and to assess their impact on the provision of public services. It is important to have robust performance measures that cover the broad range of services delivered, rather than being specific to targeted areas. It is also important that the measures are appropriate for the job in hand.

In this paper we consider an example from the health sector, where targets have received much attention in the media as well as among health specialists in research and practice (Cutler, 2002; Kmietowicz, 2003; Miller, 2002; Rowan et al., 2004; Snelling, 2003). The PSA targets for health cover a range of areas, with four main objectives set out in the 2004 Spending Review PSAs: improving population health; improving health outcomes for individuals with long-term conditions; better access to health services; and an enhanced patient experience. Within these objectives, various targets are specified. Examples of targets within these objectives include: a 10 per cent fall in health inequalities by 2010; a maximum 18 week wait for hospital treatment from receipt of a GP referral by 2008; and a decline in the adult smoking rate to less than 21 per cent by 2010.

Health is a high profile public service, seldom far from the top of the policy or public agenda. Measuring performance in health has generated considerable interest and is politically sensitive. One such measure of performance that was introduced in 2000 is the 'star ratings' whereby NHS Trusts are awarded between zero stars (worst performers) and three stars (best performers). These star ratings reflect how well NHS Trusts are performing in relation to a set of 'key targets' (set by government) and a larger set of broader performance indicators (the so-called 'balanced scorecard' indicators). This system of measuring performance has provoked much discussion, but there is little evidence to date on how star ratings compare to alternatives (an exception being the simple comparison of ratings with clinical outcomes of Rowan et al., 2004). More rigorous analysis is required in order to understand their effects, both on their intended 'targets' and their unintended consequences in other areas.

In this paper we compare the star ratings of NHS Acute Trusts with a measure familiar in analysis of the private sector, labour productivity. This work builds on recent work on the measurement of public sector productivity (O'Mahony and Stevens, 2002, 2003; Pritchard, 2003; Mai, 2004; Atkinson, 2005) and for the health sector in particular (Pritchard, 2004; Hemingway, 2004; Lee, 2004; Dawson et al., 2005).

In section 2 we discuss the use of indices in measuring performance in public services. In section 3 we outline the issues relating to the measurement of productivity in the public sector and the health sector in particular. Section 4 outlines briefly the star rating system for acute trusts. In section 5 we set out the empirical method used to calculate our output, input and productivity indices. Our results are presented in section 6 and section 7 concludes.

2. The use of indices in measuring performance in public services

There exist a plethora of performance indices of one type or another. Even when these purport to measure the same thing, they are often in fact subtly and importantly different. Many are designed for specific purposes (and may or may not be fit for these), but are often pressed into alternative uses. The differences between indices and their fitness for a particular purpose are often unclear and so in this section we set out a typology of performance indices. These categories are not mutually exclusive and often indices that claim or are designed to be one type of index will in fact be something rather different altogether.

The utility of a performance index depends on its fitness for purpose and a number of more technical factors. We discus these desiderata in this section, against which we can set any indices we encounter.

2.1 A typology of performance indices We distinguish four types of indices used to measure performance:

* Output Index--how much of a service is being produced;

* Welfare Index--what the value is to final users;

* Performance Management Index--how the services are being provided;

* Composite Index--includes elements of the above three.

Each of these indices relates to a different concept of what the organisation is doing. In the case of a single service, the distinction between these is often plain. Consider the treatment of arthritis, for example. The output is the number of patients treated; welfare is reduction in pain and increase in mobility; and performance management might include the extent to which doctors are using appropriate treatments or how long patients wait for them. With multiple services there is a need to apply weights to each service, often combining indicators that are measured in non-comparable units. Most public sector organisations offer a number of different services or services with a number of different aspects. This is at the heart of the problem of public sector assessment (Stevens, 2005) and raises a number of potential problems, which we discuss in section 3 below.

2.2 Index desiderata

In order to employ indices to measure performance it is useful to begin by asking: what are the ideal properties of such indices? We would expect indices to satisfy the following list of properties:

A. Fit for purpose

In developing indices of performance, we should heed the advice of Hamlet to 'suit the action to the word, the word to the action'. Determining which index to use will depend on the question being addressed. If the purpose is to consider the contribution of the service to aggregate economic growth then an output index (with quality adjustment) should be employed. Such an index is required by the ONS for the sector division of constant price output in the national accounts, and is useful to bodies like the Treasury and Bank of England in understanding the performance of the economy. If the purpose is to measure consumers' well-being, then a welfare index is required. This is of interest to government departments providing the service as well as to central government. Departments as providers of services will also require performance management indices to gauge the processes being employed by producing units under their control. Finally auditors--e.g. the Audit Commission, Healthcare Commission etc.--will require composite indices to gauge overall performance.

B. Theoretical framework for defining domains

The indices should be grounded in theory to define what types of services it is appropriate to include in each index. Thus we should also heed the advice of Einstein to count what counts and not just what can be counted. This should feed in to the data requirements in order to enable the index to be calculated.

C. Theoretical framework for aggregation

The weights employed to aggregate should be based on a theoretical framework which sets out the conditions under which weighting schemes are appropriate. For example under competitive market conditions cost weights may be used to aggregate outputs, but may need modification if the underlying assumptions are not met. The marginal impact of the service on consumers' utility can be used as weights in a welfare index, but may also be employed in a quality adjusted output index. Costs or expected impacts on utility may also be used for management performance indices.

D. Technical properties of index numbers Indices should satisfy a number of properties that allow logical consistency including transitivity, independence of irrelevant alternatives, path independence.
 Transitivity requires that if organisation A is ranked
 above B, and B is ranked above C, then A should be
 ranked above C.

 Independence of irrelevance of alternatives requires
 that if A is ranked above B in a comparison involving
 A, B, C and D, adding a new producing unit E to the
 comparison should not make B preferred to A.

 Path independence requires that the comparison
 between A and B is not dependent on the ordering of
 the components of the comparison; using one body of
 information followed by a second should yield the
 same result as using the second followed by the first.

 This is extremely important from the point of view of
 transparency. More simply, however, one must question
 the validity of methods that exhibit path dependence.


2.3 Composite indicators

In what circumstances is it valid to employ a composite index? One way of approaching this question is to consider how information on one of the first three types can be used to modify the others. For example an output index requires quality adjustment--how much is produced at what level. The results from work conducted for the Department of Health (Dawson et al., 2005) suggest that one can use outcome measures to quality adjust outputs. An argument could also be made that information on processes might be useful in quality adjusting, in the absence of information on outcomes.

Most composite indices, however, do not start from the above but rather put together information useful for different purposes into a mixing bowl without any regard to what properties the resulting mix satisfies. This is due usually to expediency rather than any darker motives, but nevertheless can often cause more problems than it solves and create obfuscation rather than illumination.

3. Productivity analysis in the public sector

In recent years, economists have begun assessing the performance of public sector organisations through the use of productivity indices, more common in the analysis of the private sector. These indices compare aggregate output to aggregate input use (O'Mahony and Stevens, 2003; Dawson et al., 2005). The use of productivity indices comparable to those used in the private sector encounters a number of problems that must be overcome when applying them to the provision of public services. Their use to measure productivity in the total NHS was explored jointly by NIESR and the Centre for Health Economics at the University of York, and is discussed in some depth in Dawson et al. (2005).

Broadly speaking, the main problems relate to the aggregation required to obtain input and output indices and accounting for quality. In the private sector, it is assumed that the market price of a good or service measures the consumers' marginal valuation of the bundle of characteristics from consuming the output. In the public sector this is not generally true. One way to overcome this is the use of a cost weighted activity index (CWAI), a measure currently employed by the Department of Health (DH) in measuring NHS performance and by the ONS for a range of public services. Cost weighting does not however account for increases in quality of the service produced.

Aggregation issues on the input side are potentially easier to deal with than those on the output side, as there are market prices for inputs. Thus by separating doctors by type (consultants, registrars, house officers) and weighting by their shares in total wage bills, it is possible to adjust labour input for quality.

One problem with productivity measures is that they do not measure performance relative to un-rationed demand. It is possible to imagine a hospital trust, for example, which has used its resources efficiently but fails to meet the needs of potential patients. Despite this, the productivity index approach is potentially useful in answering questions on what is being provided by public services and how inputs are utilised to achieve this. As such they represent a useful benchmark against which to set officially produced performance indicators. In this paper, we calculate input, output and productivity indices using the hospital episode, reference cost, and financial statistics for NHS Acute Trusts outlined below. By exploring differences across Trusts and those before and after changes in target regimes we can get a comprehensive idea of how such changes impact on services in a way not possible by confining attention to a small number of performance indicators.

3.1 Activities, outputs, characteristics and outcomes

In measuring outputs in the public sector in general, and the health sector in particular, it is useful to distinguish between four different concepts: (1)

* Activities (e.g. operative procedures, diagnostic tests, outpatient visits)

* Outputs (e.g. courses of treatment--may require a bundle of activities)

* Characteristics (e.g. the aspects of output of value to individuals)

* Outcomes (e.g. the value of characteristics to individuals)

In the measurement of private sector productivity, the focus is on output, rather than the characteristics that they embody. This is because, under certain assumptions, the market price of the output measures the consumers' marginal valuation of the bundle of characteristics.

Bureaucracies tend to concentrate on activities. Put simply, this is because these are the easiest of the four concepts to count. The costs of activities may also be simpler to work out. NHS productivity measures have been based upon estimates of the number of particular types of activities (procedures, consultations etc) or the number of patients treated in various institutional settings (see Dawson et al., 2005).

There are advantages with this framework. For example, in instances where care for a patient with a particular condition is provided entirely within one setting, aggregation within the setting is equivalent to aggregation by patient pathway or disease group. It ensures compatibility with current NHS reporting systems and is likely to prove amenable to analysis at a disaggregated level. It can be a useful means for monitoring and managing lower level units within the NHS. Further, the approach would ensure consistency with other policy initiatives, most notably the Financial Flows reforms (Department of Health, 2002a).

Notwithstanding the short-run benefits of using activity based measures, in order for a performance index to be representative, reliable and robust, it is more appropriate to consider output indices, particularly those with quality adjustment, where possible. Outputs are the actual goods or services provided and may consist of a number of activities bundled together. For example, when patients are diagnosed with heart problems, their treatment may consist of one or more instances of surgery, medical management, the prescription of drugs and possibly a number of other procedures. It is this bundle of activities which should properly be considered to make up the output of the health service with respect to the patient concerned. Thus, the unit of analysis is not the finished consultant episode (FCE) but rather the continuous inpatient spell (CIPS).

Whilst this is relatively simple, at least conceptually, at the aggregate sector level, it raises some problems at more disaggregate levels. In our example of acute trusts, a course of treatment may involve the provision of services by more than one trust. To overcome this we concentrate on what is known as a provider spell (PS), that is the course of treatment that is done within one organisation (although note that an acute trust may have more than one site). The use of provider spell does not throw up practical problems in our analysis, but there are potential incentive implications if one concentrates on one part of what can be potentially a multiple-provider service. If such indicators are linked to reward or punishment regimes, they may create incentives to 'farm out' difficult or problematic aspects of processes, or to refuse them when other trusts refer them. Any system of management creates incentives; if we do not consider the full set of incentives created by a regime, we cannot know if it will have the intended consequences.

One problem with the use of outputs (or activities for that matter) in the absence of market transactions is that we cannot account for variations in quality that are usually reflected in prices. Even when prices do exist--for example prescriptions and private sector comparisons--there are a number of reasons why these may not be the appropriate measures. Prescription costs explicitly do not reflect the costs of producing or value to the consumer of the items to which they relate. This is because such pricing would contravene the principles on equity on which the NHS and many other systems of health provision are based. Prescription costs have more to do with cost-sharing as a method of overcoming over-consumption that occurs when goods have zero prices than with the outcome of some pseudo market system of allocation. Private sector comparisons are tenuous in the UK healthcare setting because of its relative size, the comparability of the service offered (i.e. the sets of characteristics) and the demographics of the patients (and therefore the demands placed on the provider in terms of health problems occurring, different receptiveness to treatment and values placed on characteristics of outputs etc.).

What is required is an alternative method of quality adjustment. In order to find this we need to consider what determines the final value of goods and services in both private and public sectors. The prices which producers are willing to accept for providing services reflect the costs of provision. What is of interest to us is what determines the price consumers are willing to pay for such services. The value of goods and services depends upon their characteristics, which bundled together produce an outcome. Consider for example a hi-fi audio system. The final outcome produced by the system is the ability to listen to music and other audio material. A number of factors contribute to this, such as the ability to play music in a number of different formats, sound quality, reliability, size and ergonomics. Variations in prices of hi-fi systems can be explained in terms of these characteristics. Such variation allows one to calculate the valuations put on these characteristics by the market.

Whilst we do not have the prices with which to undertake such 'hedonic analysis', consideration of the outcomes produced by public sector organisations, and the characteristics of outputs that provide these do allow us to consider the question of quality in public sector service provision. At the very least, such a consideration focuses us on the raison d'etre of the public services in question, which is no bad thing. As we have already mentioned, often the focus is an unquestioning one on activities because these are 'what the organisation does' (rather than what it does them for).

The primary outcome of the public health sector is clear, to increase the health of the population. This is in part pro-active, through public health and the like, and partly remedial, through the treatment of illness. The focus of acute trusts is the latter. For treatment to be valued, it needs to have a positive impact on expected health, either by removing or mitigating illness where possible or palliating incurable illness. This can be done by extending life expectancy (increasing the quantity of life) or by reducing pain, increasing mobility etc. (increasing the quality of life). These two aspects of the health impact of the health service can be combined in a measure called quality-adjusted life years (QALY). A number of other characteristics of health provision can be considered in terms of their effect on the quantity and quality of life. Waiting times can be considered as time that could be spent enjoying the enhanced, post-treatment quality of life. In-treatment mortality can be considered as the years of life the patient could have enjoyed if they had not died.

The QALY measure is a widely, although not universally, accepted instrument with which to measure the impact of treatment on patients' health. The operation of the instrument is essentially in two parts. First, the impact of treatment on a number of aspects of health, such as pain and mobility, is calculated. Second, surveys are conducted to place values on these characteristics. Although the QALY instrument is a means whereby the value of health services to their consumers can be calculated, we are a long way short of populating the full set, or even the majority of treatments provided by modern health services.

In the absence of QALY measures for the majority of treatments, we must utilise some other valuation methodology. Dawson et al. (2005) conducted the first extensive analysis of the issues surrounding the calculation of input, output and productivity indices for publicly-provided health services. They concluded that until such time as we can actually measure what the National Health Service does to its patients and how well it does it, our best measure of the output of the service is to use some kind of quality-adjusted cost-weighted output index. This is the method we shall be using in this paper as a comparator to the star ratings system of performance indicators.

There are a number of methods whereby one might quality-adjust the measure of output of an acute trust. Here we consider a survival-adjusted cost-weighted activity index, i.e. adjusting for the life that could have been enjoyed if the patient had survived treatment. Of course, while the proportion of people who die in hospitals is relatively small, the cost is not insignificant.

4. Star ratings system for acute trusts

We will not describe the star ratings system in great detail here as the system in general is discussed elsewhere in this issue in Gwyn Bevan's paper. The star rating system for acute trusts was introduced in the financial year 2000/1. Its final year was 2004/5 as it is to be replaced by the 'annual health check' for 2005/6. The star rating system consists of a small set of 'Key Targets' (which drive the outcomes) and a larger set of 'balanced scorecard' indicators (Department of Health, 2001, 2002b; Commission for Health Improvement, 2003; Healthcare Commission, 2004) (with a much lower impact). The key targets are outlined in table 1; details of the balanced scorecard indicators are given in the appendix (table A.1). The outcome of the star rating system is a rating between no stars (trusts with the poorest level of measured performance) and three stars (trusts with the highest levels of performance in the measured areas). The two intermediate cases are one star (trusts where there is some cause for concern regarding particular areas of measured performance) and two stars (trusts with mostly high levels of performance, but which are not consistent across all measured areas).

The most obvious thing to notice from the targets outlined in the table is that the majority relate to waiting times. If we consider the raison d'etre of an acute trust, it is to improve the health of its patients. There are no measures of how many patients the trust treats, nor how well they are treated, or indeed if they gain any benefit at all. Long waiting times are a symptom of a system that is perhaps operating inefficiently or unproductively and as such this is a management tool that seeks to reduce these symptoms. In the typology suggested above, what we have is a composite performance management index. We have no information on the weights given to each of these indicators in the final index. (2)

Is this the most appropriate index? In the spirit of what we have said above, it is best perhaps to consider the purpose of the index (as well as the purpose of the acute trusts). In this issue of the Review, Bevan places the star ratings in a political context, suggesting that they are a typical instrument of a command and control government, i.e. they are a management tool. But as with any management tool, one has to ask 'management for what purpose?' In the private sector, we would ask 'do our management practices achieve their outcomes?' or rather 'what is their effect on the bottom line, i.e. profits or shareholder value?' In the public sector, one would hope that the bottom line is social welfare. What is required to assess management tools is a quality-adjusted output measure.

In this context of the star ratings system as a performance management tool, we can deduce that the indicators are of importance to the target setters. This may be because these were perceived to have been problem areas that needed changing, in which case they could be a legitimate component in the set of performance indicators that sought to influence change in this area. Bevan suggests that if the aim was to reduce waiting times the system certainly achieved its goal. However, Bevan also suggests that the system in practice created three tiers of outputs. At the top tier are the things that matter most--the 'key targets'; next are those things that are of secondary importance--the 'balanced scorecard' indicators; finally we have the things that do not count--i.e. everything else. Thus, in achieving this goal, the price may have been paid in adjusting the incentives within the system. Thus we raise the question once more in a different way; is there a relationship between star ratings that suggests that those with high ratings are actually the least productive because they bought their 'stars' with respect to waiting times (and improving working lives and cleanliness)?

It is useful therefore to set the star ratings system against alternative measures of performance and ask the question 'are acute trusts with high star ratings more/less productive?' That is not to say that this is the only measure against which they should be gauged. As we have stated above, a performance management tool is not a welfare index, it may have a different purpose, and as such should not be expected to produce identical results. Nevertheless, if the ultimate aim of the Government is to increase the welfare of the nation and of the health service to improve its health, it is legitimate to ask if the performance management tool is related to the ultimate aim of the management (in this case the Government). If it is not, use of such tools beyond the short period (where they might be used as a remedial measure), would have serious distortionary ramifications on the achievement of government's ultimate goal.

4.1 Did the stars move?

We can consider the issue of the success of the star ratings system as a performance management tool by considering the changes in the star ratings. Table 2 outlines the relationship between the star ratings of the acute trusts that we could match from 2000/1 and 2003/ 4. (3) We can see that 49 trusts (37 per cent) improved their star rating over the period. Over half of the trusts achieved two stars in 2000/1 and over half of these increased to three stars, with two trusts rising from zero to three stars. The rating of 26 trusts fell over the period (20 per cent). If the system intended to incentivise trusts to increase their performance with regard to the key targets (and to a lesser extent the background indicators), then it has on the face of it met with some success.

However, this sanguine interpretation makes two not inconsequential assumptions. First, the system operated in a consistent way. That is, the monitoring did not become more lenient or trusts did not become better at gaming the system. Second, the effect on other areas of the trusts' operations was modest or positive. The assessment of the first is beyond the scope of this paper, the second is not.

5. Empirical model

We construct labour productivity indices using a basic (i.e. non-quality-adjusted) cost-weighted output index and a mortality-adjusted cost-weighted output index. Labour is by far the most important input in producing health services, accounting for 71 per cent of expenditure in the hospital sector. (4)

In the absence of quality adjustments for changes in QALYs, this paper presents measures based on cost-weighting activities. If the unit of analysis is the patient spell (or patient pathway, i.e. the bundle of activities a patient receives in order to be treated), a cost-weighted output index (CWOI) can be calculated. Leaving out the time subscripts for clarity, this is given by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where s is the activities within a spell and [??] and [??] are the output and cost of spells beginning with activity type l respectively.

Our quality-(mortality-) adjusted output index is:

[O.sub.M] = [summation over (j)][([summation over (s)][x.sub.js][c.sub.js])(1-[m.sub.j])] = [summation over (l)][x.sub.l]c.sub.l](1 - [m.sub.l] (2)

where m is a dummy variable that equals one if the patient dies and zero otherwise; (l-m) represents the survival rate. The spells are created from data on individual patients' spells. If the index were calculated using episodes we could use the mean value of (l-m), which is the survival rate for the particular episode type (or to be more precise, the admission-health resource group combination) and multiply this by the number of individuals treated in that group and cost. However, in our quality-adjusted index [[??].sub.l] is the mean number of individuals whose spell began with episode type l and ended in death. From (2) we can see the difference between the activity and output (course of treatment/ bundle of episodes) methods. For a spell of treatment, we need to adjust the whole sum of episodes in that spell, weighted by their costs, by whether any of the episodes ended in death (note that an individual could be receiving a number of treatments at the same time). Thus, if an individual went in for one procedure but then went on to have others--either as part of a bundle of treatments for a particular ailment or as a result of misdiagnosis or error by medical staff--the whole spell is quality-adjusted by the fact that the individual died. For example, if a patient went into hospital for a spell of treatment for a heart condition that included an operation followed by, say, a course of drugs, it is the treatment of the heart condition that ended in a death, rather than just the course of drugs. As a corollary of this, consider a patient who went into hospital for a treatment that is not considered life-threatening--say having an in-growing toe-nail removed--but due to mismanagement was given the incorrect drugs and suffered a heart failure. The death must be linked to the course of treatment that began with an in-growing toenail treatment, not merely the treatment for heart failure. Part of the risk of having an in-growing toe-nail is the (one would hope extremely small) risk that the patient would be mistreated and die.

One objection that might be levelled at such indices is that they do not include factors such as cleanliness (except for the extreme effects such as those on mortality, e.g. MRSA) and waiting times. However, it is likely that most people are likely to give these a low weighting relative to being treated and not dying during treatment. (5)

The labour input index is a standard weighted average of m staff types (6)

L = [summation over (m)][n.sub.m][w.sub.m] (3)

where n are the m staff types and the weight w is equal to the aggregate expenditure share of labour type m. Thus our basic cost-weighted productivity index is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

Our mortality-adjusted productivity index is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)

The data for our analysis come from a number of sources. The counts of outputs (spells) come from the Hospital Episodes Statistics (HES) database of all elective and non-elective day patients, with the exception of regular day and night attenders. This gives us over 10 million provider spells every year. The cost weights are derived by DH and come from the reference costs database. (7) Reference costs are indexed by the 'healthcare resource group' of the first episode in a spell and whether the first was an elective or non-elective admission. This gives us over 1000 combinations (i.e. J or l > 1,000). (8)

Staff numbers are whole time equivalent (WTE) numbers from the NHS Census. The earnings used for the expenditure share calculations come from the NHS Earnings Survey. For more information on these data sources and how to construct the system-wide analogue of provider spells (continuous inpatient spells), see the data appendix to Dawson et al. (2005).

6. Results

6.1 Relationship between star ratings and productivity

The results for the basic CWOI are set out in table 3. The tables are standardised such that the overall average level of productivity in 2000/1 equals one. Thus the first entry states that zero star-rated trusts in 2000/1 achieved 2 per cent higher than average productivity levels in that year. In 2000/1, three-star trusts do have the highest levels of labour productivity. However, both the between-star rating and between-year variation are swamped by the within-group variation--standard deviations are considerably higher than the ratios of the indices across group or year. Thus we cannot differentiate between the productivity of any of the groupings with any degree of certainty. Indeed in 2001/2 there is, if anything, a negative relationship between the star rating the trust achieved and its labour productivity index.

The same is true when we quality-adjust the output index to account for differences in mortality across acute trusts. The patterns are similar--an increase in productivity between 2000/1 and 2001/2 followed by a levelling out, and a negative relationship between star ratings and labour productivity in 2001/2--but they are once more swamped by the within-group variation. Once again we can say that there is little or no relationship between the star rating the acute trust received and its productivity.

Our results cannot discriminate between a number of competing theses. As we noted above, one might not expect there to be a positive relationship between labour productivity and star ratings if the only difference between hospitals were the waiting lists and other key indicators (and, possibly, balanced scorecard indicators). What about the suggestion that trusts 'bought' lower scores in other key indicators (i.e. waiting times) by foregoing the financial management indicator (which referred to whether they achieved their financial plan without unplanned support) (Bevan, 2006)? This would lead to an increase in both the numerator (output), as trusts cleared their waiting lists, and the denominator (labour input), as trusts increased staff numbers or overtime to achieve this. Thus the effect on productivity would be indeterminate. Moreover, the fact that trusts with long waiting lists were no more or less productive than others may also suggest that the problem is due not so much to an individual trust's inefficiency, but rather to an excess of demand (i.e. inefficient strategic management at the level of between-trust resource allocation).

6.2 Aggregate productivity

The overall CWOI and mortality-adjusted productivity indices for the acute trust sector along with uncertainty bounds at plus and minus one standard deviation are shown in figure 1. (9) From the figure it is clear that although there was an increase in both indices between 2000/1 and 2001/2, this is dwarfed by the uncertainty about their value. There is a slight divergence between the basic CWOI and the mortality-adjusted index; this is because of the slight increase in mortality. However, once we consider the error bounds we can see that the difference is dwarfed by uncertainty about the aggregate productivity values themselves.

[FIGURE 1 OMITTED]

7. Discussion

The setting and use of targets in the public sector has generated a growing amount of interest in the UK. This has occurred at a time when analysts and policymakers are increasingly grasping the nettle of measuring performance in the public sector. The discussion in this paper highlights the fact that performance indicators should satisfy certain desiderata. In particular they should be fit for the purpose, i.e. the correct type of indicator for the job at hand. Indicators that are put to uses not foreseen by their designers are seldom put to good use. In addition the indicators should cover an appropriate domain(s) of service under examination. Management indices that include aspects that are beyond the control of the managed will present a misleading picture of performance. Aggregation methods are also important--if an index comprises more than one indicator (as is usual) the weights used to aggregate them should represent their relative value and not some arbitrary scheme, such as giving each equal weight without consideration of the alternatives.

As an empirical application of our methodology, we have compared one type of performance indicator with another. The star ratings system for acute hospital trusts in England is an example of a performance management system. It is part of a mechanism designed to improve, or at least change the sector with regard to a small number of key targets, largely relating to waiting times. We have contrasted this with a measure of labour productivity analogous to those used in the analysis of the private sector.

We find that the two methods of assessment are almost entirely unrelated. That is to say, there is no relationship between an acute trust's star rating and its productivity. Our results are the same whether one uses a simple cost-weighted output index, or whether one quality-adjusts the output index for variations in patient mortality. Whatever the merits of the star rating system in terms of a management tool--and the increase in the number of trusts achieving high star ratings suggests that it has been, on its own terms, successful--it does not provide an indicator of the performance of trusts in terms of their productiveness in providing health care for patients.

The results also do not say that high achieving trusts have obtained their high star ratings at the cost of neglecting their primary purpose of treating patients. They are, however, consistent with the thesis that since the key indicators included only one financial one and ten or so others, trusts could quite literally buy their way out of trouble by missing one target (the financial one) in order to meet the remainder (mainly on waiting times). (10) This is no more than tentative evidence, however, and it would need further research to determine whether this were true.

The lack of any correspondence between the star ratings and productivity of acute trusts does raise questions as to the appropriateness of such indicators of performance, particularly over the long term. Whilst they may have achieved certain goals deemed important by central government (i.e. waiting times), the focus they place on a small subset of what is the role of acute trusts--i.e. the speed with which services are provided rather than the services themselves--may become detrimental. The fact that the star rating system, having achieved what might be considered to be one of its primary objectives, has been replaced, gives one hope that this is not the case, although the jury is out on 'annual health checks'.
Appendix
Table A.1. Balanced scorecard indicators
used in star ratings for acute trusts

 2000/1 2001/2 2002/3 2003/4

Clinical focus indicators
Clinical negligence: compliance
 against CNST risk management
 standards x x x x
Emergency readmission to hospital
 following discharge (adults) x x x x
Emergency readmission to hospital
 following discharge (children) x x
Emergency readmission following
 treatment for a fractured hip x x x
Emergency readmission following
 treatment for a stroke x x
Deaths in hospital within
 30 days of non-elective surgery x x x x
Deaths within 30 days
 of heart bypass operation x x x
Returning home following
 treatment for a fractured hip x
Returning home following
 treatment for stroke x
Infection control x x
MRSA bacteraemia: improvement score x
Thrombolysis treatment time x x
Compliance to recommended child
 protection systems & procedures x
Clinical governance
 composite indicator x
Extent of participation in
 selected clinical audits x
% stroke patients spending time
 on a specialist stroke unit x
"Winning Ways" processes
 and procedures x

Patient focus indicators
% patients waiting less than
 6 months for an inpatient
 appointment x x x x
Patients seen within 13 weeks
 of GP written referral for
 first outpatient appointment x x x x
Trolley waits of more than
 4 hours (% non-elective FFCEs) x
Resolution of written complaints x x x
No. of patients waiting
 for an inpatient appointment
 (% planned target achieved) x x
Total time spent in A&E x
% patients not admitted within one
 month of last minute cancellation x
% patients treated within one month
 of diagnosis of breast cancer x x x
% patients treated for breast
 cancer within two months of
 urgent GP referral for suspected
 cancer x
% patients whose discharge from
 hospital was delayed x
Inpatient survey: co-ordination
 of care x
Inpatient survey: environment and
 facilities x
Inpatient survey: information and
 education x
Inpatient survey: physical and
 emotional needs x
Inpatient survey: prompt access x
Inpatient survey: respect and
 dignity x
% patients admitted to hospital
 via A&E within 4 hours of
 decision to admit x x
Hospital food (whole trust score) x x
% elective admissions cancelled at
 the last minute for non-clinical
 reasons x x
% of day cases that were pre-booked x x
% patients whose transfer of care
 from hospital was delayed x x
Heart operation waits x x x
Outpatient A&E survey:
 access and waiting x
Outpatient A&E survey:
 better information, more choice x
Outpatient A&E survey: building
 relationships x
Outpatient A&E survey: clean,
 comfortable and friendly place
 to be x
Outpatient A&E survey: safe, high
 quality co-ordinated care x
Paediatric outpatient
 non-attendance rates x
Privacy and dignity: compliance
 with objectives x
% patients with new onset
 chest pain seen in RACPCs
 within 2 weeks of GP referral x
Adult inpatient and young patient
 surveys: access and waiting x
Adult inpatient and young patient
 surveys: better information,
 more choice x
Adult inpatient and young patient
 surveys: building closer
 relationships x
Adult inpatient and young patient
 surveys: clean, comfortable
 and friendly place to be x
Adult inpatient and
 young patient surveys: safe,
 high quality co-ordinated care x

Capacity and capability
 focus indicators *
Sickness/absence rate for
 directly employed NHS staff x x x
Compliance with the New Deal
 on junior doctors' hours x x x x
Consultant vacancy rates x
Qualified nursing, midwifery and
 health visiting staff vacancy
 rates x
Qualified AHPs vacancy rates x
Data quality x x x
Staff satisfaction with employer x x
Information governance x x x
Fire, health and safety backlog x
% consultants completing annual
 appraisal x x
Staff opinion survey: health,
 safety and incidents x
Staff opinion survey: human
 resource management x
Staff opinion survey:
 staff attitudes x

Note: * This set of indicators was referred
to as 'staff focus' indicators in 2000/1.

Table A.2. Occupational categories used
in construction of labour input index

1997, 1998, 1999 2000 2001
2002, 2003

Non-medical staff

Qualified Nursing, Qualified Qualified
nursing, midwifery and nursing, nursing,
midwifery and health visiting midwifery and midwifery and
health visiting staff health visiting health visiting
staff staff staff

Qualified Scientific, Scientific, Qualified
scientific, therapeutic and therapeutic and Allied Health
therapeutic and technical staff technical staff Professionals
technical staff

Qualified Ambulance staff Ambulance staff Ambulance staff
ambulance staff

Support to Health care Health care Health care
clinical staff assistants and assistants and assistants and
 other support other support other support
 staff staff staff

NHS Admin & estates Admin & estates Admin & estates
infrastructure staff staff staff
support
 Nursing, Other nursing, Other nursing,
 midwifery and midwifery and midwifery and
 health visiting health visiting health visiting
 learners staff staff

 Other staff Other staff Other staff
 Other STT staff
Medical staff
Associate specialist and staff grade
Consultants
Dental Officers
House Officers
Hospital practitioners and clinical assistants
Other community health services
Other hospital (1997 to 2000 only)
Registrars
Senior Dental Officers
Senior House Officers

Notes: The Department of Health's Annual Workforce Census (for further
details see http://www.dh.gov.uk/PublicationsAndStatistics/Statistics/
StatisticalWorkAreas/StatisticalWorkforce/fs/en) provided data on
numbers employed by occupational group at the NHS Acute Trust level.
However, it should be noted that there are changes to the occupational
categories in the workforce census over time, particularly for non-
medical staff. The occupational groups used in calculating the index
of labour input used in this paper are summarised in table A.1 above.


REFERENCES

Atkinson, A. (2005), 'Measurement of government output and productivity for the national accounts', Atkinson Review: Final Report, HMSO.

Bevan, G. (2006), 'Setting targets for health care performance. Lessons from a case study of the English NHS', National Institute Economic Review, 197.

Commission for Health Improvement (2003), 'NHS performance ratings acute trusts, specialist trusts, ambulance trusts 2002/ 2003', London, Commission for Health Improvement.

Cutler, T. (2002), 'Star or black hole?' Community Care, pp 40-41.

Dawson, D., Gravelle, H., O'Mahony, M., Street, A., Weale, M., Castelli, A., Jacobs, R., Kind, P., Loveridge, P., Martin, S., Stevens, P. and Stokes, L. (2005), 'Developing new approaches to measuring NHS outputs and productivity', NIESR Discussion Paper No. 264 and CHE Research Paper No. 6, available at: http://www.niesr.ac.uk/pdf/n hsoutputsprod.pdf.

Department of Health (2001), 'NHS performance ratings: acute trusts 2000/01', London, Department of Health.

--(2002a), 'Reforming NHS financial flows: introducing payment by results', London, Department of Health.

--(2002b), 'NHS performance ratings and indicators: acute trusts, specialist trusts, ambulance trusts, mental health trusts 2001/ 02', London, Department of Health.

Hemingway, J. (2004), 'Sources and methods for public service productivity: health', Economic Trends, 613, pp. 82-90.

Healthcare Commission (2004), '2004 performance ratings', London, Healthcare Commission.

Hill, A. (2003), 'The UK government's public service agreement framework', Background Paper, HM Treasury.

JRSS (2004), 'Special issue on performance monitoring and surveillance', Journal of the Royal Statistical Society, 167, 3.

Kmietowicz, Z. (2003), 'Star rating system fails to reduce variation', British Medical Journal, 327, p. 184.

Lee, P. (2004), 'Public service productivity: health', Economic Trends, 613, pp. 38-59.

Mai, N. (2004), 'Measuring health care output in the UK', Economic Trends, 610, pp. 64-73.

Miller, N. (2002) 'Missing the target', Community Care, 21-27 November, pp. 36-8.

O'Mahony, M. (2006), 'Outputs, inputs and productivity in the NHS', presentation to NIESR/ESRC conference on 'Public Sector Performance', London, British Academy.

O'Mahony, M. and Stevens, P.A. (2002), 'Measuring international comparative performance in the provision of public services: a review', report to Evidence Based Policy Fund.

--(2003), 'International comparisons of performance in the provision of public services: outcome based measures for education', Presentation to NIESR conference on 'Productivity and Performance in the Provision of Public Services', London, British Academy.

OXREP (2003), 'Special issue on financing and managing public services', Oxford Review of Economic Policy, 19, 2.

Pritchard, A. (2003), 'Understanding government output and productivity', Economic Trends, 596, pp. 27-40.

--(2004), 'Measuring government health services outputs in the UK national accounts: the new methodology and further analysis', Economic Trends, 613, pp. 69-81.

Rowan, K., Harrison, D., Brady, A. and Black, N. (2004), 'Hospitals' star ratings and clinical outcomes: ecological study', British Medical Journal, 328, pp. 924-5.

Snelling, I. (2003), 'Do star ratings really reflect hospital performance?' Journal of Health Organization and Management, 17, pp. 210-23.

Stevens, P. (2005), 'Assessing the performance of local government', National Institute Economic Review, pp. 90-101.

NOTES

(1) Other authors such as Stevens (2005) and Dawson et al. (2005) consider these as three (activities, outputs and outcomes); here we differentiate more clearly between the characteristics of outputs and their effect on outcomes.

(2) Although this is possible and we are currently undertaking statistical analysis to determine the implicit weights.

(3) Note that there has been some movement in the population, with trusts amalgamating.

(4) Ideally we would calculate multi-factor productivity measures that account also for capital and intermediate inputs but doing so at the trust level is beyond the scope of this paper.

(5) Certainly, in their discussion of the effects of waiting times on quality adjustment, Dawson et al. (2005) point out that compared to the total time over which improved health due to treatment might be enjoyed, the weeks or even months of waiting are relatively small.

(6) The staff types used in the construction of the labour input index are detailed in table A.2 of the appendix.

(7) NHS reference costs are available online at: http:// www.dh.gov.uk/PolicyAndGuidance/OrganisationPolicy/ FinanceAndPlanning/NHSReferenceCosts/fs/en.

(8) Note that they represent average costs across trusts, rather than unit costs for individual trusts. Aside from concerns regarding their reliability, the use of average unit costs calculated at the trust level would create two problems. First, given that the cost is measuring the relative 'importance' of a particular output it is unclear why this should vary by trust; Second, it could create a perverse incentive for trusts to increase provision in areas where they are relatively expensive or inflate costs in areas where they are big producers. Because of these, a hospital that spends greater than average amounts providing say hip replacements would appear more productive in our index, even though the greater expense might well reflect poor management of service provision.

(9) These figures were aggregated using expenditure weights.

(10) A trust could underachieve on one target and still get three stars, but not significantly underachieve (for 2003/4 on this target, underachieve was defined as 'Adverse variance from financial plan of up to 1% of turnover or less than 1 million [pounds sterling] without unplanned financial support', while significantly underachieved was defined as 'Adverse variance from financial plan greater than 1% of turnover or greater than 1 million [pounds sterling] or unplanned financial support'.

Philip Stevens *, Lucy Stokes ** and Mary O'Mahony ***

* National Institute of Economic and Social Research and Medium Term Strategy Group, Ministry of Economic Development, New Zealand. email: p.stevens@niesr.ac.uk. **National Institute of Economic and Social Research. ***National Institute of Economic Research and Birmingham Business School. The research reported here was funded by the ESRC, Grant no. RES-153-25-00 44. Thanks go to Gwyn Bevan, Rowena Jacobs, Martin Weale and participants at the 4th NIESR Public Sector Performance Conference, January 2006, for helpful comments, although the usual caveats apply.
Table 1. Key targets used in star ratings for acute trusts

 2000/01 2001/02 2002/03

Shorter inpatient waiting lists x
Inpatients waiting longer than x x x
 the standard
Reduction in outpatient waiting x
Outpatients waiting longer than x x x
 the standard
Outpatient and elective
 (inpatient and day-case) booking
Cancer: % seen within 2 weeks * x x
Financial management x x x
12 hour waits for emergency
 admission via A&E following
 decision to admit x x x
Total time in A&E: 4 hours or less x
Cancelled operations x x x
Improving working lives x x x
Hospital cleanliness x x x

 2003/04 2004/05

Shorter inpatient waiting lists
Inpatients waiting longer than x x
 the standard
Reduction in outpatient waiting
Outpatients waiting longer than x x
 the standard
Outpatient and elective x x
 (inpatient and day-case) booking
Cancer: % seen within 2 weeks x x
Financial management x x
12 hour waits for emergency
 admission via A&E following
 decision to admit x x
Total time in A&E: 4 hours or less x x
Cancelled operations
Improving working lives x
Hospital cleanliness x x

Note: * Balanced scorecard only.

Table 2. Movement of star ratings

2000/1 2003/4 Total

 0 1 2 3

0 1 1 6 2 10
1 4 7 4 2 17
2 2 11 31 34 78
3 0 4 5 18 27
Total 7 23 46 56 132

Table 3. CWOI by star rating

Star rating 2000/1 200112 2002/3 2003/4

0 1.02 1.34 1.09 1.20
 (0.19) (0.23) (0.20) (0.15)
1 0.99 1.17 1.15 1.12
 (0.22) (0.21) (0.17) (0.21)
2 1.01 1.12 1.14 1.12
 (0.23) (0.23) (0.20) (0.17)
3 1.05 1.14 1.10 1.15
 (0.30) (0.32) (0.16) (0.16)
Total 1.00 1.13 1.09 1.11
 (0.23) (0.25) (0.18) (0.17)

Notes: Standard deviations in parenthesis.
All scores indexed to 2000/1 average.

Table 4. Mortality-adjusted CWOI by star rating

Star rating 2000/1 2000/2 2000/3 2003/4

0 1.00 1.32 1.08 1.18
 (0.19) (0.22) (0.20) (0.15)
1 10.98 1.15 1.13 1.11
 (0.21) (0.21) (0.17) (0.21)
2 20.99 1.11 1.12 1.11
 (0.22) (0.23) (0.20) (0.17)
3 31.03 1.12 1.08 1.13
 (0.29) (0.32) (0.16) (0.16)
Total 1.00 1.13 1.11 1.12
 (0.23) (0.26) (0.18) (0.17)

Notes: Standard deviations in parenthesis.
All scores indexed to 2000/1 average.
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有