Metrics, targets and performance.
Stevens, Philip ; Stokes, Lucy ; O'Mahony, Mary 等
The setting and use of targets in the public sector has generated a
growing amount of interest in the UK. This has occurred at a time when
more analysts and policymakers are grasping the nettle of measuring
performance in and of the public sector. We outline a typology of
performance indicators and a set of desiderata. We compare the outcome
of a performance management system--star ratings for acute hospital
trusts in England--with a productivity measure analogous to those used
in the analysis of the private sector. We find that the two are almost
entirely unrelated. Although this may be the case for entirely proper
reasons, it does raise questions as to the appropriateness of such
indicators of performance, particularly over the long term.
Keywords: Public sector performance; Productivity; Acute Trusts JEL
classifications: HII; HI4; 112; 118
I. Introduction
The setting and use of targets in recent years has generated a
growing amount of interest (OXREP, 2003; JRSS, 2004), especially in high
profile government services such as health and education. Public Service
Agreement (PSA) targets were first established in 1998 as part of the
Comprehensive Spending Review. The framework was based on four
principles: outcome-focused goals; devolution of responsibility to
public service providers themselves; arrangements for audit and
inspection to improve accountability; and transparency (Hill, 2003). A
number of changes have taken place since their introduction, including
simplification and reducing the number of targets set and a greater
emphasis on outcome-based measures.
The aim of such targets is to improve the quality, performance and
accountability of public services. However, in the beginning, many
targets were set without a clear idea of how to measure them. More
recently, therefore, in the 2004 Spending Review, it was stressed that
targets must be measurable.
Reliable measures of performance are required to examine whether
targets have been met and to assess their impact on the provision of
public services. It is important to have robust performance measures
that cover the broad range of services delivered, rather than being
specific to targeted areas. It is also important that the measures are
appropriate for the job in hand.
In this paper we consider an example from the health sector, where
targets have received much attention in the media as well as among
health specialists in research and practice (Cutler, 2002; Kmietowicz,
2003; Miller, 2002; Rowan et al., 2004; Snelling, 2003). The PSA targets
for health cover a range of areas, with four main objectives set out in
the 2004 Spending Review PSAs: improving population health; improving
health outcomes for individuals with long-term conditions; better access
to health services; and an enhanced patient experience. Within these
objectives, various targets are specified. Examples of targets within
these objectives include: a 10 per cent fall in health inequalities by
2010; a maximum 18 week wait for hospital treatment from receipt of a GP
referral by 2008; and a decline in the adult smoking rate to less than
21 per cent by 2010.
Health is a high profile public service, seldom far from the top of
the policy or public agenda. Measuring performance in health has
generated considerable interest and is politically sensitive. One such
measure of performance that was introduced in 2000 is the 'star
ratings' whereby NHS Trusts are awarded between zero stars (worst
performers) and three stars (best performers). These star ratings
reflect how well NHS Trusts are performing in relation to a set of
'key targets' (set by government) and a larger set of broader
performance indicators (the so-called 'balanced scorecard'
indicators). This system of measuring performance has provoked much
discussion, but there is little evidence to date on how star ratings
compare to alternatives (an exception being the simple comparison of
ratings with clinical outcomes of Rowan et al., 2004). More rigorous
analysis is required in order to understand their effects, both on their
intended 'targets' and their unintended consequences in other
areas.
In this paper we compare the star ratings of NHS Acute Trusts with
a measure familiar in analysis of the private sector, labour
productivity. This work builds on recent work on the measurement of
public sector productivity (O'Mahony and Stevens, 2002, 2003;
Pritchard, 2003; Mai, 2004; Atkinson, 2005) and for the health sector in
particular (Pritchard, 2004; Hemingway, 2004; Lee, 2004; Dawson et al.,
2005).
In section 2 we discuss the use of indices in measuring performance
in public services. In section 3 we outline the issues relating to the
measurement of productivity in the public sector and the health sector
in particular. Section 4 outlines briefly the star rating system for
acute trusts. In section 5 we set out the empirical method used to
calculate our output, input and productivity indices. Our results are
presented in section 6 and section 7 concludes.
2. The use of indices in measuring performance in public services
There exist a plethora of performance indices of one type or
another. Even when these purport to measure the same thing, they are
often in fact subtly and importantly different. Many are designed for
specific purposes (and may or may not be fit for these), but are often
pressed into alternative uses. The differences between indices and their
fitness for a particular purpose are often unclear and so in this
section we set out a typology of performance indices. These categories
are not mutually exclusive and often indices that claim or are designed
to be one type of index will in fact be something rather different
altogether.
The utility of a performance index depends on its fitness for
purpose and a number of more technical factors. We discus these
desiderata in this section, against which we can set any indices we
encounter.
2.1 A typology of performance indices We distinguish four types of
indices used to measure performance:
* Output Index--how much of a service is being produced;
* Welfare Index--what the value is to final users;
* Performance Management Index--how the services are being
provided;
* Composite Index--includes elements of the above three.
Each of these indices relates to a different concept of what the
organisation is doing. In the case of a single service, the distinction
between these is often plain. Consider the treatment of arthritis, for
example. The output is the number of patients treated; welfare is
reduction in pain and increase in mobility; and performance management
might include the extent to which doctors are using appropriate
treatments or how long patients wait for them. With multiple services
there is a need to apply weights to each service, often combining
indicators that are measured in non-comparable units. Most public sector
organisations offer a number of different services or services with a
number of different aspects. This is at the heart of the problem of
public sector assessment (Stevens, 2005) and raises a number of
potential problems, which we discuss in section 3 below.
2.2 Index desiderata
In order to employ indices to measure performance it is useful to
begin by asking: what are the ideal properties of such indices? We would
expect indices to satisfy the following list of properties:
A. Fit for purpose
In developing indices of performance, we should heed the advice of
Hamlet to 'suit the action to the word, the word to the
action'. Determining which index to use will depend on the question
being addressed. If the purpose is to consider the contribution of the
service to aggregate economic growth then an output index (with quality
adjustment) should be employed. Such an index is required by the ONS for
the sector division of constant price output in the national accounts,
and is useful to bodies like the Treasury and Bank of England in
understanding the performance of the economy. If the purpose is to
measure consumers' well-being, then a welfare index is required.
This is of interest to government departments providing the service as
well as to central government. Departments as providers of services will
also require performance management indices to gauge the processes being
employed by producing units under their control. Finally auditors--e.g.
the Audit Commission, Healthcare Commission etc.--will require composite
indices to gauge overall performance.
B. Theoretical framework for defining domains
The indices should be grounded in theory to define what types of
services it is appropriate to include in each index. Thus we should also
heed the advice of Einstein to count what counts and not just what can
be counted. This should feed in to the data requirements in order to
enable the index to be calculated.
C. Theoretical framework for aggregation
The weights employed to aggregate should be based on a theoretical
framework which sets out the conditions under which weighting schemes
are appropriate. For example under competitive market conditions cost
weights may be used to aggregate outputs, but may need modification if
the underlying assumptions are not met. The marginal impact of the
service on consumers' utility can be used as weights in a welfare
index, but may also be employed in a quality adjusted output index.
Costs or expected impacts on utility may also be used for management
performance indices.
D. Technical properties of index numbers Indices should satisfy a
number of properties that allow logical consistency including
transitivity, independence of irrelevant alternatives, path
independence.
Transitivity requires that if organisation A is ranked
above B, and B is ranked above C, then A should be
ranked above C.
Independence of irrelevance of alternatives requires
that if A is ranked above B in a comparison involving
A, B, C and D, adding a new producing unit E to the
comparison should not make B preferred to A.
Path independence requires that the comparison
between A and B is not dependent on the ordering of
the components of the comparison; using one body of
information followed by a second should yield the
same result as using the second followed by the first.
This is extremely important from the point of view of
transparency. More simply, however, one must question
the validity of methods that exhibit path dependence.
2.3 Composite indicators
In what circumstances is it valid to employ a composite index? One
way of approaching this question is to consider how information on one
of the first three types can be used to modify the others. For example
an output index requires quality adjustment--how much is produced at
what level. The results from work conducted for the Department of Health
(Dawson et al., 2005) suggest that one can use outcome measures to
quality adjust outputs. An argument could also be made that information
on processes might be useful in quality adjusting, in the absence of
information on outcomes.
Most composite indices, however, do not start from the above but
rather put together information useful for different purposes into a
mixing bowl without any regard to what properties the resulting mix
satisfies. This is due usually to expediency rather than any darker
motives, but nevertheless can often cause more problems than it solves
and create obfuscation rather than illumination.
3. Productivity analysis in the public sector
In recent years, economists have begun assessing the performance of
public sector organisations through the use of productivity indices,
more common in the analysis of the private sector. These indices compare
aggregate output to aggregate input use (O'Mahony and Stevens,
2003; Dawson et al., 2005). The use of productivity indices comparable
to those used in the private sector encounters a number of problems that
must be overcome when applying them to the provision of public services.
Their use to measure productivity in the total NHS was explored jointly
by NIESR and the Centre for Health Economics at the University of York,
and is discussed in some depth in Dawson et al. (2005).
Broadly speaking, the main problems relate to the aggregation
required to obtain input and output indices and accounting for quality.
In the private sector, it is assumed that the market price of a good or
service measures the consumers' marginal valuation of the bundle of
characteristics from consuming the output. In the public sector this is
not generally true. One way to overcome this is the use of a cost
weighted activity index (CWAI), a measure currently employed by the
Department of Health (DH) in measuring NHS performance and by the ONS
for a range of public services. Cost weighting does not however account
for increases in quality of the service produced.
Aggregation issues on the input side are potentially easier to deal
with than those on the output side, as there are market prices for
inputs. Thus by separating doctors by type (consultants, registrars,
house officers) and weighting by their shares in total wage bills, it is
possible to adjust labour input for quality.
One problem with productivity measures is that they do not measure
performance relative to un-rationed demand. It is possible to imagine a
hospital trust, for example, which has used its resources efficiently
but fails to meet the needs of potential patients. Despite this, the
productivity index approach is potentially useful in answering questions
on what is being provided by public services and how inputs are utilised
to achieve this. As such they represent a useful benchmark against which
to set officially produced performance indicators. In this paper, we
calculate input, output and productivity indices using the hospital
episode, reference cost, and financial statistics for NHS Acute Trusts
outlined below. By exploring differences across Trusts and those before
and after changes in target regimes we can get a comprehensive idea of
how such changes impact on services in a way not possible by confining attention to a small number of performance indicators.
3.1 Activities, outputs, characteristics and outcomes
In measuring outputs in the public sector in general, and the
health sector in particular, it is useful to distinguish between four
different concepts: (1)
* Activities (e.g. operative procedures, diagnostic tests,
outpatient visits)
* Outputs (e.g. courses of treatment--may require a bundle of
activities)
* Characteristics (e.g. the aspects of output of value to
individuals)
* Outcomes (e.g. the value of characteristics to individuals)
In the measurement of private sector productivity, the focus is on
output, rather than the characteristics that they embody. This is
because, under certain assumptions, the market price of the output
measures the consumers' marginal valuation of the bundle of
characteristics.
Bureaucracies tend to concentrate on activities. Put simply, this
is because these are the easiest of the four concepts to count. The
costs of activities may also be simpler to work out. NHS productivity
measures have been based upon estimates of the number of particular
types of activities (procedures, consultations etc) or the number of
patients treated in various institutional settings (see Dawson et al.,
2005).
There are advantages with this framework. For example, in instances
where care for a patient with a particular condition is provided
entirely within one setting, aggregation within the setting is
equivalent to aggregation by patient pathway or disease group. It
ensures compatibility with current NHS reporting systems and is likely
to prove amenable to analysis at a disaggregated level. It can be a
useful means for monitoring and managing lower level units within the
NHS. Further, the approach would ensure consistency with other policy
initiatives, most notably the Financial Flows reforms (Department of
Health, 2002a).
Notwithstanding the short-run benefits of using activity based
measures, in order for a performance index to be representative,
reliable and robust, it is more appropriate to consider output indices,
particularly those with quality adjustment, where possible. Outputs are
the actual goods or services provided and may consist of a number of
activities bundled together. For example, when patients are diagnosed
with heart problems, their treatment may consist of one or more
instances of surgery, medical management, the prescription of drugs and
possibly a number of other procedures. It is this bundle of activities
which should properly be considered to make up the output of the health
service with respect to the patient concerned. Thus, the unit of
analysis is not the finished consultant episode (FCE) but rather the
continuous inpatient spell (CIPS).
Whilst this is relatively simple, at least conceptually, at the
aggregate sector level, it raises some problems at more disaggregate levels. In our example of acute trusts, a course of treatment may
involve the provision of services by more than one trust. To overcome
this we concentrate on what is known as a provider spell (PS), that is
the course of treatment that is done within one organisation (although
note that an acute trust may have more than one site). The use of
provider spell does not throw up practical problems in our analysis, but
there are potential incentive implications if one concentrates on one
part of what can be potentially a multiple-provider service. If such
indicators are linked to reward or punishment regimes, they may create
incentives to 'farm out' difficult or problematic aspects of
processes, or to refuse them when other trusts refer them. Any system of
management creates incentives; if we do not consider the full set of
incentives created by a regime, we cannot know if it will have the
intended consequences.
One problem with the use of outputs (or activities for that matter)
in the absence of market transactions is that we cannot account for
variations in quality that are usually reflected in prices. Even when
prices do exist--for example prescriptions and private sector
comparisons--there are a number of reasons why these may not be the
appropriate measures. Prescription costs explicitly do not reflect the
costs of producing or value to the consumer of the items to which they
relate. This is because such pricing would contravene the principles on
equity on which the NHS and many other systems of health provision are
based. Prescription costs have more to do with cost-sharing as a method
of overcoming over-consumption that occurs when goods have zero prices
than with the outcome of some pseudo market system of allocation.
Private sector comparisons are tenuous in the UK healthcare setting
because of its relative size, the comparability of the service offered
(i.e. the sets of characteristics) and the demographics of the patients
(and therefore the demands placed on the provider in terms of health
problems occurring, different receptiveness to treatment and values
placed on characteristics of outputs etc.).
What is required is an alternative method of quality adjustment. In
order to find this we need to consider what determines the final value
of goods and services in both private and public sectors. The prices
which producers are willing to accept for providing services reflect the
costs of provision. What is of interest to us is what determines the
price consumers are willing to pay for such services. The value of goods
and services depends upon their characteristics, which bundled together
produce an outcome. Consider for example a hi-fi audio system. The final
outcome produced by the system is the ability to listen to music and
other audio material. A number of factors contribute to this, such as
the ability to play music in a number of different formats, sound
quality, reliability, size and ergonomics. Variations in prices of hi-fi
systems can be explained in terms of these characteristics. Such
variation allows one to calculate the valuations put on these
characteristics by the market.
Whilst we do not have the prices with which to undertake such
'hedonic analysis', consideration of the outcomes produced by
public sector organisations, and the characteristics of outputs that
provide these do allow us to consider the question of quality in public
sector service provision. At the very least, such a consideration
focuses us on the raison d'etre of the public services in question,
which is no bad thing. As we have already mentioned, often the focus is
an unquestioning one on activities because these are 'what the
organisation does' (rather than what it does them for).
The primary outcome of the public health sector is clear, to
increase the health of the population. This is in part pro-active,
through public health and the like, and partly remedial, through the
treatment of illness. The focus of acute trusts is the latter. For
treatment to be valued, it needs to have a positive impact on expected
health, either by removing or mitigating illness where possible or
palliating incurable illness. This can be done by extending life
expectancy (increasing the quantity of life) or by reducing pain,
increasing mobility etc. (increasing the quality of life). These two
aspects of the health impact of the health service can be combined in a
measure called quality-adjusted life years (QALY). A number of other
characteristics of health provision can be considered in terms of their
effect on the quantity and quality of life. Waiting times can be
considered as time that could be spent enjoying the enhanced,
post-treatment quality of life. In-treatment mortality can be considered
as the years of life the patient could have enjoyed if they had not
died.
The QALY measure is a widely, although not universally, accepted
instrument with which to measure the impact of treatment on
patients' health. The operation of the instrument is essentially in
two parts. First, the impact of treatment on a number of aspects of
health, such as pain and mobility, is calculated. Second, surveys are
conducted to place values on these characteristics. Although the QALY
instrument is a means whereby the value of health services to their
consumers can be calculated, we are a long way short of populating the
full set, or even the majority of treatments provided by modern health
services.
In the absence of QALY measures for the majority of treatments, we
must utilise some other valuation methodology. Dawson et al. (2005)
conducted the first extensive analysis of the issues surrounding the
calculation of input, output and productivity indices for
publicly-provided health services. They concluded that until such time
as we can actually measure what the National Health Service does to its
patients and how well it does it, our best measure of the output of the
service is to use some kind of quality-adjusted cost-weighted output
index. This is the method we shall be using in this paper as a
comparator to the star ratings system of performance indicators.
There are a number of methods whereby one might quality-adjust the
measure of output of an acute trust. Here we consider a
survival-adjusted cost-weighted activity index, i.e. adjusting for the
life that could have been enjoyed if the patient had survived treatment.
Of course, while the proportion of people who die in hospitals is
relatively small, the cost is not insignificant.
4. Star ratings system for acute trusts
We will not describe the star ratings system in great detail here
as the system in general is discussed elsewhere in this issue in Gwyn
Bevan's paper. The star rating system for acute trusts was
introduced in the financial year 2000/1. Its final year was 2004/5 as it
is to be replaced by the 'annual health check' for 2005/6. The
star rating system consists of a small set of 'Key Targets'
(which drive the outcomes) and a larger set of 'balanced
scorecard' indicators (Department of Health, 2001, 2002b;
Commission for Health Improvement, 2003; Healthcare Commission, 2004)
(with a much lower impact). The key targets are outlined in table 1;
details of the balanced scorecard indicators are given in the appendix
(table A.1). The outcome of the star rating system is a rating between
no stars (trusts with the poorest level of measured performance) and
three stars (trusts with the highest levels of performance in the
measured areas). The two intermediate cases are one star (trusts where
there is some cause for concern regarding particular areas of measured
performance) and two stars (trusts with mostly high levels of
performance, but which are not consistent across all measured areas).
The most obvious thing to notice from the targets outlined in the
table is that the majority relate to waiting times. If we consider the
raison d'etre of an acute trust, it is to improve the health of its
patients. There are no measures of how many patients the trust treats,
nor how well they are treated, or indeed if they gain any benefit at
all. Long waiting times are a symptom of a system that is perhaps
operating inefficiently or unproductively and as such this is a
management tool that seeks to reduce these symptoms. In the typology
suggested above, what we have is a composite performance management
index. We have no information on the weights given to each of these
indicators in the final index. (2)
Is this the most appropriate index? In the spirit of what we have
said above, it is best perhaps to consider the purpose of the index (as
well as the purpose of the acute trusts). In this issue of the Review,
Bevan places the star ratings in a political context, suggesting that
they are a typical instrument of a command and control government, i.e.
they are a management tool. But as with any management tool, one has to
ask 'management for what purpose?' In the private sector, we
would ask 'do our management practices achieve their
outcomes?' or rather 'what is their effect on the bottom line,
i.e. profits or shareholder value?' In the public sector, one would
hope that the bottom line is social welfare. What is required to assess
management tools is a quality-adjusted output measure.
In this context of the star ratings system as a performance
management tool, we can deduce that the indicators are of importance to
the target setters. This may be because these were perceived to have
been problem areas that needed changing, in which case they could be a
legitimate component in the set of performance indicators that sought to
influence change in this area. Bevan suggests that if the aim was to
reduce waiting times the system certainly achieved its goal. However,
Bevan also suggests that the system in practice created three tiers of
outputs. At the top tier are the things that matter most--the 'key
targets'; next are those things that are of secondary
importance--the 'balanced scorecard' indicators; finally we
have the things that do not count--i.e. everything else. Thus, in
achieving this goal, the price may have been paid in adjusting the
incentives within the system. Thus we raise the question once more in a
different way; is there a relationship between star ratings that
suggests that those with high ratings are actually the least productive
because they bought their 'stars' with respect to waiting
times (and improving working lives and cleanliness)?
It is useful therefore to set the star ratings system against
alternative measures of performance and ask the question 'are acute
trusts with high star ratings more/less productive?' That is not to
say that this is the only measure against which they should be gauged.
As we have stated above, a performance management tool is not a welfare
index, it may have a different purpose, and as such should not be
expected to produce identical results. Nevertheless, if the ultimate aim
of the Government is to increase the welfare of the nation and of the
health service to improve its health, it is legitimate to ask if the
performance management tool is related to the ultimate aim of the
management (in this case the Government). If it is not, use of such
tools beyond the short period (where they might be used as a remedial
measure), would have serious distortionary ramifications on the
achievement of government's ultimate goal.
4.1 Did the stars move?
We can consider the issue of the success of the star ratings system
as a performance management tool by considering the changes in the star
ratings. Table 2 outlines the relationship between the star ratings of
the acute trusts that we could match from 2000/1 and 2003/ 4. (3) We can
see that 49 trusts (37 per cent) improved their star rating over the
period. Over half of the trusts achieved two stars in 2000/1 and over
half of these increased to three stars, with two trusts rising from zero
to three stars. The rating of 26 trusts fell over the period (20 per
cent). If the system intended to incentivise trusts to increase their
performance with regard to the key targets (and to a lesser extent the
background indicators), then it has on the face of it met with some
success.
However, this sanguine interpretation makes two not inconsequential assumptions. First, the system operated in a consistent way. That is,
the monitoring did not become more lenient or trusts did not become
better at gaming the system. Second, the effect on other areas of the
trusts' operations was modest or positive. The assessment of the
first is beyond the scope of this paper, the second is not.
5. Empirical model
We construct labour productivity indices using a basic (i.e.
non-quality-adjusted) cost-weighted output index and a
mortality-adjusted cost-weighted output index. Labour is by far the most
important input in producing health services, accounting for 71 per cent
of expenditure in the hospital sector. (4)
In the absence of quality adjustments for changes in QALYs, this
paper presents measures based on cost-weighting activities. If the unit
of analysis is the patient spell (or patient pathway, i.e. the bundle of
activities a patient receives in order to be treated), a cost-weighted
output index (CWOI) can be calculated. Leaving out the time subscripts
for clarity, this is given by:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)
where s is the activities within a spell and [??] and [??] are the
output and cost of spells beginning with activity type l respectively.
Our quality-(mortality-) adjusted output index is:
[O.sub.M] = [summation over (j)][([summation over
(s)][x.sub.js][c.sub.js])(1-[m.sub.j])] = [summation over
(l)][x.sub.l]c.sub.l](1 - [m.sub.l] (2)
where m is a dummy variable that equals one if the patient dies and
zero otherwise; (l-m) represents the survival rate. The spells are
created from data on individual patients' spells. If the index were
calculated using episodes we could use the mean value of (l-m), which is
the survival rate for the particular episode type (or to be more
precise, the admission-health resource group combination) and multiply
this by the number of individuals treated in that group and cost.
However, in our quality-adjusted index [[??].sub.l] is the mean number
of individuals whose spell began with episode type l and ended in death.
From (2) we can see the difference between the activity and output
(course of treatment/ bundle of episodes) methods. For a spell of
treatment, we need to adjust the whole sum of episodes in that spell,
weighted by their costs, by whether any of the episodes ended in death
(note that an individual could be receiving a number of treatments at
the same time). Thus, if an individual went in for one procedure but
then went on to have others--either as part of a bundle of treatments
for a particular ailment or as a result of misdiagnosis or error by
medical staff--the whole spell is quality-adjusted by the fact that the
individual died. For example, if a patient went into hospital for a
spell of treatment for a heart condition that included an operation
followed by, say, a course of drugs, it is the treatment of the heart
condition that ended in a death, rather than just the course of drugs.
As a corollary of this, consider a patient who went into hospital for a
treatment that is not considered life-threatening--say having an
in-growing toe-nail removed--but due to mismanagement was given the
incorrect drugs and suffered a heart failure. The death must be linked
to the course of treatment that began with an in-growing toenail treatment, not merely the treatment for heart failure. Part of the risk
of having an in-growing toe-nail is the (one would hope extremely small)
risk that the patient would be mistreated and die.
One objection that might be levelled at such indices is that they
do not include factors such as cleanliness (except for the extreme
effects such as those on mortality, e.g. MRSA) and waiting times.
However, it is likely that most people are likely to give these a low
weighting relative to being treated and not dying during treatment. (5)
The labour input index is a standard weighted average of m staff
types (6)
L = [summation over (m)][n.sub.m][w.sub.m] (3)
where n are the m staff types and the weight w is equal to the
aggregate expenditure share of labour type m. Thus our basic
cost-weighted productivity index is:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)
Our mortality-adjusted productivity index is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)
The data for our analysis come from a number of sources. The counts
of outputs (spells) come from the Hospital Episodes Statistics (HES)
database of all elective and non-elective day patients, with the
exception of regular day and night attenders. This gives us over 10
million provider spells every year. The cost weights are derived by DH
and come from the reference costs database. (7) Reference costs are
indexed by the 'healthcare resource group' of the first
episode in a spell and whether the first was an elective or non-elective
admission. This gives us over 1000 combinations (i.e. J or l >
1,000). (8)
Staff numbers are whole time equivalent (WTE) numbers from the NHS
Census. The earnings used for the expenditure share calculations come
from the NHS Earnings Survey. For more information on these data sources
and how to construct the system-wide analogue of provider spells
(continuous inpatient spells), see the data appendix to Dawson et al.
(2005).
6. Results
6.1 Relationship between star ratings and productivity
The results for the basic CWOI are set out in table 3. The tables
are standardised such that the overall average level of productivity in
2000/1 equals one. Thus the first entry states that zero star-rated
trusts in 2000/1 achieved 2 per cent higher than average productivity
levels in that year. In 2000/1, three-star trusts do have the highest
levels of labour productivity. However, both the between-star rating and
between-year variation are swamped by the within-group
variation--standard deviations are considerably higher than the ratios
of the indices across group or year. Thus we cannot differentiate
between the productivity of any of the groupings with any degree of
certainty. Indeed in 2001/2 there is, if anything, a negative
relationship between the star rating the trust achieved and its labour
productivity index.
The same is true when we quality-adjust the output index to account
for differences in mortality across acute trusts. The patterns are
similar--an increase in productivity between 2000/1 and 2001/2 followed
by a levelling out, and a negative relationship between star ratings and
labour productivity in 2001/2--but they are once more swamped by the
within-group variation. Once again we can say that there is little or no
relationship between the star rating the acute trust received and its
productivity.
Our results cannot discriminate between a number of competing
theses. As we noted above, one might not expect there to be a positive
relationship between labour productivity and star ratings if the only
difference between hospitals were the waiting lists and other key
indicators (and, possibly, balanced scorecard indicators). What about
the suggestion that trusts 'bought' lower scores in other key
indicators (i.e. waiting times) by foregoing the financial management
indicator (which referred to whether they achieved their financial plan
without unplanned support) (Bevan, 2006)? This would lead to an increase
in both the numerator (output), as trusts cleared their waiting lists,
and the denominator (labour input), as trusts increased staff numbers or
overtime to achieve this. Thus the effect on productivity would be
indeterminate. Moreover, the fact that trusts with long waiting lists
were no more or less productive than others may also suggest that the
problem is due not so much to an individual trust's inefficiency,
but rather to an excess of demand (i.e. inefficient strategic management
at the level of between-trust resource allocation).
6.2 Aggregate productivity
The overall CWOI and mortality-adjusted productivity indices for
the acute trust sector along with uncertainty bounds at plus and minus
one standard deviation are shown in figure 1. (9) From the figure it is
clear that although there was an increase in both indices between 2000/1
and 2001/2, this is dwarfed by the uncertainty about their value. There
is a slight divergence between the basic CWOI and the mortality-adjusted
index; this is because of the slight increase in mortality. However,
once we consider the error bounds we can see that the difference is
dwarfed by uncertainty about the aggregate productivity values
themselves.
[FIGURE 1 OMITTED]
7. Discussion
The setting and use of targets in the public sector has generated a
growing amount of interest in the UK. This has occurred at a time when
analysts and policymakers are increasingly grasping the nettle of
measuring performance in the public sector. The discussion in this paper
highlights the fact that performance indicators should satisfy certain
desiderata. In particular they should be fit for the purpose, i.e. the
correct type of indicator for the job at hand. Indicators that are put
to uses not foreseen by their designers are seldom put to good use. In
addition the indicators should cover an appropriate domain(s) of service
under examination. Management indices that include aspects that are
beyond the control of the managed will present a misleading picture of
performance. Aggregation methods are also important--if an index
comprises more than one indicator (as is usual) the weights used to
aggregate them should represent their relative value and not some
arbitrary scheme, such as giving each equal weight without consideration
of the alternatives.
As an empirical application of our methodology, we have compared
one type of performance indicator with another. The star ratings system
for acute hospital trusts in England is an example of a performance
management system. It is part of a mechanism designed to improve, or at
least change the sector with regard to a small number of key targets,
largely relating to waiting times. We have contrasted this with a
measure of labour productivity analogous to those used in the analysis
of the private sector.
We find that the two methods of assessment are almost entirely
unrelated. That is to say, there is no relationship between an acute
trust's star rating and its productivity. Our results are the same
whether one uses a simple cost-weighted output index, or whether one
quality-adjusts the output index for variations in patient mortality.
Whatever the merits of the star rating system in terms of a management
tool--and the increase in the number of trusts achieving high star
ratings suggests that it has been, on its own terms, successful--it does
not provide an indicator of the performance of trusts in terms of their
productiveness in providing health care for patients.
The results also do not say that high achieving trusts have
obtained their high star ratings at the cost of neglecting their primary
purpose of treating patients. They are, however, consistent with the
thesis that since the key indicators included only one financial one and
ten or so others, trusts could quite literally buy their way out of
trouble by missing one target (the financial one) in order to meet the
remainder (mainly on waiting times). (10) This is no more than tentative
evidence, however, and it would need further research to determine
whether this were true.
The lack of any correspondence between the star ratings and
productivity of acute trusts does raise questions as to the
appropriateness of such indicators of performance, particularly over the
long term. Whilst they may have achieved certain goals deemed important
by central government (i.e. waiting times), the focus they place on a
small subset of what is the role of acute trusts--i.e. the speed with
which services are provided rather than the services themselves--may
become detrimental. The fact that the star rating system, having
achieved what might be considered to be one of its primary objectives,
has been replaced, gives one hope that this is not the case, although
the jury is out on 'annual health checks'.
Appendix
Table A.1. Balanced scorecard indicators
used in star ratings for acute trusts
2000/1 2001/2 2002/3 2003/4
Clinical focus indicators
Clinical negligence: compliance
against CNST risk management
standards x x x x
Emergency readmission to hospital
following discharge (adults) x x x x
Emergency readmission to hospital
following discharge (children) x x
Emergency readmission following
treatment for a fractured hip x x x
Emergency readmission following
treatment for a stroke x x
Deaths in hospital within
30 days of non-elective surgery x x x x
Deaths within 30 days
of heart bypass operation x x x
Returning home following
treatment for a fractured hip x
Returning home following
treatment for stroke x
Infection control x x
MRSA bacteraemia: improvement score x
Thrombolysis treatment time x x
Compliance to recommended child
protection systems & procedures x
Clinical governance
composite indicator x
Extent of participation in
selected clinical audits x
% stroke patients spending time
on a specialist stroke unit x
"Winning Ways" processes
and procedures x
Patient focus indicators
% patients waiting less than
6 months for an inpatient
appointment x x x x
Patients seen within 13 weeks
of GP written referral for
first outpatient appointment x x x x
Trolley waits of more than
4 hours (% non-elective FFCEs) x
Resolution of written complaints x x x
No. of patients waiting
for an inpatient appointment
(% planned target achieved) x x
Total time spent in A&E x
% patients not admitted within one
month of last minute cancellation x
% patients treated within one month
of diagnosis of breast cancer x x x
% patients treated for breast
cancer within two months of
urgent GP referral for suspected
cancer x
% patients whose discharge from
hospital was delayed x
Inpatient survey: co-ordination
of care x
Inpatient survey: environment and
facilities x
Inpatient survey: information and
education x
Inpatient survey: physical and
emotional needs x
Inpatient survey: prompt access x
Inpatient survey: respect and
dignity x
% patients admitted to hospital
via A&E within 4 hours of
decision to admit x x
Hospital food (whole trust score) x x
% elective admissions cancelled at
the last minute for non-clinical
reasons x x
% of day cases that were pre-booked x x
% patients whose transfer of care
from hospital was delayed x x
Heart operation waits x x x
Outpatient A&E survey:
access and waiting x
Outpatient A&E survey:
better information, more choice x
Outpatient A&E survey: building
relationships x
Outpatient A&E survey: clean,
comfortable and friendly place
to be x
Outpatient A&E survey: safe, high
quality co-ordinated care x
Paediatric outpatient
non-attendance rates x
Privacy and dignity: compliance
with objectives x
% patients with new onset
chest pain seen in RACPCs
within 2 weeks of GP referral x
Adult inpatient and young patient
surveys: access and waiting x
Adult inpatient and young patient
surveys: better information,
more choice x
Adult inpatient and young patient
surveys: building closer
relationships x
Adult inpatient and young patient
surveys: clean, comfortable
and friendly place to be x
Adult inpatient and
young patient surveys: safe,
high quality co-ordinated care x
Capacity and capability
focus indicators *
Sickness/absence rate for
directly employed NHS staff x x x
Compliance with the New Deal
on junior doctors' hours x x x x
Consultant vacancy rates x
Qualified nursing, midwifery and
health visiting staff vacancy
rates x
Qualified AHPs vacancy rates x
Data quality x x x
Staff satisfaction with employer x x
Information governance x x x
Fire, health and safety backlog x
% consultants completing annual
appraisal x x
Staff opinion survey: health,
safety and incidents x
Staff opinion survey: human
resource management x
Staff opinion survey:
staff attitudes x
Note: * This set of indicators was referred
to as 'staff focus' indicators in 2000/1.
Table A.2. Occupational categories used
in construction of labour input index
1997, 1998, 1999 2000 2001
2002, 2003
Non-medical staff
Qualified Nursing, Qualified Qualified
nursing, midwifery and nursing, nursing,
midwifery and health visiting midwifery and midwifery and
health visiting staff health visiting health visiting
staff staff staff
Qualified Scientific, Scientific, Qualified
scientific, therapeutic and therapeutic and Allied Health
therapeutic and technical staff technical staff Professionals
technical staff
Qualified Ambulance staff Ambulance staff Ambulance staff
ambulance staff
Support to Health care Health care Health care
clinical staff assistants and assistants and assistants and
other support other support other support
staff staff staff
NHS Admin & estates Admin & estates Admin & estates
infrastructure staff staff staff
support
Nursing, Other nursing, Other nursing,
midwifery and midwifery and midwifery and
health visiting health visiting health visiting
learners staff staff
Other staff Other staff Other staff
Other STT staff
Medical staff
Associate specialist and staff grade
Consultants
Dental Officers
House Officers
Hospital practitioners and clinical assistants
Other community health services
Other hospital (1997 to 2000 only)
Registrars
Senior Dental Officers
Senior House Officers
Notes: The Department of Health's Annual Workforce Census (for further
details see http://www.dh.gov.uk/PublicationsAndStatistics/Statistics/
StatisticalWorkAreas/StatisticalWorkforce/fs/en) provided data on
numbers employed by occupational group at the NHS Acute Trust level.
However, it should be noted that there are changes to the occupational
categories in the workforce census over time, particularly for non-
medical staff. The occupational groups used in calculating the index
of labour input used in this paper are summarised in table A.1 above.
REFERENCES
Atkinson, A. (2005), 'Measurement of government output and
productivity for the national accounts', Atkinson Review: Final
Report, HMSO.
Bevan, G. (2006), 'Setting targets for health care
performance. Lessons from a case study of the English NHS',
National Institute Economic Review, 197.
Commission for Health Improvement (2003), 'NHS performance
ratings acute trusts, specialist trusts, ambulance trusts 2002/
2003', London, Commission for Health Improvement.
Cutler, T. (2002), 'Star or black hole?' Community Care,
pp 40-41.
Dawson, D., Gravelle, H., O'Mahony, M., Street, A., Weale, M.,
Castelli, A., Jacobs, R., Kind, P., Loveridge, P., Martin, S., Stevens,
P. and Stokes, L. (2005), 'Developing new approaches to measuring
NHS outputs and productivity', NIESR Discussion Paper No. 264 and
CHE Research Paper No. 6, available at: http://www.niesr.ac.uk/pdf/n
hsoutputsprod.pdf.
Department of Health (2001), 'NHS performance ratings: acute
trusts 2000/01', London, Department of Health.
--(2002a), 'Reforming NHS financial flows: introducing payment
by results', London, Department of Health.
--(2002b), 'NHS performance ratings and indicators: acute
trusts, specialist trusts, ambulance trusts, mental health trusts 2001/
02', London, Department of Health.
Hemingway, J. (2004), 'Sources and methods for public service
productivity: health', Economic Trends, 613, pp. 82-90.
Healthcare Commission (2004), '2004 performance ratings',
London, Healthcare Commission.
Hill, A. (2003), 'The UK government's public service
agreement framework', Background Paper, HM Treasury.
JRSS (2004), 'Special issue on performance monitoring and
surveillance', Journal of the Royal Statistical Society, 167, 3.
Kmietowicz, Z. (2003), 'Star rating system fails to reduce
variation', British Medical Journal, 327, p. 184.
Lee, P. (2004), 'Public service productivity: health',
Economic Trends, 613, pp. 38-59.
Mai, N. (2004), 'Measuring health care output in the UK',
Economic Trends, 610, pp. 64-73.
Miller, N. (2002) 'Missing the target', Community Care,
21-27 November, pp. 36-8.
O'Mahony, M. (2006), 'Outputs, inputs and productivity in
the NHS', presentation to NIESR/ESRC conference on 'Public
Sector Performance', London, British Academy.
O'Mahony, M. and Stevens, P.A. (2002), 'Measuring
international comparative performance in the provision of public
services: a review', report to Evidence Based Policy Fund.
--(2003), 'International comparisons of performance in the
provision of public services: outcome based measures for
education', Presentation to NIESR conference on 'Productivity
and Performance in the Provision of Public Services', London,
British Academy.
OXREP (2003), 'Special issue on financing and managing public
services', Oxford Review of Economic Policy, 19, 2.
Pritchard, A. (2003), 'Understanding government output and
productivity', Economic Trends, 596, pp. 27-40.
--(2004), 'Measuring government health services outputs in the
UK national accounts: the new methodology and further analysis',
Economic Trends, 613, pp. 69-81.
Rowan, K., Harrison, D., Brady, A. and Black, N. (2004),
'Hospitals' star ratings and clinical outcomes: ecological
study', British Medical Journal, 328, pp. 924-5.
Snelling, I. (2003), 'Do star ratings really reflect hospital
performance?' Journal of Health Organization and Management, 17,
pp. 210-23.
Stevens, P. (2005), 'Assessing the performance of local
government', National Institute Economic Review, pp. 90-101.
NOTES
(1) Other authors such as Stevens (2005) and Dawson et al. (2005)
consider these as three (activities, outputs and outcomes); here we
differentiate more clearly between the characteristics of outputs and
their effect on outcomes.
(2) Although this is possible and we are currently undertaking
statistical analysis to determine the implicit weights.
(3) Note that there has been some movement in the population, with
trusts amalgamating.
(4) Ideally we would calculate multi-factor productivity measures
that account also for capital and intermediate inputs but doing so at
the trust level is beyond the scope of this paper.
(5) Certainly, in their discussion of the effects of waiting times
on quality adjustment, Dawson et al. (2005) point out that compared to
the total time over which improved health due to treatment might be
enjoyed, the weeks or even months of waiting are relatively small.
(6) The staff types used in the construction of the labour input
index are detailed in table A.2 of the appendix.
(7) NHS reference costs are available online at: http://
www.dh.gov.uk/PolicyAndGuidance/OrganisationPolicy/
FinanceAndPlanning/NHSReferenceCosts/fs/en.
(8) Note that they represent average costs across trusts, rather
than unit costs for individual trusts. Aside from concerns regarding
their reliability, the use of average unit costs calculated at the trust
level would create two problems. First, given that the cost is measuring
the relative 'importance' of a particular output it is unclear
why this should vary by trust; Second, it could create a perverse
incentive for trusts to increase provision in areas where they are
relatively expensive or inflate costs in areas where they are big
producers. Because of these, a hospital that spends greater than average
amounts providing say hip replacements would appear more productive in
our index, even though the greater expense might well reflect poor
management of service provision.
(9) These figures were aggregated using expenditure weights.
(10) A trust could underachieve on one target and still get three
stars, but not significantly underachieve (for 2003/4 on this target,
underachieve was defined as 'Adverse variance from financial plan
of up to 1% of turnover or less than 1 million [pounds sterling] without
unplanned financial support', while significantly underachieved was
defined as 'Adverse variance from financial plan greater than 1% of
turnover or greater than 1 million [pounds sterling] or unplanned
financial support'.
Philip Stevens *, Lucy Stokes ** and Mary O'Mahony ***
* National Institute of Economic and Social Research and Medium
Term Strategy Group, Ministry of Economic Development, New Zealand.
email: p.stevens@niesr.ac.uk. **National Institute of Economic and
Social Research. ***National Institute of Economic Research and
Birmingham Business School. The research reported here was funded by the
ESRC, Grant no. RES-153-25-00 44. Thanks go to Gwyn Bevan, Rowena
Jacobs, Martin Weale and participants at the 4th NIESR Public Sector
Performance Conference, January 2006, for helpful comments, although the
usual caveats apply.
Table 1. Key targets used in star ratings for acute trusts
2000/01 2001/02 2002/03
Shorter inpatient waiting lists x
Inpatients waiting longer than x x x
the standard
Reduction in outpatient waiting x
Outpatients waiting longer than x x x
the standard
Outpatient and elective
(inpatient and day-case) booking
Cancer: % seen within 2 weeks * x x
Financial management x x x
12 hour waits for emergency
admission via A&E following
decision to admit x x x
Total time in A&E: 4 hours or less x
Cancelled operations x x x
Improving working lives x x x
Hospital cleanliness x x x
2003/04 2004/05
Shorter inpatient waiting lists
Inpatients waiting longer than x x
the standard
Reduction in outpatient waiting
Outpatients waiting longer than x x
the standard
Outpatient and elective x x
(inpatient and day-case) booking
Cancer: % seen within 2 weeks x x
Financial management x x
12 hour waits for emergency
admission via A&E following
decision to admit x x
Total time in A&E: 4 hours or less x x
Cancelled operations
Improving working lives x
Hospital cleanliness x x
Note: * Balanced scorecard only.
Table 2. Movement of star ratings
2000/1 2003/4 Total
0 1 2 3
0 1 1 6 2 10
1 4 7 4 2 17
2 2 11 31 34 78
3 0 4 5 18 27
Total 7 23 46 56 132
Table 3. CWOI by star rating
Star rating 2000/1 200112 2002/3 2003/4
0 1.02 1.34 1.09 1.20
(0.19) (0.23) (0.20) (0.15)
1 0.99 1.17 1.15 1.12
(0.22) (0.21) (0.17) (0.21)
2 1.01 1.12 1.14 1.12
(0.23) (0.23) (0.20) (0.17)
3 1.05 1.14 1.10 1.15
(0.30) (0.32) (0.16) (0.16)
Total 1.00 1.13 1.09 1.11
(0.23) (0.25) (0.18) (0.17)
Notes: Standard deviations in parenthesis.
All scores indexed to 2000/1 average.
Table 4. Mortality-adjusted CWOI by star rating
Star rating 2000/1 2000/2 2000/3 2003/4
0 1.00 1.32 1.08 1.18
(0.19) (0.22) (0.20) (0.15)
1 10.98 1.15 1.13 1.11
(0.21) (0.21) (0.17) (0.21)
2 20.99 1.11 1.12 1.11
(0.22) (0.23) (0.20) (0.17)
3 31.03 1.12 1.08 1.13
(0.29) (0.32) (0.16) (0.16)
Total 1.00 1.13 1.11 1.12
(0.23) (0.26) (0.18) (0.17)
Notes: Standard deviations in parenthesis.
All scores indexed to 2000/1 average.