Undercounts in offender data and closing the gap between indigenous and other Australians.
Hunter, Boyd ; Ayyar, Aarthi
Introduction
The accuracy of administrative data for Indigenous populations and
ethnic minorities is an important matter for researchers and
policy-makers. For example, if governments do not have reliable data on
who is Indigenous in administrative data that are used to plan services,
then they cannot plan adequately to address Indigenous disadvantage.
Consequently the major thrust of Indigenous policy that attempts to
'Overcome Indigenous Disadvantage' (OLD) and 'Close the
Gap' between Indigenous and non-Indigenous life expectancy is
undermined (Steering Committee for the Review of Government Service
Provision (SCRGSP) 2003; 2005; 2007).
Undercounts of Indigenous people are present in almost all data
sets, even in the Census, and hence it is difficult to know the exact
number of people covered by a particular policy. The Australian Bureau
of Statistics (ABS) uses sophisticated techniques to correct for this
tendency towards under-enumeration in the Indigenous population in
Census data, but it is relatively rare to find similar corrections
applied to data collected for administrative purposes by government
agencies. In addition to undermining the credibility of the information
provided from such sources, poor quality data have profound consequences
for the policy settings. For example, the Commonwealth Grants
Commission's (CGC's) horizontal fiscal equalisation formula is
partially based on the Indigenous Estimated Residential Population
(ERP), hence data quality for Indigenous mortality and fertility will
fundamentally affect the distribution of resources in Australia's
federal system.
While data quality is a rather dry topic, it is clearly very
important. The objective of this paper is to present some analysis to
illustrate how one might go about critically examining administrative
data on the Indigenous population. While many administrative data sets
collect information on the Indigenous status of clients, often the
quality of such data is questionable.
One of the largest categories in both population Censuses and
administrative data is that of Indigenous status listed as unknown. The
fundamental premise of this paper is that it is important to understand
this category lest the data quality be fundamentally compromised. If the
unknowns are a substantial component of the population, then one cannot
be certain one has correctly estimated the incidence of phenomena in the
Indigenous and the residual Australian population. That is, if the
unknowns are more like the non-Indigenous population than Indigenous
Australians, then a comparison between the known Indigenous and
non-Indigenous populations would overstate Indigenous disadvantage. Of
course the obverse of this proposition is also true (that is, if the
unknowns are more like those who clearly indicate their indigeneity,
then Indigenous disadvantage is likely to be understated).
It is obviously important to understand who identifies as an
Indigenous person at a particular point in time, but it is arguably even
more important to understand how someone's identity might change
over time. This is because, as alluded to above, administrative data are
increasingly used to ascertain whether Indigenous outcomes are improving
according to various indictors (for example, in OID reports). If data
quality is unreliable and uncertain, then one should question the level
of investment warranted to achieve desired outcomes. Real resources can
be diverted from effective programs because of spurious trends in
outcomes identified using unreliable data especially when funding
available for social policy is scarce.
This paper illustrates some of the issues involved in using
administrative data from the Re-Offending Database (ROD) provided by the
NSW Bureau of Crime Statistics and Research (BOCSAR) (Snowball &
Weatherburn 2006). This data set has several features that make it
suitable for a study of the quality of Indigenous data: First, it is
collected over a period of time during which it is possible for
Indigenous people to change their identification; and second, data on
Indigenous status is collected from two 'independent' sources
and this allows us to validate our estimates.
The standard approach adopted to estimate the number of people
missing from any particular enumeration is that of a follow-up survey
(Marks et al. 1974). Such a survey is undertaken after major Censuses in
most developed countries; in Australia it is known as the
Post-Enumeration Survey (PES). (1) Another method for estimating the
potential Indigenous population is the Dual System Estimator (DSE),
sometimes referred to as 'dual survey estimators' or
'dual record systems', which can be used to benchmark the
estimate of the Indigenous population within the NSW criminal court
system.
The next section describes broader issues regarding historical
changes in the Indigenous population, followed by a detailed
introduction to the ROD data. It is proposed that a statistical model be
used to predict whether the unknowns are more like the Indigenous or
non-Indigenous population. After presenting a summary of the results
estimated from this model, the penultimate section benchmarks these
results using a simple DSE that has been used to estimate populations
for various groups in many countries (see Hunter & Dungey 2006). The
final section provides some concluding remarks about the utility of the
estimators used and points to future research directions that might
prove useful for policy makers and researchers. While this paper
attempts to build confidence in the data, we also hope to raise basic
issues that can and should be considered by anyone who collects and
analyses data on Indigenous Australians.
The big picture for Indigenous population changes
In general, population levels change over time according to the
demographic balancing equation, which is an accounting identity that
measures the flows into and out of a particular population (Shyrock,
Siegel & Associates 1976: 4):
[ERP.sub.t+1] = [ERP.sub.t] + [Births.sub.t] - [Deaths.sub.t] + Net
[Migration.sub.t] + Census [Procedure.sub.t] + [E.sub.t] (1)
[ERP.sub.t] is the time-specific Estimated Residential Population
and [E.sub.t] is an error or residual term. The term [ERP.sub.t] takes
into account the tendency to miss some people when counting
populations--that is, net undercount at a particular point of time. The
recognition of this tendency does not deny the existence of double
counting in some circumstances, but the reality is that many Indigenous
people often do not identify or choose to identify as Indigenous, for
reasons that could include past experiences of racism or discrimination.
The balancing equation can be characterised as an accounting identity
when one examines the total population because the residual term means
that this equation is always true by definition. For sub-populations
such as Indigenous and non-Indigenous population the changes are
complicated by non-biological population growth (sometimes called an
'error of closure', which is largely embodied in the term E).
This non-biological growth can include components such as those due to
increased (or decreased) propensity to identify as Indigenous, to
inter-marriage between-various sub-populations and to identification of
the resulting progeny from such marriages. Another factor that affects
non-biological growth is the change in both coverage of sub-populations
(Guimond 1999) and the Census editing procedures (Ross 1999). The Census
procedures in equation (1) are sometimes included in estimates of the
'error of closure' because it is difficult to get a precise
measure of the effect of collection methodology on estimated
populations.
There are three possible sources of change in responses to ethnic
affiliation, sometimes called Ethnic Mobility (Brown et al. 2010: 46;
also see Westbrooke & Jones 2002):
* unreliability in measurement;
* changes due to alterations in ethnicity questions; and
* conscious changes in ethnic affiliation.
Switching affiliation between ethnic groups can be the result of
changing incentives, both positive and negative (Brown et al. 2010: 46).
Overall, 8 per cent of respondents to New Zealand's Survey of
Family, Income and Employment (SoFIE) changed ethnicity at least once
during three recent waves of that survey between 2002-05 (Carter et al.
2009).
The substantial uncertainty about the size of the Indigenous
population means that trends in related outcomes are particularly
difficult to interpret. This is because changes in an individual's
Indigenous status over time lead to changes in the composition of the
Indigenous population, which is difficult, if not impossible, to take
into account before the population has identified itself at a particular
point in time. Predicting future outcomes for the Indigenous Australian
population is fraught for the same reason: one can never be sure about
the extent of the non-biological growth. The main point is that it is
difficult to evaluate long-running initiatives and changes in a policy
regime because researchers can never be sure that cross sectional data
measured over time relates to the same group of individuals. That is,
longitudinal data are required in order to make sense of trends in
Indigenous welfare and other outcomes.-
The Census counts of Indigenous people, and the related issues of
undercounts and the increasing propensity to identify as Indigenous,
illustrate that many who do not self-identify as Indigenous in one
Census may eventually identify as Indigenous in a latter Census.
Two questions arise in this context: 1. who is likely to change
their Indigenous status?; and 2. is the process of increasing propensity
to identify as Indigenous beginning to wane? With respect to the latter
question the non-biological increase in the Indigenous population is
likely never to be finalised because intermarriage between Indigenous
and non-Indigenous people means that the resulting children are likely
to self-identify as Indigenous in some circumstances. There is a high
and increasing rate of intermarriage especially in urban areas.
With respect to question 1, it is worth noting that the level of
non-response to the Indigenous status in the Censuses is over twice the
size of the actual number of people who identified as Indigenous
directly. ABS (2007a) reports that 5.7 per cent of Australians did not
respond to the question on Indigenous status in the 2006 Census (that
is, using usual residents counts).
Another general observation that can be made from the ABS (2007a)
results is that the Census questions to which respondents are more
likely not to respond are questions which relate only to part of the
population or for which respondents are uncertain about the appropriate
response (for example, residential status in non-private dwelling,
unpaid domestic work and unpaid assistance to a person with a
disability). The size of the non-response rates for the Indigenous
status question is not unduly large compared to other questions, but the
overall number of non-responses to that question is particularly
significant given the small overall number of people who identified as
Indigenous.
This paper argues that those who do not indicate their Indigenous
status in one year are the most likely part of the population to
indicate they are Indigenous in future. Even if a relatively small
proportion of these 'unknown' respondents change their status
to Indigenous in the future, the Census-based estimates of the
Indigenous population will be measured with considerable error, because
there are so many respondents who are 'unknowns'.
Figure 1 shows the demographic profiles for Indigenous,
non-Indigenous and unknown-indigenous populations in the Census and the
unknown Indigenous status in the court data used in the ROD. People with
unknown Aboriginal and Torres Strait Islanders (ATSI) status in the ROD
are actually closer to the Indigenous Census profile than that for other
Australians. That may be partially driven by the younger profile of the
population involved in the criminal justice system. The importance of
this figure is that it illustrates that it is still worth asking
question about the unknown categories and the remainder of this paper
does that. Note that the following analysis refers to both ATSI status
in criminal court data at a particular point in time, and the
consolidated Indigenous status in the ROD. The ROD status indicates
whether a person identified as an Indigenous person at any appearance in
the ROD.
[FIGURE 1 OMITTED]
Another observation that can be made in Figure 1 is that the
demographic profile of the not stated category in the Census is not
substantially different to the self-identified non-Indigenous
population. Given that Indigenous people are such a small minority of
the Australian population this is not unexpected.
Data and Method
Overview of the ROD
The ROD is not a simple data source that has been consistently
collected and collated; it is a compilation of several data sets all
potentially constructed using different criteria. The ROD has evolved
over time in response to data quality issues as they become apparent.
BOCSAR has done an invaluable job in combining the data so that they are
broadly comparable for major socio-demographic characteristics. Data
from the Children's, Local and Higher Courts include a profile of
sex, age, ATSI status and current location (measured at various
levels--including Postcode, Local Government Area and Statistical
Divisions) for each particular court appearance. These data sources also
include information on offence and penalty but such data was not used in
this study since we were primarily interested in a person's
individual characteristics and identity to better estimate the
Indigenous populations. Data from Youth Justice Conferences and
custodial data from the Department of Correctional Services were also
available but were not used for similar reasons.
The courts' ATSI status indicator is sourced from the police
records (so when the matter goes to court, the ATSI status is filled in
from the police file rather than the courts recording it separately).
There is no audit between the court records and police records.
In the remainder of this paper we refer to the 'consolidated
Indigenous identifier' whereas the raw police data will be referred
to as the 'ATSI indicator'. Police data, and hence court data,
on ATSI status vary over time because of the failure of police to record
that information and the difficulty in reconciling names provided at
different points in time. The quality of the police use of this flag
apparently increased after 1995 when the Department began to increase
the emphasis on the gathering of ATSI status (personal communication
from Don Weatherburn, Director of BOCSAR). (2)
The total number of unique individuals appearing each year in the
ROD data increased gradually over the period examined in this paper from
around 90,000 in 1994 to almost 110,000 in 2006.
It is important to understand the structure of the data used in
this study. We did not follow the characteristics of individuals for
every court appearance (at any point in time) in the study period, as
this would have entailed an enormous amount of data that would be
difficult to manage. While BOCSAR had access to all the criminal court
records in the ROD we reduced the size and dimensionality of the data
management exercise by focussing on the characteristics of people for
their first appearance in a court in each year between 1994 and 2006.
Obviously, this means that our analysis did not capture all the
information on the ROD where there were multiple court appearances in a
particular year. However, we would argue that this simplification is
justifiable in that the information used in our study either does not
change or is slow to change over time (demographic characteristics,
Indigenous status and location of residence). By focusing on annual
changes we can capture most of the variation in the data.
Of the 89,945 individuals recorded in our ROD data at some stage in
1994, less than 20,000 of these also appeared in court in 1995. The
number of these individuals in subsequent years declined over time but
was generally over 10,000 people reappearing in any given year (n.b.,
the only exception was 2006 when the number of the original individuals
appearing at least once was 9,579). While the rate of re-offence was
quite high over the period examined, some of the original individuals in
the ROD in 1994 did not appear in later years. Of course, other people
appeared for the first time in the ROD in 1995 and later years. That is,
the ROD data are intrinsically dynamic, and hence it is not easy to
provide a summary overview.
The fluctuations in the percentage of unknowns in ROD provide prima
facie evidence of the data quality issues. Of course, if the percentage
of unknowns in the administrative data is over 50 per cent there is
literally more you do not know about the Indigenous population than you
do know. There is no definitive rule whereby one can establish that
administrative data are reliable, but if you take this time series in
Figure 2 as a starting point, then the level of unknowns decreases
around 1997 or 1998 to around 30 per cent and stays stable at around
20-30 per cent thereafter. This observation accords with anecdotal
evidence of a NSW Police push to use the ATSI indicator more
consistently.
[FIGURE 2 OMITTED]
At the time of writing, a person for whom there is no information
recorded on their ATSI status by the police is treated as non-Indigenous
for the purposes of comparative statistics calculated from the ROD
(Snowball & Weatherburn 2006: 6). The argument of this current paper
is that the unknown category should be interrogated to see if there is
some unused information that would allow more robust and reliable
comparative statistics to be calculated for both Indigenous and
non-Indigenous populations. The simplest way to do this is to randomly
allocate the unknown category between Indigenous and non-Indigenous
groups, using a uniform statistical distribution, on the seemingly
reasonable grounds that we do know whether such people are more likely
to be one or the other. In other words, random allocation assumes the
same percentage of Indigenous and non-Indigenous persons in both the
known and unknown populations.
Figure 3 reports the results of this calculation. In 1994, random
allocation produced a 15 percentage point higher estimate of the ATSI
population than that which treated the unknowns as all being
non-Indigenous. The size of this wedge is due to the high levels of
unknowns. Note that the difference in the estimated rate of ATSI is
greatly reduced by 1997 and seems to stabilise at just over two per cent
of the population who appeared in court at some stage during a
particular year. There will always be a wedge between these estimates as
long as there is some non-response to the questions about ATSI status.
Given that we do not have any reason for expecting that the proportion
of ATSI is going to vary much in the short-run, Figure 3 would seem to
add weight to our assertion that the data quality for ASTI status was
not very good before 1997 and one should probably discount results
generated from the ROD for that period.
[FIGURE 3 OMITTED]
Figure 4 compares the consolidated Indigenous identifier, both with
and without the unknowns randomly allocated, against the ATSI indicator
with unknowns randomly allocated. The consolidated identifier usually
implies a higher predicted Indigenous population within the ROD than
would be predicted using the percentage of ATSI estimated after the
random allocation of unknowns. This is understandable since people who
indicate they are ATSI in later (or previous) court appearances, but did
not do so in the current year, are reclassified as Indigenous. However,
once the unknown category is randomly allocated for the ATSI indicator
and the consolidated Indigenous indicator, there is no necessary reason
why this is the case. However, the only time when the allocated per cent
of ATSI is greater than the per cent with consolidated Indigenous status
is in 1995. It is likely that this was due to the large number of people
with unknown ATSI status in the court data in that year.
[FIGURE 4 OMITTED]
In addition to providing the best guess of the true Indigenous
population without any sophisticated statistical treatment, Figure 4
confirms that the proportion of people appearing in the ROD with some
Indigenous identity is relatively stable after 1997 or 1998. This is
consistent with the above suggestion that the quality of the ROD data
with respect to Indigenous status is credible and reasonably robust
after 1997.
Geographic Issues
Indigenous Australians are more likely to live in less accessible
areas than other Australians, and hence geographic information may be a
predictor of the true population of Indigenous offenders. Furthermore
many researchers argue that the processes of identification are
fundamentally different in more remote environments (see Ross 1999).
Postcode information is nominally available for the ROD. However, there
are several problems when using it in the context of Indigenous
Australians. The main issue is that there is an insufficient number of
Indigenous Australians in the average Australian postcode for many
statistical purposes (see Hunter, 1996). Neighbourhood analyses of
postcode data are sometimes attempted for the total Australian
population, but the lack of credible population estimates for Indigenous
Australians at this geographic level means that such analyses cannot be
conducted for Indigenous communities.
The ABS draws boundaries and estimates the population distribution
for other geographic levels of analysis. Local Government Areas (LGAs),
which can be aggregated from Statistical Local Area (SLA) boundaries
relatively easily, are also revised using the latest Census data and the
changes are recorded in the hierarchical Australian Standard Geographic
Classification (ASGC). The LGA boundaries are relatively large and
stable over time, compared with postcodes, and hence are large enough to
minimise measurement error. (3)
Another way to limit the error introduced by uncertainty over
geographic boundaries is to use a robust indictor that captures broad
spatial differences, such as the ABS's Accessibility/Remoteness
Index of Australia (ARIA) index. One ARIA-style indicator, the Levels of
Relative Isolation (LORI) index that was used in Western Australian
Aboriginal Child Health Survey (WAACHS), characterises the accessibility
of local Indigenous communities (Zubrick et al. 2004). Aggregations of
areas classified by LORI also provide a meaningful indication of the
accessibility of services in an area and hence the LORI index values for
SLAs are aggregated to LGAs for use in what follows.
Statistical Model of Unknowns
Figure 1 provides some evidence that individuals in the ROD with
unknown ATSI status may be closer to the Indigenous population in the
Census, at least in terms of their basic demographic age profile.
However, this similarity may be a result of the distinctive nature of
the court data and hence it is necessary to estimate the basic
demographic profiles for those for whom we have direct information
regarding ATSI status. At a minimum one would expect that the
distinctive demographic profiles of males and females also be taken into
account when trying to estimate whether unknown ATSI status profiles are
closer to ATSI or other residents of NSW.
The following analysis uses a simple binary logistic framework to
predict whether an individual with unknown status is ATSI or non-ATSI
using a series of socio-demographic and geographic characteristics for
their first court appearance in a particular year. (4) Logistic
regressions are often used where the dependent variable has two possible
values, zero or one--for example, a person can either identify
themselves as having either ATSI or non-ATSI status. To overcome the
fact that this is a limited dependent variable, a logit transformation
is used to ensure that the predicted probabilities lie between zero and
one. The basic formulation of the binomial logistic regression model is
Logit [P.sub.i] = log [(P/(1-P)).sub.i] = b[X.sub.i] + [e.sub.i]
where b is a coefficient vector, the explanatory variables
[X.sub.i] and [e.sub.i] are the error terms which approximate a normal
distribution. See Agresti (1984) and Hosmer and Lemeshow (2000) for
fuller discussions. Logit P, also known as the log odds ratio, is the
dependent variable in the logistic regression. The logistic regression
models are estimated using maximum likelihood estimation techniques.
Often the coefficients of the binomial logistic model are
interpreted using the log odds ratio. Hosmer and Lemeshow (2000) show
that the log odds, or rather the natural log of the odds ratio, equals
the individual coefficient of the respective variables. (5) When the
explanatory variables are also categorical, the coefficients in a
logistic model must be interpreted as relative to a reference person
defined by the omitted categories of the respective groups of
explanatory variables. The reference person, or base case, in the
following analysis is a female, aged 25 to 34 years and living in a
highly accessible SLA. Therefore, if the interest is in the effect of
being male on the probability of being ATSI, then a negative coefficient
implies that males are less likely to be identified as ATSI than females
(that is, the odds ratio is less than one).
This regression model helped us to determine the differences
between ATSI and non-ATSI population in the ROD data given that a person
has some valid ATSI status identified for their records. Hunter and
Ayyar (2009) report the odd-ratios, and associated standard errors, for
a basic set of demographic and geographic characteristics. (6) Males are
about half as likely to be identified as ATSI as females. Also, in
general the older a person is the less likely they are to identify as
ATSI, with the youngest age group, 15 to 17 year olds, being around
three to four times as likely to identify as ATSI compared to the base
age group in all years of the analysis. Finally, increasing the level of
remoteness is significantly (and substantially) associated with being
more likely to identify as ATSI.
The next step in the analysis is to provide an
'out-of-sample' prediction of the proportion of ATSI for those
with unknown ATSI status in the respective years. That is, we then ask
whether the people with unknown ATSI status are more like the ATSI or
the non-ATSI populations, and classify the person as such for the
purposes of estimating the true population.
Before reporting the results we reflect briefly on the implications
of the fact that the data used are highly grouped and hence there are
relatively few cells from which to impute ATSI status. As a result the
predicted probabilities for such cells will be classified as either ATSI
or non-ATSI (depending on whether the probability of being ATSI is
greater than 0.5). There is nothing particularly wrong with this
procedure except that the small number of cells means that it will lead
to less reliable estimates than would otherwise be the case.
Consequently, we use an alternative method, whereby we first estimate
the probability of being ATSI for each cell (that is, conditioned on
relevant demographic and geographic characteristics), and then multiply
by the number of those with unknown status in that cell. (7)
Figure 5 reports the levels of reclassification of Indigenous
status using the logistic and random allocation techniques after
restricting the focus solely to those individuals for whom we have
complete information on sex, age and geography. That is, it uses only
the data included in the regression analysis.
[FIGURE 5 OMITTED]
When Figures 3-4 are compared with Figure 5 the estimated per cent
of ATSI is higher in the latter after the random allocation procedure is
performed on the unknown ATSI category. This is mainly because the
latter was confined to those for whom all the relevant demographic and
geographic data were available. For such data the proportion of people
who identified as ATSI increased by around three percentage points,
irrespective of the allocation of the unknown categories.
We surmise from the small size of the difference between the random
allocation and logistic allocation of ATSI status that the diligent
recording of demographic and geographic data by police for the courts is
associated with more accuracy in recording ATSI status. Hence the
auditing of records to improve the overall quality of demographic data
will also improve the reliability of evidence for Indigenous
over-representation in the criminal justice system.
As indicated above, the ROD includes a consolidated Indigenous
identifier that takes into account previous identification patterns in
NSW court data. The above analysis can also be conducted for the
distribution of this indicator. Note that one would expect this
indicator to deliver a higher level of Indigenous identification, with
less likelihood of an undercount. In most circumstances it would be
expected that this measure of Indigenous identification should be closer
to the true Indigenous population within ROD. A priori, however, one
cannot reject the hypothesis that changes in Indigenous status are not
occurring in an arbitrary fashion that is unrelated to an
individual's true identity. Figure 5 illustrated that the extent of
reclassification when using regression estimates is only slightly higher
than that done when unknowns are randomly assigned (using uniform
distribution). Given that there are fewer unknowns to assign when the
consolidated Indigenous identifier is used, it is reasonably certain
that there will not be much difference between the randomly allocated
and regression adjusted estimates of the proportion of Indigenous people
in ROD. Another reason not to estimate the regression adjustments for
the consolidated Indigenous identifier is that these are cross-sectional
regressions and the ROD Indigenous status variable uses information
across time. Therefore any regression adjustments are likely to be
correlated over time, which would certainly induce less reliability into
the resulting estimates and may induce some bias with more recent
offenders having had less time for their consolidated Indigenous
identifier to be updated or 'consolidated'. Notwithstanding
this, as indicated above, the estimates based on the random allocation
of unknowns for the consolidated Indigenous identifier are likely to
provide a close approximation of our best guess for the true Indigenous
population within NSW courts.
The DSE Methodology
The following uses the DSE estimator to validate the above
estimates of the Indigenous offenders appearing in court within the ROD
data. The simplest DSE is a two-sample model. The first
'sample' identifies certain individuals who are returned to
the population of all offenders after the 'survey' is
complete, while the second 'sample' provides an independent
measure of the population. Using the numbers of individuals in both
samples and the numbers identified in just one sample, it is possible to
estimate the number not captured in either sample, thus providing an
estimate of the total population size. The assumptions required for such
an estimate to be valid are that:
1. there is no change to the population during the investigation
(that is, the population is closed);
2. individuals can be matched from one sample to the next;
3. the chance of being in each sample is uncorrelated for each
individual; and
4. The two samples are independent.
Sekar and Deming (1949) were the first to adapt this method for
human populations when they used it to estimate birth and death rates,
and the extent of their registration in 1949, with hospital data from
India. There is also a substantial literature going back to the 1940s,
dealing with the application of the two-sample method to Census data
(Fienberg 1992). By taking another sample in addition to the Census, the
method can be used for estimating undercount by the Census (Hogan 1993).
In terms of the validity of assumptions for estimating the
potential numbers of Indigenous Australians, it was necessary to confine
our attention to closed populations. Even populations with high
mobility, such as people in remote Indigenous communities, may be
considered 'closed' so long as the PES or follow-up survey
takes place shortly after the initial survey or Census (Paradies et al.
2000).
With respect to assumption (2), matching will depend on the quality
of the records and the uniqueness of respondents' names. BOCSAR has
given a detailed assurance that all due care is taken to match the court
data for individuals by spending considerable time and resources
constructing a unique person identifier.
Another of the assumptions required for DSEs to be valid is the
homogeneity of the population (assumption 3 above). That is, all the
Indigenous population should have the same chance of being sampled in
the follow-up survey. This assumption is unlikely to be violated as it
is not a choice for most offenders. Both the police and the court
records report information ATSI status although there is obviously room
for doubt about the certainty of the categories, as is evidenced by the
existence of the unknown category for both 'surveys'.
The question of independence is discussed by Sekar and Deming
(1949) in some detail (also see Marks et al. 1974). ROD directly uses
police data on ATSI identification, but subsequent identification is
arguably independent in that the names of offenders are often matched
statistically on ROD because of the systematic difficulties encountered
in ensuring that the record refers to the same person over time.
However, if one does reject the assumption of independence on the
grounds of correlation between 'samples' (that is, police
remember individual offenders and are consistent in their identification
and marking of records), then one should expect that the DSE provides a
small or understated adjustment to the estimate of indigenous offenders
(because relatively few offenders will change status over time).
Another issue, which can cause problems for the DSE methodology, is
that of coverage. If there are individuals who are not sampled in both
samples, this results in potential upwards bias of the estimates
(Shyrock et al. 1976). For Census data and administrative data which in
principle cover the whole population, this source of error should be
relatively small.
The key to the DSE method is an ability to match individual records
on some different criteria (that is, different to the one of immediate
interest), and then check the observation of interest for consistency.
In a two-outcome situation, such as a yes/no question, four potential
outcomes occur, as illustrated in Table 1. First, the record can be
'yes' on both the initial and second surveys, designated by
the cell [x.sub.11]. Second, the record can be 'yes' on the
first and 'no' on the second, designated by the cell
[x.sub.12]. Third, the record can be 'no' on the first and
'yes' on the second, denoted by cell [x.sub.21], and finally
the record can be 'no' on both surveys, given by [x.sub.22].
This method cannot, of course, pick up information that has been
incorrectly recorded on both surveys (e.g. respondents answering
'yes' on both surveys when the true observation was
'no').
Using the Sekar-Deming (1949) formula, the revised population
estimate is:
[??] = [x.sub.11] + [x.sub.12] + [x.sub.21] +
[x.sub.12][x.sub.21]/[x.sub.11] (3)
If Table 1 refers to the response to a question about Indigenous
status, then only [x.sub.22] people always deny they are Indigenous.
Consequently, Hunter (1998) referred to the potential Indigenous
population as being equal to [x.sub.11] + [x.sub.12] + [x.sub.21]. The
consolidated Indigenous identifier on ROD is closely related to this
'potential Indigenous population'. The main difference is that
the ROD estimate is potentially based on repeated appearances in court
and hence takes into account more than two 'surveys'. However,
some of the [x.sub.22] people may also admit to being Indigenous in
other circumstances (that is, if other similar independent surveys were
conducted repeatedly). The 4th term on the right-hand side of equation 3
is the number expected to identify as Indigenous at least once if all
surveys are 'independent' (in statistical terms). (8)
[FIGURE 6 OMITTED]
Figure 6 reports the DSE of the percentage of people appearing in
the ROD who are Indigenous between 1997 and 2006. The estimates are
remarkably stable and we would argue that this approach provides the
most accurate picture of the true population of Indigenous offenders in
the ROD. It is probably not a coincidence that the DSE estimate based on
any previous appearance is very similar to the estimates based on the
consolidated Indigenous identifier (after random allocation) in 1997.
Prior to that year, the high level of unknowns due in part to poor data
quality of court records, especially with respect to Indigenous status,
lead to unreliable estimates for both the consolidated Indigenous
estimates and the DSEs. The DSE is particularly affected in those years
because it is driven by the exceptionally small number of people
specifically identifying as ATSI in the earlier years. (9)
Observant readers will note that there is a gradual decline in the
proportion of people appearing in the ROD estimated to be Indigenous
after the consolidated Indigenous identifier is allocated randomly. We
suspect that this is due to the fact that the consolidated Indigenous
identifier is less likely to assign an Indigenous status if there have
been fewer years over which people could change their status. In a
sense, the data are more affected by right-censoring when individuals
only enter the court system for the first time in the years immediately
leading up to 2006. The DSE seems to be less affected by this distortion
because it is only defined for people who have appeared at least twice
(that is, over several years).
Note that DSE estimates are always higher than the logistic
regression adjusted estimates. One explanation for this is that the
logistic adjustments only use information available in a particular year
{that is, they are cross-sectional in nature) and do not use any of the
information on changes in Indigenous status over time. Another reason
why logistic estimates are lower than the other estimates is that DSE
estimates are only defined for those offenders who go to court rather
than to all offenders identified by police. That is, if Indigenous
offenders are more likely to appear in court than other offenders, then
one should expect the DSE estimates to be higher. Accordingly, it can be
argued that the logistic adjusted estimates provide a conservative
estimate of the Indigenous offenders in the NSW criminal court system.
In summary, the DSE estimates are more stable than the consolidated
Indigenous identifier (after allocation) and the logistic adjusted
estimates for each year, but there is very little difference in 1997.
While the DSE estimates are always higher than the alternative
techniques, the gap between estimates increases over time because of
biases in the other techniques. Notwithstanding, it is not possible to
discount the possibility that DSE over-enumerates the number of
Indigenous offenders in the court system and hence a number of adjusted
estimates can be justified.
Reflections on Knowing Something about the Unknowns
This paper argues that it is important to understand the processes
that determine who is identified as, or rather chooses to identify as,
Indigenous. If nothing else the size of Indigenous involvement in the
criminal justice system may be severely underestimated if no attempt is
made to establish or estimate the true identity of the large number of
people with unknown ATSI status within the criminal justice system. It
is unlikely that any auditing process of administrative data will
entirely remove uncertainty because individuals are likely to have an
incentive to not reveal their Indigenous status given widespread
perceptions of discrimination against Indigenous people. In the presence
of systematic undercounts, statistical procedures like those used in
this paper are likely to be important for estimating the true Indigenous
population of offenders.
One of the main implications of this paper is that Indigenous
disadvantage may be understated if administrative data are not corrected
to account for those who may at some later stage identify as Indigenous
or whose Indigenous status is unrecorded or unknown. The secondary
message is that data quality issues not only decrease the reliability of
resulting estimates for Indigenous and other Australians, but also
result in the potential for systematic biases which could affect
conclusions about the size of Indigenous disadvantage and the ability of
policy makers to 'close the gap'. These observations are
particularly important for the administrative data collections reported
in the Overcoming Indigenous Disadvantage Framework and for the policies
that arise from such statistical reportage (for example, SCRGSP 2007).
Given that the trends in the adjusted estimates of the proportion
of offenders who are Indigenous vary depending on which technique is
used, governments should be particularly cautious about making policy on
the basis of trends in administrative data on Indigenous Australians.
Estimates of trends are more unreliable than is generally appreciated,
as the assumptions underlying such trends are potentially problematic
given the vagaries of the processes of Indigenous identification.
These conclusions are underscored by the significance of geographic
factors in the processes that determine Indigenous status. The
geographic distributions of ATSI status will be systematically biased
with respect to the incidence of unknown status and the incidence of
ATSI (given that ATSI status is 'known') as both are
significantly correlated with accessibility of the local geographic
area. In this case, inferences from the unadjusted ROD data about
relative crime rates for Indigenous and non-Indigenous people will not
be valid.
The reliability of measures of Indigenous disadvantage is further
complicated by the need to estimate local ERPs for the Indigenous and
other populations for use in the calculations of rates of offences in
the respective populations. The calculation of accurate ERPs for the
Indigenous population is itself hotly debated in academic and
administrative areas (Taylor 1997), but the failure to use ERPs for the
local Indigenous population will result in distorted pictures of
Indigenous involvement in the criminal justice system. Given that the
Indigenous ERPs usually experience higher undercounts than is evident
for general ERPs, Indigenous rates are highly likely to be overstated by
more than they are for the rest of the population (ABS 2007b). While
some might argue that one should not worry too much at an aggregate
level as Indigenous disadvantage is such a manifest problem, such
distortions may disproportionately affect certain regions, and hence
administrative data need to be as accurate and reliable as possible.
However, a more accurate estimate of Indigenous offender populations
could be achieved by alternative methods, including the allocation of
unknowns or by using a DSE methodology. Such estimates should be
combined with local ERPs estimates for Indigenous and other Australians
to ensure that policy to address relative offence rates is only based on
valid empirical evidence.
When information on Indigenous undercount in court data and ERPs
are taken into account, the overrepresentation of Indigenous offenders
in the NSW criminal justice system more than doubles. The Indigenous
rates of offence increase from 119 to 243 offenders in every 1,000
Indigenous residents (measured by the relevant ERPs for NSW). The
non-Indigenous rates of offence do not change substantially as the
Indigenous population is still an ethnic minority. In summary, the
over-representation of Indigenous people in the criminal justice system
increases from 5.1 (according to the methodology historically used) to
11.5 (according to the DSE methodology)--that is, Indigenous people are
almost 11.5 times more likely to be an offender than non-Indigenous
people in the NSW court data. More importantly, the magnitude of
Indigenous disadvantage in justice outcomes is clearly larger than has
historically been appreciated.
There are other methods for estimating offender populations which
could be considered (Collins & Wilson 1990). However, such methods
are only valid for estimating the unobserved Indigenous and
non-Indigenous rates by estimating the Indigenous and other Australians
who do not appear in the court or are identified in police records
(using count data models). While such methods are invaluable for
estimating consistent offender rates in the relevant populations, and
obviates the need to generate consistent and comparable ERPs, the above
analysis is justified solely as an exercise in validating the quality of
the ROD Indigenous identifier, given that a person has been observed in
the administrative data source (for example, the court system).
Much humorous comment has been made about Donald Rumsfeld's
observation that it is intrinsically difficult to 'know the
unknown' (Hunter & Ayyar 2009) --but the cost of not attempting
to understand the consequences of the category 'unknown' is
likely to be particularly high given the potential to misallocate
resources when attempting to design effective Indigenous policy.
References
ABS (Australian Bureau of Statistics) (2007a) Non-response Rates,
AUST 2006 Usual Residence and Place of Enumeration, Cat. No.
2914.0.55.001.
ABS (Australian Bureau of Statistics) (2007b) Population
Distribution, Aboriginal and Torres Strait Islander Australians 2006,
Cat. No. 4705.0.
Agresti, A. (1984) Analysis of Ordinal Categorical Data, New York,
Wiley.
Altman, J.C., Biddle, N. & Hunter, B.H. (2008) 'Prospects
for "closing the gap" in socioeconomic outcomes for Indigenous
Australians?', Australian Economic History Review, 49 (3) 225-51.
Brown, P., Callister, P., Carter, K. & Engler, R. (2010)
'Ethnic mobility: Is it important for research and policy
analysis?', Policy Quarterly, 6 (3), 45-51.
Carter, K.N., Hayward, M., Blakely, T. & Shaw, C. (2009)
'How much and for whom does self-ethnicity change over time in New
Zealand? Results from a longitudinal study', Social Policy Journal
of New Zealand, 36, 32-45.
Collins, M.E & Wilson, R.M. (1990) 'Automobile theft:
Estimating the size of the criminal population', Journal of
Quantitative Criminology, 6 (4), 395-409.
Fienberg, S.E. (1992) 'Bibliography on capture-recapture
modeling with application to Census undercount adjustment', Survey
Methodology, 18 (1), 143-54.
Greene, W.H. (2000) Econometric Analysis (4th edn ed.), New Jersey,
Prentice Hall.
Guimond, E. (1999) Ethnic Mobility and the Demographic Growth of
Canada's Aboriginal Populations from 1986 to 1996, Ottawa, Cat. No.
91-209-XPE, Statistics Canada.
Hogan, H. (1993) 'The 1990 post-enumeration survey: Operations
and results', Journal of the American Statistical Association, 88,
1047-60.
Hosmer, D. & Lemeshow, S. (2000) Applied Logistic Regression
(2nd edn), New York, John Wiley & Sons.
Hunter, B.H. (1996) Indigenous Australians and the Socioeconomic
Status of Urban Neighbourhoods, CAEPR Discussion Paper No. 106,
Canberra, CAEPR, ANU.
Hunter, B.H. (1998) Assessing the Utility of 1996 Census Data on
Indigenous Australians, CAEPR Discussion Paper No. 154, Canberra, CAEPR,
ANU.
Hunter, B.H. & Ayyar, A. (2009) Some Reflections on the Quality
of Administrative Data for Indigenous Australians: The Importance of
Knowing Something about the Unknown(s), Canberra, CAEPR, ANU
http://www.anu.edu.au/caepr/.
Hunter, B.H. & Dungey, M.H. (2006) 'Creating a sense of
"CLOSURE": Providing confidence intervals on some recent
estimates of indigenous populations', Canadian Studies in
Population, 33 (1), 1-23.
Marks, E.S., Seltzer, W. & Krtoki, K.J. (1974) Population
Growth Estimates: A Handbook of Vital Statistics Measurement, New York,
The Population Council.
Paradies, Y., Huppatz, S., Warnsey, J. & Barnes, T. (2000)
'Population and globalisation: Australia in the 21st century'.
Paper delivered at the 10th Biennial Conference of the Australian
Population Association, Melbourne.
Ross, K. (1999) Occasional Paper: Population Issues, Indigenous
Australians, Cat. No. 4708.0. Canberra, Australian Bureau of Statistics.
Sekar, C., & Deming, E.W. (1949) 'On a method of
estimating birth and death rates and extent of registration',
Journal of the American Statistical Association, 44 (1), 101-15.
Shyrock, H.S., Siegel, J.S., & Associates (1976) The Methods
and Materials of Demography, London, Academic Press.
Snowball, L. & Weatherburn, D. (2006) 'Indigenous
over-representation in prison: The role of offender
characteristics', Crime and Justice Bulletin, 99 (September), 1-20.
(SCRGSP) (Steering Committee for the Review of Government Service
Provision) (2003) Overcoming Indigenous Disadvantage: Key Indicators
2003 Report, Melbourne, Productivity Commission.
SCRGSP (Steering Committee for the Review of Government Service
Provision) (2005) Overcoming Indigenous Disadvantage: Key Indicators
2005 Report, Melbourne, Productivity Commission.
SCRGSP (Steering Committee for the Review of Government Service
Provision) (2007) Overcoming Indigenous Disadvantage: Key Indicators
2007 Report, Melbourne, Productivity Commission.
Taylor, J. (1997) 'The contemporary demography of Indigenous
Australians', Journal of the Australian Population Association, 14
(1), 77-114.
Westbrooke, I. & Jones, L. (2002) 'Imputation of Maori
Descent for Electoral Calculations in New Zealand', Australian and
New Zealand Journal of Statistics, 44 (3), 257-65.
Zubrick, S.R., Lawrence, D.M., Silburn, S.R., Blair, E., Milroy,
H., Wilkes, T. et al. (2004) The Western Australian Aboriginal Child
Health Survey: The Health of Aboriginal Children & Young People,
Perth, Telethon Institute for Child Health Research.
Endnotes
(1.) The Australian PES is an interviewer-based survey conducted
three weeks after Census night which allows comparison of the responses
in the Census and the PES to identify whether they have changed, A
matched sample of those who responded to both the Census and the PES is
used. It is also possible that the PES may pick up some uncounted
population from the Census, both samples are drawn from the population
as a whole. Information is collected to determine whether persons have
been missed or double counted in the Census and whether dwellings were
missed. The PES collects personal information on indigenous origin, age,
sex, marital status and birthplace. Note that there are several
differences between the Census and PES collections. For example, the
Census question on indigenous status is based on self-identification
whereas the PES involves an interviewer. In addition there were slight
differences in the wording of the question. More importantly, the PES
question is asked of the entire household whereas the Census is asked of
each person individually.
(2.) Notwithstanding, from July-2003, Police stopped asking ATSI
status for alleged offenders in traffic offences (as opposed to serious
driving matters). This change in procedure and the resultant increase in
unknown ATSI status for a group that comprised a significant proportion
of all court data was the reason why BOCSAR implemented the "ever
identified ATSI" variable in ROD.
(3.) In contrast to postal areas, LGA boundaries are relatively
slow to change over time. LGA-level ROD data are estimated by BOCSAR.
(4.) It could be argued that the analysis of the unknown category
in ROD should investigate all the information available on the unknown
category that is associated with Indigenous status. For example, the
incidence of being recorded as unknown Indigenous status seems highly
associated with offence-type and the propensity to offend. The inclusion
of such factors may ultimately enhance the specification, but it is also
possible that the processes for recording Indigenous status are
correlated to the processes for recording offence type (especially
driving offences} and the statistical processes that drive to propensity
to offend. In Econometrics this would raise the possibility of
simultaneous equation bias (Greene 2000: 710). In an attempt to avoid
such issues, this "paper uses an extremely parsimonious
specification that should be accurate on average. A more sophisticated
analysis might be able to discount the possibility of simultaneity bias
and hence could confidently use additional information on offence type
and propensity to offend.
(5.) See Hosmer and Lemeshow (2000) for details of the
interpretation of these ratios.
(6.) Concordance statistics were estimated to provide an indication
of the adequacy of the models for prediction in the respective years
(available from the authors on request}. It givcs the percent of all
possible pairs of cases in which the model assigns a higher probability
to a correct case than to an incorrect case. Hosmer and Lemeshow (2000:
162) provide guidelines for Concordance statistic, which indicates that
any statistic over the value of 0.7 is evidence that the model is
adequate. All reported results are based on adequate models according to
this criteria.
(7.) While this should provide a reliable and robust estimate, it
is rather difficult to estimate the standard error for the overall
estimates. This does not matter excessively since this exercise is
designed to illustrate the potential importance of the issue. However,
given the large number of unknowns and the associated low standard
errors for the estimates, it is anticipated that this estimator provides
a highly accurate estimate of the number of unknowns re-classified as
ATSI.
(8.) The variance of N can be estimated using the standard binomial
approach (see Sekar & Deming 1949).
(9.) In the context of this DSE, the process of predicting whether
unknowns could be assigned to ATSI or non-ATSI is more complex than the
cross sectional DSE estimates reported above. Assigning unknowns in the
DSE estimates requires a more sophisticated technique and hence is left
for another paper.
Table 1: A two-outcome example of DSE methodology
Response A
Yes No Total
Response B
Yes [x.sub.11] [x.sub.12] [x.sub.11] + [x.sub.12]
No [x.sub.21] [x.sub.22] [x.sub.21] + [x.sub.22]
Total [x.sub.11] + [x.sub.12] + [x.sub.11] + [x.sub.12] +
[x.sub.21] [x.sub.22] [x.sub.21] + [x.sub.22]