Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Sources of error in each phase

This section illustrates the possible sources of error the framework can identify.

Li-Chun Zhang's 2012 framework built on earlier work by Groves and Bakker (2004), which was more focused on survey-only collections. Figures 1 and 2 are based on the steps that connect the abstract, ideal measurements (or objects) to the final data actually obtained. To apply the error framework, we need to clearly define the boxed terms. Once they’re defined, list and categorise issues that arise in each step into the error types (ovals pointing to each transition).

Phase 1 errors

Phase 1 applies to a single dataset in isolation. For a complex statistical output from many different datasets, carry out the phase 1 evaluation separately for each source dataset. The framework can be used for both administrative and survey data. Once we have a comprehensive list of phase 1 errors, we determine the effect of these errors on the final output we want to produce using the phase 2 framework.

Figure 1
Phase 1 error framework showing the different types of error that can arise

Image, phase 1 error framework showing the different types of error that can arise.

In phase 1, define the target concept and target set by the dataset’s original purpose, whether it is a stand-alone sample survey or a transactions database held by a retail store. For a traditional sample survey designed to produce a statistical output, phase 1 errors mean the final outputs are not perfect estimates of the true population values – there is always some uncertainty due to sampling error, imputation, non-response, and other issues. For an administrative source, we focus on evaluating how well the source meets the purpose intended by the collecting business or agency.

Below we explain the terms used in figure 1.

Measurement (variables) terms

The measurement side describes the path from an abstract target concept to a final edited value for a concretely defined variable.

Validity error

Measurement begins with the target concept, or ‘the ideal information that is sought about an object’. To obtain this information, we must define a variable or measure that can be observed in practice. Validity error indicates misalignment between the ideal target information and the operational target measure used to collect it. Typically, administrative variables are collected for a definite purpose and defined in a very concrete way. The error arising in this step refers to the translation from an abstract target concept to a concrete target measure; it doesn’t include errors such as misunderstanding terms on a form.

Measurement error

Once the target measure is defined, we collect actual data values. The values for specific units are the obtained measures. When the data is obtained from people responding to a survey or filling out a form for a government agency many errors can occur. People may misremember details or interpret questions differently from what was intended.

Note: for some administrative sources the objects that data is being collected about may not be ‘respondents’ in a traditional sense.

An example: a retail chain might record the values and times of all transactions made in their stores. In this case the object is a transaction whose value is recorded automatically, but measurement error could still occur. For example, a fault in the reporting system that delayed processing of a week’s transactions and ended up recording them as occurring on the day the system was fixed would be a measurement error.

Processing error

The edited measure is the final value recorded in the administrative or survey dataset, after any processing, validation, or other checks. These checks might correct errors in the values originally obtained, but can introduce additional errors. For example, in a survey, dividing a response by 10 because it appears a magnitude error was made by a respondent, when in fact the original response was correct.

Representation (objects)

The representation side of the flowchart deals with defining and creating ‘objects’ – the basic elements of the population being measured.

Frame error

The target set is similar to the target concept – it is the set of all objects the data producer would ideally have data on. An important distinction between the usual statistical concept of ‘units’ and ‘objects’ in this context is that in some administrative datasets the base units could be records of individual events (eg transactions with customers). Statistically, we may want to create a list by customer that links many transaction events into one statistical unit, but the administrative agency may only care about the events themselves. To avoid confusion, we say that the final dataset after all phase one transformations is organised into ‘objects’ rather than ‘units’.

Frame error refers to the difference between the ideal target set of objects and the accessible set, the set from which we can take measurements in theory. These concepts are clarified under ‘Selection error’ below.

Selection error

Many collections have objects in the accessible set that don’t end up in the data. For instance, our accessible set could be all people eligible to vote, but the accessed set, the set we actually obtain information about, includes only people who actually registered on the electoral roll. The missing, unregistered people are a source of selection error.

The distinction between frame error and selection error can be confusing, especially when the collection is designed with restrictions already in mind.

An example: crime statistics. If the target set is all crimes committed, and the accessed set is all crimes reported in the Police database, then it is probably best to treat unreported crimes as selection errors, and ‘unreportable’ crimes (crimes that could never be reported even in theory – if there are any) as the frame error.

Another example: a retail chain wants to produce statistics on the transactions across all its stores, but their system can only record purchases using electronic cards. Cash transactions could be said to be ‘inaccessible’ since they will never be in the database – they cause a frame error. However, if a store manager forgets to run the reporting tool for a week, the transactions missing from the dataset due to that mistake will be selection errors: they were accessible, but were not accessed and do not appear in the dataset.

Missing/redundancy error

The observed set comprises objects in the final, verified dataset. Most checks an agency does are likely to remove objects that shouldn’t have been in the selected set to begin with (eg someone trying to enrol to vote who is under 18); these types of errors are selection errors. The incidence of errors where the agency mistakenly rejects or duplicates objects due to their own processing is fairly rare, but this category of error exists so we keep such errors distinct from reporting-type errors.

Phase 2 errors

Phase two of the error framework covers errors arising when existing data is used to produce an output that meets a certain statistical purpose. Often this involves combining different datasets for different parts of the population, or integrating several datasets together. However, phase two can also be valuable when a single administrative dataset is used to produce an output on its own – the process allows us to distinguish between quality problems in the original data and errors resulting from trying to make the data measure something it wasn't intended to. In phase two, the reference points are the statistical population we would ideally access, and the statistical concepts we want to measure for the units in the population.

Figure 2
Phase 2 error framework showing the different types of error that can arise

Image, phase 2 error framework showing the different types of error that can arrise.

Measurement in phase 2 is concerned with how to reconcile variables from each source dataset, which may differ from the target concept or from each other. Representation is about creating a set of statistical units from the objects in the original datasets.

Figure 2 indicates possible sources of error in phase 2. Note that errors arising in phase 1 can also propagate through to the final data, and that movement is not necessarily directly related to specific or sequential steps in a statistical process. We need to carefully consider the effect phase 1 errors have on the final data, which depends on the intended statistical purpose.

Below we explain the terms used in figure 2.

Measurement (variables)

Relevance error

The target concept in phase 2 is similar to that in phase 1 (the ideal information sought about the statistical units). The harmonised measures are the practical measures decided on in designing the statistical output, such as a survey question aligned with a standard classification. In some cases they could be the same measures as in one of the datasets to be combined, but the harmonised measures may also be a standardised statistical measure that does not align perfectly with variables in the original datasets.

As in phase 1, relevance errors are entirely conceptual, and don’t arise from actual data or values. Harmonisation can be thought of as “consist[ing] for the greater part of the formulation of decision rules, in which the measurement of a concept is determined as precisely as possible, given the existing information in the data sources” (Bakker, 2010).

Mapping error

We transform measures in the source datasets into harmonised variable values. The values we assign in this process are called re-classified measures. Practical difficulties encountered in this stage lead to mapping errors.

An example: our building consents data, where the ‘job description’ field of the consent must be assigned to a specific code in the building type classification. The job description the builder enters is free text and may be ambiguous or unclear – the resulting reclassified measure may not be the correct one.

Following from Bakker’s description of harmonisation above, mapping errors may result from the decision rules chosen, which won’t work perfectly in every case. Mapping error also includes ‘modelling errors’.

See Understanding errors arising from modelling for more on this important source of error. 

Comparability error

Regardless of how the reclassified measures are derived, we may need editing and imputation to obtain consistent outputs. The final values after these processes are our adjusted measures. In addition to the usual imputation, for units with missing variables in the source datasets, we may need extra checks to reconcile values that are correct for each individual dataset but disagree with each other for the output measure.

An example: someone loses their job and applies for a benefit just before their employer refiles their employee tax returns. If we link the benefit data with the tax data, the person could be recorded in both, since they were paid taxable income but also registered as unemployed in the reference period. Both datasets are individually correct, but we would need to resolve the inconsistency for our final data (eg by looking at application and filing dates).

Representation

Representation in phase 2 deals with creating a list of statistical units to include in the output data, based on the source data’s objects. Here is where the object/unit distinction is most important – the individual datasets may be based on transactions or events we need to connect then place into newly created statistical units that relate to customers, stores, or other entities of interest in the statistical target population.

An example: someone whose hiring was recorded on a register of jobs, but whose dismissal from the job was not recorded. If the jobs register is based on events, failing to record the dismissal is a selection error in the jobs register. In phase 2, if we define a harmonised employment measure that classifies people as employed if they were hired and not dismissed, then we misclassify this person’s employment status – this is a mapping error. The distinction between phase 1 and phase 2 allows us to understand complex situations such as this.

Coverage error

The target population is fairly familiar from survey statistics – it is the ‘set of statistical units that the statistics should cover’. The linked sets are the units that are connected across the relevant datasets. Note that these units will not necessarily be the final statistical units of the output. For instance, the target population might be households, but the linked sets could be individuals we link using an address variable across different administrative datasets.

Coverage errors are the differences between the units actually linked in practice and the full set of units we include in the (ideal) target population. These errors arise in several ways. For instance, the datasets themselves may not cover the whole target population, or linking errors may mean we don’t identify some members of the linked sets. Measurement errors in the source data can also cause coverage errors.

An example: if the date-of-birth variable on an administrative dataset is not of good quality and we filter on age to select our population, we could end up with undercoverage even though the units aren't missing from the source data.

Working through an example: imagine we want to build a dataset that includes qualifications and income for every person living in New Zealand – to study how these variables are related. Our source datasets are individual Inland Revenue tax records and university enrolment data. The target population would be all people in New Zealand. If we try to link people based on name, date of birth, and address across these two datasets coverage errors could occur, including:

  • out-of-date addresses, spelling mistakes in names, or other errors, so we can't link a person's Inland Revenue record with any enrolment data (missed links)
  • people who studied overseas, so have a qualification but don't appear in the New Zealand enrolment data (undercoverage)
  • people who are linked in the two sets but have moved overseas, so are not actually part of the target population (overcoverage).

Identification error

Depending on the type of units in the linked sets, we may want to create ‘composite units’, which are made up of one or more ‘base units’.

An example: the Quarterly Building Activity Survey, where our target units are construction jobs, but we receive data on individual consent approvals. Usually one consent corresponds to one building job, but some complex jobs file separate consents for different stages of the job. Conversely, some consents can be for two buildings of different types, which we would like to have separate statistical units for. We can consider the aligned sets as a table that records the consents relating to each construction job. Failure to recognise a consent as the next stage of a job already in progress, and not recording it as related to the previous consent in the job, would be an identification error.

In more complex cases, different datasets may conflict and we must decide how to resolve this.

An example: we have person-level data linked by a common identifier across several datasets, and want to form groups of people living at the same address. If the different datasets contain different addresses for the same person we may make identification errors when we are forced to decide on a single address for each person.

Unit error

The final statistical units in the output dataset could be created from scratch, without a direct correspondence to any of the units in the source datasets. In the example above about addresses, we may create a dwelling unit that consists of all the people living at each unique address. The conceptual difference between linking errors and unit errors is that we are not just connecting people to a known list of addresses – we are simultaneously determining which addresses should actually be given a dwelling unit and which people should be connected to each dwelling unit.

Understanding errors arising from modelling

The variables in administrative datasets typically differ from the ideal data we would like to use to measure our statistical target concepts. In Li-Chun Zhang’s error framework, he gives examples that involve reclassifying the raw values in an administrative variable, such as a free text ‘job title’ field into an official statistics occupation classification. Any errors arising from this process are mapping errors.

A conceptually similar, but often more complex situation arises when we want to estimate a numerical target variable from one or more administrative variables that don’t precisely capture the information we really want.

An example: using GST returns from businesses to estimate the sales and purchase variables as defined on our subannual business surveys. One way we do this is to calculate the ratio of survey sales to GST sales for the larger units we survey, and use this ratio to estimate sales for small, non-surveyed units (for which we only have GST data).

Many sources of error arise from this kind of modelling. Because they generally occur in the step from harmonised measures to reclassified measures in phase 2 of the framework, they come under mapping errors.

Measuring and minimising these errors is crucial to deciding how to make more use of administrative data, and how much survey data we might still require in an ‘administrative data first’ design.

An example: we need to answer questions such as “for which units does the model perform poorly?”, “how stable are the model parameters and how can we monitor them over time?”, and “how large is the uncertainty in our modelled estimates?”

To help understand modelling errors, we consider two types of error that can arise when we use a statistical model to estimate a target variable:

  • Model structure error – the model specification chosen may not capture the real relationship between the variables. For example, we might use a simple linear model to predict one variable, using another, but in reality the relationship between these variables is non-linear. Common techniques for assessing this type of error include goodness-of-fit tests and residual plots.
  • Parameter uncertainty – when we estimate the values of the parameters in a model, there is always some uncertainty. We need to measure parameter uncertainty and propagate it through to the final results that rely on the model. Techniques such as bootstrapping or Bayesian estimation are often used.

We also consider whether an overall model uncertainty can be determined. If we have more than one possible model, we might combine the results of the different models to provide an overall measure of uncertainty. Bayesian model averaging is one way of doing this.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+