Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Appendix 1: Glossary

Here are clear definitions and explanations for the technical terms used in this guide.

 Term  Explanation
 Accessed set  The set of objects for which measurements are obtained in practice. For example the electoral roll doesn’t include people who fail to enrol despite being legally entitled, or whose forms get lost in the mail.
 Accessibility  Statistics are presented in a clear and understandable manner and are widely disseminated. One of the six quality dimensions.
 Accessible set  The set of objects from which measurements can be taken in theory.
 Accuracy  Source data and statistical techniques are sound and statistical outputs sufficiently portray the reality they are designed to represent. One of the six quality dimensions.
 Admin(istrative) data  Admin data is all data collected by government agencies or private organisations in conducting their business or services Such data is not collected primarily for statistical purposes. Rather, it is collected or captured for operations such as delivering a service, registering members, events, or activities, or as legally required records.
 Aligned sets  The groups of base units that are determined (after linking and other processing) to belong to each composite unit in a final output dataset. For instance, we might create household units based on dwelling units and person units – the aligned sets could be represented by a table containing all these relationships (eg household 1 consists of dwelling A and persons X, Y, Z; household 2 consists of dwelling B and person W)
 Base dataset  Where data integration is carried out by linking one or more datasets to a single large dataset we call this central dataset the base dataset.
 Base units  The lowest-level units created after linking within and across datasets. These units often represent individual people, businesses, or dwellings.
 Comparability error  An error arising from editing and other treatment methods applied to values obtained from reclassified measures – to correct for missing values, inconsistencies, or invalid values.
 Composite unit  A unit made up of one or more base units. These are not necessarily the final statistical units in the output: intermediate composite units may be created and further combined or arranged into the final statistical units.
 Consistency  Statistics are consistent and coherent within the dataset, over time, and with other major datasets. One of the six quality dimensions.
 Coverage error  The differences between the units actually included in the linked datasets in practice (linked set) and the full set of units included in the (ideal) target population. Coverage errors can arise in several ways. For example, the datasets themselves may not cover the whole target population, or linking errors may mean some members of the linked sets are not identified.

This error type may also be caused by measurement errors. For example, if the date of birth variable on an admin dataset is not of good quality and we filter on age to select our population, we could end up with undercoverage even though the units aren't missing from the source data
 Dimensions of quality  A guide to help a national statistical office manage quality in their operations, to ensure customers can have confidence in the statistics published. These dimensions are: accuracy, relevance, timeliness, accessibility, consistency, and interpretability.
 Edited measure  The final values recorded in an admin or survey dataset, after any processing, validation, and other checks. This term is only relevant in phase 1 of the error framework.
 Final dataset  An output micro-dataset after all processing and checks are completed.
 Frame error  The difference between the ideal target set of objects and the accessible set. These errors refer to objects that are inaccessible, even in principle. In a survey context the accessible set is the sampling frame. For an admin source objects may be inaccessible for many reasons.
 Harmonised measure  The operational measures decided on in designing the statistical output to capture the target concepts. They include elements such as questions, classifications, and variable definitions. For example, a survey question aligned with a standard classification.
 Identification error  Misalignment between the linked set and the aligned set. This type of error also includes situations where the target statistical units cannot be adequately represented using combinations of base units. For example, to measure the economic activity of all manufacturing businesses by industry, we would ideally have separate statistical units to capture different types of manufacturing done by a single company. However, in practice we might have to define statistical units via legal entities. Changes in company or legal structures might result in statistical units being absorbed into others, despite no real-world change in economic activity occurring.
 Indicator  A numerical or descriptive value that can be used to measure or report on an aspect of quality. An indicator can be either a quantitative or qualitative measure.
 Interpretability  Processes and methods used to produce official statistics, including measures of quality such as estimated measurement errors, are fully documented and available so customers can understand the data and determine whether it meets their needs. One of the six quality dimensions.
 Input dataset  Any datasets assessed in phase 1 that are combined and processed to produce the final statistical dataset.
 Input quality  Aspects of the quality of an original data source at the point where it is finalised by the admin agency. The quality of a dataset is assessed to determine any treatment required for it to be used in the statistical production of outputs.

Input quality is best assessed for the data’s original purpose. This allows a dataset to have a single assessment that can be used by anyone trying to use the dataset for different purposes – quality issues have different effects depending on what is done with the original data.
 Li-Chun Zhang’s framework  The framework developed by Li-Chun Zhang provides a well-defined list of errors that can occur when producing statistics, using a given dataset or combinations of various datasets.

Phase 1 of Li-Chun Zhang’s model allows a single data source to be evaluated for the purpose for which data was collected. This evaluation is entirely for the input dataset itself, and does not depend on what we intend to do with the data. In phase 1 the focus is on 'objects', which could be events, transactions, or other entries in an admin dataset.

Phase 2 of the error framework covers errors that arise when existing data is used to produce an output that meets a certain statistical purpose. In phase 2 the reference point is the statistical population we would ideally have access to, and the statistical concepts we want to measure about the units in the population.
 Linked set  Includes all the basic objects from across all source datasets that are matched together to make base units. These units will not necessarily be the output’s final statistical units.
 Mapping error

 These arise from transforming variables on the input datasets into defined output variables (the harmonised measures). Such transformations include:

  • Reclassifying from a non-standard classification, or coding a free text field.
  • Deriving a numerical variable from a source dataset, such as removing GST from a transaction value.
  • Modelling a target variable using a combination of several variables on a source dataset, and some model parameters.

In each instance the value of the output variable may differ from the true value – these differences are mapping errors.

 Measure  When used in the context of the error framework (eg target or adjusted measure) ‘measure’ refers to the practical definition and method for capturing a variable value. For example, a question on a survey or admin form, including the instructions and definitions given to respondents.

A quality measure is a value derived by analysing a dataset or metadata that captures information about an aspect of the dataset’s quality.
 Measurement (variable) in phase 1  Refers to Li-Chun Zhang’s framework. The measurement side of the diagram is the path from the (possibly abstract) target concept the data is intended to capture, to a final processed value for a concretely defined variable. Sources of error on the measurement side include the degree to which the operational measure used captures the target concept, and how many and what kind of errors are introduced by respondent misunderstanding or processing difficulties.
 Measurement (variables) in phase 2  Refers to Li-Chun Zhang’s framework. The measurement side in phase 2 is concerned with how well the final values of the output statistical variables capture information about the target statistical concept. Measurement errors in phase 2 are mostly result from a mismatch (eg in concept, definitions, classifications) between the variables on the original source datasets and the target concept the final output aims for. Ideally, variables are collected using classifications and questions that match what we would use to collect the same information in a specialised survey.
 Measurement error  This occurs when the obtained measure (value actually recorded in the dataset) differs from the measurement intended. Errors could include people misremembering details or interpreting questions differently from their design. In more automated admin systems, such as electronic transaction records, measurement errors could include computer system problems that corrupt some values or introduce ambiguity.
 Missing/redundancy error  Misalignment between the accessed set and the observed set. For example, errors where an agency mistakenly rejects or duplicates objects, due to their own processing, could mean objects are missing from the dataset even though correct data was received about them. This category of error exists so such errors are kept distinct from reporting-type errors. (Compare with selection error.)
 Object  This could be events, transactions, or other entries in an admin dataset. The final dataset at the end of all the phase 1 transformations is organised in terms of ‘objects’ rather than ‘units’ – to avoid confusion.
 Obtained measure  The values initially received for specific variables against objects in the dataset.
 Observed set  The set of objects that end up in the final, verified dataset after all processing by the source agency.
 Output quality  The quality of the final statistical product for the intended statistical purpose. Reporting on output quality could involve many quantitative measures and explanatory notes about possible limitations.
 Processing error  These arise from editing and other processing done by the source agency to correct or change the initial values received (the obtained measures).

This kind of processing is done to improve the quality of the data for the target concept, but it is important to understand how much improvement it makes, as well as any limitations introduced by the processing.
 Relevance error  These are the phase 2 errors analogous to validity errors. They are errors at a conceptual level that arise from the fact that the concrete harmonised measure usually fails to precisely capture the abstract statistical target concept. For example, we want to find out about personal income but we only measure taxable income – this creates a relevance error, since non-taxable income is part of our target concept but not our harmonised measure.
 Representation (objects) in phase 1  Representation concerns creating the final list of objects in an individual output dataset. This part of the framework deals with how well the objects in the dataset match the objects in the target set (or target population). Ideally every object in the target set has a corresponding object recorded in the data.
 Representation (objects) in phase 2  The representation side of phase 2 deals with creating a list of statistical units to be included in the output data, based on the source data’s objects. Sometimes this list is created directly from the list of objects in a source dataset, but in complex cases different types of linked units created from several datasets might be combined into new statistical composite units.
 Selection error  These errors arise when objects in the accessible set do not appear in the accessed set. For example, if a store manager forgets to run the reporting tool for a week, the transactions missing from the dataset due to that mistake are selection errors: they were accessible, but were not accessed.
 Source agency  The business, organisation, or group originally responsible for the design and creation of an individual dataset.
 Target concept  This is ‘the ideal information sought about an object’ for phase 1, and ‘the ideal information sought about the statistical units’ for phase 2. The target concept is usually connected to the underlying purpose of the collection and may be quite abstract. Examples are: household income, political views, advertising effectiveness, or population counts.
 Target measure  The operational measurement used in practice by a source agency to capture information. A target measure includes elements such as variable definitions, classifications, a questionnaire, or rules and instructions for people filing out forms.
 Target set  The set of all objects the data producer would ideally have data on. For example, people, businesses, events, and transactions.
 Target population  The ideal set of statistical units that a final dataset should cover.
 Timeliness  Data is released within a time period that permits the information to be of value to customers.
One of the six quality dimensions.
 Unit error  Creating the final statistical units for the output dataset can introduce unit errors. For instance, to create household units from aligned sets of dwellings and people, we must simultaneously decide which dwellings should have a household created, and which people should go into which household unit. Because the statistical units may not correspond to any of the units in the source data, a variety of errors can arise at this stage.
 Validity error  This error refers to misalignment between the ideal target information and the operational ‘target measure’ used to collect it. The error arises from translating from an abstract target concept (the ideal information sought from the admin dataset about an object) to a concrete target measure, which can actually be observed in practice, and does not include issues such as misunderstanding a term used on a form.
See ‘Available files’ for appendixes 2 and 3 (the quality indicators for phases 1 and 2).
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+