Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Further technical information

Key data integration concepts

The Data Integration Manual provides a guide to data integration as carried out at Statistics New Zealand.

Integration for statistical and administrative purposes

When data is linked for statistical purposes, individuals (or units) are identified only to enable the link to be made. When the linkage is complete, the identity of the individual (or unit) is no longer of any statistical interest. The linked dataset is used to report statistical findings about the population or sub-populations.

In contrast, when data is linked for administrative purposes, individuals are identified not only to enable the link to be made, but also for administrative use subsequent to the linkage. This may sometimes result in adverse action, such as prosecution, being taken against individuals. Statistics NZ undertakes data integration only for statistical purposes.

Exact linkage and probabilistic linkage

There are two key methods for matching records. Exact linkage involves using a unique identifier (for example a tax number, passport number or driver’s license number) that is present on both files to link records. It is the easiest and most efficient way to link datasets, and standard statistical software such as SAS can be used.

Where a unique identifier is not available, or is not of sufficient quality or coverage to be relied on alone, probabilistic linkage is employed. This involves the use of other variables common to both files (for example names, addresses, date of birth and sex). Probabilistic linking is more complex and sophisticated data integration software is required in order to achieve high-quality results.

Quality assessment

Either linking method can result in two types of errors: false positive matches and false negative matches. A false positive match is where two records are linked together, when in reality they are not the same person or unit. A false negative match is where two records are not linked together, when they do in fact belong to the same person or unit. Generally there is a trade-off between the two types of errors since, for example, reducing the rate of false positives may increase the rate of false negatives. Thus it is important to consider the consequences of each type of error and to determine whether one is more critical than the other.

An assessment of the size of each of these sources of linkage error should be undertaken as part of the integration and results made available. Analysis of an integrated dataset should take into account possible impacts of the linkage error.

Further details on quality assessment can be found in section 6.4 of the manual.

Data integration scenarios

There are different ways in which two datasets being integrated relate to each other.

Diagram, results from a pair of integrated datasets.

Situation 1

This is where every individual on dataset A is also on dataset B and vice versa. For example, dataset A might consist of addresses while dataset B contains rates information for each address.

Situation 2

This is where every individual on dataset B is on dataset A but there are individuals who appear on dataset A who are not on dataset B. For example, dataset A could be student enrolments and dataset B could be information for those students who have student loans.

Situation 3

This is where some individuals appear on both dataset A and dataset B. However, other individuals will appear on only one dataset or the other. For example, dataset A might be Accident Compensation Corporation (ACC) clients, while dataset B could be people who are admitted to hospital.

It should be noted that these are ‘theoretical’ relationships between pairs of files. Real life is rarely that perfect. For example, in situation 1 there could be duplication and omissions within the files, and timing differences between the two files, which mean that they do not have 100 percent overlap.

There are also different desired results from a pair of integrated datasets: the union or the intersection.

Diagram, a different combination of records.

For example, if dataset A was ACC claims and dataset B was hospitalisations for injury, the intersection would be of interest if statistics were wanted on the number of ACC claimants admitted to hospital as a result of their injury. The union would be of interest if statistics were wanted on the total number of injuries, without double counting injuries represented in both datasets.

Sometimes a different combination of records may be required – for example, “all records on B, with information added from A if it is available”.

Diagram, ways in which two datasets being integrated relate to each other.

Continuing the previous example, this combination may be of interest if hospitalisation costs
were to be combined with costs to ACC, for the population of ACC claims.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+