The EpiLink record linkage software: Presentation and results of linkage test on cancer registry files

Cancer Registry Division, Istituto Nazionale per lo Studio e la Cura dei Tumori Via Venezian 1, 20133 Milan, Italy.
Methods of Information in Medicine (Impact Factor: 2.25). 02/2005; 44(1):66-71. DOI: 10.1267/METH05010066
Source: PubMed


Record linkage, the process of bringing together separately compiled but related records from different databases, is essential in many areas of biomedical research. We developed a record linkage program (EpiLink), which employs a simple mathematical approach. We describe the program and present results obtained testing it in a linkage task.
EpiLink was designed to be flexible with user-friendly settings to tailor linkage and operating parameters to specific linkage tasks, and employ deterministic, probabilistic or sequential deterministic-probabilistic linkage strategies as required. The user can also standardize data format, examine linkage results and accept or discard them. We used EpiLink to link a subset of cases of the Lombardy Cancer Registry (20,724 records) with the Social Security file of the population (1,021,846 records) covered by the registry. The linkage strategy was deterministic, followed by several probabilistic linkage steps.
Manual inspection of the results showed that EpiLink achieved 98.8% specificity and 96.5% sensitivity.
EpiLink is a practical and accurate means of linking records from different databases that can be used by non-statisticians and is efficient in terms of human and financial resources.

  • Source
    • "Open Registry then links the records of the sources files to aggregate information for person. This is done using deterministic and probabilistic methods [13]. Finally data consistency checks are performed, again by ad-hoc routines within Open Registry. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Automated procedures are increasingly used in cancer registration, and it is important that the data produced are systematically checked for consistency and accuracy. We evaluated an automated procedure for cancer registration adopted by the Lombardy Cancer Registry in 1997, comparing automatically-generated diagnostic codes with those produced manually over one year (1997). The automatically generated cancer cases were produced by Open Registry algorithms. For manual registration, trained staff consulted clinical records, pathology reports and death certificates. The social security code, present and checked in both databases in all cases, was used to match the files in the automatic and manual databases. The cancer cases generated by the two methods were compared by manual revision. The automated procedure generated 5027 cases: 2959 (59%) were accepted automatically and 2068 (41%) were flagged for manual checking. Among the cases accepted automatically, discrepancies in data items (surname, first name, sex and date of birth) constituted 8.5% of cases, and discrepancies in the first three digits of the ICD-9 code constituted 1.6%. Among flagged cases, cancers of female genital tract, hematopoietic system, metastatic and ill-defined sites, and oropharynx predominated. The usual reasons were use of specific vs. generic codes, presence of multiple primaries, and use of extranodal vs. nodal codes for lymphomas. The percentage of automatically accepted cases ranged from 83% for breast and thyroid cancers to 13% for metastatic and ill-defined cancer sites. Since 59% of cases were accepted automatically and contained relatively few, mostly trivial discrepancies, the automatic procedure is efficient for routine case generation effectively cutting the workload required for routine case checking by this amount. Among cases not accepted automatically, discrepancies were mainly due to variations in coding practice.
    Full-text · Article · Feb 2006 · Population Health Metrics
  • [Show abstract] [Hide abstract]
    ABSTRACT: In French national claims databases, claims are currently anonymous i.e. not linked to individual patients. In order to improve our estimate of the medical activity related to cancer in one French region, a statistical method was developed to use claims data to assess the number of cancer patients hospitalized in acute care. This method used the medical and administrative information available in the claims (i.e. age, primary site, length of stay) to predict an average number of stays per patient, followed by a number of patients. It was based on a two-phase study design using an internal dataset which contained personal identifiers to estimate the model parameters. The predicted number of acute care patients hospitalized in one or several health care centers in one French region was 38,109 with a 95% predictive interval (37,990; 38,228) for the first six months of 2002. A prediction error of 24 per thousand was found. We provide a good estimate of the morbidity in acute care hospitals using claims data that is not linked to individual patients. This estimate reflects the medical activity and can be used to anticipate acute care needs.
    No preview · Article · Feb 2006 · Methods of Information in Medicine
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Birth defects are a leading cause of neonatal and infant mortality in Italy, however little is known of the etiology of most defects. Improvements in diagnosis have revealed increasing numbers of clinically insignificant defects, while improvements in treatment have increased the survival of those with more serious and complex defects. For etiological studies, prevention, and management, it is important to have population-based monitoring which provides reliable data on the prevalence at birth of such defects. We recently initiated population-based birth defect monitoring in the Provinces of Mantova, Sondrio and Varese of the Region of Lombardy, northern Italy, and report data for the first year of operation (1999). The registry uses all-electronic source files (hospital discharge files, death certificates, regional health files, and pathology reports) and a proven case-generation methodology, which is described. The data were checked manually by consulting clinical records in hospitals. Completeness was checked against birth certificates by capture-recapture. Data on cases were coded according to the four-digit malformation codes of the International Classification of Diseases, Ninth Revision (ICD-9). We present data only on selected defects. We found 246 selected birth defects in 12,008 live births in 1999, 148 among boys and 98 among girls. Congenital heart defects (particularly septal defects) were the most common (90.8/10,000), followed by defects of the genitourinary tract (34.1/10, 000) (particularly hypospadias in boys), digestive system (23.3/10,000) and central nervous system (14.9/10,000), orofacial clefts (10.8/10,000) and Down syndrome (8.3/10,000). Completeness was satisfactory: analysis of birth certificates resulted in the addition of two birth defect cases to the registry. This is the first population-based analysis on selected major birth defects in the Region. The high birth prevalences for septal heart defect and hypospadias are probably due to the inclusion of minor defects and lack of coding standardization; the latter problem also seems important for other defects. However the data produced are useful for estimating the demands made on the health system by babies with birth defects.
    Full-text · Article · Feb 2007 · Population Health Metrics
Show more