The EpiLink record linkage software: presentation and results of linkage test on cancer registry files.

Cancer Registry Division, Istituto Nazionale per lo Studio e la Cura dei Tumori Via Venezian 1, 20133 Milan, Italy.
Methods of Information in Medicine (Impact Factor: 1.08). 02/2005; 44(1):66-71. DOI: 10.1267/METH05010066
Source: PubMed

ABSTRACT Record linkage, the process of bringing together separately compiled but related records from different databases, is essential in many areas of biomedical research. We developed a record linkage program (EpiLink), which employs a simple mathematical approach. We describe the program and present results obtained testing it in a linkage task.
EpiLink was designed to be flexible with user-friendly settings to tailor linkage and operating parameters to specific linkage tasks, and employ deterministic, probabilistic or sequential deterministic-probabilistic linkage strategies as required. The user can also standardize data format, examine linkage results and accept or discard them. We used EpiLink to link a subset of cases of the Lombardy Cancer Registry (20,724 records) with the Social Security file of the population (1,021,846 records) covered by the registry. The linkage strategy was deterministic, followed by several probabilistic linkage steps.
Manual inspection of the results showed that EpiLink achieved 98.8% specificity and 96.5% sensitivity.
EpiLink is a practical and accurate means of linking records from different databases that can be used by non-statisticians and is efficient in terms of human and financial resources.


Available from: Andrea Tittarelli, Nov 14, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Medical research networks rely on record linkage and pseudonymization to determine which records from different sources relate to the same patient. To establish informational separation of powers, the required identifying data are redirected to a trusted third party that has, in turn, no access to medical data. This pseudonymization service receives identifying data, compares them with a list of already reported patient records and replies with a (new or existing) pseudonym. We found existing solutions to be technically outdated, complex to implement or not suitable for internet-based research infrastructures. In this article, we propose a new RESTful pseudonymization interface tailored for use in web applications accessed by modern web browsers. Methods The interface is modelled as a resource-oriented architecture, which is based on the representational state transfer (REST) architectural style. We translated typical use-cases into resources to be manipulated with well-known HTTP verbs. Patients can be re-identified in real-time by authorized users¿ web browsers using temporary identifiers. We encourage the use of PID strings for pseudonyms and the EpiLink algorithm for record linkage. As a proof of concept, we developed a Java Servlet as reference implementation. Results The following resources have been identified: Sessions allow data associated with a client to be stored beyond a single request while still maintaining statelessness. Tokens authorize for a specified action and thus allow the delegation of authentication. Patients are identified by one or more pseudonyms and carry identifying fields. Relying on HTTP calls alone, the interface is firewall-friendly. The reference implementation has proven to be production stable. Conclusion The RESTful pseudonymization interface fits the requirements of web-based scenarios and allows building applications that make pseudonymization transparent to the user using ordinary web technology. The open-source reference implementation implements the web interface as well as a scientifically grounded algorithm to generate non-speaking pseudonyms.
    BMC Medical Informatics and Decision Making 02/2015; 15(1):2. DOI:10.1186/s12911-014-0123-5 · 1.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: High circulating glucose has been associated with increased risk of breast cancer (BC). There may also be a link between serum glucose and prognosis in women treated for BC. We assessed the effect of peridiagnostic fasting blood glucose and body mass index (BMI) on long-term BC prognosis. We retrospectively investigated 1,261 women diagnosed and treated for stage I-III BC at the National Cancer Institute, Milan, in 1996, 1999 and 2000. Data on blood tests and follow-up were obtained by linking electronic archives, with follow-up to end of 2009. Multivariate Cox modelling estimated hazard ratios (HR) with 95 % confidence intervals (CI) for distant metastasis, recurrence and death (all causes) in relation to categorized peridiagnostic fasting blood glucose and BMI. Mediation analysis investigated whether blood glucose mediated the BMI-breast cancer prognosis association. The risks of distant metastasis were significantly higher for all other quintiles compared to the lowest glucose quintile (reference <87 mg/dL) (respective HRs: 1.99 95 % CI 1.23-3.24, 1.85 95 % CI 1.14-3.0, 1.73 95 % CI 1.07-2.8, and 1.91 95 % CI 1.15-3.17). The risk of recurrence was significantly higher for all other glucose quintiles compared to the first. The risk of death was significantly higher than reference in the second, fourth and fifth quintiles. Women with BMI ≥ 25 kg/m(2) had significantly greater risks of recurrence and distant metastasis than those with BMI < 25 kg/m(2), irrespective of blood glucose. The increased risks remained invariant over a median follow-up of 9.5 years. Mediation analysis indicated that glucose and BMI had independent effects on BC prognosis. Peridiagnostic high fasting glucose and obesity predict worsened short- and long-term outcomes in BC patients. Maintaining healthy blood glucose levels and normal weight may improve prognosis.
    Breast Cancer Research and Treatment 04/2013; 138(3). DOI:10.1007/s10549-013-2519-9 · 4.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automated procedures are increasingly used in cancer registration, and it is important that the data produced are systematically checked for consistency and accuracy. We evaluated an automated procedure for cancer registration adopted by the Lombardy Cancer Registry in 1997, comparing automatically-generated diagnostic codes with those produced manually over one year (1997). The automatically generated cancer cases were produced by Open Registry algorithms. For manual registration, trained staff consulted clinical records, pathology reports and death certificates. The social security code, present and checked in both databases in all cases, was used to match the files in the automatic and manual databases. The cancer cases generated by the two methods were compared by manual revision. The automated procedure generated 5027 cases: 2959 (59%) were accepted automatically and 2068 (41%) were flagged for manual checking. Among the cases accepted automatically, discrepancies in data items (surname, first name, sex and date of birth) constituted 8.5% of cases, and discrepancies in the first three digits of the ICD-9 code constituted 1.6%. Among flagged cases, cancers of female genital tract, hematopoietic system, metastatic and ill-defined sites, and oropharynx predominated. The usual reasons were use of specific vs. generic codes, presence of multiple primaries, and use of extranodal vs. nodal codes for lymphomas. The percentage of automatically accepted cases ranged from 83% for breast and thyroid cancers to 13% for metastatic and ill-defined cancer sites. Since 59% of cases were accepted automatically and contained relatively few, mostly trivial discrepancies, the automatic procedure is efficient for routine case generation effectively cutting the workload required for routine case checking by this amount. Among cases not accepted automatically, discrepancies were mainly due to variations in coding practice.
    Population Health Metrics 02/2006; 4:10. DOI:10.1186/1478-7954-4-10 · 2.11 Impact Factor