A Borg

Johannes Gutenberg-Universität Mainz, Mainz, Rhineland-Palatinate, Germany

Are you A Borg?

Claim your profile

Publications (5)8.95 Total impact

  • Article: A practical framework for data management processes and their evaluation in population-based medical registries.
    [show abstract] [hide abstract]
    ABSTRACT: Introduction We present a framework for data management processes in population-based medical registries. Existing guidelines lack the concreteness we deem necessary for them to be of practical use, especially concerning the establishment of new registries. Therefore, we propose adjustments and concretisations with regard to data quality, data privacy, data security and registry purposes. Materials and methods First, we separately elaborate on the issues to be included into the framework and present proposals for their improvements. Thereafter, we provide a framework for medical registries based on quasi-standard-operation procedures. Results The main result is a concise and scientifically based framework that tries to be both broad and concrete. Within that framework, we distinguish between data acquisition, data storage and data presentation as sub-headings. We use the framework to categorise and evaluate the data management processes of a German cancer registry. Discussion The standardisation of data management processes in medical registries is important to guarantee high quality of the registered data, to enhance the realisation of purposes, to increase efficiency and to enable comparisons between registries. Our framework is destined to show how one central impediment for such standardisations - lack of practicality - can be addressed on scientific grounds.
    Informatics for Health and Social Care 01/2013; · 0.87 Impact Factor
  • Article: Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data.
    M Sariyar, A Borg
    [show abstract] [hide abstract]
    ABSTRACT: Record linkage or deduplication deals with the detection and deletion of duplicates in and across files. For this task, this paper introduces and evaluates two new machine-learning methods (bumping and multiview) together with bagging, a tree-based ensemble-approach. Whereas bumping represents a tree-based approach as well, multiview is based on the combination of different methods and the semi-supervised learning principle. After providing a theoretical background of the methods, initial empirical results on patient identity data are given. In the empirical evaluation, we calibrate the methods on three different kinds of training data. The results show that the smallest training data set, which is obtained by a simple active learning strategy, leads to the best results. Multiview can outperform the other methods only when all are calibrated on a randomly sampled training set; in all other cases, it performs worse. The results of bumping do not differ significantly from the overall best performing method bagging. We cautiously conclude that tree-based record linkage methods are likely to produce similar results because of the low-dimensionality (p≪n) and straightforwardness of the underlying problem. Multiview is possibly rather suitable for problems that are more sophisticated.
    Computer methods and programs in biomedicine 09/2012; · 1.14 Impact Factor
  • Article: Active learning strategies for the deduplication of electronic patient data using classification trees.
    [show abstract] [hide abstract]
    ABSTRACT: Supervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether a simple active learning strategy using binary comparison patterns is sufficient or if string metrics together with a more sophisticated algorithm are necessary to achieve high accuracies with a small training set. Based on medical registry data with different numbers of attributes, we used active learning to acquire training sets for classification trees, which were then used to classify the remaining data. Active learning for binary patterns means that every distinct comparison pattern represents a stratum from which one item is sampled. Active learning for patterns consisting of the Levenshtein string metric values uses an iterative process where the most informative and representative examples are added to the training set. In this context, we extended the active learning strategy by Sarawagi and Bhamidipaty (2002) [6]. On the original data set, active learning based on binary comparison patterns leads to the best results. When dropping four or six attributes, using string metrics leads to better results. In both cases, not more than 200 manually reviewed training examples are necessary. In record linkage applications where only forename, name and birthday are available as attributes, we suggest the sophisticated active learning strategy based on string metrics in order to achieve highly accurate results. We recommend the simple strategy if more attributes are available, as in our study. In both cases, active learning significantly reduces the amount of manual involvement in training data selection compared to usual record linkage settings.
    Journal of Biomedical Informatics 02/2012; 45(5):893-900. · 1.79 Impact Factor
  • Source
    Article: Missing values in deduplication of electronic patient data.
    [show abstract] [hide abstract]
    ABSTRACT: INTRODUCTION: Systematic approaches to dealing with missing values in record linkage are still lacking. This article compares the ad-hoc treatment of unknown comparison values as 'unequal' with other and more sophisticated approaches. An empirical evaluation was conducted of the methods on real-world data as well as on simulated data based on them. MATERIAL AND METHODS: Cancer registry data and artificial data with increased numbers of missing values in a relevant variable are used for empirical comparisons. As a classification method, classification and regression trees were used. On the resulting binary comparison patterns, the following strategies for dealing with missingness are considered: imputation with unique values, sample-based imputation, reduced-model classification and complete-case induction. These approaches are evaluated according to the number of training data needed for induction and the F-scores achieved. RESULTS: The evaluations reveal that unique value imputation leads to the best results. Imputation with zero is preferred to imputation with 0.5, although the latter shows the highest median F-scores. Imputation with zero needs considerably less training data, it shows only slightly worse results and simplifies the computation by maintaining the binary structure of the data. CONCLUSIONS: The results support the ad-hoc solution for missing values 'replace NA by the value of inequality'. This conclusion is based on a limited amount of data and on a specific deduplication method. Nevertheless, the authors are confident that their results should be confirmed by other empirical analyses and applications.
    Journal of the American Medical Informatics Association 10/2011; 19(e1):e76-e82. · 3.61 Impact Factor
  • Article: Evaluation of record linkage methods for iterative insertions.
    [show abstract] [hide abstract]
    ABSTRACT: There have been many developments and applications of mathematical methods in the context of record linkage as one area of interdisciplinary research efforts. However, comparative evaluations of record linkage methods are still underrepresented. In this paper improvements of the Fellegi-Sunter model are compared with other elaborated classification methods in order to direct further research endeavors to the most promising methodologies. The task of linking records can be viewed as a special form of object identification. We consider several non-stochastic methods and procedures for the record linkage task in addition to the Fellegi-Sunter model and perform an empirical evaluation on artificial and real data in the context of iterative insertions. This evaluation provides a deeper insight into empirical similarities and differences between different modelling frames of the record linkage problem. In addition, the effects of using string comparators on the performance of different matching algorithms are evaluated. Our central results show that stochastic record linkage based on the principle of the EM algorithm exhibits best classification results when calibrating data are structurally different to validation data. Bagging, boosting together with support vector machines are best classification methods when calibrating and validation data have no major structural differences. The most promising methodologies for record linkage in environments similar to the one considered in this paper seem to be stochastic ones.
    Methods of Information in Medicine 09/2009; 48(5):429-37. · 1.53 Impact Factor

Institutions

  • 2009–2013
    • Johannes Gutenberg-Universität Mainz
      • Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI)
      Mainz, Rhineland-Palatinate, Germany