Stephan Dreiseitl

Harvard University, Cambridge, Massachusetts, United States

Are you Stephan Dreiseitl?

Claim your profile

Publications (44)

  • Dominic Girardi · Sandra Wartner · Gerhard Halmerbauer · [...] · Stephan Dreiseitl
    [Show abstract] [Hide abstract] ABSTRACT: Objective: We introduce a new distance measure that is better suited than traditional methods at detecting similarities in patient records by referring to a concept hierarchy. Materials and methods: The new distance measure improves on distance measures for categorical values by taking the path distance between concepts in a hierarchy into account. We evaluate and compare the new measure on a data set of 836 patients. Results: The new measure shows marked improvements over the standard measures, both qualitatively and quantitatively. Using the new measure for clustering patient data reveals structure that is otherwise not visible. Statistical comparisons of distances within patient groups with similar diagnoses shows that the new measure is significantly better at detecting these similarities than the standard measures. Conclusion: The new distance measure is an improvement over the current standard whenever a hierarchical arrangement of categorical values is available.
    Article · Jul 2016 · Journal of Biomedical Informatics
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Background: Next-generation sequencing allows for determining the genetic composition of a mixed sample. For instance, when performing resistance testing for BCR-ABL1 it is necessary to identify clones and define compound mutations; together with an exact quantification this may complement diagnosis and therapy decisions with additional information. Moreover, that applies not only to oncological issues but also determination of viral, bacterial or fungal infection. The efforts to retrieve multiple haplotypes (more than two) and proportion information from data with conventional software are difficult, cumbersome and demand multiple manual steps. Results: Therefore, we developed a tool called cFinder that is capable of automatic detection of haplotypes and their accurate quantification within one sample. BCR-ABL1 samples containing multiple clones were used for testing and our cFinder could identify all previously found clones together with their abundance and even refine some results. Additionally, reads were simulated using GemSIM with multiple haplotypes, the detection was very close to linear (R(2) = 0.96). Our aim is not to deduce haploblocks over statistics, but to characterize one sample's composition precisely. As a result the cFinder reports the connections of variants (haplotypes) with their readcount and relative occurrence (percentage). Download is available at . Conclusions: Our cFinder is implemented in an efficient algorithm that can be run on a low-performance desktop computer. Furthermore, it considers paired-end information (if available) and is generally open for any current next-generation sequencing technology and alignment strategy. To our knowledge, this is the first software that enables researchers without extensive bioinformatic support to designate multiple haplotypes and how they constitute to a sample.
    Full-text Article · Dec 2015 · BMC Research Notes
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: We evaluated the accuracy of diagnoses made from pictures taken with the built-in cameras of mobile phones in a 'real-life' clinical setting. A total of 263 patients took part, who photographed their own lesions where possible, and provided clinical information via a questionnaire. After the teledermatology procedure, each patient was examined face-to-face and a gold standard diagnosis was made. The telemedicine data and pictures were diagnosed by 15 dermatologists. The 299 cases contained 1-22 clinical images each (median 3). Nine dermatologists finished all the cases and the remaining six completed some of them, thus providing 2893 decisions. Overall, 61% of all cases were rated as possible to diagnose and of those, 80% were correct in comparison with the face-to-face diagnosis. Image quality was evaluated and the median was 5 on a 10-point scale. There was a significant correlation between the correct diagnosis and the quality of the photographs taken (P < 0.001). In nearly two-thirds of all cases, a teledermatology diagnosis was possible; however, there was insufficient information to make a telemedicine diagnosis in about one-third of the cases. If applied carefully, mobile phones could be a powerful tool for people to optimize their health care status.
    Full-text Article · Jun 2013 · Journal of Telemedicine and Telecare
  • [Show abstract] [Hide abstract] ABSTRACT: Objective: To develop a birth weight (BW), gestational age (GA), and postnatal-weight gain retinopathy of prematurity (ROP) prediction model in a cohort of infants meeting current screening guidelines. Methods: Multivariate logistic regression was applied retrospectively to data from infants born with BW less than 1501 g or GA of 30 weeks or less at a single Philadelphia hospital between January 1, 2004, and December 31, 2009. In the model, BW, GA, and daily weight gain rate were used repeatedly each week to predict risk of Early Treatment of Retinopathy of Prematurity type 1 or 2 ROP. If risk was above a cut-point level, examinations would be indicated. Results: Of 524 infants, 20 (4%) had type 1 ROP and received laser treatment; 28 (5%) had type 2 ROP. The model (Children's Hospital of Philadelphia [CHOP]) accurately predicted all infants with type 1 ROP; missed 1 infant with type 2 ROP, who did not require laser treatment; and would have reduced the number of infants requiring examinations by 49%. Raising the cut point to miss one type 1 ROP case would have reduced the need for examinations by 79%. Using daily weight measurements to calculate weight gain rate resulted in slightly higher examination reduction than weekly measurements. Conclusions: The BW-GA-weight gain CHOP ROP model demonstrated accurate ROP risk assessment and a large reduction in the number of ROP examinations compared with current screening guidelines. As a simple logistic equation, it can be calculated by hand or represented as a nomogram for easy clinical use. However, larger studies are needed to achieve a highly precise estimate of sensitivity prior to clinical application.
    Article · Dec 2012 · Archives of ophthalmology
  • M. Osl · M. Netzer · S. Dreiseitl · C. Baumgartner
    [Show abstract] [Hide abstract] ABSTRACT: This chapter provides an overview of emerging bioinformatics methods for the biomarker discovery process and medical decision support. It introduces study design consideration and bioanalytic concepts for generating biomedical data, followed by various data mining and information retrieval procedures such as feature selection, classification as well as statistical and clinical validation. The reviewed methods are illustrated by real examples from preclinical and clinical studies, and the application in medical decision making is discussed. This chapter is anticipated to address to those with a bioinformatics background as well as biomedical researchers who are interested in the application of computational methods in biomarker discovery and medical decision making.
    Chapter · Jan 2012
  • Stephan Dreiseitl · Melanie Osl
    [Show abstract] [Hide abstract] ABSTRACT: The accurate assessment of the calibration of classification models is severely limited by the fact that there is no easily available gold standard against which to compare a model's outputs. The usual procedures group expected and observed probabilities, and then perform a χ(2) goodness-of-fit test. We propose an entirely new approach to calibration testing that can be derived directly from the first principles of statistical hypothesis testing. The null hypothesis is that the model outputs are correct, i.e., that they are good estimates of the true unknown class membership probabilities. Our test calculates a p-value by checking how (im)probable the observed class labels are under the null hypothesis. We demonstrate by experiments that our proposed test performs comparable to, and sometimes even better than, the Hosmer-Lemeshow goodness-of-fit test, the de facto standard in calibration assessment.
    Article · Jan 2012 · AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
  • Stephan Dreiseitl · Maja Pivec · Michael Binder
    [Show abstract] [Hide abstract] ABSTRACT: To use computer-based eye tracking technology to record and evaluate examination characteristics of the diagnosis of pigmented skin lesions. 16 study participants with varying levels of diagnostic expertise (little, intermediate, superior) were recorded while diagnosing a series of 28 digital images of pigmented skin lesions, obtained by non-invasive digital dermatoscopy, on a computer screen. Eye tracking hardware recorded the gaze track and fixations of the physicians while they examined the lesion images. Analysis of variance was used to test for differences in examination characteristics between physicians grouped according to expertise. There were no significant differences between physicians with little and intermediate levels of expertise in terms of average time until diagnosis (6.61 vs. 6.19s), gaze track length (6.65 vs. 6.15 kilopixels), number of fixations (23.1 vs. 19.1), and time in fixations (4.91 vs. 4.17s). The experts were significantly different with 3.17s time until diagnosis, 4.53 kilopixels gaze track length, 9.9 fixations, and 1.74s in fixations, respectively. Differentiation between benign and malignant lesions had no effect on examination measurements. The results show that experience level has a significant impact on the way in which lesion images are examined. This finding can be used to construct decision support systems that employ important diagnostic features identified by experts, and to optimize teaching for less experienced physicians.
    Article · Dec 2011 · Artificial intelligence in medicine
  • Melanie Osl · Stephan Dreiseitl
    [Show abstract] [Hide abstract] ABSTRACT: Acute myocardial infarction is one of the most common cardiovascular diseases in the Western world. Fortunately, not all myocardial infractions are fatal. By early diagnosis of acute myocardial infarction based on symptoms at a patient’s presentation in the emergency department, the number of deaths may be further reduced, as life-saving actions can be taken sooner. In this paper, we investigate the application of kernel-based methods to this problem, i.e. we evaluate the performance of support vector machines and kernel logistic regression models and compare these two methods to logistic regression models in terms of discrimination and calibration. The results show that kernel-based methods have higher discriminatory power for early diagnosis of acute myocardial infarction than logistic regression models and that kernel logistic regression models have superior calibration in comparison to logistic regression models and support vector machines.
    Article · Apr 2011
  • Stephan Dreiseitl · Melanie Osl
    [Show abstract] [Hide abstract] ABSTRACT: The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent “natural” groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.
    Conference Paper · Feb 2011
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: To develop an efficient clinical prediction model that includes postnatal weight gain to identify infants at risk of developing severe retinopathy of prematurity (ROP). Under current birth weight (BW) and gestational age (GA) screening criteria, <5% of infants examined in countries with advanced neonatal care require treatment. This study was a secondary analysis of prospective data from the Premature Infants in Need of Transfusion Study, which enrolled 451 infants with a BW < 1000 g at 10 centers. There were 367 infants who remained after excluding deaths (82) and missing weights (2). Multivariate logistic regression was used to predict severe ROP (stage 3 or treatment). Median BW was 800 g (445-995). There were 67 (18.3%) infants who had severe ROP. The model included GA, BW, and daily weight gain rate. Run weekly, an alarm that indicated need for eye examinations occurred when the predicted probability of severe ROP was >0.085. This identified 66 of 67 severe ROP infants (sensitivity of 99% [95% confidence interval: 94%-100%]), and all 33 infants requiring treatment. Median alarm-to-outcome time was 10.8 weeks (range: 1.9-17.6). There were 110 (30%) infants who had no alarm. Nomograms were developed to determine risk of severe ROP by BW, GA, and postnatal weight gain. In a high-risk cohort, a BW-GA-weight-gain model could have reduced the need for examinations by 30%, while still identifying all infants requiring laser surgery. Additional studies are required to determine whether including larger-BW, lower-risk infants would reduce examinations further and to validate the prediction model and nomograms before clinical use.
    Full-text Article · Feb 2011 · PEDIATRICS
  • S. Dreiseitl · M. Osl
    [Show abstract] [Hide abstract] ABSTRACT: Binary classifier systems that provide class membership probabilities as outputs may be augmented by a reject option to refuse classification for cases that either appear to be outliers, or for which the output probability is around 0.5. We investigated the effect of these two reject options (called "distance reject" and "ambiguity reject", respectively) on the calibration and discriminatory power of logistic regression models. Outliers were found using one-class support vector machines. Discriminatory power was measured by the area under the ROC curve, and calibration by the Hosmer-Lemeshow goodness-of-fit test. Using an artificial data set and a real-world data set for diagnosing myocardial infarction, we found that ambiguity reject increased discriminatory power, while distance reject decreased it. We did not observe any influence of either reject option on the calibration of the logistic regression models.
    Article · Jan 2011
  • Melanie Osl · Stephan Dreiseitl · Jihoon Kim · [...] · Lucila Ohno-Machado
    [Show abstract] [Hide abstract] ABSTRACT: The quality of predictive modeling in biomedicine depends on the amount of data available for model building. To study the effect of combining microarray data sets on feature selection and predictive modeling performance. Empirical evaluation of stability of feature selection and discriminatory power of classifiers using three previously published gene expression data sets, analyzed both individually and in combination. Feature selection was not robust for the individual as well as for the combined data sets. The classification performance of models built on individual and combined data sets was heavily dependent on the data set from which the features were extracted. We identified volatility of feature selection as contributing factor to some of the problems faced by predictive modeling using microarray data.
    Article · Nov 2010 · AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
  • Stephan Dreiseitl · Melanie Osl · Christian Baumgartner · Staal Vinterbo
    [Show abstract] [Hide abstract] ABSTRACT: To evaluate and compare the performance of different rule-ranking algorithms for rule-based classifiers on biomedical datasets. Empirical evaluation of five rule ranking algorithms on two biomedical datasets, with performance evaluation based on ROC analysis and 5 × 2 cross-validation. On a lung cancer dataset, the area under the ROC curve (AUC) of, on average, 14267.1 rules was 0.862. Multi-rule ranking found 13.3 rules with an AUC of 0.852. Four single-rule ranking algorithms, using the same number of rules, achieved average AUC values of 0.830, 0.823, 0.823, and 0.822, respectively. On a prostate cancer dataset, an average of 339265.3 rules had an AUC of 0.934, while 9.4 rules obtained from multi-rule and single-rule rankings had average AUCs of 0.932, 0.926, 0.925, 0.902 and 0.902, respectively. Multi-variate rule ranking performs better than the single-rule ranking algorithms. Both single-rule and multi-rule methods are able to substantially reduce the number of rules while keeping classification performance at a level comparable to the full rule set.
    Article · May 2010 · Artificial intelligence in medicine
  • Stephan Dreiseitl · Melanie Osl · Christian Scheibböck · Michael Binder
    [Show abstract] [Hide abstract] ABSTRACT: Medical diagnosis and prognosis using machine learning methods is usually represented as a supervised classification problem, where a model is built to distinguish "normal" from "abnormal" cases. If cases are available from only one class, this approach is not feasible. To evaluate the performance of classification via outlier detection by one-class support vector machines (SVMs) as a means of identifying abnormal cases in the domain of melanoma prognosis. Empirical evaluation of one-class SVMs on a data set for predicting the presence or absence of metastases in melanoma patients, and comparison with regular SVMs and artificial neural networks. One-class SVMs achieve an area under the ROC curve (AUC) of 0.71; two-class algorithms achieve AUCs between 0.5 and 0.84, depending on the available number of cases from the minority class. One-class SVMs offer a viable alternative to two-class classification algorithms if class distribution is heavily imbalanced.
    Article · Jan 2010 · AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
  • M Osl · S Dreiseitl · F Cerqueira · [...] · C Baumgartner
    [Show abstract] [Hide abstract] ABSTRACT: The identification of a set of relevant but not redundant features is an important first step in building predictive and diagnostic models from biomedical data sets. Most commonly, individual features are ranked in terms of a quality criterion, out of which the best (first) k features are selected. However, feature ranking methods do not sufficiently account for interactions and correlations between the features. Thus, redundancy is likely to be encountered in the selected features. We present a new algorithm, termed Redundancy Demoting (RD), that takes an arbitrary feature ranking as input, and improves this ranking by identifying redundant features and demoting them to positions in the ranking in which they are not redundant. Redundant features are those that are correlated with other features and not relevant in the sense that they do not improve the discriminatory ability of a set of features. Experiments on two cancer data sets, one melanoma image data set and one lung cancer microarray data set, show that our algorithm greatly improves the feature rankings provided by the methods information gain, ReliefF and Student's t-test in terms of predictive power.
    Article · Jun 2009 · Journal of Biomedical Informatics
  • Stephan Dreiseitl · Michael Binder · Krispin Hable · Harald Kittler
    [Show abstract] [Hide abstract] ABSTRACT: The aim of this study was to evaluate the accuracy of a computer-based system for the automated diagnosis of melanoma in the hands of nonexpert physicians. We performed a prospective comparison between nonexperts using computer assistance and experts without assistance in the setting of a tertiary referral center at a University hospital. Between February and November 2004 we enrolled 511 consecutive patients. Each patient was examined by two nonexpert physicians with low to moderate diagnostic skills who were allowed to use a neural network-based diagnostic system at their own discretion. Every patient was also examined by an expert dermatologist using standard dermatoscopy equipment. The nonexpert physicians used the automatic diagnostic system in 3827 pigmented skin lesions. In their hands, the system achieved a sensitivity of 72% and a specificity of 82%. The sensitivity was significantly lower than that of the expert physician (72 vs. 96%, P = 0.001), whereas the specificity was significantly higher (82 vs. 72%, P<0.01). Three melanomas were missed because the physicians who operated the system did not choose them for examination. The system as a stand-alone device had an average discriminatory power of 0.87, as measured by the area under the receiver operating characteristic curve, with optimal sensitivities and specificities of 75 and 84%, respectively. The diagnostic accuracy achieved in this clinical trial was lower than that achieved in a previous experimental trial of the same system. In total, the performance of a decision-support system for melanoma diagnosis under real-life conditions is lower than that expected from experimental data and depends upon the physicians who are using the system.
    Article · Apr 2009 · Melanoma research
  • Stephan Dreiseitl · Melanie Osl
    [Show abstract] [Hide abstract] ABSTRACT: The process of feature selection is an important first step in building machine learning models. Feature selection algorithms can be grouped into wrappers and filters; the former use machine learning models to evaluate feature sets, the latter use other criteria to evaluate features individually. We present a new approach to feature selection that combines advantages of both wrapper as well as filter approaches, by using logistic regression and the area under the ROC curve (AUC) to evaluate pairs of features. After choosing as starting feature the one with the highest individual discriminatory power, we incrementally rank features by choosing as next feature the one that achieves the highest AUC in combination with an already chosen feature. To evaluate our approach, we compared it to standard filter and wrapper algorithms. Using two data sets from the biomedical domain, we are able to demonstrate that the performance of our approach exceeds that of filter methods, while being comparable to wrapper methods at smaller computational cost.
    Conference Paper · Feb 2009
  • Christian Scheibböck · Stephan Dreiseitl · Michael Binder
    Conference Paper · Jan 2009
  • Melanie Osl · Christian Baumgartner · Bernhard Tilg · Stephan Dreiseitl
    [Show abstract] [Hide abstract] ABSTRACT: Classifiers based on parametric or non-parametric learning methods have different advantages and disadvantages. To take advantage of the strengths of both methods, we propose an algorithm that combines a parametric model (logistic regression) with a non-parametric classification method (k-nearest neighbors). This combination is based on a measure of appropriateness that uses a heuristic to decide which of the two components should contribute more to the final classification output. We measure the performance of this combination method on two data sets (one from medical informatics, and one consisting of simulated data) in terms of areas under the ROC curves (AUCs). We are able to demonstrate that our method of combining classifiers exceeds the performance of both individual classifiers taken separately.
    Conference Paper · Dec 2008
  • Melanie Osl · Lucila Ohno-Machado · Christian Baumgartner · [...] · Stephan Dreiseitl
    [Show abstract] [Hide abstract] ABSTRACT: Objective: To improve the calibration of logistic regression (LR) estimates using local information. Background: Individualized risk assessment tools are increasingly being utilized. External validation of these tools often reveals poor model calibration. Methods: We combine a clustering algorithm with an LR model to produce probability estimates that are close to the true probabilities for a particular case. The new method is compared to a standard LR model in terms of calibration, as measured by the sum of absolute differences (SAD) between model estimates and true probabilities, and discrimination, as measured by area under the ROC curve (AUC). Results: We evaluate the new method on two synthetic data sets. SADs are significantly lower (p < 0.0001) in both data sets, and AUCs are significantly higher in one data set (p < 0.01). Conclusion: The results suggest that the proposed method may be useful to improve the calibration of LR models.
    Article · Nov 2008 · AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium