Stephan Dreiseitl

Medical University of Vienna, Wien, Vienna, Austria

Are you Stephan Dreiseitl?

Claim your profile

Publications (36)29.58 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We evaluated the accuracy of diagnoses made from pictures taken with the built-in cameras of mobile phones in a 'real-life' clinical setting. A total of 263 patients took part, who photographed their own lesions where possible, and provided clinical information via a questionnaire. After the teledermatology procedure, each patient was examined face-to-face and a gold standard diagnosis was made. The telemedicine data and pictures were diagnosed by 15 dermatologists. The 299 cases contained 1-22 clinical images each (median 3). Nine dermatologists finished all the cases and the remaining six completed some of them, thus providing 2893 decisions. Overall, 61% of all cases were rated as possible to diagnose and of those, 80% were correct in comparison with the face-to-face diagnosis. Image quality was evaluated and the median was 5 on a 10-point scale. There was a significant correlation between the correct diagnosis and the quality of the photographs taken (P < 0.001). In nearly two-thirds of all cases, a teledermatology diagnosis was possible; however, there was insufficient information to make a telemedicine diagnosis in about one-third of the cases. If applied carefully, mobile phones could be a powerful tool for people to optimize their health care status.
    Journal of telemedicine and telecare 01/2013; 19(4):213-8. · 0.92 Impact Factor
  • Stephan Dreiseitl, Melanie Osl
    [Show abstract] [Hide abstract]
    ABSTRACT: The accurate assessment of the calibration of classification models is severely limited by the fact that there is no easily available gold standard against which to compare a model's outputs. The usual procedures group expected and observed probabilities, and then perform a χ(2) goodness-of-fit test. We propose an entirely new approach to calibration testing that can be derived directly from the first principles of statistical hypothesis testing. The null hypothesis is that the model outputs are correct, i.e., that they are good estimates of the true unknown class membership probabilities. Our test calculates a p-value by checking how (im)probable the observed class labels are under the null hypothesis. We demonstrate by experiments that our proposed test performs comparable to, and sometimes even better than, the Hosmer-Lemeshow goodness-of-fit test, the de facto standard in calibration assessment.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:164-9.
  • Stephan Dreiseitl, Maja Pivec, Michael Binder
    [Show abstract] [Hide abstract]
    ABSTRACT: To use computer-based eye tracking technology to record and evaluate examination characteristics of the diagnosis of pigmented skin lesions. 16 study participants with varying levels of diagnostic expertise (little, intermediate, superior) were recorded while diagnosing a series of 28 digital images of pigmented skin lesions, obtained by non-invasive digital dermatoscopy, on a computer screen. Eye tracking hardware recorded the gaze track and fixations of the physicians while they examined the lesion images. Analysis of variance was used to test for differences in examination characteristics between physicians grouped according to expertise. There were no significant differences between physicians with little and intermediate levels of expertise in terms of average time until diagnosis (6.61 vs. 6.19s), gaze track length (6.65 vs. 6.15 kilopixels), number of fixations (23.1 vs. 19.1), and time in fixations (4.91 vs. 4.17s). The experts were significantly different with 3.17s time until diagnosis, 4.53 kilopixels gaze track length, 9.9 fixations, and 1.74s in fixations, respectively. Differentiation between benign and malignant lesions had no effect on examination measurements. The results show that experience level has a significant impact on the way in which lesion images are examined. This finding can be used to construct decision support systems that employ important diagnostic features identified by experts, and to optimize teaching for less experienced physicians.
    Artificial intelligence in medicine 12/2011; 54(3):201-5. · 1.65 Impact Factor
  • Melanie Osl, Stephan Dreiseitl
    [Show abstract] [Hide abstract]
    ABSTRACT: Acute myocardial infarction is one of the most common cardiovascular diseases in the Western world. Fortunately, not all myocardial infractions are fatal. By early diagnosis of acute myocardial infarction based on symptoms at a patient’s presentation in the emergency department, the number of deaths may be further reduced, as life-saving actions can be taken sooner. In this paper, we investigate the application of kernel-based methods to this problem, i.e. we evaluate the performance of support vector machines and kernel logistic regression models and compare these two methods to logistic regression models in terms of discrimination and calibration. The results show that kernel-based methods have higher discriminatory power for early diagnosis of acute myocardial infarction than logistic regression models and that kernel logistic regression models have superior calibration in comparison to logistic regression models and support vector machines.
  • Stephan Dreiseitl, Melanie Osl
    Computer Aided Systems Theory - EUROCAST 2011 - 13th International Conference, Las Palmas de Gran Canaria, Spain, February 6-11, 2011, Revised Selected Papers, Part I; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: To evaluate and compare the performance of different rule-ranking algorithms for rule-based classifiers on biomedical datasets. Empirical evaluation of five rule ranking algorithms on two biomedical datasets, with performance evaluation based on ROC analysis and 5 × 2 cross-validation. On a lung cancer dataset, the area under the ROC curve (AUC) of, on average, 14267.1 rules was 0.862. Multi-rule ranking found 13.3 rules with an AUC of 0.852. Four single-rule ranking algorithms, using the same number of rules, achieved average AUC values of 0.830, 0.823, 0.823, and 0.822, respectively. On a prostate cancer dataset, an average of 339265.3 rules had an AUC of 0.934, while 9.4 rules obtained from multi-rule and single-rule rankings had average AUCs of 0.932, 0.926, 0.925, 0.902 and 0.902, respectively. Multi-variate rule ranking performs better than the single-rule ranking algorithms. Both single-rule and multi-rule methods are able to substantially reduce the number of rules while keeping classification performance at a level comparable to the full rule set.
    Artificial intelligence in medicine 05/2010; 50(3):175-80. · 1.65 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The quality of predictive modeling in biomedicine depends on the amount of data available for model building. To study the effect of combining microarray data sets on feature selection and predictive modeling performance. Empirical evaluation of stability of feature selection and discriminatory power of classifiers using three previously published gene expression data sets, analyzed both individually and in combination. Feature selection was not robust for the individual as well as for the combined data sets. The classification performance of models built on individual and combined data sets was heavily dependent on the data set from which the features were extracted. We identified volatility of feature selection as contributing factor to some of the problems faced by predictive modeling using microarray data.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2010; 2010:567-71.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Medical diagnosis and prognosis using machine learning methods is usually represented as a supervised classification problem, where a model is built to distinguish "normal" from "abnormal" cases. If cases are available from only one class, this approach is not feasible. To evaluate the performance of classification via outlier detection by one-class support vector machines (SVMs) as a means of identifying abnormal cases in the domain of melanoma prognosis. Empirical evaluation of one-class SVMs on a data set for predicting the presence or absence of metastases in melanoma patients, and comparison with regular SVMs and artificial neural networks. One-class SVMs achieve an area under the ROC curve (AUC) of 0.71; two-class algorithms achieve AUCs between 0.5 and 0.84, depending on the available number of cases from the minority class. One-class SVMs offer a viable alternative to two-class classification algorithms if class distribution is heavily imbalanced.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2010; 2010:172-6.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The identification of a set of relevant but not redundant features is an important first step in building predictive and diagnostic models from biomedical data sets. Most commonly, individual features are ranked in terms of a quality criterion, out of which the best (first) k features are selected. However, feature ranking methods do not sufficiently account for interactions and correlations between the features. Thus, redundancy is likely to be encountered in the selected features. We present a new algorithm, termed Redundancy Demoting (RD), that takes an arbitrary feature ranking as input, and improves this ranking by identifying redundant features and demoting them to positions in the ranking in which they are not redundant. Redundant features are those that are correlated with other features and not relevant in the sense that they do not improve the discriminatory ability of a set of features. Experiments on two cancer data sets, one melanoma image data set and one lung cancer microarray data set, show that our algorithm greatly improves the feature rankings provided by the methods information gain, ReliefF and Student's t-test in terms of predictive power.
    Journal of Biomedical Informatics 06/2009; 42(4):721-5. · 2.13 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study was to evaluate the accuracy of a computer-based system for the automated diagnosis of melanoma in the hands of nonexpert physicians. We performed a prospective comparison between nonexperts using computer assistance and experts without assistance in the setting of a tertiary referral center at a University hospital. Between February and November 2004 we enrolled 511 consecutive patients. Each patient was examined by two nonexpert physicians with low to moderate diagnostic skills who were allowed to use a neural network-based diagnostic system at their own discretion. Every patient was also examined by an expert dermatologist using standard dermatoscopy equipment. The nonexpert physicians used the automatic diagnostic system in 3827 pigmented skin lesions. In their hands, the system achieved a sensitivity of 72% and a specificity of 82%. The sensitivity was significantly lower than that of the expert physician (72 vs. 96%, P = 0.001), whereas the specificity was significantly higher (82 vs. 72%, P<0.01). Three melanomas were missed because the physicians who operated the system did not choose them for examination. The system as a stand-alone device had an average discriminatory power of 0.87, as measured by the area under the receiver operating characteristic curve, with optimal sensitivities and specificities of 75 and 84%, respectively. The diagnostic accuracy achieved in this clinical trial was lower than that achieved in a previous experimental trial of the same system. In total, the performance of a decision-support system for melanoma diagnosis under real-life conditions is lower than that expected from experimental data and depends upon the physicians who are using the system.
    Melanoma research 04/2009; 19(3):180-4. · 2.06 Impact Factor
  • Stephan Dreiseitl, Melanie Osl
    [Show abstract] [Hide abstract]
    ABSTRACT: The process of feature selection is an important first step in building machine learning models. Feature selection algorithms can be grouped into wrappers and filters; the former use machine learning models to evaluate feature sets, the latter use other criteria to evaluate features individually. We present a new approach to feature selection that combines advantages of both wrapper as well as filter approaches, by using logistic regression and the area under the ROC curve (AUC) to evaluate pairs of features. After choosing as starting feature the one with the highest individual discriminatory power, we incrementally rank features by choosing as next feature the one that achieves the highest AUC in combination with an already chosen feature. To evaluate our approach, we compared it to standard filter and wrapper algorithms. Using two data sets from the biomedical domain, we are able to demonstrate that the performance of our approach exceeds that of filter methods, while being comparable to wrapper methods at smaller computational cost.
    Computer Aided Systems Theory - EUROCAST 2009, 12th International Conference, Las Palmas de Gran Canaria, Spain, February 15-20, 2009, Revised Selected Papers; 01/2009
  • Proceedings of the Second International Conference on Health Informatics, HEALTHINF 2009, Porto, Portugal, January 14-17, 2009; 01/2009
  • M. Osl, C. Baumgartner, B. Tilg, S. Dreiseitl
    [Show abstract] [Hide abstract]
    ABSTRACT: Classifiers based on parametric or non-parametric learning methods have different advantages and disadvantages. To take advantage of the strengths of both methods, we propose an algorithm that combines a parametric model (logistic regression) with a non-parametric classification method (k-nearest neighbors). This combination is based on a measure of appropriateness that uses a heuristic to decide which of the two components should contribute more to the final classification output. We measure the performance of this combination method on two data sets (one from medical informatics, and one consisting of simulated data) in terms of areas under the ROC curves (AUCs). We are able to demonstrate that our method of combining classifiers exceeds the performance of both individual classifiers taken separately.
    Broadband Communications, Information Technology & Biomedical Applications, 2008 Third International Conference on; 12/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Prostate cancer is the most prevalent tumor in males and its incidence is expected to increase as the population ages. Prostate cancer is treatable by excision if detected at an early enough stage. The challenges of early diagnosis require the discovery of novel biomarkers and tools for prostate cancer management. We developed a novel feature selection algorithm termed as associative voting (AV) for identifying biomarker candidates in prostate cancer data measured via targeted metabolite profiling MS/MS analysis. We benchmarked our algorithm against two standard entropy-based and correlation-based feature selection methods [Information Gain (IG) and ReliefF (RF)] and observed that, on a variety of classification tasks in prostate cancer diagnosis, our algorithm identified subsets of biomarker candidates that are both smaller and show higher discriminatory power than the subsets identified by IG and RF. A literature study confirms that the highest ranked biomarker candidates identified by AV have independently been identified as important factors in prostate cancer development. The algorithm can be downloaded from the following
    Bioinformatics 10/2008; 24(24):2908-14. · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Objective: To improve the calibration of logistic regression (LR) estimates using local information. Background: Individualized risk assessment tools are increasingly being utilized. External validation of these tools often reveals poor model calibration. Methods: We combine a clustering algorithm with an LR model to produce probability estimates that are close to the true probabilities for a particular case. The new method is compared to a standard LR model in terms of calibration, as measured by the sum of absolute differences (SAD) between model estimates and true probabilities, and discrimination, as measured by area under the ROC curve (AUC). Results: We evaluate the new method on two synthetic data sets. SADs are significantly lower (p < 0.0001) in both data sets, and AUCs are significantly higher in one data set (p < 0.01). Conclusion: The results suggest that the proposed method may be useful to improve the calibration of LR models.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The work reported in this paper investigates the use of a decision-support tool for the diagnosis of pigmented skin lesions in a real-world clinical trial with 511 patients and 3827 lesion evaluations. We analyzed a number of outcomes of the trial, such as direct comparison of system performance in laboratory and clinical setting, the performance of physicians using the system compared to a control dermatologist without the system, and repeatability of system recommendations. The results show that system performance was significantly less in the real-world setting compared to the laboratory setting (c-index of 0.87 vs. 0.94, p = 0.01). Dermatologists using the system achieved a combined sensitivity of 85% and combined specificity of 95%. We also show that the process of acquiring lesion images using digital dermoscopy devices needs to be standardized before sufficiently high repeatability of measurements can be assured.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 02/2007;
  • International Conference on Bioinformatics & Computational Biology, BIOCOMP 2007, Volume II, June 25-28, 2007, Las Vegas Nevada, USA; 01/2007
  • Stephan Dreiseitl
    [Show abstract] [Hide abstract]
    ABSTRACT: Receiver operating characteristic (ROC) curves are a plot of a ranking classifier’s true-positive rate versus its false-positive rate, as one varies the threshold between positive and negative classifications across the continuum. The area under the ROC curve offer a measure of the discriminatory power of machine learning algorithms that is independent of class distribution, via its equivalence to Mann-Whitney U-statistics. This measure has recently been extended to cover problems of discriminating three and more classes. In this case, the area under the curve generalizes to the volume under the ROC surface. In this paper, we show how a multi-class classifier can be trained by directly maximizing the volume under the ROC surface. This is accomplished by first approximating the discrete U-statistic that is equivalent to the volume under the surface in a continuous manner, and then maximizing this approximation by gradient ascent.
    Computer Aided Systems Theory - EUROCAST 2007, 11th International Conference on Computer Aided Systems Theory, Las Palmas de Gran Canaria, Spain, February 12-16, 2007, Revised Selected Papers; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) are locations at which the genomic sequences of population members differ. Since these differences are known to follow patterns, disease association studies are facilitated by identifying SNPs that allow the unique identification of such patterns. This process, known as haplotype tagging, is formulated as a combinatorial optimization problem and analyzed in terms of complexity and approximation properties. It is shown that the tagging problem is NP-hard but approximable within 1 + ln((n2 - n)/2) for n haplotypes but not approximable within (1-epsilon) ln(n/2) for any epsilon > 0 unless NP subset DTIME(n(log log n)). A simple, very easily implementable algorithm that exhibits the above upper bound on solution quality is presented. This algorithm has running time O(np/2(2m-p+1)) < or = O(m(n2-n)/2) where p < or = min(n, m) for n haplotypes of size m. As we show that the approximation bound is asymptotically tight, the algorithm presented is optimal with respect to this asymptotic bound. The haplotype tagging problem is hard, but approachable with a fast, practical, and surprisingly simple algorithm that cannot be significantly improved upon on a single processor machine. Hence, significant improvement in computational efforts expended can only be expected if the computational effort is distributed and done in parallel.
    BMC Bioinformatics 01/2006; 7:8. · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression models are widely used in medicine, but difficult to apply without the aid of electronic devices. In this paper, we present a novel approach to represent logistic regression models as nomograms that can be evaluated by simple line drawings. As a case study, we show how data obtained from a questionnaire-based patient self-assessment study on the risks of developing melanoma can be used to first identify a subset of significant covariates, build a logistic regression model, and finally transform the model to a graphical format. The advantage of the nomogram is that it can easily be mass-produced, distributed and evaluated, while providing the same information as the logistic regression model it represents.
    Journal of Biomedical Informatics 11/2005; 38(5):389-94. · 2.48 Impact Factor

Publication Stats

511 Citations
29.58 Total Impact Points


  • 2013
    • Medical University of Vienna
      • Division of General Dermatology
      Wien, Vienna, Austria
  • 2002–2012
    • Fachhochschule Oberösterreich
      Wels, Upper Austria, Austria
  • 2008–2010
    • Private Universität für Gesundheitswissenschaften, Medizinische Informatik und Technik
      • Institute for Electrical and Biomedical Engineering
      Hall in Tirol, Tyrol, Austria
    • FHG – Zentrum für Gesundheitsberufe Tirol
      Innsbruck, Tyrol, Austria
  • 2009
    • Fachhochschule für Gesundheit Gera
      Gera, Thuringia, Germany
  • 2000–2001
    • Massachusetts Institute of Technology
      • Division of Health Sciences and Technology
      Cambridge, Massachusetts, United States
    • Brigham and Women's Hospital
      • Department of Medicine
      Boston, MA, United States
  • 1999
    • Harvard Medical School
      • Harvard-MIT Division of Health Sciences and Technology
      Cambridge, MA, United States