Stephan Dreiseitl

Fachhochschule Oberösterreich, Wels, Upper Austria, Austria

Are you Stephan Dreiseitl?

Claim your profile

Publications (21)20.29 Total impact

  • Article: Testing the calibration of classification models from first principles.
    Stephan Dreiseitl, Melanie Osl
    [show abstract] [hide abstract]
    ABSTRACT: The accurate assessment of the calibration of classification models is severely limited by the fact that there is no easily available gold standard against which to compare a model's outputs. The usual procedures group expected and observed probabilities, and then perform a χ(2) goodness-of-fit test. We propose an entirely new approach to calibration testing that can be derived directly from the first principles of statistical hypothesis testing. The null hypothesis is that the model outputs are correct, i.e., that they are good estimates of the true unknown class membership probabilities. Our test calculates a p-value by checking how (im)probable the observed class labels are under the null hypothesis. We demonstrate by experiments that our proposed test performs comparable to, and sometimes even better than, the Hosmer-Lemeshow goodness-of-fit test, the de facto standard in calibration assessment.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:164-9.
  • Article: Differences in examination characteristics of pigmented skin lesions: results of an eye tracking study.
    Stephan Dreiseitl, Maja Pivec, Michael Binder
    [show abstract] [hide abstract]
    ABSTRACT: To use computer-based eye tracking technology to record and evaluate examination characteristics of the diagnosis of pigmented skin lesions. 16 study participants with varying levels of diagnostic expertise (little, intermediate, superior) were recorded while diagnosing a series of 28 digital images of pigmented skin lesions, obtained by non-invasive digital dermatoscopy, on a computer screen. Eye tracking hardware recorded the gaze track and fixations of the physicians while they examined the lesion images. Analysis of variance was used to test for differences in examination characteristics between physicians grouped according to expertise. There were no significant differences between physicians with little and intermediate levels of expertise in terms of average time until diagnosis (6.61 vs. 6.19s), gaze track length (6.65 vs. 6.15 kilopixels), number of fixations (23.1 vs. 19.1), and time in fixations (4.91 vs. 4.17s). The experts were significantly different with 3.17s time until diagnosis, 4.53 kilopixels gaze track length, 9.9 fixations, and 1.74s in fixations, respectively. Differentiation between benign and malignant lesions had no effect on examination measurements. The results show that experience level has a significant impact on the way in which lesion images are examined. This finding can be used to construct decision support systems that employ important diagnostic features identified by experts, and to optimize teaching for less experienced physicians.
    Artificial intelligence in medicine 12/2011; 54(3):201-5. · 1.65 Impact Factor
  • Conference Proceeding: Effects of Data Grouping on Calibration Measures of Classifiers.
    Stephan Dreiseitl, Melanie Osl
    Computer Aided Systems Theory - EUROCAST 2011 - 13th International Conference, Las Palmas de Gran Canaria, Spain, February 6-11, 2011, Revised Selected Papers, Part I; 01/2011
  • Article: An evaluation of heuristics for rule ranking.
    [show abstract] [hide abstract]
    ABSTRACT: To evaluate and compare the performance of different rule-ranking algorithms for rule-based classifiers on biomedical datasets. Empirical evaluation of five rule ranking algorithms on two biomedical datasets, with performance evaluation based on ROC analysis and 5 × 2 cross-validation. On a lung cancer dataset, the area under the ROC curve (AUC) of, on average, 14267.1 rules was 0.862. Multi-rule ranking found 13.3 rules with an AUC of 0.852. Four single-rule ranking algorithms, using the same number of rules, achieved average AUC values of 0.830, 0.823, 0.823, and 0.822, respectively. On a prostate cancer dataset, an average of 339265.3 rules had an AUC of 0.934, while 9.4 rules obtained from multi-rule and single-rule rankings had average AUCs of 0.932, 0.926, 0.925, 0.902 and 0.902, respectively. Multi-variate rule ranking performs better than the single-rule ranking algorithms. Both single-rule and multi-rule methods are able to substantially reduce the number of rules while keeping classification performance at a level comparable to the full rule set.
    Artificial intelligence in medicine 05/2010; 50(3):175-80. · 1.65 Impact Factor
  • Article: Outlier Detection with One-Class SVMs: An Application to Melanoma Prognosis.
    [show abstract] [hide abstract]
    ABSTRACT: Medical diagnosis and prognosis using machine learning methods is usually represented as a supervised classification problem, where a model is built to distinguish "normal" from "abnormal" cases. If cases are available from only one class, this approach is not feasible. To evaluate the performance of classification via outlier detection by one-class support vector machines (SVMs) as a means of identifying abnormal cases in the domain of melanoma prognosis. Empirical evaluation of one-class SVMs on a data set for predicting the presence or absence of metastases in melanoma patients, and comparison with regular SVMs and artificial neural networks. One-class SVMs achieve an area under the ROC curve (AUC) of 0.71; two-class algorithms achieve AUCs between 0.5 and 0.84, depending on the available number of cases from the minority class. One-class SVMs offer a viable alternative to two-class classification algorithms if class distribution is heavily imbalanced.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2010; 2010:172-6.
  • Article: Effect of data combination on predictive modeling: a study using gene expression data.
    [show abstract] [hide abstract]
    ABSTRACT: The quality of predictive modeling in biomedicine depends on the amount of data available for model building. To study the effect of combining microarray data sets on feature selection and predictive modeling performance. Empirical evaluation of stability of feature selection and discriminatory power of classifiers using three previously published gene expression data sets, analyzed both individually and in combination. Feature selection was not robust for the individual as well as for the combined data sets. The classification performance of models built on individual and combined data sets was heavily dependent on the data set from which the features were extracted. We identified volatility of feature selection as contributing factor to some of the problems faced by predictive modeling using microarray data.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2010; 2010:567-71.
  • Article: Computer versus human diagnosis of melanoma: evaluation of the feasibility of an automated diagnostic system in a prospective clinical trial.
    [show abstract] [hide abstract]
    ABSTRACT: The aim of this study was to evaluate the accuracy of a computer-based system for the automated diagnosis of melanoma in the hands of nonexpert physicians. We performed a prospective comparison between nonexperts using computer assistance and experts without assistance in the setting of a tertiary referral center at a University hospital. Between February and November 2004 we enrolled 511 consecutive patients. Each patient was examined by two nonexpert physicians with low to moderate diagnostic skills who were allowed to use a neural network-based diagnostic system at their own discretion. Every patient was also examined by an expert dermatologist using standard dermatoscopy equipment. The nonexpert physicians used the automatic diagnostic system in 3827 pigmented skin lesions. In their hands, the system achieved a sensitivity of 72% and a specificity of 82%. The sensitivity was significantly lower than that of the expert physician (72 vs. 96%, P = 0.001), whereas the specificity was significantly higher (82 vs. 72%, P<0.01). Three melanomas were missed because the physicians who operated the system did not choose them for examination. The system as a stand-alone device had an average discriminatory power of 0.87, as measured by the area under the receiver operating characteristic curve, with optimal sensitivities and specificities of 75 and 84%, respectively. The diagnostic accuracy achieved in this clinical trial was lower than that achieved in a previous experimental trial of the same system. In total, the performance of a decision-support system for melanoma diagnosis under real-life conditions is lower than that expected from experimental data and depends upon the physicians who are using the system.
    Melanoma research 04/2009; 19(3):180-4. · 2.06 Impact Factor
  • Article: Demoting redundant features to improve the discriminatory ability in cancer data.
    Journal of Biomedical Informatics. 01/2009; 42:721-725.
  • Conference Proceeding: Feature Selection Based on Pairwise Classification Performance.
    Stephan Dreiseitl, Melanie Osl
    Computer Aided Systems Theory - EUROCAST 2009, 12th International Conference, Las Palmas de Gran Canaria, Spain, February 15-20, 2009, Revised Selected Papers; 01/2009
  • Conference Proceeding: Applicability of Mobile Phones for Tele-dermatology - A Pilot Study.
    Proceedings of the Second International Conference on Health Informatics, HEALTHINF 2009, Porto, Portugal, January 14-17, 2009; 01/2009
  • Article: A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry.
    [show abstract] [hide abstract]
    ABSTRACT: Prostate cancer is the most prevalent tumor in males and its incidence is expected to increase as the population ages. Prostate cancer is treatable by excision if detected at an early enough stage. The challenges of early diagnosis require the discovery of novel biomarkers and tools for prostate cancer management. We developed a novel feature selection algorithm termed as associative voting (AV) for identifying biomarker candidates in prostate cancer data measured via targeted metabolite profiling MS/MS analysis. We benchmarked our algorithm against two standard entropy-based and correlation-based feature selection methods [Information Gain (IG) and ReliefF (RF)] and observed that, on a variety of classification tasks in prostate cancer diagnosis, our algorithm identified subsets of biomarker candidates that are both smaller and show higher discriminatory power than the subsets identified by IG and RF. A literature study confirms that the highest ranked biomarker candidates identified by AV have independently been identified as important factors in prostate cancer development. The algorithm can be downloaded from the following http://biomed.umit.at/page.cfm?pageid=516.
    Bioinformatics 10/2008; 24(24):2908-14. · 5.47 Impact Factor
  • Article: Improving calibration of logistic regression models by local estimates.
    [show abstract] [hide abstract]
    ABSTRACT: Objective: To improve the calibration of logistic regression (LR) estimates using local information. Background: Individualized risk assessment tools are increasingly being utilized. External validation of these tools often reveals poor model calibration. Methods: We combine a clustering algorithm with an LR model to produce probability estimates that are close to the true probabilities for a particular case. The new method is compared to a standard LR model in terms of calibration, as measured by the sum of absolute differences (SAD) between model estimates and true probabilities, and discrimination, as measured by area under the ROC curve (AUC). Results: We evaluate the new method on two synthetic data sets. SADs are significantly lower (p < 0.0001) in both data sets, and AUCs are significantly higher in one data set (p < 0.01). Conclusion: The results suggest that the proposed method may be useful to improve the calibration of LR models.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2008;
  • Article: Applying a decision support system in clinical practice: results from melanoma diagnosis.
    [show abstract] [hide abstract]
    ABSTRACT: The work reported in this paper investigates the use of a decision-support tool for the diagnosis of pigmented skin lesions in a real-world clinical trial with 511 patients and 3827 lesion evaluations. We analyzed a number of outcomes of the trial, such as direct comparison of system performance in laboratory and clinical setting, the performance of physicians using the system compared to a control dermatologist without the system, and repeatability of system recommendations. The results show that system performance was significantly less in the real-world setting compared to the laboratory setting (c-index of 0.87 vs. 0.94, p = 0.01). Dermatologists using the system achieved a combined sensitivity of 85% and combined specificity of 95%. We also show that the process of acquiring lesion images using digital dermoscopy devices needs to be standardized before sufficiently high repeatability of measurements can be assured.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 02/2007;
  • Conference Proceeding: A data warehouse for prostate cancer biomarker discovery.
    International Conference on Bioinformatics & Computational Biology, BIOCOMP 2007, Volume II, June 25-28, 2007, Las Vegas Nevada, USA; 01/2007
  • Source
    Article: Approximation properties of haplotype tagging.
    [show abstract] [hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) are locations at which the genomic sequences of population members differ. Since these differences are known to follow patterns, disease association studies are facilitated by identifying SNPs that allow the unique identification of such patterns. This process, known as haplotype tagging, is formulated as a combinatorial optimization problem and analyzed in terms of complexity and approximation properties. It is shown that the tagging problem is NP-hard but approximable within 1 + ln((n2 - n)/2) for n haplotypes but not approximable within (1-epsilon) ln(n/2) for any epsilon > 0 unless NP subset DTIME(n(log log n)). A simple, very easily implementable algorithm that exhibits the above upper bound on solution quality is presented. This algorithm has running time O(np/2(2m-p+1)) < or = O(m(n2-n)/2) where p < or = min(n, m) for n haplotypes of size m. As we show that the approximation bound is asymptotically tight, the algorithm presented is optimal with respect to this asymptotic bound. The haplotype tagging problem is hard, but approachable with a fast, practical, and surprisingly simple algorithm that cannot be significantly improved upon on a single processor machine. Hence, significant improvement in computational efforts expended can only be expected if the computational effort is distributed and done in parallel.
    BMC Bioinformatics 01/2006; 7:8. · 2.75 Impact Factor
  • Article: Nomographic representation of logistic regression models: a case study using patient self-assessment data.
    [show abstract] [hide abstract]
    ABSTRACT: Logistic regression models are widely used in medicine, but difficult to apply without the aid of electronic devices. In this paper, we present a novel approach to represent logistic regression models as nomograms that can be evaluated by simple line drawings. As a case study, we show how data obtained from a questionnaire-based patient self-assessment study on the risks of developing melanoma can be used to first identify a subset of significant covariates, build a logistic regression model, and finally transform the model to a graphical format. The advantage of the nomogram is that it can easily be mass-produced, distributed and evaluated, while providing the same information as the logistic regression model it represents.
    Journal of Biomedical Informatics 11/2005; 38(5):389-94. · 1.79 Impact Factor
  • Source
    Article: Do physicians value decision support? A look at the effect of decision support systems on physician opinion.
    Stephan Dreiseitl, Michael Binder
    [show abstract] [hide abstract]
    ABSTRACT: Clinical decision support systems are on the verge of becoming routine software tools in clinical settings. We investigate the question of how physicians react when faced with decision support suggestions that contradict their own diagnoses. We used a study design involving 52 volunteer dermatologists who each rated the malignancy of 25 lesion images on an ordinal scale and gave a dichotomous excise/no excise recommendation for each lesion image. After seeing the system's rating and excise suggestions, the physicians could revise their initial recommendations. We observed that in 24% of the cases in which the physicians' diagnoses did not match those of the decision support system, the physicians changed their diagnoses. There was a slight but significant negative correlation between susceptibility to change and experience level of the physicians. Physicians were significantly less likely to follow the decision system's recommendations when they were confident of their initial diagnoses. No differences between the physicians' inclinations to following excise versus no excise recommendations could be observed. These results indicate that physicians are quite susceptible to accepting the recommendations of decision support systems, and that quality assurance and validation of such systems is therefore of paramount importance.
    Artificial Intelligence in Medicine 02/2005; 33(1):25-30. · 1.35 Impact Factor
  • Source
    Article: A Comparison of Machine Learning Methods for the Diagnosis of Pigmented Skin Lesions
    [show abstract] [hide abstract]
    ABSTRACT: We analyze the discriminatory power of k-nearest neighbors, logistic regression, artificial neural networks (ANNs), decision tress, and support vector machines (SVMs) on the task of classifying pigmented skin lesions as common nevi, dysplastic nevi, or melanoma. Three different classification tasks were used as benchmarks: the dichotomous problem of distinguishing common nevi from dysplastic nevi and melanoma, the dichotomous problem of distinguishing melanoma from common and dysplastic nevi, and the trichotomous problem of correctly distinguishing all three classes. Using ROC analysis to measure the discriminatory power of the methods shows that excellent results for specific classification problems in the domain of pigmented skin lesions can be achieved with machine-learning methods. On both dichotomous and trichotomous tasks, logistic regression, ANNs, and SVMs performed on about the same level, with k-nearest neighbors and decision trees performing worse.
    Journal of Biomedical Informatics 03/2001; · 1.79 Impact Factor
  • Source
    Article: Self-Organizing Maps for Visualization of Medical Data Sets
    Stephan Dreiseitl, Lucila Ohno-Machado
  • Source
    Article: Clinical Data Processing Tools: A Machine Learning Resource