ArticlePDF Available

Data Mining with Combined Use of Optimization Techniques and Self-Organizing Maps for Improving Risk Grouping Rules: Application to Prostate Cancer Patients

Authors:

Abstract and Figures

Data mining techniques provide a popular and powerful tool set to generate various data-driven classification systems. In this paper, we investigate the combined use of self-organizing maps (SOM) and nonsmooth nonconvex optimization techniques in order to produce a working case of a data-driven risk classification system. The optimization approach strengthens the validity of SOM results, and the improved classification system increases both the quality of prediction and the homogeneity within the risk groups. Accurate classification of prostate cancer patients into risk groups is important to assist in the identification of appropriate treatment paths. We start with the existing rules and aim to improve classification accuracy by identifying inconsistencies utilizing self-organizing maps as a data visualization tool. Then, we progress to the study of assigning prostate cancer patients into homogenous groups with the aim to support future clinical treatment decisions. Using the case of prostate cancer patients grouping, we demonstrate strong potential of data-driven risk classification schemes for addressing the risk grouping issues in more general organizational settings.
Content may be subject to copyright.
A preview of the PDF is not available
... There are numerous studies that have used unsupervised learning techniques in general. A few notable ones include Zheng et al. [96], Bockstedt and Goh [7], Visa et al. [85], Churilov et al. [12], Guo et al. [26], and Ivanova and Scholz [34]. Zheng et al. [96] use a semi-supervised ensemble learning embedded with independent component analysis to identify highly influential reviewers. ...
... Bockstedt and Goh [7] use unsupervised learning techniques to identify common seller strategies for the use of discretionary auction attributes. Churilov et al. [12] use undirected knowledge discovery (unsupervised learning methods) to group the patients. Guo et al. [26] explore unsupervised deep learning for personalized point-of-interest recommendation. ...
Article
Online reviews play a significant role in influencing decisions made by users in day-to-day life. The presence of reviewers who deliberately post fake reviews for financial or other gains, however, negatively impacts both users and businesses. Unfortunately, automatically detecting such reviewers is a challenging problem since fake reviews do not seem out-of-place next to genuine reviews. In this paper, we present a fully unsupervised approach to detect anomalous behavior in online reviewers. We propose a novel hierarchical approach for this task in which we (1) derive distributions for key features that define reviewer behavior, and (2) combine these distributions into a finite mixture model. Our approach is highly generalizable and it allows us to seamlessly combine both univariate and multivariate distributions into a unified anomaly detection system. Most importantly, it requires no explicit labeling (spam/not spam) of the data. Our newly developed approach outperforms prior state-of-the-art unsupervised anomaly detection approaches.
... Prostate cancer is the type of cancer where an uncontrolled extension of cancer cells appears in prostate tissue. This cancer is the second most common cancer among men after lung cancer, ranking first in developed countries [1]. Various prostate tumors spread gently and are confined to the prostate gland, where they may not create dangerous harm. ...
Article
Full-text available
Introduction: Prostate cancer is one of the leading causes of death in men, and the early detection of this disease can be a significant factor in controlling and managing it. Applying data mining techniques can lead to the extraction of hidden knowledge from a huge amount of data and can help diagnose this disease by physicians. This study aims to determine the algorithm with the best performance to diagnose prostate cancer. Methods: In this study, nine data mining techniques, including Support Vector Machine, Decision Tree, Naive Bayes, K-Nearest Neighbors, Neural Network, Random Forest, Deep Learning, Auto-MLP, and Rule Induction algorithms, were used to extract hidden patterns from prostate cancer data. In this study, the data of 100 patients, which included eight characteristics, were used, and the RapidMiner Studio environment was employed for modeling. To compare the performance of the mentioned approaches used in this study to diagnose prostate cancer, accuracy, recall, precision, AUC, sensitivity, and specificity were calculated and reported for all techniques. Results: The results of this study showed that the accuracy of the applied algorithms was between 77% and 84%. Using different criteria to evaluate the techniques used showed that the two algorithms K-Nearest Neighbors and Neural Network, had better performance and accuracy (84%) than other methods. The sensitivity in these two algorithms was 80% for Neural Networks and 85% for K-Nearest Neighbors, respectively. Conclusion: The usage of different data mining techniques can lead to the discovery of hidden patterns among an enormous amount of data related to prostate cancer, and as a result, it leads to the early diagnosis of this disease and saves the subsequent costs.
... In doing so, we build our research endeavor upon "the idea that research can start with data or data-driven discoveries, rather than with theory" (Müller et al., 2016, p. 291). Several researchers already applied this strategy in service science (Antons & Breidbach, 2018), medicine (Churilov et al., 2005), financial analysis (Sung et al., 1999), and even fishery (Syed & Weber, 2018), to name just a few. We instantiate our datadriven approach to acquire interpretable results, instead of reaching the highest accuracy possible . ...
Article
Full-text available
While the Information Systems (IS) discipline has researched digital platforms extensively, the body of knowledge appertaining to platforms still appears fragmented and lacking conceptual consistency. Based on automated text mining and unsupervised machine learning, we collect, analyze, and interpret the IS discipline’s comprehensive research on platforms—comprising 11,049 papers spanning 44 years of research activity. From a cluster analysis concerning platform concepts’ semantically most similar words, we identify six research streams on platforms, each with their own platform terms. Based on interpreting the identified concepts vis-à-vis the extant research and considering a temporal perspective on the concepts’ application, we present a lexicon of platform concepts, to guide further research on platforms in the IS discipline. Researchers and managers can build on our results to position their work appropriately, applying a specific theoretical perspective on platforms in isolation or combining multiple perspectives to study platform phenomena at a more abstract level.
... The patterns of SOM in a high-dimensional input space are originally very complicated. When projected on a graphical map display, its structure, after clustering, turns out to be not only understandable but more transparent as well [12]. ...
Conference Paper
A case study of applying RFM (recency, frequency, and monetary) model and clustering techniques in the sector of electronic commerce with a view to evaluating customers' values is presented. Self-organizing maps method (SOM) is first used to determine the best number of clusters and then K-means method is applied to classify 730 customers into eight clusters when R, F, and M are the segmenting variables, and then developing effective marketing strategies for each cluster. The average values of RFM are computed for each cluster and the overall customers. The values of RFM variables for each cluster greater than those of the overall average are identified. The results show that the cluster 7 is the most important cluster because the average values of R F and M are higher than the overall average value. In summary, the purpose of this case study is customer segmentation using RFM model and clustering algorithms (SOM and K-means) to specify loyal and profitable customers for achieving maximum benefit and a win-win situation.
... The patterns of SOM in a high-dimensional input space are originally very complicated. When projected on a graphical map display, its structure, after clustering, turns out to be not only understandable but more transparent as well [28]. ...
Article
Full-text available
Given the increase in the number of e-commerce sites, the number of competitors has become very important. This means that companies have to take appropriate decisions in order to meet the expectations of their customers and satisfy their needs. In this paper, we present a case study of applying LRFM (length, recency, frequency and monetary) model and clustering techniques in the sector of electronic commerce with a view to evaluating customers’ values of the Moroccan e-commerce websites and then developing effective marketing strategies. To achieve these objectives, we adopt LRFM model by applying a two-stage clustering method. In the first stage, the self-organizing maps method is used to determine the best number of clusters and the initial centroid. In the second stage, kmeans method is applied to segment 730 customers into nine clusters according to their L, R, F and M values. The results show that the cluster 6 is the most important cluster because the average values of L, R, F and M are higher than the overall average value. In addition, this study has considered another variable that describes the mode of payment used by customers to improve and strengthen clusters’ analysis. The clusters’ analysis demonstrates that the payment method is one of the key indicators of a new index which allows to assess the level of customers’ confidence in the company's Website.
Article
This study proposes a novel framework for designing business rule analytics to assist businesses offering digital content in effectively converting free-only users (FOUs) into paying customers. Based on the theory of expected utility, we expand upon traditional frequency-driven rule analytics by integrating three business-relevant factors (target size, conversion profit, and conversion likelihood) into the process of generating recommendations for FOUs in digital content markets. The framework was tested using two different types of empirical analysis. We conducted a field experiment collaborating with a nationwide e-book store to determine how FOUs responded to the recommendations generated under the proposed framework. Furthermore, we analyzed over 5 million transactions collected from the e-book seller and a mobile application provider to examine the impact of customer segmentation on the effectiveness of our approach. Our findings suggest that business analytics derived from the utility-based mechanisms can significantly enhance digital content providers' business performance.
Article
Full-text available
While design science research is contending its position in the information systems community, there is a lack of transparency regarding the recent and impactful information systems design science research (IS-DSR) papers. This arguably poses challenges to an informed discourse and limits our ability to communicate progress achieved by IS-DSR. After providing a map of the impactful IS-DSR papers, we therefore develop a scientometric study to address the lack of insights into factors that affect the scientific impact of IS-DSR papers published in top IS journals. In this study, we focus on IS-specific, active research areas of IS-DSR and consider papers published in the AIS Senior Scholars' Basket of Journals between 2004 and 2014. Specifically, we develop a model that explores a set of factors that affect the scientific impact of IS-DSR papers. Our findings show that scientific impact is significantly explained by theorization and novelty. We discuss the implications of our work and derive recommendations intended to shape future knowledge creation in IS-DSR.
Conference Paper
The main objective of this paper is to introduce a hybrid model of Self Organizing Maps (SOM) and C4.5, to reduce the costs while maintaining an acceptable diagnostic performance. In this hybrid model, SOM is used first to form clusters and then C4.5 trees specific to each cluster is constructed. The proposed hybrid model is tested on multiclass Thyroid Data and compared to standalone C4.5 tree. Costs were reduced by 22 %–27 % and performance results vary between 88 % and 97 % in terms of accuracy and 90 %–97 % in terms of sensitivity. Cost and performance differences between the hybrid model and standalone C4.5 found to be statistically significant according to Wilcoxon signed-rank test.
Article
Full-text available
Much of the research on Business Intelligence (BI) has examined the ability of BI systems to help organizations address challenges and opportunities. However, the literature is fragmented and lacks an overarching framework to integrate findings and systematically guide research. Moreover, researchers and practitioners continue to question the value of BI systems. This study reviews and synthesizes empirical Information System (IS) studies to learn what we know, how well we know, and what we need to know about the processes of organizations obtaining business value from BI systems. The study aims to identify which parts of the BI business value process have been studied and are still most in need of research, and to propose specific research questions for the future. The findings show that organizations appear to obtain value from BI systems according to the process suggested by Soh and Markus (1995), as a chain of necessary conditions from BI investments to BI assets to BI impacts to organizational performance; however, researchers have not sufficiently studied the probabilistic processes that link the necessary conditions together. Moreover, the research has not sufficiently covered all relevant levels of analysis, nor examined how the levels link up. Overall, the paper identified many opportunities for researchers to provide a more complete picture of how organizations can and do obtain value from BI.
Conference Paper
Full-text available
Data mining techniques provide a popular and powerful toolset to address both clinical and management issues in the area of health care. This paper describes the study of assigning prostate cancer patients into homogenous groups with the aim to support future clinical treatment decisions. The cluster analysis based model is suggested and an application of non-smooth non-convex optimization techniques to solve this model is discussed. It is demonstrated that using the optimization based approach to data mining of a prostate cancer patients database can lead to generation of a significant amount of new knowledge that can be effectively utilized to enhance clinical decision making.
Chapter
The feature selection problem involves the selection of a subset of features that will be sufficient for the determination of structures or clusters in a given dataset and in making predictions. This chapter presents an algorithm for feature selection, which is based on the methods of optimization. To verify the effectiveness of the proposed algorithm we applied it to a number of publicly available real-world databases. The results of numerical experiments are presented and discussed. These results demonstrate that the algorithm performs well on the datasets considered.
Article
We present numerical methods for minimizing nonsmooth functions presented as a difference of two Clarke regular functions. These methods are based on continuous approximations to the subdifferential and belong to the derivative free methods of nonsmooth optimization. Then, we consider an application of the proposed methods for the calculation of semi-equilibrium prices in exchange model of economics. Some results of numerical experiments are presented.
Book
From the Publisher: SOMs (Self-Organizing Maps) have proven to be an effective methodology for analyzing problems in finance and economics--including applications such as market analysis, financial statement analysis, prediction of bankruptcies, interest rates, and stock indices. This book covers real-world financial applications of neural networks, using the SOM approach, as well as introducing SOM methodology, software tools, and tips for processing. 106 illus. in color.
Article
Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.
Article
This work contains a theoretical study and computer simulations of a new self-organizing process. The principal discovery is that in a simple network of adaptive physical elements which receives signals from a primary event space, the signal representations are automatically mapped onto a set of output responses in such a way that the responses acquire the same topological order as that of the primary events. In other words, a principle has been discovered which facilitates the automatic formation of topologically correct maps of features of observable events. The basic self-organizing system is a one- or two-dimensional array of processing units resembling a network of threshold-logic units, and characterized by short-range lateral feedback between neighbouring units. Several types of computer simulations are used to demonstrate the ordering process as well as the conditions under which it fails.