[Show abstract][Hide abstract] ABSTRACT: Multicollinearity is the most challenging problem caused by tendency that inde-pendent variables in regression analysis are highly correlated. The multicollinearity reduces the reliability of estimated regression coefficients. In this study, we intro-duce a way of deciding the threshold of correlation which indicates the severity of multicollinearity. The way is to draw a conflict graph, which is the minimum vertex cover of multicollinear variables. The simulation results demonstrate that our pro-posed algorithm can provide an appropriate threshold for reducing large amounts of uncertainty of estimated regression coefficients.
Submitted to INOC 2015 7th International Conference on Network Optimization; 12/2014
[Show abstract][Hide abstract] ABSTRACT: Feature selection based on an ensemble classifier has been recognized as a crucial technique for modeling high-dimensional data. Feature selection based on the random forests model, which is constructed by aggregating multiple decision tree classifiers, has been widely used. However, a lack of stability and balance in decision trees decreases the robustness of random forests. This limitation motivated us to propose a feature selection method based on newly designed nearest-neighbor ensemble classifiers. The proposed method finds significant features by using an iterative procedure. We performed experiments with 20 datasets of microarray gene expressions to examine the property of the proposed method and compared it with random forests. The results demonstrated the effectiveness and robustness of the proposed method, especially when the number of features exceeds the number of observations.
Expert Systems with Applications 11/2014; · 1.97 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: From the text: This special volume evolved out of the Institute for Operations Research and Management Sciences (INFORMS) 2009 Workshop on data mining and system informatics, held at the 2009 INFORMS annual meeting.
Annals of Operations Research 05/2014; 216(1). · 1.10 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Variable selection has been widely used in regression data mining not only to select informative variables, but also to simplify the statistical model. A computer experiment based optimization approach employs design of experiments and statistical modeling to represent a complex objective function that can only be evaluated pointwise by solving an optimization subproblem. In large-scale applications, the number of variables is huge, and direct use of computer experiments would require an exceedingly large experimental design and, consequently, significant computational effort. Typically, a large portion of the variables have lit tle impact on the objective; thus, there is a need to eliminat e these before performing the complete set of optimization subproblem computer experiments. Ideally, variable selection would be conducted after a small number of computer experiment runs, likely fewer runs (n) than the number of variables (p). Conventional variable selection techniques cannot be applied in this "large p and small n" problem. We explore the use of regression trees and a multiple testing procedure based on false discovery rate. Performance of the selected variables is measured using the coefficient o f determination (R2) and relative errors. Two real world
Annals of Operations Research 05/2014; · 1.10 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Processes characterized by high dimensional and mixture data challenge traditional statistical process control charts. In this study, we propose a multivariate control chart based on the Gower distance that can handle a mixture of continuous and categorical data. An extensive simulation study was conducted to examine the properties of the proposed control chart under various scenarios and compared it with some existing multivariate control charts. The simulation results revealed that the proposed control chart outperformed the existing charts when the number of categorical variables increases. Furthermore, we demonstrated the applicability and effectiveness of the proposed control charts through a real case study.
Expert Systems with Applications 03/2014; 41(4):1701–1707. · 1.97 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We propose a new nonparametric multivariate control chart that integrates a novelty score. The proposed control chart uses as its monitoring statistic a hybrid novelty score, calculated based on the distance to local observations as well as on the distance to the convex hull constructed by its neighbors. The control limits of the proposed control chart were established based on a bootstrap method. A rigorous simulation study was conducted to examine the properties of the proposed control chart under various scenarios and compare it with existing multivariate control charts in terms of average run length (ARL) performance. The simulation results showed that the proposed control chart outperformed both the parametric and nonparametric Hotelling's T 2 control charts, especially in nonnormal situations. Moreover, experimental results with real semiconductor data demonstrated the applicability and effectiveness of the proposed control chart. To increase the capability to detect small mean shift, we propose an exponentially weighted hybrid novelty score control chart. Simulation results indicated that exponentially weighted hybrid score charts outperformed the hybrid novelty score based control charts.
Communication in Statistics- Simulation and Computation 01/2014; 43(1). · 0.29 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper aims to propose a research framework of analyzing voting activities of a national assembly on the basis of member-level voting similarity and provides a case study in the national assembly in South Korea. First, we propose a bill contentiousness measure that gives a higher score to bills for which ayes and noes are more diversified in both conservative and progressive parties. Based on the bill contentiousness measure, the top 5%, 10%, and 20% bills were identified and used for further analyses. Moreover, we propose a member-level voting similarity measure that compensates for the lower frequency of noes, and evaluate the pair-wise voting similarities for all lawmakers. Then, voting similarity differences to the affiliated/non-affiliated parties were analyzed for the members in the two major parties according to some internal/external key factors. Finally, similar voting groups were identified and their affiliations were investigated based on the multi-dimensional scaling (MDS) and network analysis techniques. A case study on the national assembly of South Korea showed that the cohesion of the members in the 'Hanara' party becomes higher than that of the 'Minju' party as the bill contentiousness increases, whereas the number of elected, local constituency versus proportional representation, and the competition intensity in a local constituency were found to be partially influential to the voting activities of lawmakers. In addition, MDS and network analysis showed that there is a distinctive difference between two parties when all bills are analyzed, whereas the diversity of parties increases in the same group as the bill contentiousness increases.
Journal of Korean Institute of Industrial Engineers. 01/2014; 40(1).
[Show abstract][Hide abstract] ABSTRACT: Statistical process control techniques have been widely used to improve processes by reducing variations and defects. In the present paper, we propose a multivariate control chart technique based on a clustering algorithm that can effectively handle a situation in which the distribution of in-control observations is inhomogeneous. A simulation study was conducted to examine the characteristics of the proposed control chart and to compare them with Hotelling’s T 2 multivariate control charts that are widely used in real-world processes. Moreover, an experiment with real data from the thin film transistor liquid crystal display (TFT-LCD) manufacturing process demonstrated the effectiveness and accuracy of the proposed control chart.
International Journal of Production Research 09/2013; 51(18). · 1.32 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Relevant statistical modeling and analysis of dental data can improve diagnostic and treatment procedures. The purpose of this study is to demonstrate the use of various data mining algorithms to characterize patients with dentofacial deformities. A total of 72 patients with skeletal malocclusions who had completed orthodontic and orthognathic surgical treatments were examined. Each patient was characterized by 22 measurements related to dentofacial deformities. Clustering analysis and visualization grouped the patients into three different patterns of dentofacial deformities. A feature selection approach based on a false discovery rate was used to identify a subset of 22 measurements important in categorizing these three clusters. Finally, classification was performed to evaluate the quality of the measurements selected by the feature selection approach. The results showed that feature selection improved classification accuracy while simultaneously determining which measurements were relevant.
PLoS ONE 08/2013; 8(8):e67862. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Multivariate control charts have been widely used in many industries to monitor and diagnose processes characterized by a large number of quality characteristics. Usually, these characteristics are highly correlated with each other. The direct use of conventional multivariate control charts for situations with highly correlated characteristics may lead to increased rates of false alarms. Principal component analysis (PCA) control charts have been widely used to address problems posed by such high correlations by transforming the set of correlated variables to an uncorrelated set of variables and then identifying the PCs with highest contribution which then allows one to reduce dimensionality. However, an assumption that the data are normally distributed underlies the construction of the control limits of traditional PCA control charts. This assumption has limited the use of PCA control charts in nonnormal situations found in many modern systems. This study presents the development of nonparametric PCA control charts that do not require any distributional assumptions for their construction. We propose to use nonparametric techniques, kernel density estimation, and bootstrapping to establish the control limits of these charts. A simulation study was conducted to evaluate the performance of the proposed charts and compare them with traditional PCA control charts. The comparative performance in terms of average run length showed that the proposed nonparametric PCA control charts performed better than the parametric PCA control charts in nonnormal situations.
Expert Systems with Applications 06/2013; 40(8):3044–3054. · 1.97 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Extracting useful and meaningful patterns from large volumes of text data is of growing importance. In the present study we analyze vast amounts of prescription data, generated from the book of oriental medicine to identify the relationships between the symptoms and the associated medicines used to treat these symptoms. The oriental medicine book used in this study (called Bangyakhappyeon) contains a large number of prescriptions to treat about 54 categorized symptoms and lists the corresponding herbal materials. We used an association rule algorithm combined with network analysis and found useful and informative relationships between the symptoms and medicines.
PLoS ONE 03/2013; 8(3):e59241. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Control charts have been widely recognised as important tools in system monitoring of abnormal behaviour and quality improvement. Traditional control charts have a major assumption that successive observations are uncorrelated and normally distributed. When this assumption is violated, the traditional control charts do not perform well, but instead show increased false alarm rates. In this study, we propose a data mining model adjustment control chart to address autocorrelation problems for cascade processes. The basic idea of the proposed control chart is to monitor the residuals obtained by data mining models. The data mining models used in this study include support vector regression and artificial neural networks. A simulation study was conducted to evaluate the performance of the proposed control chart and compare it with the standard regression adjustment control chart and the observations-based control chart in terms of average run length performance. The results showed that the proposed data mining model adjustment control charts yielded better performance than the two other methods considered in this study. [Received 8 December 2010; Revised 19 June 2011; Revised 9 September 2011; Accepted 29 November 2011]
European J of Industrial Engineering 01/2013; 7(4):442 - 455. · 1.50 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The main objective of the present paper is to characterize smoking behavior among older adults by assessing the psychological distress, physical health status, alcohol use, and demographic variables in relations to the current smoking. We targeted 466 senior American smokers who are 65 years of age or older from the 2006 National Survey on Drug Use and Health (NSDUH, 2006). We employed a decision tree algorithm to conduct classification analysis to find the relationship between the average numbers of cigarette use per day. The results showed that the most important explanatory variable for prediction of the average number of cigarette use per day is the age when first started smoking cigarettes every day, followed by education level, and psychological distress. These results suggest that social workers need to provide more customized and individualized intervention to older adults.
[Show abstract][Hide abstract] ABSTRACT: Process monitoring and diagnosis have been widely recognized as important and critical tools in system monitoring for detection of abnormal behavior and quality improvement. Although traditional statistical process control (SPC) tools are effective in simple manufacturing processes that generate a small volume of independent data, these tools are not capable of handling the large streams of multivariate and autocorrelated data found in modern systems. As the limitations of SPC methodology become increasingly obvious in the face of ever more complex processes, data mining algorithms, because of their proven capabilities to effectively analyze and manage large amounts of data, have the potential to resolve the challenging problems that are stretching SPC to its limits. In the present study we attempted to integrate state-of-the-art data mining algorithms with SPC techniques to achieve efficient monitoring in multivariate and autocorrelated processes. The data mining algorithms include artificial neural networks, support vector regression, and multivariate adaptive regression splines. The residuals of data mining models were utilized to construct multivariate cumulative sum control charts to monitor the process mean. Simulation results from various scenarios indicated that data mining model-based control charts performs better than traditional time-series model-based control charts.
[Show abstract][Hide abstract] ABSTRACT: We propose new multivariate control charts that can effectively deal with massive amounts of complex data through their integration with classification algorithms. We call the proposed control chart the ‘Probability of Class (PoC) chart’ because the values of PoC, obtained from classification algorithms, are used as monitoring statistics. The control limits of PoC charts are established and adjusted by the bootstrap method. Experimental results with simulated and real data showed that PoC charts outperform Hotelling's T 2 control charts. Further, a simulation study revealed that a small proportion of out-of-control observations are sufficient for PoC charts to achieve the desired performance.
Journal of Statistical Computation and Simulation 12/2011; 81(12):1897-1911. · 0.71 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Malaria is a devastating mosquito-borne disease, which affects hundreds of millions of people each year. It is transmitted predominantly by Anopheles gambiae, whose females must be >10 days old to become infective. In this study, cuticular lipids from a laboratory strain of this mosquito species were analyzed using a mass spectrometry method to evaluate their utility for age, sex and mating status differentiation. Matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS), in conjunction with an acenaphthene/silver nitrate matrix preparation, was shown to be 100% effective in classifying A. gambiae females into 1, 7-10, and 14 days of age. MALDI-MS analysis, supported by multivariate statistical methods, was also effective in detecting cuticular lipid differences between the sexes and between virgin and mated females. The technique requires further testing, but the obtained results suggest that MALDI-MS cuticular lipid spectra could be used for age grading of A. gambiae females with precision greater than with other available methods.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a statistical analysis on the environmental impact of airport deicing activities at Dallas-Fort Worth (D/FW) International Airport. The focus of this paper is on aircraft deicing, which typically uses a spray of aircraft deicing and anti-icing fluids (ADAF). ADAF has a high concentration of ethylene/propylene/diethylene glycol, which shears off airfoil surfaces during takeoff and drips to the runways during taxiing. A significant portion of the glycol runs off and mixes with the airport’s receiving waters during heavy deicing periods, resulting in bacterial growth that causes an increase in chemical oxygen demand (COD) and a subsequent reduction in dissolved oxygen (DO) in the receiving waters. In this study, statistical methods for data mining were employed to evaluate the impact of airport deicing activities on COD and DO in the receiving waters immediately surrounding D/FW Airport. In particular, decision tree models were developed to determine important explanatory variables for predicting levels of COD and DO in the airport’s waterways. The decision tree modeling and analysis of COD determined north–south wind, glycol usage at a specific deicing pad, and monitoring site to be significant explanatory variables. The impact of glycol usage on DO was apparent as every decision tree at least one group with a median DO below 4.0mg/l, and these low-DO groups were associated with high glycol usage. These results are crucial to D/FW Airport in their goal to minimize the potential adverse impact of deicing activities on the water quality in waterways proximate to the airport. The advantages of data-driven modeling and analysis are its cost-effectiveness due to its potential to be implemented without making major changes in physical systems, ease of application, and usefulness in making future management decisions.
Expert Systems with Applications 11/2011; 38:14899-14906. · 1.97 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This pilot study was designed to determine if metabolic effects in different brain regions (left and right parietal lobes, midbrain) caused by 3 d of food consumption without methionine or cysteine could be detected by proton magnetic resonance spectroscopy.
Healthy individuals 18 to 36 y old (n = 8) were studied by magnetic resonance spectroscopy after receiving a diet with adequate sulfur amino acids (SAAs) or with zero SAA for 3 d. Pulse sequences were used to selectively measure glutathione (GSH), and linear combination modeling of spectra was used to measure other high-abundance brain metabolites and expressed relative to creatine (Cr).
Although dietary SAAs are required to maintain GSH, the 3-d SAA insufficiency resulted in no significant change in GSH/Cr in the three brain regions. Principal component analysis of 16 metabolites measured by linear combination modeling showed that the metabolic pattern in the midbrain, but not in the parietal lobes, was distinguished according to the dietary SAAs. Multivariate statistical analysis showed that the major discriminating factors were signals of glutamate/Cr, (glutamate + glutamine)/Cr, and myoinositol/Cr. Correlation analyses between midbrain metabolites and GSH-related metabolites in plasma showed that midbrain glutamate/Cr had an inverse correlation with plasma cystine.
The data show that magnetic resonance spectroscopy is a non-invasive tool suitable for nutritional assessment and suggest that nutritional imbalance caused by 3 d of SAA-free food more selectively affects the midbrain than the parietal lobes.