Seongho Kim

Karmanos Cancer Institute, Detroit, Michigan, United States

Are you Seongho Kim?

Claim your profile

Publications (49)143.67 Total impact

  • Seongho Kim
    [Show abstract] [Hide abstract]
    ABSTRACT: Lack of a general matrix formula hampers implementation of the semi-partial correlation, also known as part correlation, to the higher-order coefficient. This is because the higher-order semi-partial correlation calculation using a recursive formula requires an enormous number of recursive calculations to obtain the correlation coefficients. To resolve this difficulty, we derive a general matrix formula of the semi-partial correlation for fast computation. The semi-partial correlations are then implemented on an R package ppcor along with the partial correlation. Owing to the general matrix formulas, users can readily calculate the coefficients of both partial and semi-partial correlations without computational burden. The package ppcor further provides users with the level of the statistical significance with its test statistic.
    No preview · Article · Nov 2015
  • Seongho Kim · Elisabeth Heath · Lance Heilbrun
    [Show abstract] [Hide abstract]
    ABSTRACT: Although the sample size for simple logistic regression can be readily determined using currently available methods, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination ([Formula: see text]) of a covariate of interest with other covariates, which is often unavailable in practice. The response variable of logistic regression follows a logit-normal distribution which can be generated from a logistic transformation of a normal distribution. Using this property of logistic regression, we propose new methods of determining the sample size for simple and multiple logistic regressions using a normal transformation of outcome measures. Simulation studies and a motivating example show several advantages of the proposed methods over the existing methods: (i) no need for [Formula: see text] for multiple logistic regression, (ii) available interim or group-sequential designs, and (iii) much smaller required sample size. © The Author(s) 2015 Reprints and permissions:
    No preview · Article · Mar 2015 · Statistical Methods in Medical Research
  • [Show abstract] [Hide abstract]
    ABSTRACT: Warfarin is a frequently used oral anticoagulant for long-term prevention and treatment of thromboembolic events. Due to its narrow therapeutic range and large inter-individual dose-response variability, it is highly desirable to personalize warfarin dosing. However, the complexity of the conventional kinetic-pharmacodynamic (K-PD) models hampers the development of the personalized dose management. To avert this challenge, we propose simplified PD models for warfarin dose-INR response relationship, which is motivated by ideas from control theory. The simplified models were further applied to longitudinal data of 37 patients undergoing anticoagulation treatment using the standard two-stage approach and then compared with the conventional K-PD models. Data analysis shows that all models have a similar predictive ability, but the simplified models are most parsimonious.
    No preview · Article · Feb 2015 · Biomedical Engineering Applications Basis and Communications
  • Source
    Seongho Kim · Hyejeong Jang
    [Show abstract] [Hide abstract]
    ABSTRACT: A nonlinear mixed-effects approach is developed for disease progression models that incorporate variation in age in a Bayesian framework. We further generalize the probability model for sensitivity to depend on age at diagnosis, time spent in the preclinical state and sojourn time. The developed models are then applied to the Johns Hopkins Lung Project data and the Health Insurance Plan for Greater New York data using Bayesian Markov chain Monte Carlo and are compared with the estimation method that does not consider random-effects from age. Using the developed models, we obtain not only age-specific individual-level distributions, but also population-level distributions of sensitivity, sojourn time and transition probability.
    Preview · Article · Jan 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The goal of metabolic association networks is to identify topology of a metabolic network for a better understanding of molecular mechanisms. An accurate metabolic association network enables investigation of the functional behavior of metabolites in a cell or tissue. Gaussian Graphical model (GGM)-based methods have been widely used in genomics to infer biological networks. However, the performance of various GGM-based methods for the construction of metabolic association networks remains unknown in metabolomics. The performance of principal component regression (PCR), independent component regression (ICR), shrinkage covariance estimate (SCE), partial least squares regression (PLSR), and extrinsic similarity (ES) methods in constructing metabolic association networks was compared by estimating partial correlation coefficient matrices when the number of variables is larger than the sample size. To do this, the sample size and the network density (complexity) were considered as variables for network construction. Simulation studies show that PCR and ICR are more stable to the sample size and the network density than SCE and PLSR in terms of F1 scores. These methods were further applied to the analysis of experimental metabolomics data acquired from metabolite extract of mouse liver. For the simulated data, the proposed methods PCR and ICR outperform other methods when the network density is large, while PLSR and SCE perform better when the network density is small. As for the experimental metabolomics data, PCR and ICR discover more significant edges and perform better than PLSR and SCE when the discovered edges are evaluated using KEGG pathway. These results suggest that the metabolic network might be more complex and therefore, PCR and ICR have the advantage over PLSR and SCE in constructing the metabolic association networks.
    Full-text · Article · Nov 2014 · Chemometrics and Intelligent Laboratory Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: The universal sequel to chronic kidney condition (CKD) is anemia. Patients of anemia have kidneys that are incapable of performing certain basic functions such as sensing of oxygen levels to secrete erythropoietin when red blood cell counts are low. Under such conditions, external administration of human recombinant erythropoietin (EPO) is administered as alternative to improve conditions of CKD patients by increasing their hemoglobin (Hb) levels to a given therapeutic range.Presently, EPO dosing strategies extensively depend on packet inserts and on “average” responses to the medication from previous patients. Clearly dosage strategies based on these approaches are, at best, nonoptimal to EPO medication and potentially dangerous to patients that do not adhere to the notion of expected “average” response. In this work, a technique called semi-blind robust identification is provided to uniquely identify models of the individual patients of anemia based on their actual Hb responses and EPO administration. Using the a priori information and the measured input-output data of the individual patients, the procedure identifies a unique model consisting of a nominal model and the associated model uncertainty for the patients. By incorporating the effects of unknown system initial conditions, considerably small measurement samples can be used in the modeling process.
    No preview · Article · Oct 2014 · Computer Methods and Programs in Biomedicine
  • Imhoi Koo · Sen Yao · Xiang Zhang · Seongho Kim
    [Show abstract] [Hide abstract]
    ABSTRACT: Gaussian graphical model (GGM)-based method, a key approach to reverse engineering biological networks, uses partial correlation to measure conditional dependence between two variables by controlling the contribution from other variables. After estimating partial correlation coefficients, one of the most critical processes in network construction is to control the false discovery rate (FDR) to assess the significant associations among variables. Various FDR methods have been proposed mainly for biomarker discovery, but it still remains unclear which FDR method performs better for network construction. Furthermore, there is no study to see the effect of the network structure on network construction. We selected the six FDR methods, the linear step-up procedure (BH95), the adaptive linear step-up procedure (BH00), Efron's local FDR (LFDR), Benjamini-Yekutieli's step-up procedure (BY01), Storey's q-value procedure (Storey01), and Storey-Taylor-Siegmund's adaptive step-up procedure (STS04), to evaluate their performances on network construction. We further considered two network structures, random and scale-free networks, to investigate their influence on network construction. Both simulated data and real experimental data suggest that STS04 provides the highest true positive rate (TPR) or F1 score, while BY01 has the highest positive predictive value (PPV) in network construction. In addition, no significant effect of the network structure is found on FDR methods.
    No preview · Article · Aug 2014 · Journal of Bioinformatics and Computational Biology
  • Seongho Kim · Xiang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Compound identification is a critical process in metabolomics. The widely used approach for compound identification in gas chromatography–mass spectrometry-based metabolomics is spectrum matching, in which the mass spectral similarity between an experimental mass spectrum and each mass spectrum in a reference library is calculated. While various similarity measures have been developed to improve the overall accuracy of compound identification, little attention has been paid to reducing the false discovery rate. We, therefore, develop an approach for controlling the false identification rate using the distribution of the difference between the first and second highest spectral similarity scores. We further propose a model-based approach to achieving a desired true positive rate. The developed method is applied to the National Institute of Standards and Technology mass spectral library, and its performance is compared with that of the conventional approach that uses only the maximum spectral similarity score. The results show that the developed method achieves a significantly higher F1 score and positive predictive value than did the conventional approach. Copyright © 2014 John Wiley & Sons, Ltd.
    No preview · Article · Aug 2014 · Journal of Chemometrics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC$\times$GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cutoffs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
    Full-text · Article · Aug 2014 · The Annals of Applied Statistics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Childhood obesity has become a national public health crisis in America. Physical inactivity and unhealthy eating behaviors may contribute to the childhood obesity epidemic. School-based healthy lifestyle interventions play a promising role in preventing and controlling childhood obesity. A comprehensive school-based healthy lifestyle intervention was implemented in 4 rural elementary schools in Kentucky. The intervention included 4 goals: improving physical education, health education, family/community involvement, and school wellness policies. Children's physical activity was assessed by pedometer, and nutrition was assessed by a previous day recall survey in January (baseline), February (t1), March (t2), April (t3), and May (t4) of 2011. The intervention had significant effects on increasing the percentages of children meeting physical activity (1% vs 5%, p < .01) and nutrition (15% vs 26%, p < .01) recommendations. The effects of the intervention on physical activity and nutrition depended on school, grade, and age of the children. There was an increasing linear trend of physical activity and an increasing quadratic trend of nutrition over time among children. The intervention had beneficial effects in improving healthy behaviors among children. Further studies are needed to assess its long-term effects and cost-effectiveness.
    Full-text · Article · Apr 2014 · Journal of School Health
  • Source
    Imhoi Koo · Xue Shi · Seongho Kim · Xiang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: We developed a method, iMatch2, for compound identification using retention indices (RI) in NIST11 library. Three-way ANOVA test and Kruskal-Wallis test respectively demonstrates that column class and temperature program type defined by the NIST library are the most dominant factors affecting the magnitude of retention index while the retention index data type does not cause significant difference. The developed linear regression transformation for merging retention indices with different data types but the same column class and temperature program type reduces the standard deviation of retention index up to 8%, compared to the simple union approach used in the original iMatch. As for outlier detection methods to remove retention indices having large difference with the remaining data of the same compound, Tietjen-Moore test and generalized extreme studentized deviate test are the strictest methods, while methods such as Dixon's test, Thompson tau approach, and Grubbs’ test are more conservative. To improve the accuracy of retention index window, a concept of compound specific retention index window is introduced for compounds with a large number of retention indices in the NIST11 library, while the retention index window is calculated from empirical distributions for the compounds with a small number of retention indices. Analysis of the experimental data of a mixture of compound standards and the metabolite extract from mouse liver show significant improvement of retention index quality in the NIST11 library and the new data analysis methods.
    Full-text · Article · Apr 2014 · Journal of Chromatography A
  • Xiaoli Wei · Imhoi Koo · Seongho Kim · Xiang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: We report a compound identification method (SimMR), which simultaneously evaluates the mass spectrum similarity and the retention index distance using an empirical mixture score function, for the analysis of GC-MS data. The performance of the developed SimMR method was compared to that of two existing compound identification strategies. One is the mass spectrum matching method without incorporation of retention index information (SM). The other is the method that sequentially evaluates the mass spectrum similarity and retention index distance (SeqMR). For comparison purposes, we used the NIST/EPA/NIH Mass Spectral Library 2005. Our study demonstrates that SimMR performs the best among the three compound identification methods, by improving the overall identification accuracy up to 1.53% and 4.81% compared to SeqMR and SM, respectively.
    No preview · Article · Mar 2014 · The Analyst
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A data dependent peak model (DDPM) based spectrum deconvolution method was developed for analysis of high resolution LC-MS data. To construct the selected ion chromatogram (XIC), a clustering method, the density based spatial clustering of applications with noise (DBSCAN), is applied to all m/z values of an LC-MS data set to group the m/z values into each XIC. The DBSCAN constructs XICs without the need for a user defined m/z variation window. After the XIC construction, the peaks of molecular ions in each XIC are detected using both the first and the second derivative tests, followed by an optimized chromatographic peak model selection method for peak deconvolution. A total of six chromatographic peak models are considered, including Gaussian, log-normal, Poisson, gamma, exponentially modified Gaussian, and hybrid of exponential and Gaussian models. The abundant nonoverlapping peaks are chosen to find the optimal peak models that are both data- and retention-time-dependent. Analysis of 18 spiked-in LC-MS data demonstrates that the proposed DDPM spectrum deconvolution method outperforms the traditional method. On average, the DDPM approach not only detected 58 more chromatographic peaks from each of the testing LC-MS data but also improved the retention time and peak area 3% and 6%, respectively.
    Full-text · Article · Feb 2014 · Analytical Chemistry
  • Seongho Kim · Lang Li
    [Show abstract] [Hide abstract]
    ABSTRACT: The statistical identifiability of nonlinear pharmacokinetic (PK) models with the Michaelis-Menten (MM) kinetic equation is considered using a global optimization approach, which is particle swarm optimization (PSO). If a model is statistically non-identifiable, the conventional derivative-based estimation approach is often terminated earlier without converging, due to the singularity. To circumvent this difficulty, we develop a derivative-free global optimization algorithm by combining PSO with a derivative-free local optimization algorithm to improve the rate of convergence of PSO. We further propose an efficient approach to not only checking the convergence of estimation but also detecting the identifiability of nonlinear PK models. PK simulation studies demonstrate that the convergence and identifiability of the PK model can be detected efficiently through the proposed approach. The proposed approach is then applied to clinical PK data along with a two-compartmental model.
    No preview · Article · Oct 2013 · Computer methods and programs in biomedicine
  • Source
    Seongho Kim · Xiang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Peak alignment is a critical procedure in mass spectrometry-based biomarker discovery in metabolomics. One of peak alignment approaches to comprehensive two-dimensional gas chromatography mass spectrometry (GC×GC-MS) data is peak matching-based alignment. A key to the peak matching-based alignment is the calculation of mass spectral similarity scores. Various mass spectral similarity measures have been developed mainly for compound identification, but the effect of these spectral similarity measures on the performance of peak matching-based alignment still remains unknown. Therefore, we selected five mass spectral similarity measures, cosine correlation, Pearson's correlation, Spearman's correlation, partial correlation, and part correlation, and examined their effects on peak alignment using two sets of experimental GC×GC-MS data. The results show that the spectral similarity measure does not affect the alignment accuracy significantly in analysis of data from less complex samples, while the partial correlation performs much better than other spectral similarity measures when analyzing experimental data acquired from complex biological samples.
    Preview · Article · Sep 2013 · Computational and Mathematical Methods in Medicine
  • Hyejeong Jang · Seongho Kim · Dongfeng Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: Lung cancer screening using X-rays has been controversial for many years. A major concern is whether lung cancer screening really brings any survival benefits, which depends on effective treatment after early detection. The problem was analyzed from a different point of view and estimates were presented of the projected lead time for participants in a lung cancer screening program using the Johns Hopkins Lung Project (JHLP) data. The newly developed method of lead time estimation was applied where the lifetime T was treated as a random variable rather than a fixed value, resulting in the number of future screenings for a given individual is a random variable. Using the actuarial life table available from the United States Social Security Administration, the lifetime distribution was first obtained, then the lead time distribution was projected using the JHLP data. The data analysis with the JHLP data shows that, for a male heavy smoker with initial screening ages at 50, 60, and 70, the probability of no-early-detection with semiannual screens will be 32.16%, 32.45%, and 33.17%, respectively; while the mean lead time is 1.36, 1.33 and 1.23years. The probability of no-early-detection increases monotonically when the screening interval increases, and it increases slightly as the initial age increases for the same screening interval. The mean lead time and its standard error decrease when the screening interval increases for all age groups, and both decrease when initial age increases with the same screening interval. The overall mean lead time estimated with a random lifetime T is slightly less than that with a fixed value of T. This result is hoped to be of benefit to improve current screening programs.
    No preview · Article · Sep 2013
  • Imhoi Koo · Seongho Kim · Xiang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Compound identification in gas chromatography-mass spectrometry (GC-MS) is usually achieved by matching query spectra to spectra present in a reference library. Although several spectral similarity measures have been developed and compared using a small reference library, it still remains unknown how the relationship between the spectral similarity measure and the size of reference library affects on the identification accuracy as well as the optimal weight factor. We used three reference libraries to investigate the dependency of the optimal weight factor, spectral similarity measure and the size of reference library. Our study demonstrated that the optimal weight factor depends on not only spectral similarity measure but also the size of reference library. The mixture semi-partial correlation measure outperforms all existing spectral similarity measures in all tested reference libraries, in spite of the computational expense. Furthermore, the accuracy of compound identification using a larger reference library in future is estimated by varying the size of reference library. Simulation study indicates that the mixture semi-partial correlation measure will have the best performance with the increase of reference library in future.
    No preview · Article · May 2013 · Journal of Chromatography A
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Due to the high complexity of metabolome, the comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC×GC-TOF MS) is considered as a powerful analytical platform for metabolomics study. However, the applications of GC×GC-TOF MS in metabolomics are not popular due to the lack of bioinformatics system for data analysis. RESULTS: We developed a computational platform entitled MetPP for analysis of metabolomics data acquired on a GC×GC-TOF MS system. MetPP can process peak filtering and merging, retention index matching, peak list alignment, normalization, statistical significance tests, and pattern recognition, using the peak lists deconvoluted from the instrument data as its input. The performance of MetPP software was tested with two sets of experimental data acquired in a spike-in experiment and a biomarker discovery experiment, respectively. MetPP not only correctly aligned the spiked-in metabolite standards from the experimental data, but also correctly recognized their concentration difference between sample groups. For analysis of the biomarker discovery data, a total of 15 metabolites were recognized with significant concentration difference between the sample groups and these results agree with the literature results of histological analysis, demonstrating the effectiveness of applying MetPP software for disease biomarker discovery. AVAILABILITY: The source code of MetPP is available at CONTACT: SUPPLEMENTARY INFORMATION: Supplementary Information data are available at Bioinformatics online.
    Full-text · Article · May 2013 · Bioinformatics
  • Source
    Jaesik Jeong · Xiang Zhang · Xue Shi · Seongho Kim · Changyu Shen
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Since peak alignment in metabolomics has a huge effect on the subsequent statistical analysis, it is considered a key preprocessing step and many peak alignment methods have been developed. However, existing peak alignment methods do not produce satisfactory results. Indeed, the lack of accuracy results from the fact that peak alignment is done separately from another preprocessing step such as identification. Therefore, a post-hoc approach, which integrates both identification and alignment results, is in urgent need for the purpose of increasing the accuracy of peak alignment. Results The proposed post-hoc method was validated with three datasets such as a mixture of compound standards, metabolite extract from mouse liver, and metabolite extract from wheat. Compared to the existing methods, the proposed approach improved peak alignment in terms of various performance measures. Also, post-hoc approach was verified to improve peak alignment by manual inspection. Conclusions The proposed approach, which combines the information of metabolite identification and alignment, clearly improves the accuracy of peak alignment in terms of several performance measures. R package and examples using a dataset are available at
    Full-text · Article · Apr 2013 · BMC Bioinformatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A method of employing high-resolution mass spectrometry in combination with in vivo metabolite deuterium labeling was developed in this study to investigate the effects of alcohol exposure on lipid homeostasis at the white adipose tissue (WAT)-liver axis in a mouse model of alcoholic fatty liver. In order to differentiate the liver lipids synthesized from the fatty acids that were transported back from adipose tissue and the lipids synthesized from other sources of fatty acids, a two-stage mouse feeding experiment was performed to incorporate deuterium into metabolites. Hepatic lipids extracted from mouse liver, epididymal white adipose tissue (eWAT) and subcutaneous white adipose tissue (sWAT) were analyzed. It was found that 13 and 10 triacylglycerols (TGs) incorporated with a certain number of deuterium were significantly increased in alcohol induced fatty liver at two and four weeks of alcohol feeding periods, respectively. The concentration changes of these TGs ranged from 1.7 to 6.3-fold increase. A total of 14 deuterated TGs were significantly decreased in both eWAT and sWAT at the two and four weeks and the fold-change ranged from 0.19 to 0.77. The increase of deuterium incorporated TGs in alcohol-induced fatty liver and their decrease in both eWAT and sWAT indicate that alcohol exposure induces hepatic influx of fatty acids which are released from WATs. The results of time course analysis further indicate a mechanistic link between adipose fat loss and hepatic fat gain in alcoholic fatty liver.
    Full-text · Article · Feb 2013 · PLoS ONE

Publication Stats

325 Citations
143.67 Total Impact Points


  • 2014-2015
    • Karmanos Cancer Institute
      Detroit, Michigan, United States
  • 2013-2014
    • Wayne State University
      Detroit, Michigan, United States
  • 2010-2014
    • University of Louisville
      • • Department of Pharmacology and Toxicology
      • • Department of Bioinformatics and Biostatistics
      Louisville, Kentucky, United States
  • 2008-2010
    • Indiana University-Purdue University Indianapolis
      • • Department of Biostatistics
      • • Department of Medicine
      Indianapolis, Indiana, United States