Wilson Wen Bin Goh

Wilson Wen Bin Goh
  • PhD
  • Nanyang Technological University

About

132
Publications
20,314
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,716
Citations
Current institution
Nanyang Technological University
Additional affiliations
Position
  • Researcher

Publications

Publications (132)
Article
Full-text available
Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through...
Article
Full-text available
We apply machine learning techniques to navigate the multifaceted landscape of schizophrenia. Our method entails the development of predictive models, emphasizing peripheral inflammatory biomarkers, which are classified into treatment response subgroups: antipsychotic-responsive, clozapine-responsive, and clozapine-resistant. The cohort comprises 1...
Article
Full-text available
The “similarity of dissimilarities” is an emerging paradigm in biomedical science with significant implications for protein function prediction, machine learning (ML), and personalized medicine. In protein function prediction, recognizing dissimilarities alongside similarities provides a more detailed understanding of evolutionary processes, allowi...
Article
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive co...
Article
Full-text available
Distinguishing stable and fluctuating psychopathological features in young individuals at Ultra High Risk (UHR) for psychosis is challenging, but critical for building robust, accurate, early clinical detection and prevention capabilities. Over a 24-month period, 159 UHR individuals were assessed using the Positive and Negative Symptom Scale (PANSS...
Article
Full-text available
Author summary Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on h...
Preprint
Full-text available
Background In this research study, we apply machine learning techniques to navigate the multifaceted landscape of schizophrenia. Our method entails the development of predictive models, emphasizing peripheral inflammatory biomarkers, which are classified into treatment response subgroups: antipsychotic-responsive, clozapine-responsive, and clozapin...
Article
Motivation Deep graph learning (DGL) has been widely employed in the realm of ligand-based virtual screening (LBVS). Within this field, a key hurdle is the existence of activity cliffs (ACs), where minor chemical alterations can lead to significant changes in bioactivity. In response, several DGL models have been developed to enhance ligand bioacti...
Article
Full-text available
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflow...
Article
The paper demonstrates for the first time that a brain-inspired spiking neural network (SNN) architecture can be used not only to learn spatio-temporal data, but also to extract fuzzy spatio-temporal rules from such data and to update these rules incrementally in a transfer learning mode. We propose a method, where a SNN model learns incrementally...
Article
Full-text available
Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reprod...
Article
This article summarizes the PROTREC method and investigates the impact that the different hyper‐parameters have on the task of missing protein prediction using PROTREC. We evaluate missing protein recovery rates using different PROTREC score selection approaches (MAX, MIN, MEDIAN, and MEAN), different PROTREC score thresholds, as well as different...
Article
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabil...
Preprint
Full-text available
In the process of identifying phenotype-specific or differentially expressed proteins from proteomic data, a standard workflow consists of five key steps: raw data quantification, expression matrix construction, matrix normalization, missing data imputation, and differential expression analysis. However, due to the availability of multiple options...
Article
In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MV...
Article
Full-text available
In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide whic...
Article
Proteomic studies characterize the protein composition of complex biological samples. Despite recent advancements in mass spectrometry instrumentation and computational tools, low proteome coverage and interpretability remains a challenge. To address this, we developed Proteome Support Vector Enrichment (PROSE), a fast, scalable and lightweight pip...
Article
The robustness of a breast cancer gene signature, the super-proliferation set (SPS), is initially tested and investigated on breast cancer cell lines from the Cancer Cell Line Encyclopaedia (CCLE). Previously, SPS was derived via a meta-analysis of 47 independent breast cancer gene signatures, benchmarked on survival information from clinical data...
Article
Full-text available
Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising...
Article
Full-text available
Finding predictors of social and cognitive impairment in non-transition Ultra-High-Risk individuals (UHR) is critical in prognosis and planning of potential personalised intervention strategies. Social and cognitive functioning observed in youth at UHR for psychosis may be protective against transition to clinically relevant illness. The current st...
Preprint
Full-text available
Statistical analyses in high-dimensional omics data are often hampered by the presence of batch effects (BEs) and missing values (MVs), but the interaction between these two issues is not well-studied nor understood. MVs may manifest as a BE when their proportions differ across batches. These are termed as Batch-Effect Associated Missing values (BE...
Article
Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper...
Article
Full-text available
Sentiment Analysis (SA) is a category of data mining techniques that extract latent representations of affective states within textual corpuses. This has wide ranging applications from online reviews to capturing mental states. In this paper, we present a novel SA feature set; Emotional Variance Analysis (EVA), which captures patterns of emotional...
Article
Full-text available
Interpretable machine learning models for gene expression datasets are important for understanding the decision-making process of a classifier and gaining insights on the underlying molecular processes of genetic conditions. Interpretable models can potentially support early diagnosis before full disease manifestation. This is particularly importan...
Preprint
Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is...
Article
Full-text available
Functional doppelgängers (FDs) are independently derived sample pairs that confound machine learning model (ML) performance when assorted across training and validation sets. Here, we detail the use of doppelgangerIdentifier (DI), providing software installation, data preparation, doppelgänger identification, and functional testing steps. We demons...
Article
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapte...
Article
Motivation: Differentiating 12 stages of mouse seminiferous epithelial cycle is vital towards understanding the dynamic spermatogenesis process. However, it is challenging since two adjacent spermatogenic stages are morphologically similar. Distinguishing Stages I-III from Stages IV-V is important for histologists to understand sperm development i...
Article
Full-text available
The essential role of the Reelin gene (RELN) during brain development makes it a prominent candidate in human epigenetic studies of Schizophrenia. Previous literature has reported differing levels of DNA methylation (DNAm) in patients with psychosis. Therefore, this study aimed to (1) examine and compare RELN DNAm levels in subjects at different st...
Article
Full-text available
Background Progesterone receptor (PGR) is a master regulator of uterine function through antagonistic and synergistic interplays with oestrogen receptors. PGR action is primarily mediated by activation functions AF1 and AF2, but their physiological significance is unknown. Results We report the first study of AF1 function in mice. The AF1 mutant m...
Article
Full-text available
A scatterplot is often the graph of choice for displaying the relationship between two variables. Scatterplots are useful for exploratory analysis, but can do much more than just identifying correlations. As data sets get larger and more complex, relying solely on “eye power” alone may cause us to miss interesting associations, or worse, make wrong...
Article
Full-text available
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch corre...
Article
Full-text available
Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in “data holes”. These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. N...
Article
Full-text available
Doppelgänger effects occur when samples exhibit chance similarities, such that when split across training and validation sets, inflates trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus far, there are no tools for doppelgänger identification or standard pra...
Preprint
Full-text available
A scatterplot is often the graph of choice for displaying the relationship between two variables. Scatterplots are useful for exploratory analysis, but can do much more than just identifying correlations. As datasets get larger and more complex, relying solely on “eye power” alone may cause us to miss interesting associations, or worse, make wrong...
Article
natural products (NPs) constitute a large reserve of bioactive compounds useful for drug development. Recent advances in high-throughput technologies facilitate functional analysis of therapeutic effects and NP-based drug discovery. However, the large amount of generated data is complex and difficult to analyze effectively. This limitation is incre...
Article
Full-text available
Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact...
Article
Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges. We first cover how BE modeling differs between tradi...
Preprint
Full-text available
Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are particularly endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This...
Article
Full-text available
We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper “PROTREC: A probability-based approach for recovering missing proteins based on biological networks” [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Ac...
Preprint
Full-text available
Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in “data holes”. These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples. This hinders biomarker and drug discovery from proteomics data. Network...
Preprint
Full-text available
Proteomic studies characterize the protein composition of complex biological samples. Despite recent developments in mass spectrometry instrumentation and computational tools, low proteome coverage remains a challenge. To address this, we present Proteome Support Vector Enrichment (PROSE), a fast, scalable, and effective pipeline for scoring protei...
Article
Full-text available
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrou...
Article
A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods – such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) – across a variety of proteomics datasets derived from...
Article
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently...
Chapter
Experiential learning is a key development area of artificial intelligence in education (AIEd). It aims to provide learners with intuitive environments for autonomous knowledge formation and discovery through interactive experiences. However, experiential learning in AIEd faces two main challenges. Firstly, measuring learning performances in unstru...
Article
Full-text available
Dendritic cells residing in the skin represent a large family of antigen presenting cells, ranging from long-lived Langerhans cells (LC) in the epidermis to various distinct classical dendritic cell subsets in the dermis. Through genetic fate mapping analysis and single cell RNA sequencing we have identified a novel separate population of LC-indepe...
Article
Full-text available
Data science is about deriving insight, learning and understanding from data. This process may be automated via the use of advanced algorithms or scaffolded cognitively via the use of graphs. While much emphasis is currently placed on machine learning, there is still much to learn about the role of the data scientist, in particular the thinking pro...
Preprint
Full-text available
Dendritic cells residing in the skin represent a large family of antigen presenting cells, ranging from long-lived Langerhans cells (LC) in the epidermis to various distinct classical dendritic cell subsets in the dermis. Through genetic fate mapping analysis and single cell RNA sequencing we have identified a novel separate population of LC-indepe...
Article
Full-text available
We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validati...
Article
Full-text available
Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data s...
Article
Full-text available
Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning h...
Article
Full-text available
Membrane integrity at the endoplasmic reticulum (ER) is tightly regulated, and its disturbance is implicated in metabolic diseases. Using an engineered sensor that activates the unfolded protein response (UPR) exclusively when normal ER membrane lipid composition is compromised, we identified pathways beyond lipid metabolism that are necessary to m...
Conference Paper
Full-text available
To equip biology students with data literacy skills, this study investigates the utility of retooling the electronic programming tutorial system, Swirl, for teaching applied statistics in a cohort of biology students in a 2-phased study: Phase 1 involved administration of tutorial-based course, pretest-posttest assessment and preliminary survey, wh...
Conference Paper
Full-text available
In Students-as-partner (SAP), students work in partnership with staff members in higher learning institutions to facilitate deeper learning in students by promoting student engagement. While SAP's impact on student consultants and staff members directly involved in partnership is generally well and widely researched, relatively little is reported a...
Preprint
Membrane integrity at the endoplasmic reticulum (ER) is tightly regulated and is implicated in metabolic diseases when compromised. Using an engineered sensor that exclusively activates the unfolded protein response (UPR) during aberrant ER membrane lipid composition, we identified pathways beyond lipid metabolism that are necessary to maintain ER...
Article
Living longer with sustainable quality of life is becoming increasingly important in aging populations. Understanding associative biological mechanisms have proven daunting, because of multigenicity and population heterogeneity. Although Big Data and Artificial Intelligence (AI) could help, naïve adoption is ill advised. We hold the view that model...
Article
Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of "best" are also i...
Article
Cancer is a heterogeneous disease, confounding the identification of relevant markers and drug targets. Network-based analysis is robust against noise, potentially offering a promising approach towards biomarker identification. We describe here the application of two network-based methods, qPSP (Quantitative Proteomics Signature Profiling) and PFSN...
Article
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated...
Article
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradig...
Article
Full-text available
Overcoming multidrug resistance has always been a major challenge in cancer treatment. Recent evidence suggested epithelial-mesenchymal transition plays a role in MDR, but the mechanism behind this link remains unclear. We found that the expression of multiple ABC transporters was elevated in concordance with an increased drug efflux in cancer cell...
Article
Artificial intelligence (AI) is profoundly changing biotechnological innovation. Beyond direct application, it is also a useful tool for adaptive learning and forging new conceptual connections within the vast network of knowledge for the advancement of biotechnology. Here, we discuss a new paradigm for biotechnology education that involves coevolu...
Article
Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (...
Article
Random signature superiority (RSS) occurs when random gene signatures outperform published and/or known signatures. Unlike reproducibility and generalizability issues, RSS is relatively underexplored. Yet, understanding it is imperative for better analytical outcome. In breast cancer, RSS correlates strongly with enrichment for proliferation genes...
Article
The Anna Karenina effect is a manifestation of the theory-practice gap that exists when theoretical statistics are applied on real-world data. In the course of analyzing biological data for differential features such as genes or proteins, it derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), ra...
Article
Statistical feature selection is used for identification of relevant genes from biological data, with implications for biomarker and drug development. Recent work demonstrates that the t-test p-value exhibits high sample-to-sample p-value variability accompanied by an exaggeration of effect size in the univariate scenario. To deepen understanding,...
Article
A missing protein (MP) is an unconfirmed genetic sequence for which a protein product is not yet detected. Currently, MPs are tiered based on supporting evidence mainly in the form of protein existence (PE) classification. As we discuss here, this definition is overly restrictive because proteins go missing in day-to-day proteomics as a result of l...
Article
Full-text available
The ultra-high risk (UHR) state was originally conceived to identify individuals at imminent risk of developing psychosis. Although recent studies have suggested that most individuals designated UHR do not, they constitute a distinctive group, exhibiting cognitive and functional impairments alongside multiple psychiatric morbidities. UHR characteri...
Article
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package N...
Article
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are s...
Article
Full-text available
Background In proteomics, batch effects are technical sources of variation that confounds proper analysis, preventing effective deployment in clinical and translational research. ResultsUsing simulated and real data, we demonstrate existing batch effect-correction methods do not always eradicate all batch effects. Worse still, they may alter data i...
Article
Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch e...
Article
The human cytochrome P450 (CYP) enzyme CYP4Z1 is a fatty acid hydroxylase which among human CYPs is unique for being much stronger expressed in the mammary gland than in all other tissues. Moreover, it is strongly overexpressed in all subtypes of breast cancer, and some overexpression has also been found in other types of malignancies, such as ovar...
Article
Full-text available
Background The hypergeometric enrichment analysis approach typically fares poorly in feature-selection stability due to its upstream reliance on the t-test to generate differential protein lists before testing for enrichment on a protein complex, subnetwork or gene group. Methods Swapping the t-test in favour of a fuzzy rank-based weight system sim...
Article
The brain adapts to dynamic environmental conditions by altering its epigenetic state, thereby influencing neuronal transcriptional programs. An example of an epigenetic modification is protein methylation, catalyzed by protein arginine methyltransferases (PRMT). One member, Prmt8, is selectively expressed in the central nervous system during a cru...
Article
In clinical proteomics, reproducible feature selection is unattainable given the standard statistical hypothesis-testing framework. This leads to irreproducible signatures with no diagnostic power. Instability stems from high P-value variability (p_var), which is inevitable and insolvable. The impact of p_var can be reduced via power increment, for...
Article
Full-text available
Despite the global impact of macrophage activation in vascular disease, the underlying mechanisms remain obscure. Here we show, with global proteomic analysis of macrophage cell lines treated with either IFNγ or IL-4, that PARP9 and PARP14 regulate macrophage activation. In primary macrophages, PARP9 and PARP14 have opposing roles in macrophage act...
Data
Supplementary Figures 1-20 and Supplementary Tables 1-4.
Article
In proteomics, useful signal may be unobserved or lost due to the lack of confident peptide-spectral matches. Selection of differential spectra, followed by associative peptide/protein mapping may be a complementary strategy for improving sensitivity and comprehensiveness of analysis (spectra-first paradigm). This approach is complementary to the s...
Article
Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. [Formula: see text]-test and recursive feature elimination. Mo...
Article
Despite advances in proteomic technologies, idiosyncratic data issues, for example, incomplete coverage and inconsistency, resulting in large data holes, persist. Moreover, because of naïve reliance on statistical testing and its accompanying p values, differential protein signatures identified from such proteomics data have little diagnostic power...
Article
Networks can resolve many analytical problems in proteomics, including incomplete coverage and inconsistency. Despite high expectations, network-related research in proteomics has experienced only modest growth. In practice, most current research examines non-quantitative usages, for example determining physical interactions among proteins or conte...

Network

Cited By