
Wilson Wen Bin Goh- PhD
- Nanyang Technological University
Wilson Wen Bin Goh
- PhD
- Nanyang Technological University
About
132
Publications
20,314
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,716
Citations
Introduction
Current institution
Additional affiliations
Position
- Researcher
Publications
Publications (132)
Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through...
We apply machine learning techniques to navigate the multifaceted landscape of schizophrenia. Our method entails the development of predictive models, emphasizing peripheral inflammatory biomarkers, which are classified into treatment response subgroups: antipsychotic-responsive, clozapine-responsive, and clozapine-resistant. The cohort comprises 1...
The “similarity of dissimilarities” is an emerging paradigm in biomedical science with significant implications for protein function prediction, machine learning (ML), and personalized medicine. In protein function prediction, recognizing dissimilarities alongside similarities provides a more detailed understanding of evolutionary processes, allowi...
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive co...
Distinguishing stable and fluctuating psychopathological features in young individuals at Ultra High Risk (UHR) for psychosis is challenging, but critical for building robust, accurate, early clinical detection and prevention capabilities. Over a 24-month period, 159 UHR individuals were assessed using the Positive and Negative Symptom Scale (PANSS...
Author summary
Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on h...
Background
In this research study, we apply machine learning techniques to navigate the multifaceted landscape of schizophrenia. Our method entails the development of predictive models, emphasizing peripheral inflammatory biomarkers, which are classified into treatment response subgroups: antipsychotic-responsive, clozapine-responsive, and clozapin...
Motivation
Deep graph learning (DGL) has been widely employed in the realm of ligand-based virtual screening (LBVS). Within this field, a key hurdle is the existence of activity cliffs (ACs), where minor chemical alterations can lead to significant changes in bioactivity. In response, several DGL models have been developed to enhance ligand bioacti...
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflow...
The paper demonstrates for the first time that a brain-inspired spiking neural network (SNN) architecture can be used not only to learn spatio-temporal data, but also to extract fuzzy spatio-temporal rules from such data and to update these rules incrementally in a transfer learning mode. We propose a method, where a SNN model learns incrementally...
Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reprod...
This article summarizes the PROTREC method and investigates the impact that the different hyper‐parameters have on the task of missing protein prediction using PROTREC. We evaluate missing protein recovery rates using different PROTREC score selection approaches (MAX, MIN, MEDIAN, and MEAN), different PROTREC score thresholds, as well as different...
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabil...
In the process of identifying phenotype-specific or differentially expressed proteins from proteomic data, a standard workflow consists of five key steps: raw data quantification, expression matrix construction, matrix normalization, missing data imputation, and differential expression analysis. However, due to the availability of multiple options...
In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MV...
In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide whic...
Proteomic studies characterize the protein composition of complex biological samples. Despite recent advancements in mass spectrometry instrumentation and computational tools, low proteome coverage and interpretability remains a challenge. To address this, we developed Proteome Support Vector Enrichment (PROSE), a fast, scalable and lightweight pip...
The robustness of a breast cancer gene signature, the super-proliferation set (SPS), is initially tested and investigated on breast cancer cell lines from the Cancer Cell Line Encyclopaedia (CCLE). Previously, SPS was derived via a meta-analysis of 47 independent breast cancer gene signatures, benchmarked on survival information from clinical data...
Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising...
Finding predictors of social and cognitive impairment in non-transition Ultra-High-Risk individuals (UHR) is critical in prognosis and planning of potential personalised intervention strategies. Social and cognitive functioning observed in youth at UHR for psychosis may be protective against transition to clinically relevant illness. The current st...
Statistical analyses in high-dimensional omics data are often hampered by the presence of batch effects (BEs) and missing values (MVs), but the interaction between these two issues is not well-studied nor understood. MVs may manifest as a BE when their proportions differ across batches. These are termed as Batch-Effect Associated Missing values (BE...
Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper...
Sentiment Analysis (SA) is a category of data mining techniques that extract latent representations of affective states within textual corpuses. This has wide ranging applications from online reviews to capturing mental states. In this paper, we present a novel SA feature set; Emotional Variance Analysis (EVA), which captures patterns of emotional...
Interpretable machine learning models for gene expression datasets are important for understanding the decision-making process of a classifier and gaining insights on the underlying molecular processes of genetic conditions. Interpretable models can potentially support early diagnosis before full disease manifestation. This is particularly importan...
Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is...
Functional doppelgängers (FDs) are independently derived sample pairs that confound machine learning model (ML) performance when assorted across training and validation sets. Here, we detail the use of doppelgangerIdentifier (DI), providing software installation, data preparation, doppelgänger identification, and functional testing steps. We demons...
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapte...
Motivation:
Differentiating 12 stages of mouse seminiferous epithelial cycle is vital towards understanding the dynamic spermatogenesis process. However, it is challenging since two adjacent spermatogenic stages are morphologically similar. Distinguishing Stages I-III from Stages IV-V is important for histologists to understand sperm development i...
The essential role of the Reelin gene (RELN) during brain development makes it a prominent candidate in human epigenetic studies of Schizophrenia. Previous literature has reported differing levels of DNA methylation (DNAm) in patients with psychosis. Therefore, this study aimed to (1) examine and compare RELN DNAm levels in subjects at different st...
Background
Progesterone receptor (PGR) is a master regulator of uterine function through antagonistic and synergistic interplays with oestrogen receptors. PGR action is primarily mediated by activation functions AF1 and AF2, but their physiological significance is unknown.
Results
We report the first study of AF1 function in mice. The AF1 mutant m...
A scatterplot is often the graph of choice for displaying the relationship between two variables. Scatterplots are useful for exploratory analysis, but can do much more than just identifying correlations. As data sets get larger and more complex, relying solely on “eye power” alone may cause us to miss interesting associations, or worse, make wrong...
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch corre...
Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in “data holes”. These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. N...
Doppelgänger effects occur when samples exhibit chance similarities, such that when split across training and validation sets, inflates trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus far, there are no tools for doppelgänger identification or standard pra...
A scatterplot is often the graph of choice for displaying the relationship between two variables. Scatterplots are useful for exploratory analysis, but can do much more than just identifying correlations. As datasets get larger and more complex, relying solely on “eye power” alone may cause us to miss interesting associations, or worse, make wrong...
natural products (NPs) constitute a large reserve of bioactive compounds useful for drug development. Recent advances in high-throughput technologies facilitate functional analysis of therapeutic effects and NP-based drug discovery. However, the large amount of generated data is complex and difficult to analyze effectively. This limitation is incre...
Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact...
Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges. We first cover how BE modeling differs between tradi...
Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are particularly endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This...
We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper “PROTREC: A probability-based approach for recovering missing proteins based on biological networks” [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Ac...
Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in “data holes”. These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples. This hinders biomarker and drug discovery from proteomics data. Network...
Proteomic studies characterize the protein composition of complex biological samples. Despite recent developments in mass spectrometry instrumentation and computational tools, low proteome coverage remains a challenge. To address this, we present Proteome Support Vector Enrichment (PROSE), a fast, scalable, and effective pipeline for scoring protei...
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrou...
A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods – such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) – across a variety of proteomics datasets derived from...
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently...
Experiential learning is a key development area of artificial intelligence in education (AIEd). It aims to provide learners with intuitive environments for autonomous knowledge formation and discovery through interactive experiences. However, experiential learning in AIEd faces two main challenges. Firstly, measuring learning performances in unstru...
Dendritic cells residing in the skin represent a large family of antigen presenting cells, ranging from long-lived Langerhans cells (LC) in the epidermis to various distinct classical dendritic cell subsets in the dermis. Through genetic fate mapping analysis and single cell RNA sequencing we have identified a novel separate population of LC-indepe...
Data science is about deriving insight, learning and understanding from data. This process may be automated via the use of advanced algorithms or scaffolded cognitively via the use of graphs. While much emphasis is currently placed on machine learning, there is still much to learn about the role of the data scientist, in particular the thinking pro...
Dendritic cells residing in the skin represent a large family of antigen presenting cells, ranging from long-lived Langerhans cells (LC) in the epidermis to various distinct classical dendritic cell subsets in the dermis. Through genetic fate mapping analysis and single cell RNA sequencing we have identified a novel separate population of LC-indepe...
We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validati...
Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data s...
Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning h...
Membrane integrity at the endoplasmic reticulum (ER) is tightly regulated, and its disturbance is implicated in metabolic diseases. Using an engineered sensor that activates the unfolded protein response (UPR) exclusively when normal ER membrane lipid composition is compromised, we identified pathways beyond lipid metabolism that are necessary to m...
To equip biology students with data literacy skills, this study investigates the utility of retooling the electronic programming tutorial system, Swirl, for teaching applied statistics in a cohort of biology students in a 2-phased study: Phase 1 involved administration of tutorial-based course, pretest-posttest assessment and preliminary survey, wh...
In Students-as-partner (SAP), students work in partnership with staff members in higher learning institutions to facilitate deeper learning in students by promoting student engagement. While SAP's impact on student consultants and staff members directly involved in partnership is generally well and widely researched, relatively little is reported a...
Membrane integrity at the endoplasmic reticulum (ER) is tightly regulated and is implicated in metabolic diseases when compromised. Using an engineered sensor that exclusively activates the unfolded protein response (UPR) during aberrant ER membrane lipid composition, we identified pathways beyond lipid metabolism that are necessary to maintain ER...
Living longer with sustainable quality of life is becoming increasingly important in aging populations. Understanding associative biological mechanisms have proven daunting, because of multigenicity and population heterogeneity. Although Big Data and Artificial Intelligence (AI) could help, naïve adoption is ill advised. We hold the view that model...
Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of "best" are also i...
Cancer is a heterogeneous disease, confounding the identification of relevant markers and drug targets. Network-based analysis is robust against noise, potentially offering a promising approach towards biomarker identification. We describe here the application of two network-based methods, qPSP (Quantitative Proteomics Signature Profiling) and PFSN...
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated...
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradig...
Overcoming multidrug resistance has always been a major challenge in cancer treatment. Recent evidence suggested epithelial-mesenchymal transition plays a role in MDR, but the mechanism behind this link remains unclear. We found that the expression of multiple ABC transporters was elevated in concordance with an increased drug efflux in cancer cell...
Artificial intelligence (AI) is profoundly changing biotechnological innovation. Beyond direct application, it is also a useful tool for adaptive learning and forging new conceptual connections within the vast network of knowledge for the advancement of biotechnology. Here, we discuss a new paradigm for biotechnology education that involves coevolu...
Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (...
Random signature superiority (RSS) occurs when random gene signatures outperform published and/or known signatures. Unlike reproducibility and generalizability issues, RSS is relatively underexplored. Yet, understanding it is imperative for better analytical outcome. In breast cancer, RSS correlates strongly with enrichment for proliferation genes...
The Anna Karenina effect is a manifestation of the theory-practice gap that exists when theoretical statistics are applied on real-world data. In the course of analyzing biological data for differential features such as genes or proteins, it derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), ra...
Statistical feature selection is used for identification of relevant genes from biological data, with implications for biomarker and drug development. Recent work demonstrates that the t-test p-value exhibits high sample-to-sample p-value variability accompanied by an exaggeration of effect size in the univariate scenario. To deepen understanding,...
A missing protein (MP) is an unconfirmed genetic sequence for which a protein product is not yet detected. Currently, MPs are tiered based on supporting evidence mainly in the form of protein existence (PE) classification. As we discuss here, this definition is overly restrictive because proteins go missing in day-to-day proteomics as a result of l...
The ultra-high risk (UHR) state was originally conceived to identify individuals at imminent risk of developing psychosis. Although recent studies have suggested that most individuals designated UHR do not, they constitute a distinctive group, exhibiting cognitive and functional impairments alongside multiple psychiatric morbidities. UHR characteri...
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package N...
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are s...
Background
In proteomics, batch effects are technical sources of variation that confounds proper analysis, preventing effective deployment in clinical and translational research. ResultsUsing simulated and real data, we demonstrate existing batch effect-correction methods do not always eradicate all batch effects. Worse still, they may alter data i...
Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch e...
The human cytochrome P450 (CYP) enzyme CYP4Z1 is a fatty acid hydroxylase which among human CYPs is unique for being much stronger expressed in the mammary gland than in all other tissues. Moreover, it is strongly overexpressed in all subtypes of breast cancer, and some overexpression has also been found in other types of malignancies, such as ovar...
Background
The hypergeometric enrichment analysis approach typically fares poorly in feature-selection stability due to its upstream reliance on the t-test to generate differential protein lists before testing for enrichment on a protein complex, subnetwork or gene group. Methods
Swapping the t-test in favour of a fuzzy rank-based weight system sim...
The brain adapts to dynamic environmental conditions by altering its epigenetic state, thereby influencing neuronal transcriptional programs. An example of an epigenetic modification is protein methylation, catalyzed by protein arginine methyltransferases (PRMT). One member, Prmt8, is selectively expressed in the central nervous system during a cru...
In clinical proteomics, reproducible feature selection is unattainable given the standard statistical hypothesis-testing framework. This leads to irreproducible signatures with no diagnostic power. Instability stems from high P-value variability (p_var), which is inevitable and insolvable. The impact of p_var can be reduced via power increment, for...
Despite the global impact of macrophage activation in vascular disease, the underlying mechanisms remain obscure. Here we show, with global proteomic analysis of macrophage cell lines treated with either IFNγ or IL-4, that PARP9 and PARP14 regulate macrophage activation. In primary macrophages, PARP9 and PARP14 have opposing roles in macrophage act...
Supplementary Figures 1-20 and Supplementary Tables 1-4.
In proteomics, useful signal may be unobserved or lost due to the lack of confident peptide-spectral matches. Selection of differential spectra, followed by associative peptide/protein mapping may be a complementary strategy for improving sensitivity and comprehensiveness of analysis (spectra-first paradigm). This approach is complementary to the s...
Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. [Formula: see text]-test and recursive feature elimination. Mo...
Despite advances in proteomic technologies, idiosyncratic data issues, for example, incomplete coverage and inconsistency, resulting in large data holes, persist. Moreover, because of naïve reliance on statistical testing and its accompanying p values, differential protein signatures identified from such proteomics data have little diagnostic power...
Networks can resolve many analytical problems in proteomics, including incomplete coverage and inconsistency. Despite high expectations, network-related research in proteomics has experienced only modest growth. In practice, most current research examines non-quantitative usages, for example determining physical interactions among proteins or conte...