Ting-Yi Sung

Taipei Medical University, Taipei, Taipei, Taiwan

Are you Ting-Yi Sung?

Claim your profile

Publications (81)141.2 Total impact

  • Article: A spectrum-based method to generate good decoy libraries for spectral library searching in peptide identifications.
    [show abstract] [hide abstract]
    ABSTRACT: As spectral library searching has received increasing attention for peptide identification, constructing good decoy spectra from the target spectra is the key to correctly estimate the false discovery rate in searching against the concatenated target-decoy spectral library. Several methods have been proposed to construct decoy spectral libraries. Most of them construct decoy peptide sequences and then generate theoretical spectra accordingly. In this paper, we propose a method, called precursor-swap, which directly constructs decoy spectral libraries directly at the "spectrum level" without generating decoy peptide sequences by swapping the precursors of two spectra selected according to a very simple rule. Our spectrum-based method does not require additional efforts to deal with ion types (e.g., a, b or c ions), fragment mechanism (e.g., CID, or ETD), or unannotated peaks, but preserves many spectral properties. The precursor-swap method is evaluated on different spectral libraries and the results of obtained decoy ratios show that it is comparable to other methods. Notably, it is efficient in time and memory usage for constructing decoy libraries. A software tool called Precursor-Swap-Decoy-Generation (PSDG) is publicly available for download at http://ms.iis.sinica.edu.tw/PSDG/.
    Journal of Proteome Research 04/2013; · 5.11 Impact Factor
  • Source
    Article: Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing
    [show abstract] [hide abstract]
    ABSTRACT: Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%~70% and 0.250~0.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.
    BMC Bioinformatics 12/2012; 13. · 2.75 Impact Factor
  • Article: Label-free Quantitative Proteomics and N-Glycoproteomics Analysis of KRAS-activated Human Bronchial Epithelial Cells.
    [show abstract] [hide abstract]
    ABSTRACT: Mutational activation of KRAS promotes various malignancies, including lung adenocarcinoma. Knowledge of the molecular targets mediating the downstream effects of activated KRAS is limited. Here, we provide the KRAS target proteins and N-glycoproteins using human bronchial epithelial cells with and without the expression of activated KRAS (KRAS(V12)). Using an OFFGEL peptide fractionation and hydrazide method combined with subsequent LTQ-Orbitrap analysis, we identified 5713 proteins and 608 N-glycosites on 317 proteins in human bronchial epithelial cells. Label-free quantitation of 3058 proteins (≥2 peptides; coefficient of variation (CV) ≤ 20%) and 297 N-glycoproteins (CV ≤ 20%) revealed the differential regulation of 23 proteins and 14 N-glycoproteins caused by activated KRAS, including 84% novel ones. An informatics-assisted IPA-Biomarker® filter analysis prioritized some of the differentially regulated proteins (ALDH3A1, CA2, CTSD, DST, EPHA2, and VIM) and N-glycoproteins (ALCAM, ITGA3, and TIMP-1) as cancer biomarkers. Further, integrated in silico analysis of microarray repository data of lung adenocarcinoma clinical samples and cell lines containing KRAS mutations showed positive mRNA fold changes (p < 0.05) for 61% of the KRAS-regulated proteins, including biomarker proteins, CA2 and CTSD. The most significant discovery of the integrated validation is the down-regulation of FABP5 and PDCD4. A few validated proteins, including tumor suppressor PDCD4, were further confirmed as KRAS targets by shRNA-based knockdown experiments. Finally, the studies on KRAS-regulated N-glycoproteins revealed structural alterations in the core N-glycans of SEMA4B in KRAS-activated human bronchial epithelial cells and functional role of N-glycosylation of TIMP-1 in the regulation of lung adenocarcinoma A549 cell invasion. Together, our study represents the largest proteome and N-glycoproteome data sets for HBECs, which we used to identify several novel potential targets of activated KRAS that may provide insights into KRAS-induced adenocarcinoma and have implications for both lung cancer therapy and diagnosis.
    Molecular &amp Cellular Proteomics 07/2012; 11(10):901-15. · 7.40 Impact Factor
  • Source
    Article: Computational comparative study of tuberculosis proteomes using a model learned from signal peptide structures.
    [show abstract] [hide abstract]
    ABSTRACT: Secretome analysis is important in pathogen studies. A fundamental and convenient way to identify secreted proteins is to first predict signal peptides, which are essential for protein secretion. However, signal peptides are highly complex functional sequences that are easily confused with transmembrane domains. Such confusion would obviously affect the discovery of secreted proteins. Transmembrane proteins are important drug targets, but very few transmembrane protein structures have been determined experimentally; hence, prediction of the structures is essential. In the field of structure prediction, researchers do not make assumptions about organisms, so there is a need for a general signal peptide predictor.To improve signal peptide prediction without prior knowledge of the associated organisms, we present a machine-learning method, called SVMSignal, which uses biochemical properties as features, as well as features acquired from a novel encoding, to capture biochemical profile patterns for learning the structures of signal peptides directly.We tested SVMSignal and five popular methods on two benchmark datasets from the SPdb and UniProt/Swiss-Prot databases, respectively. Although SVMSignal was trained on an old dataset, it performed well, and the results demonstrate that learning the structures of signal peptides directly is a promising approach. We also utilized SVMSignal to analyze proteomes in the entire HAMAP microbial database. Finally, we conducted a comparative study of secretome analysis on seven tuberculosis-related strains selected from the HAMAP database. We identified ten potential secreted proteins, two of which are drug resistant and four are potential transmembrane proteins.SVMSignal is publicly available at http://bio-cluster.iis.sinica.edu.tw/SVMSignal. It provides user-friendly interfaces and visualizations, and the prediction results are available for download.
    PLoS ONE 01/2012; 7(4):e35018. · 4.09 Impact Factor
  • Article: Phosphoproteomic analysis of human mesenchymal stromal cells during osteogenic differentiation.
    [show abstract] [hide abstract]
    ABSTRACT: Human mesenchymal stromal cells (hMSCs) are promising candidates for cell therapy and tissue regeneration. Knowledge of the molecular mechanisms governing hMSC commitment into osteoblasts is critical to the development of therapeutic applications for human bone diseases. Because protein phosphorylation plays a critical role in signaling transduction network, the purpose of this study is to elucidate the phosphoproteomic changes in hMSCs during early osteogenic lineage commitment. hMSCs cultured in osteogenic induction medium for 0, 1, 3, and 7 days were analyzed by liquid chromatography tandem mass spectrometry (LC-MS/MS). Surprisingly, we observed a dramatic loss of protein phosphorylation level after 1 day of osteogenic induction. Pathways analysis of these reduced phosphoproteins exhibited a high correlation with cell proliferation and protein synthesis pathways. During osteogenic differentiation, differentially expressed phosphoproteins demonstrated the dynamic alterations in cytoskeleton at the early stages of differentiation. The fidelity of our quantitative phosphoproteomic analyses were further confirmed by Western blot analyses, and the changes from protein expression or its phosphorylation level were distinguished. In addition, several ion channels and transcription factors with differentially expressed phosphorylation sites during osteogenic differentiation were identified and may serve as potentially unexplored transcriptional regulators of the osteogenic phenotype of hMSCs. Taken together, our results have demonstrated the dynamic changes in phosphoproteomic profiles of hMSCs during osteogenic differentiation and unraveled potential candidates mediating the osteogenic commitment of hMSCs. The findings in this study may also shed light on the development of new therapeutic targets for metabolic bone diseases such as osteoporosis and osteomalacia.
    Journal of Proteome Research 11/2011; 11(2):586-98. · 5.11 Impact Factor
  • Source
    Article: TMPad: an integrated structural database for helix-packing folds in transmembrane proteins.
    [show abstract] [hide abstract]
    ABSTRACT: α-helical transmembrane (TM) proteins play an important role in many critical and diverse biological processes, and specific associations between TM helices are important determinants for membrane protein folding, dynamics and function. In order to gain insights into the above phenomena, it is necessary to investigate different types of helix-packing modes and interactions. However, such information is difficult to obtain because of the experimental impediment and a lack of a well-annotated source of helix-packing folds in TM proteins. We have developed the TMPad (TransMembrane Protein Helix-Packing Database) which addresses the above issues by integrating experimentally observed helix-helix interactions and related structural information of membrane proteins. Specifically, the TMPad offers pre-calculated geometric descriptors at the helix-packing interface including residue backbone/side-chain contacts, interhelical distances and crossing angles, helical translational shifts and rotational angles. The TMPad also includes the corresponding sequence, topology, lipid accessibility, ligand-binding information and supports structural classification, schematic diagrams and visualization of the above structural features of TM helix-packing. Through detailed annotations and visualizations of helix-packing, this online resource can serve as an information gateway for deciphering the relationship between helix-helix interactions and higher levels of organization in TM protein structure and function. The website of the TMPad is freely accessible to the public at http://bio-cluster.iis.sinica.edu.tw/TMPad.
    Nucleic Acids Research 01/2011; 39(Database issue):D347-55. · 8.03 Impact Factor
  • Source
    Article: Phosphoproteomics identifies oncogenic Ras signaling targets and their involvement in lung adenocarcinomas.
    [show abstract] [hide abstract]
    ABSTRACT: Ras is frequently mutated in a variety of human cancers, including lung cancer, leading to constitutive activation of MAPK signaling. Despite decades of research focused on the Ras oncogene, Ras-targeted phosphorylation events and signaling pathways have not been described on a proteome-wide scale. By functional phosphoproteomics, we studied the molecular mechanics of oncogenic Ras signaling using a pathway-based approach. We identified Ras-regulated phosphorylation events (n = 77) using label-free comparative proteomics analysis of immortalized human bronchial epithelial cells with and without the expression of oncogenic Ras. Many were newly identified as potential targets of the Ras signaling pathway. A majority (∼60%) of the Ras-targeted events consisted of a [pSer/Thr]-Pro motif, indicating the involvement of proline-directed kinases. By integrating the phosphorylated signatures into the Pathway Interaction Database, we further inferred Ras-regulated pathways, including MAPK signaling and other novel cascades, in governing diverse functions such as gene expression, apoptosis, cell growth, and RNA processing. Comparisons of Ras-regulated phosphorylation events, pathways, and related kinases in lung cancer-derived cells supported a role of oncogenic Ras signaling in lung adenocarcinoma A549 and H322 cells, but not in large cell carcinoma H1299 cells. This study reveals phosphorylation events, signaling networks, and molecular functions that are regulated by oncogenic Ras. The results observed in this study may aid to extend our knowledge on Ras signaling in lung cancer.
    PLoS ONE 01/2011; 6(5):e20199. · 4.09 Impact Factor
  • Source
    Article: Improving the alignment quality of consistency based aligners with an evaluation function using synonymous protein words.
    [show abstract] [hide abstract]
    ABSTRACT: Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.
    PLoS ONE 01/2011; 6(12):e27872. · 4.09 Impact Factor
  • Source
    Article: An informatics-assisted label-free quantitation strategy that depicts phosphoproteomic profiles in lung cancer cell invasion.
    [show abstract] [hide abstract]
    ABSTRACT: Aberrant protein phosphorylation plays important roles in cancer-related cell signaling. With the goal of achieving multiplexed, comprehensive, and fully automated relative quantitation of site-specific phosphorylation, we present a simple label-free strategy combining an automated pH/acid-controlled IMAC procedure and informatics-assisted SEMI (sequence, elution time, mass-to-charge, and internal standard) algorithm. The SEMI strategy effectively increased the number of quantifiable peptides more than 4-fold in replicate experiments (from 262 to 1171, p < 0.05, false discovery rate = 0.46%) by using a fragmental regression algorithm for elution time alignment followed by peptide cross-assignment in all LC-MS/MS runs. In addition, the strategy demonstrated good quantitation accuracy (10-12%) for standard phosphoprotein and variation less than 1.9 fold (within 99% confidence range) in proteome scale and reliable linear quantitation correlation (R(2) = 0.99) with 4000-fold dynamic concentrations, which was attributed to our reproducible experimental procedure and informatics-assisted peptide alignment tool to minimize system variations. In an attempt to explore metastasis-associated phosphoproteomic alterations in lung cancer, this approach was used to delineate differential phosphoproteomic profiles of a lung cancer metastasis model. Without sample fractionation, the SEMI algorithm enabled quantification of 1796 unique phosphopeptides (false discovery rate = 0.56%) corresponding to 854 phosphoproteins from a series of non-small cell lung cancer lines with varying degrees of in vivo invasiveness. Nearly 40% of the phosphopeptides showed >2-fold change in highly invasive cells; validation of phosphoprotein subsets by Western blotting not only demonstrated the consistency of data obtained by our SEMI strategy but also revealed that such dramatic changes in the phosphoproteome result mostly from translational or post-translational regulation. Mapping of these differentially expressed phosphoproteins in multiple cellular pathways related to cancer invasion and metastasis suggests that the site and degree of phosphorylation might have distinct patterns or functions in the complex process of cancer progression.
    Journal of Proteome Research 11/2010; 9(11):5582-97. · 5.11 Impact Factor
  • Source
    Article: GAPM–A Robust Algorithm for the Physical Mapping Problem
    [show abstract] [hide abstract]
    ABSTRACT: A major challenge for next generation sequencing technology is genome assembly. A physical map could be used as a preliminary step towards genome sequencing in a hybrid approach. In this paper, we illustrate a robust physical mapping algorithm, GAPM, which could well complement with the assembly of short fragments. The physical mapping problem (PMP) is to determine the relative positions of genetic markers (called probes) along the DNA sequences. The presence and absence of probes in clones can be represented by a 0-1 matrix with rows corresponding to clones and columns corresponding to probes. A 0-1 matrix satisfies the consecutive ones property (COP) for the rows if there exists a column permutation such that the ones in each row of the resulting matrix are consecutive. In the error-free case, the PMP can be reduced to testing the COP of a 0-1 matrix. Lu and Hsu proposed an iterative clustering algorithm to deal with the following four types of errors: false positives, false negatives, chimerical clones, and non-unique probes. In this paper, we present a novel genetic algorithm, called GAPM, with a much better performance. GAPM can be run in parallel and generate approximate optimal physical maps regardless of the error rates and matrix sizes. Moreover, GAPM is very flexible in dealing with unknown data. We test 9,000 different cases and compare GAPM with L&H's method. The results indicate that GAPM is more robust and reliable for most data.
    01/2010;
  • Source
    Article: Improving protein secondary structure prediction based on short subsequences with local structure similarity.
    [show abstract] [hide abstract]
    ABSTRACT: When characterizing the structural topology of proteins, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures because it represents the local conformation of amino acids into regular structures. Although PSS prediction has been studied for decades, the prediction accuracy reaches a bottleneck at around 80%, and further improvement is very difficult. In this paper, we present an improved dictionary-based PSS prediction method called SymPred, and a meta-predictor called SymPsiPred. We adopt the concept behind natural language processing techniques and propose synonymous words to capture local sequence similarities in a group of similar proteins. A synonymous word is an n-gram pattern of amino acids that reflects the sequence variation in a protein's evolution. We generate a protein-dependent synonymous dictionary from a set of protein sequences for PSS prediction.On a large non-redundant dataset of 8,297 protein chains (DsspNr-25), the average Q3 of SymPred and SymPsiPred are 81.0% and 83.9% respectively. On the two latest independent test sets (EVA Set_1 and EVA_Set2), the average Q3 of SymPred is 78.8% and 79.2% respectively. SymPred outperforms other existing methods by 1.4% to 5.4%. We study two factors that may affect the performance of SymPred and find that it is very sensitive to the number of proteins of both known and unknown structures. This finding implies that SymPred and SymPsiPred have the potential to achieve higher accuracy as the number of protein sequences in the NCBInr and PDB databases increases. Our experiment results show that local similarities in protein sequences typically exhibit conserved structures, which can be used to improve the accuracy of secondary structure prediction. For the application of synonymous words, we demonstrate an example of a sequence alignment which is generated by the distribution of shared synonymous words of a pair of protein sequences. We can align the two sequences nearly perfectly which are very dissimilar at the sequence level but very similar at the structural level. The SymPred and SymPsiPred prediction servers are available at http://bio-cluster.iis.sinica.edu.tw/SymPred/.
    BMC Genomics 01/2010; 11 Suppl 4:S4. · 4.07 Impact Factor
  • Article: Automated generic analysis tools for protein quantitation using stable isotope labeling.
    Wen-Lian Hsu, Ting-Yi Sung
    [show abstract] [hide abstract]
    ABSTRACT: Isotope labeling combined with LC-MS/MS provides a robust platform for quantitative proteomics. Protein quantitation based on mass spectral data falls into two categories: one determined by MS/MS scans, e.g., iTRAQ-labeling quantitation, and the other by MS scans, e.g., quantitation using SILAC, ICAT, or (18)O labeling. In large-scale LC-MS proteomic experiments, tens of thousands of MS and MS/MS spectra are generated and need to be analyzed. Data noise further complicates the data analysis. In this chapter, we present two automated tools, called Multi-Q and MaXIC-Q, for MS/MS- and MS-based quantitation analysis. They are designed as generic platforms that can accommodate search results from SEQUEST and Mascot, as well as mzXML files converted from raw files produced by various mass spectrometers. Toward accurate quantitation analysis, Multi-Q determines detection limits of the user's instrument to filter out outliers and MaXIC-Q adopts stringent validation on our constructed projected ion mass spectra to ensure correct data for quantitation.
    Methods in molecular biology (Clifton, N.J.) 01/2010; 604:257-72.
  • Article: IDEAL-Q, an automated tool for label-free quantitation analysis using an efficient peptide alignment approach and spectral data validation.
    [show abstract] [hide abstract]
    ABSTRACT: In this study, we present a fully automated tool, called IDEAL-Q, for label-free quantitation analysis. It accepts raw data in the standard mzXML format as well as search results from major search engines, including Mascot, SEQUEST, and X!Tandem, as input data. To quantify as many identified peptides as possible, IDEAL-Q uses an efficient algorithm to predict the elution time of a peptide unidentified in a specific LC-MS/MS run but identified in other runs. Then, the predicted elution time is used to detect peak clusters of the assigned peptide. Detected peptide peaks are processed by statistical and computational methods and further validated by signal-to-noise ratio, charge state, and isotopic distribution criteria (SCI validation) to filter out noisy data. The performance of IDEAL-Q has been evaluated by several experiments. First, a serially diluted protein mixed with Escherichia coli lysate showed a high correlation with expected ratios and demonstrated good linearity (R(2) = 0.996). Second, in a biological replicate experiment on the THP-1 cell lysate, IDEAL-Q quantified 87% (1,672 peptides) of all identified peptides, surpassing the 45.7% (909 peptides) achieved by the conventional identity-based approach, which only quantifies peptides identified in all LC-MS/MS runs. Manual validation on all 11,940 peptide ions in six replicate LC-MS/MS runs revealed that 97.8% of the peptide ions were correctly aligned, and 93.3% were correctly validated by SCI. Thus, the mean of the protein ratio, 1.00 +/- 0.05, demonstrates the high accuracy of IDEAL-Q without human intervention. Finally, IDEAL-Q was applied again to the biological replicate experiment but with an additional SDS-PAGE step to show its compatibility for label-free experiments with fractionation. For flexible workflow design, IDEAL-Q supports different fractionation strategies and various normalization schemes, including multiple spiked internal standards. User-friendly interfaces are provided to facilitate convenient inspection, validation, and modification of quantitation results. In summary, IDEAL-Q is an efficient, user-friendly, and robust quantitation tool. It is available for download.
    Molecular &amp Cellular Proteomics 09/2009; 9(1):131-44. · 7.40 Impact Factor
  • Source
    Article: MaXIC-Q Web: a fully automated web service using statistical and computational methods for protein quantitation based on stable isotope labeling and LC-MS.
    [show abstract] [hide abstract]
    ABSTRACT: Isotope labeling combined with liquid chromatography-mass spectrometry (LC-MS) provides a robust platform for analyzing differential protein expression in proteomics research. We present a web service, called MaXIC-Q Web (http://ms.iis.sinica.edu.tw/MaXIC-Q_Web/), for quantitation analysis of large-scale datasets generated from proteomics experiments using various stable isotope-labeling techniques, e.g. SILAC, ICAT and user-developed labeling methods. It accepts spectral files in the standard mzXML format and search results from SEQUEST, Mascot and ProteinProphet as input. Furthermore, MaXIC-Q Web uses statistical and computational methods to construct two kinds of elution profiles for each ion, namely, PIMS (projected ion mass spectrum) and XIC (extracted ion chromatogram) from MS data. Toward accurate quantitation, a stringent validation procedure is performed on PIMSs to filter out peptide ions interfered with co-eluting peptides or noise. The areas of XICs determine ion abundances, which are used to calculate peptide and protein ratios. Since MaXIC-Q Web adopts stringent validation on spectral data, it achieves high accuracy so that manual validation effort can be substantially reduced. Furthermore, it provides various visualization diagrams and comprehensive quantitation reports so that users can conveniently inspect quantitation results. In summary, MaXIC-Q Web is a user-friendly, interactive, robust, generic web service for quantitation based on ICAT and SILAC labeling techniques.
    Nucleic Acids Research 07/2009; 37(Web Server issue):W661-9. · 8.03 Impact Factor
  • Article: Predicting helix-helix interactions from residue contacts in membrane proteins.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: Helix-helix interactions play a critical role in the structure assembly, stability and function of membrane proteins. On the molecular level, the interactions are mediated by one or more residue contacts. Although previous studies focused on helix-packing patterns and sequence motifs, few of them developed methods specifically for contact prediction. RESULTS: We present a new hierarchical framework for contact prediction, with an application in membrane proteins. The hierarchical scheme consists of two levels: in the first level, contact residues are predicted from the sequence and their pairing relationships are further predicted in the second level. Statistical analyses on contact propensities are combined with other sequence and structural information for training the support vector machine classifiers. Evaluated on 52 protein chains using leave-one-out cross validation (LOOCV) and an independent test set of 14 protein chains, the two-level approach consistently improves the conventional direct approach in prediction accuracy, with 80% reduction of input for prediction. Furthermore, the predicted contacts are then used to infer interactions between pairs of helices. When at least three predicted contacts are required for an inferred interaction, the accuracy, sensitivity and specificity are 56%, 40% and 89%, respectively. Our results demonstrate that a hierarchical framework can be applied to eliminate false positives (FP) while reducing computational complexity in predicting contacts. Together with the estimated contact propensities, this method can be used to gain insights into helix-packing in membrane proteins.
    Bioinformatics 03/2009; 25(8):996-1003. · 5.47 Impact Factor
  • Article: Protein subcellular localization prediction of eukaryotes using a knowledge-based approach.
    BMC Bioinformatics. 01/2009; 10:8.
  • Source
    Article: Protein subcellular localization prediction of eukaryotes using a knowledge-based approach.
    [show abstract] [hide abstract]
    ABSTRACT: The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. In this study, we propose a knowledge based method, called KnowPredsite, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred site's performance. The experiment results show that KnowPred site achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred site is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred site is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred site. KnowPred site demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred site is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred site is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred site prediction server is available at http://bio-cluster.iis.sinica.edu.tw/kbloc/.
    BMC Bioinformatics 01/2009; 10 Suppl 15:S8. · 2.75 Impact Factor
  • Source
    Article: PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.
    [show abstract] [hide abstract]
    ABSTRACT: Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/.
    Proteins Structure Function and Bioinformatics 09/2008; 72(2):693-710. · 3.39 Impact Factor
  • Source
    Article: Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function.
    [show abstract] [hide abstract]
    ABSTRACT: The prediction of transmembrane (TM) helix and topology provides important information about the structure and function of a membrane protein. Due to the experimental difficulties in obtaining a high-resolution model, computational methods are highly desirable. In this paper, we present a hierarchical classification method using support vector machines (SVMs) that integrates selected features by capturing the sequence-to-structure relationship and developing a new scoring function based on membrane protein folding. The proposed approach is evaluated on low- and high-resolution data sets with cross-validation, and the topology (sidedness) prediction accuracy reaches as high as 90%. Our method is also found to correctly predict both the location of TM helices and the topology for 69% of the low-resolution benchmark set. We also test our method for discrimination between soluble and membrane proteins and achieve very low overall false positive (0.5%) and false negative rates (0 to approximately 1.2%). Lastly, the analysis of the scoring function suggests that the topogeneses of single-spanning and multispanning TM proteins have different levels of complexity, and the consideration of interloop topogenic interactions for the latter is the key to achieving better predictions. This method can facilitate the annotation of membrane proteomes to extract useful structural and functional information. It is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/SVMtop.
    Journal of Proteome Research 03/2008; 7(2):487-96. · 5.11 Impact Factor
  • Source
    Article: Predicting RNA-binding sites of proteins using support vector machines and evolutionary information.
    [show abstract] [hide abstract]
    ABSTRACT: RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities. We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%-6.83%, 0.88%-5.33%, and 0.10-0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%-26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed. Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.
    BMC Bioinformatics 02/2008; 9 Suppl 12:S6. · 2.75 Impact Factor

Institutions

  • 2012
    • Taipei Medical University
      Taipei, Taipei, Taiwan
  • 1997–2012
    • Academia Sinica
      • • Institute of Information Science
      • • Institute of Earth Sciences
      Taipei, Taipei, Taiwan
  • 2007–2008
    • National Tsing Hua University
      Hsinchu, Taiwan, Taiwan
  • 2000
    • Chung Shan Institute of Science and Technology
      Taoyuan, Taiwan, Taiwan
  • 1998–2000
    • National Chiao Tung University
      • Department of Computer Science
      Hsinchu, Taiwan, Taiwan