Article

Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89%, which, to the best of our knowledge, is the highest prediction rate ever reported. Our prediction is 14% higher than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Optimizing w is equivalent to minimizing Eq. (2). (18) ( ) , ...
... The basic SVM is widely used in binary classification problems, and it divides one side into a positive class and the other side into a negative class around a hyperplane. (18) The most basic idea of SVM is that data consist of two classes (positive and negative classes). The goal is to find the hyperplane that best separates them. ...
... The exact position of these proteins in the cell was predicted using CELLO version 2.5 (subCELlular LOcalization predictor version 2.5). 28 CELLO applies the amino acid composition and di-peptide composition based on physicochemical parameters of amino acids to predict the subcellular position of the proteins. 28 The gram-positive bacterial proteins have the following localization sites: the cell membrane, the cytoplasm, the extracellular space and the cell wall. ...
... 28 CELLO applies the amino acid composition and di-peptide composition based on physicochemical parameters of amino acids to predict the subcellular position of the proteins. 28 The gram-positive bacterial proteins have the following localization sites: the cell membrane, the cytoplasm, the extracellular space and the cell wall. ...
Article
Full-text available
We discover essential enzymes catalyzing critical metabolic reactions as potential drug targets, which may help to fight Listeria infections and their associated secondary infections extensively and effectively. A comparative metabolic pathway approach has been applied to identify and determine putative drug targets against Listeria monocytogenes. For this, enzymes unique to pathogenic pathways of L. monocytogenes EGD-e were determined using the KEGG database. They were further refined by selecting enzymes with sequences non-homologous to the host Homo sapiens and analysing their essentiality to the pathogen’s survival. We report 15 essential pathogen-host non-homologous proteins as putative drug targets that can be exploited for development of specific drug targets or vaccines against multidrug resistant strains of L. monocytogenes. Finally, four essential enzymes from the pathogen: UDP-N-acetylglucosamine 1-carboxyvinyltransferase, Acetate kinase, Phosphate acetyltransferase, and Aspartate kinase were reported as novel putative targets for vaccine and drug development against L. monocytogenes infections. Unravelling novel target proteins and their associated pathways by comparing metabolic pathway analysis between L. monocytogenes EGD-e and host H. sapiens, develops the novelty of the work towards broad spectrum putative drug targets. This research design yields putative drug target critical enzymes that turn out to be fatal to the pathogen without interacting with the host machinery.
... The FASTA format is assessed to understand the subcellular localization which provides an overview about the biological role of a specific protein [26] by CELLO version2.5 [27]. ...
... The enzyme is anticipated to be located in the cytoplasm as per the analysis by CELLO version2.5 [27]. An evaluation score of 2.193 is computed by the tool which represents the reliability of the obtained results. ...
Article
A vital step in drug discovery is structural and interaction assessment. L-asparaginase is a highly used compound in various treatment regimens which includes combinational therapies. The application is not only limited to treating Acute lymphoblastic leukemia, where it is also found effective for myeloid leukemia and canine lymphoma. The study details about L-asparaginase of Streptomyces koyangensis (WP_203215894). Structural insights are obtained by evaluation with GalaxyTBM. The constructed 3D-model is based on L-asparaginase models belonging to Escherichia coli, Wolinella succinogenes and Helicobacter pylori. The Ramachandran assessment corresponds to 91.4% of residues in the favourable region while ProSA infers to -8.9 Z-score. Further, QMEAN4 score of -0.46, 92.903 ERRAT score as well as 100% of residues in agreement with Verify3D theory describe the quality of the protein to be good and acceptable. The protein is identified to be secreted in the cytoplasm as per subsequent CELLO version2.5. Docking studies involving L-asparaginase (WP_203215894) molecule with L-asparagine resulted in the lowest energy of -4.5 kcal/mol along with prominent active sites recognized by PyRx. Similar prediction in addition to -4.6 and -4.8 kcal/mol is also resulted by CB-Dock. Further, the binding conformation referring to affinity of -4.5 kcal/mol is also supported by the interpretation of GalaxySite with identification of the same residues involved in ligand interaction. High values of affinity correspond to the analysis by SwissDock referring to -9.42 kcal/mol. The RMSF plot generated as part of protein flexibility analysis by CABS flex 2.0 depicted varied fluctuations however, the crucial residues interacting with the ligand are computed to have RMSF value < 2 represents rigidity. The therapeutic effectiveness of the enzyme has created interest in terms of tracing prospective sources and conducting computational analysis. The anticipated structure of WP_203215894 and derived assumptions are helpful and contribute valuable information for future in-vitro findings on L-asparaginase.
... Prediction of protein localization was done through Cello server. It uses a hybrid approach i.e., support vector machines model and a structural homology approach for localization prediction [10]. SignalP 5.0 was used for predicting signal peptide and cleavage site in a protein's sequence. ...
... Cello server does not rely solely on the homology of the sequences but on the combination of two-level support vector machine classifiers to determine the subcellular location and thus, reduces the bias while increasing the accuracy [10]. Among 398 uncharacterized proteins, most of the proteins (74%) were predicted to be localized in cytoplasm whereas 12% and 7% of proteins were local-ized in the inner and outer membrane of the cell respectively. ...
Article
Full-text available
Fusobacterium nucleatum is a gram-negative bacteria associated with diverse infections like appendicitis and colorectal cancer. It mainly attacks the epithelial cells in the oral cavity and throat of the infected individual. It has a single circular genome of 2.7 Mb. Many proteins in F. nucleatum genome are listed as “Uncharacterized.” Annotation of these proteins is crucial for obtaining new facts about the pathogen and deciphering the gene regulation, functions, and pathways along with discovery of novel target proteins. In the light of new genomic information, an armoury of bioinformatic tools were used for predicting the physicochemical parameters, domain and motif search, pattern search, and localization of the uncharacterized proteins. The programs such as receiver operating characteristics determine the efficacy of the databases that have been employed for prediction of different parameters at 83.6%. Functions were successfully assigned to 46 uncharacterized proteins which included enzymes, transporter proteins, membrane proteins, binding proteins, etc. Apart from the function prediction, the proteins were also subjected to string analysis to reveal the interacting partners. The annotated proteins were also put through homology-based structure prediction and modeling using Swiss PDB and Phyre2 servers. Two probable virulent factors were also identified which could be investigated further for potential drug-related studies. The assigning of functions to uncharacterized proteins has shown that some of these proteins are important for cell survival inside the host and can act as effective drug targets.
... Subsequently, the proteins identified through CELLO were subjected to further filtration based on their virulence properties, utilizing Island Viewer 4. (https://www.pathogenomics.sfu.ca/islandviewer/) [15]. ...
Article
Full-text available
Shigella dysenteriae, is a Gram-negative bacterium that emerged as the second most significant cause of bacillary dysentery. Antibiotic treatment is vital in lowering Shigella infection rates, yet the growing global resistance to broad-spectrum antibiotics poses a significant challenge. The persistent multidrug resistance of S. dysenteriae complicates its management and control. Hence, there is an urgent requirement to discover novel therapeutic targets and potent medications to prevent and treat this disease. Therefore, the integration of bioinformatics methods such as subtractive and comparative analysis provides a pathway to compute the pan-genome of S. dysenteriae. In our study, we analysed a dataset comprising 27 whole genomes. The S. dysenteriae strain SD197 was used as the reference for determining the core genome. Initially, our focus was directed towards the identification of the proteome of the core genome. Moreover, several filters were applied to the core genome, including assessments for non-host homology, protein essentiality, and virulence, in order to prioritize potential drug targets. Among these targets were Integration host factor subunit alpha and Tyrosine recombinase XerC. Furthermore, four drug-like compounds showing potential inhibitory effects against both target proteins were identified. Subsequently, molecular docking analysis was conducted involving these targets and the compounds. This initial study provides the list of novel targets against S. dysenteriae. Conclusively, future in vitro investigations could validate our in-silico findings and uncover potential therapeutic drugs for combating bacillary dysentery infection.
... Assembled genomes were annotated by Bakta v1.8.1 [26] and the pan-genome was reconstructed using ROARY v3.13 [27]. Protein's cellular localization and topology were predicted by CELLO2GO v.2.5 [28] and DeepTMHMM v1.0.24 [29], both was accessed on 30 September 2023. Cogclassifier v1.0.5 [30] was used for functional classification of the pan-genome. ...
Article
Full-text available
Aeromonas spp. are commonly found in the aquatic environment and have been responsible for motile Aeromonas septicemia (MAS) in striped catfish, resulting in significant economic loss. These organisms also cause a range of opportunistic infections in humans with compromised immune systems. Here, we conducted a genomic investigation of 87 Aeromonas isolates derived from diseased catfish, healthy catfish and environmental water in catfish farms affected by MAS outbreaks in eight provinces in Mekong Delta (years: 2012–2022), together with 25 isolates from humans with bloodstream infections (years: 2010–2020). Genomics-based typing method precisely delineated Aeromonas species while traditional methods such as aerA PCR and MALDI-TOF were unable identify A. dhakensis. A. dhakensis was found to be more prevalent than A. hydrophila in both diseased catfish and human infections. A. dhakensis sequence type (ST) 656 followed by A. hydrophila ST251 were the predominant virulent species-lineages in diseased catfish (43.7 and 20.7 %, respectively), while diverse STs were found in humans with bloodstream infections. There was evidence of widespread transmission of ST656 and ST251 on striped catfish in the Mekong Delta region. ST656 and ST251 isolates carried a significantly higher number of acquired antimicrobial resistance (AMR) genes and virulence factors in comparison to other STs. They, however, exhibited several distinctions in key virulence factors (i.e. lack of type IV pili and enterotoxin ast in A. dhakensis ), AMR genes (i.e. presence of imiH carbapenemase in A. dhakensis ), and accessory gene content. To uncover potential conserved proteins of Aeromonas spp. for vaccine development, pangenome analysis has unveiled 2202 core genes between ST656 and ST251, of which 78 proteins were in either outer membrane or extracellular proteins. Our study represents one of the first genomic investigations of the species distribution, genetic landscape, and epidemiology of Aeromonas in diseased catfish and human infections in Vietnam. The emergence of antimicrobial resistant and virulent A. dhakensis strains underscores the needs of enhanced genomic surveillance and strengthening vaccine research and development in preventing Aeromonas diseases in catfish and humans, and the search for potential vaccine candidates could focus on Aeromonas core genes encoded for membrane and secreted proteins.
... edu. tw/)47,48 . Phylogenetic trees of each gene between referenced species ...
Article
Full-text available
Carotenoids play essential roles in plant growth and development and provide plants with a tolerance to a series of abiotic stresses. In this study, the function and biological significance of lycopene β-cyclase, lycopene ε-cyclase, and β-carotene hydroxylase, which are responsible for the modification of the tetraterpene skeleton procedure, were isolated from Lycium chinense and analyzed. The overexpression of lycopene β-cyclase, lycopene ε-cyclase, and β-carotene hydroxylase promoted the accumulation of total carotenoids and photosynthesis enhancement, reactive oxygen species scavenging activity, and proline content of tobacco seedlings after exposure to the salt stress. Furthermore, the expression of the carotenoid biosynthesis genes and stress-related genes (ascorbate peroxidase, catalase, peroxidase, superoxide dismutase, and pyrroline-5-carboxylate reductase) were detected and showed increased gene expression level, which were strongly associated with the carotenoid content and reactive oxygen species scavenging activity. After exposure to salt stress, the endogenous abscisic acid content was significantly increased and much higher than those in control plants. This research contributes to the development of new breeding aimed at obtaining stronger salt tolerance plants with increased total carotenoids and vitamin A content.
... Subcellular localization of cucumber TFGs were estimated through CELLO (http://cello.life.nctu.edu.tw/) (Yu et al., 2004). Furthermore, the chromosomal location of cucumber TFGs was visualized using TBtools. ...
... Sub-cellular localization of essential and virulent proteins is important for predicting successful vaccine candidates. For this purpose, PSORTb [28], CELLO v2.5 [29], and BUSCA [30] subcellular localization servers were used, and the extracellular and membrane proteins were subjected to downstream analysis [31]. ...
Article
Yersinia pestis, the causative agent of plague, is a gram-negative bacterium that can be fatal if not treated properly. Three types of plague are currently known: bubonic, septicemic, and pneumonic plague, among which the fatality rate of septicemic and pneumonic plague is very high. Bubonic plague can be treated, but only if antibiotics are used at the initial stage of the infection. But unfortunately, Y. pestis has also shown resistance to certain antibiotics such as kanamycin, minocycline, tetracycline, streptomycin, sulfonamides, spectinomycin, and chloramphenicol. Despite tremendous progress in vaccine development against Y. pestis, there is no proper FDA-approved vaccine available to protect people from its infections. Therefore, effective broad-spectrum vaccine development against Y. pestis is indispensable. In this study, vaccinomics-assisted immunoinformatics techniques were used to find possible vaccine candidates by utilizing the core proteome prepared from 58 complete genomes of Y. pestis. Human non-homologous, pathogen-essential, virulent, and extracellular and membrane proteins are potential vaccine targets. Two antigenic proteins were prioritized for the prediction of lead epitopes by utilizing reverse vaccinology approaches. Four vaccine designs were formulated using the selected Band T-cell epitopes coupled with appropriate linkers and adjuvant sequences capable of inducing potent immune responses. The HLA allele population coverage of the T-cell epitopes selected for vaccine construction was also analyzed. The V2 constructs were top-ranked and selected for further analysis on the basis of immunological, physicochemical, and immune-receptor docking interactions and scores. Docking and molecular dynamic simulations confirmed the stability of construct V2 interactions with the host immune receptors. Immune simulation analysis anticipated the strong immune profile of the prioritized construct. In silico restriction cloning ensured the feasible cloning ability of the V2 construct in the expression system of E. coli strain K12. It is anticipated that the designed vaccine construct may be safe, effective, and able to elicit strong immune responses against Y. pestis infections and may, therefore, merit investigation using in vitro and in vivo assays.
... The subcellular localization of the proteins with differential expression was assessed using the CELLO [20] program. Based on the investigation, the bulk of proteins were found in the mitochondria, cytoplasm, cell membrane, and nucleus. ...
Article
Full-text available
Necrotizing enterocolitis (NEC) is a common gastrointestinal complication in premature infants, resulting in high morbidity and mortality, and its early detection is crucial for accurate treatment and outcome prediction. Extensive research has demonstrated a clear correlation between NEC and extremely low birth weight, degree of preterm, formula feeding, infection, hypoxic/ischemic damage, and intestinal dysbiosis. The development of noninvasive biomarkers of NEC from stool, urine, and serum has attracted a great deal of interest because to these clinical connections and the quest for a deeper knowledge of disease pathophysiology. Therefore, this study aims to identify protein expression patterns in NEC and discover innovative diagnostic biomarkers. In this study, we recruited five patients diagnosed with NEC and paired necrotic segments of intestinal tissue with adjacent normal segments of intestine to form experimental and control groups. Quantitative proteomics tandem mass tagging (TMT) labeling technique was used to detect and quantify the proteins, and the expression levels of the candidate biomarkers in the intestinal tissues were further determined by quantitative polymerase chain reaction (RT-qPCR), Western blot analysis, Immunofluorescence methods and enzyme-linked immunosorbent assay (ELISA). A total of 6880 proteins were identified and quantified in patients with NEC. A significant disparity in protein expression was observed between necrotic and normal segments of intestinal tissue in NEC patients. A total of 55 proteins were found to be upregulated, and 40 proteins were found to be downregulated in NEC patients when using a p-value of < 0.05, and an absolute fold change of > 1.2 for analysis. GO function enrichment analysis showed the positive regulation of significant biological processes such as mitochondrial organization, vasoconstriction, rRNA catabolism, fluid shear stress response, and glycerol ether biosynthesis processes. Enrichment analysis also revealed essential functions such as ligand-gated ion channel activity, potassium channel activity, ligand-gated cation channel activity, ligand-gated ion channel activity, and ligand-gated channel activity, including molecular functions such as ligand-gated ion channel activity and mitotic events in this comparative group. Significant changes were found in endomembrane protein complex, membrane fraction, mitochondrial membrane fraction, membrane components, membrane intrinsic components, and other localized proteins. Additional validation of intestinal tissue and serum revealed a substantial increase in TRAF6 (tumor necrosis factor receptor-associated factor 6) and IL-8(Interleukin-8, CXCL8). The quantitative proteomic TMT method can effectively detect proteins with differential expression in the intestinal tissues of NEC patients. Proteins TRAF6 and CXCL8/IL-8 are significantly upregulated in the intestinal tissues and serum samples of patients and may serve as valuable predictor factors for NEC’s early diagnosis.
... Domain identification, subcellular localization and gene synteny were performed with the information from NCBI genomic database, respectively. 46,47 ...
Article
Full-text available
The profound impacts of global changes on biodiversity necessitate a more comprehensive documentation, particularly at the microscale level. To achieve precise and rapid insights into this unique diversity, the choice of an ideal species candidate is crucial. Neurospora crassa, a well-established organism in the field of biology, emerges as a promising candidate for this purpose. In our study, we explore the potential of the Carboxypeptidase A1 (CPA1) enzyme as a valuable tool for profiling global diversity. Our investigation has revealed that CPA1 possesses distinctive characteristics, notably its conserved solvent accessibility. This unique feature makes CPA1 an invaluable asset for microscale studies of global changes. The insights presented in our study serve as a practical blueprint, showcasing the application of structural biology in understanding diversity and global changes within microscale environments.
... The subcellular localization of the proteins was predicted using CELLO (http://cello.life.nctu.edu.tw/) (Yu et al., 2004). The structural domains of the proteins were analyzed using Pfam data and compared with the InterProScan software package (Jones et al., 2014;El-Gebali et al., 2019). ...
Article
Full-text available
This study aimed to enhance the use of male sterility in pepper to select superior hybrid generations. Transcriptomic and proteomic analyses of fertile line 1933A and nucleic male sterility line 1933B of Capsicum annuum L. were performed to identify male sterility-related proteins and genes. The phylogenetic tree, physical and chemical characteristics, gene structure characteristics, collinearity and expression characteristics of candidate genes were analyzed. The study identified 2,357 differentially expressed genes, of which 1,145 and 229 were enriched in the Gene Ontology and Kyoto Encyclopedia of Genes and Genomes databases, respectively. A total of 7,628 quantifiable proteins were identified and 29 important proteins and genes were identified. It is worth noting that the existence of CaPRX genes has been found in both proteomics and transcriptomics, and 3 CaPRX genes have been identified through association analysis. A total of 66 CaPRX genes have been identified at the genome level, which are divided into 13 subfamilies, all containing typical CaPRX gene conformal domains. It is unevenly distributed across 12 chromosomes (including the virtual chromosome Chr00). Salt stress and co-expression analysis show that male sterility genes are expressed to varying degrees, and multiple transcription factors are co-expressed with CaPRXs, suggesting that they are involved in the induction of pepper salt stress. The study findings provide a theoretical foundation for genetic breeding by identifying genes, metabolic pathways, and molecular mechanisms involved in male sterility in pepper.
... While numerous machine learning (ML) methods have been developed for predicting the subcellular location of eukaryotic proteins (Almagro Armenteros et al., 2017;Blum et al., 2009;Briesemeister et al., 2009), the availability of subcellular location predictors for prokaryotic proteins is limited and not as recent (Goldberg et al., 2012;C.-S. Yu et al., 2004;N. Y. Yu et al., 2010). In recent years, Deep learning (DL) algorithms have become the method of choice for location prediction methods (Stärk et al., 2021). To this date, no DL based method for prokaryotic location prediction has been proposed. ...
Preprint
Full-text available
Protein subcellular location prediction is a widely explored task in bioinformatics because of its importance in proteomics research. We propose DeepLocPro, an extension to the popular method DeepLoc, tailored specifically to archaeal and bacterial organisms. DeepLocPro is a multiclass subcellular location prediction tool for prokaryotic proteins, trained on experimentally verified data curated from UniProt and PSORTdb. DeepLocPro compares favorably to the PSORTb 3.0 ensemble method, surpassing its performance across multiple metrics on our benchmark experiment. The DeepLocPro prediction tool is available online at https://ku.biolib.com/deeplocpro and https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/.
... The significantly differentially expressed proteins were used for bioinformatics analysis. The Subcellular Localization Predictive System (CELLO) [32] was utilized to analyze their subcellular localization. Additionally, Interproscan [33] was utilized for structural domain prediction. ...
Article
Full-text available
Hexahydro-1,3,5-trinitro-1,3,5-triazine (RDX) is an energetic and persistent explosive with long-lasting properties. Rhodococcus sp. strain DN22 has been discovered to be a microbe capable of degrading RDX. Herein, the complete genome of Rhodococcus sp. strain DN22 was sequenced and analyzed. The entire sequences of genes that encoded the two proteins participating in RDX degradation in Rhodococcus sp. strain DN22 were obtained, and were validated through proteomic data. In addition, few studies have investigated the physiological changes and metabolic pathways occurring within Rhodococcus sp. cells when treated with RDX, particularly through mass spectrometry-based omics. Hence, proteomic and metabolomic analyses were carried out on Rhodococcus sp. strain DN22 with the existence or lack of RDX in the medium. A total of 3186 proteins were identified between the two groups, with 115 proteins being significantly differentially expressed proteins. There were 1056 metabolites identified in total, among which 130 metabolites were significantly different. Through the combined analysis of differential proteomics and metabolomics, KEGG pathways including two-component system, ABC transporters, alanine, aspartate and glutamate metabolism, arginine biosynthesis, purine metabolism, nitrogen metabolism, and phosphotransferase system (PTS), were observed to be significantly enriched. These findings provided ponderable perspectives on the physiological alterations and metabolic pathways in Rhodococcus sp. strain DN22, responding to the existence or lack of RDX. This study is anticipated to expand the knowledge of Rhodococcus sp. strain DN22, as well as advancing understanding of microbial degradation.
... 23 Subcellular localization prediction was performed with the CELLO v2.5 server. 24,25 Signal peptide sequence was established using SignalP 5.0 server (www.cbs.dtu.dk/services/SignalP). 26 Subsequently, the amino acid sequence obtained was compared with those reported in the NCBI using the BLAST algorithm (https:// www.ncbi.nlm.nih.gov/), and phylogenetic tree was constructed with Clustal Omega (https://www.ebi.ac.uk/Tools/ msa/clustalo/). ...
Article
Full-text available
The DSR-IBUN dextransucrase produced by Leuconostoc mesenteroides strain IBUN 91.2.98 has a short production time (4.5 hours), an enzymatic activity of 24.8 U/mL, and a specific activity of purified enzyme 2 times higher (331.6 U/mg) than that reported for similar enzymes. The aim of this study was to generate a structural model that, from an in silico approach, allows a better understanding, from the structural point of view, of the activity obtained by the enzyme of interest, which is key to continue with its study and industry application. For this, we translated the nucleotide sequence of the dsr_IBUN gene. With the primary structure of DSR-IBUN, the in silico prediction of physicochemical parameters, the possible subcellular localization, the presence of signal peptide, and the location of domains and functional and structural motifs of the protein were established. Subsequently, its secondary and tertiary structure were predicted and a homology model of the dextransucrase under study was constructed using Swiss-Model, performing careful template selection. The values obtained for the model, Global Model Quality Estimation (0.63), Quality Mean (−1.49), and root-mean-square deviation (0.09), allow us to affirm that the model for the enzyme dextransucrase DSR-IBUN is of adequate quality and can be used as a source of information for this protein.
... The host immune system may quickly recognize surface-exposed outer membrane proteins, which may be linked to pathogenesis (Haake et al., 1999). To recover the outer membrane protein, the Q8F8B3 protein sequence was sent to the CELLO v.2.5 server (Yu et al., 2004). ...
... The normalized ratios of heavy/light (H/L) isotopes and medium/light (M/L) isotopes (Supplementary Data S1) were used for relative quantitative analysis. Subcellular localization was based on previously published experimental data (Supplementary Data S2) and bioinformatic prediction using PSORTb v3.0.3 61 , CELLO v.2.5 62 , and SOSUI-GramN 63 with a majority voting strategy. Proteins that were predicted to be at different locations by each tool were assigned to unknown. ...
Article
Full-text available
Bacterial extracellular vesicles (EVs) are generally formed by pinching off outer membrane leaflets while simultaneously releasing multiple active molecules into the external environment. In this study, we aimed to identify the protein cargo of leptospiral EVs released from intact leptospires grown under three different conditions: EMJH medium at 30 °C, temperature shifted to 37 °C, and physiologic osmolarity (EMJH medium with 120 mM NaCl). The naturally released EVs observed under transmission electron microscopy were spherical in shape with an approximate diameter of 80–100 nm. Quantitative proteomics and bioinformatic analysis indicated that the EVs were formed primarily from the outer membrane and the cytoplasm. The main functional COG categories of proteins carried in leptospiral EVs might be involved in cell growth, survival and adaptation, and pathogenicity. Relative to their abundance in EVs grown in EMJH medium at 30 °C, 39 and 69 proteins exhibited significant changes in response to the temperature shift and the osmotic change, respectively. During exposure to both stresses, Leptospira secreted several multifunctional proteins via EVs, while preserving certain virulence proteins within whole cells. Therefore, leptospiral EVs may serve as a decoy structure for host responses, whereas some virulence factors necessary for direct interaction with the host environment are reserved in leptospiral cells. This knowledge will be useful for understanding the pathogenesis of leptospirosis and developing as one of vaccine platforms against leptospirosis in the future.
... edu. tw/) (Yu et al. 2004 (Tamura et al. 2021) with the neighbor-joining method and default parameters like 1000 replicates of bootstrap, Poisson correction distance, and pairwise deletion. The circular tree was further refined using iTOL (https:// itol. ...
Article
Full-text available
The present investigation profoundly asserted the catalytic potential of plant-based aldo-ketoreductase, postulating its role in polyketide biosynthesis and providing new insights for tailored biosynthesis of vital plant polyketides for therapeutics. Plants hold great potential as a future source of innovative biocatalysts, expanding the possibilities within chemical reactions and generating a variety of benefits. The aldo–keto reductase (AKR) superfamily includes a huge collection of NAD(P)H-dependent oxidoreductases that carry out a variety of redox reactions essential for biosynthesis, detoxification, and intermediary metabolism. The present study involved the isolation, cloning, and purification of a novel aldo-ketoreductase (AvAKR) from the leaves of Aloe vera (Aloe barbadensis Miller) by heterologous gene expression in Escherichia coli based on the unigene sequences of putative ketoreductase and cDNA library screening by oligonucleotide hybridization. The in-silico structural analysis, phylogenetic relationship, and molecular modeling were outranged to approach the novelty of the sequence. Additionally, agroinfiltration of the candidate gene tagged with a green fluorescent protein (GFP) was employed for transient expression in the Nicotiana benthamiana to evaluate the sub-cellular localization of the candidate gene. The AvAKR preferred cytoplasmic localization and shared similarities with the known plant AKRs, keeping the majority of the conserved active-site residues in the AKR superfamily enzymes. The enzyme facilitated the NADPH-dependent reduction of various carbonyl substrates, including benzaldehyde and sugars, proclaiming a broad spectrum range. Our study successfully isolated and characterized a novel aldo-ketoreductase (AvAKR) from Aloe vera, highlighting its versatile NADPH-dependent carbonyl reduction proficiency therewith showcasing its potential as a versatile biocatalyst in diverse redox reactions.
... Cello server was used to predict protein localization in cell (Yu et al, 2004). Signal peptide (SP) were predicted using SignalP 5.0 (Almagro Armenteros et al, 2019). ...
Article
Full-text available
The ATPase family AAA domain containing protein 3 (ATAD3) is found in mitochondria of nematodes, plants and mammals including humans. These proteins are engaged in a variety of processes, which are crucial to the survival of the organism. They have been associated with several conditions including osteoporosis, mitochondrial proliferation, cancer etc. Loa loa is a parasitic nematode which causes a disease loiasis in humans which is considered a neglected tropical disease. Here we have annotated a hypothetical protein XP_003137384.1 mined from Loa loa genome using several bioinformatic tools. Domain identification and structural homology suggest it to be a ATAD3 protein. Homology based modelling was also performed to visualize the complete structure of the protein. Two distinct structural domains were seen with a well conserved C-terminal ATPase domain. Interaction mapping using the string database showed that this protein makes extended interactions with mitochondrial ribosomal proteins as well as many V-type protein pump subunits signifying its function in mitochondrial protein translation and ATPase activity. Since, this protein is known to make crucial interaction with endoplasmic reticulum, our study should help in further understanding of an important mitochondrial component.
... Specific identification of domain/s and secondary structure analysis were performed with subcellular prediction of microbial server beside the Chou & Fasman Secondary prediction (Yu et al. 2004(Yu et al. , 2006Kumar 2013;Yang et al. 2020). Furthermore, the phylogenetic tree of esterases with retrieved sequences from NCBI presented with MEGA Version 11 (Tamura et al. 2021). ...
Article
Full-text available
Over recent years, Alicyclobacillus acidocaldarius, a Gram-positive nonpathogenic rod-shaped thermo-acid-tolerant bacterium, has posed numerous challenges for the fruit juice industry. However, the bacterium’s unique characteristics, particularly its nonpathogenic and thermophilic capabilities, offer significant opportunities for genetic exploration by biotechnologists. This study presents the computational proteogenomics report on the carboxylesterase (CE) enzyme in A. acidocaldarius, shedding light on structural and evolutional of CEs from this bacterium. Our analysis revealed that the average molecular weight of CEs in A. acidocaldarius was 41 kDa, with an isoelectric point around 5. The amino acid composition favored negative amino acids over positive ones. The aliphatic index and hydropathicity were approximately 88 and − 0.15, respectively. While the protein sequence showed no disulfide bonds in the CEs’ structure, the presence of Cys amino acids was observed in the structure of CEs. Phylogenetic analysis presented more than 99% similarity between CEs, indicating their close evolutionary relationship. By applying homology modeling, the 3-dimensional structural models of the carboxylesterase were constructed, which with the help of structural conservation and solvent accessibility analysis highlighted key residues and regions responsible for enzyme stability and conformation. The specific patterns presented the total solvent accessibility of less than 25 (Ų) was in considerable position as well as Gly residues were noticeably have high accessibility to solvent in all structures. Ala was the more frequent amino acids in the conserved-SASA of carboxylesterases. Furthermore, unsupervised agglomerative hierarchical clustering based on solvent accessibility feature successfully clustered and even distinguished this enzyme from proteases from the same genome. These findings contribute to a deeper understanding of the nonpathogenic A. acidocaldarius carboxylesterase and its potential applications in biotechnology. Additionally, structural analysis of CEs would help to address potential solutions in fruit juice industry with utilization of computational structural biology.
... (Yu et al., 2010), CELLO (http://cello.life.nctu.edu.tw/) (Yu et al., 2004), and SOSUI-GramN (https://harrier.nagahama-i-bio.ac.jp/sosui/ sosuigramn/sosuigramn_submit.html) (Imai et al., 2008) programs. ...
Article
Full-text available
Background Neisseria gonorrhoeae (gonococcus) is the causative agent of the sexually transmitted disease gonorrhea, for which no vaccines exist. Efforts are being made to identify potential vaccine protein antigens, and in this study, an immunoproteomics approach was used to identify protein signatures in gonococci that were recognized by sera from patients with gonorrhea. Methods Sera from patients with uncomplicated gonorrhea and from controls were reacted on Western blot with gonococcal whole-cell lysate separated by 2D electrophoresis. Reactive bands were excised and digested, and peptides were analyzed by mass spectrometry to identify protein hits. Proteins were analyzed with in-silico bioinformatics tools (PSORTb v3.0, CELLO, SOSUI-GramN, LipoP 1.0, SignalP 5.0, TMHMM 2.0, eggNOG-mapper 5.0) to select for surface-exposed/outer membrane proteins (OMPs) and exclude cytoplasmic proteins and most periplasmic proteins. Sera were tested for bactericidal activity against homologous and heterologous gonococcal strains. Results Patient sera reacted with 180 proteome bands, and 18 of these bands showed ≥2-fold increased reactivity compared with sera from individuals ( n = 5) with no history of gonococcal infection. Mass spectrometry produced peptide signatures for 1,107 proteins, and after bioinformatics analyses, a final collection of 33 proteins was produced that contained 24 OMPs/extracellular proteins never previously studied to our knowledge, 6 proteins with homologs in Neisseria meningitidis that can generate functional immune responses, and 3 unknown proteins. The sera showed little or no significant bactericidal activity, which may be related to the immunoproteomic identification of contraindicated proteins Rmp and H.8 that can generate blocking antibodies. Conclusion Studies on the vaccine potential of these newly identified proteins deserve consideration.
... TMHMM-2.0/). Using the online website CELLOv.2.5, subcellular localization was predicted (http://cello.life.nctu.edu.tw/ ) (Yu et al., 2010). ...
Article
Full-text available
The YUCCAs (YUC) are functionally identified flavin-containing monooxidases (FMOs) in plants that act as an important rate-limiting enzyme functioning in the auxin synthesis IPA (indole-3-pyruvic acid) pathway. In this study, 12 MsYUCs and 15 MtYUCs containing characteristic conserved motifs were identified in M. sativa ( Medicago sativa L.) and M. truncatula ( Medicago truncatula Gaertn.), respectively. Phylogenetic analysis revealed that YUC proteins underwent an evolutionary divergence. Both tandem and segmental duplication events were presented in MsYUC and MtYUC genes. Comparative syntenic maps of M. sativa with M. truncatula , Arabidopsis ( Arabidopsis thaliana ), or rice ( Oryza sativa L.) were constructed to illustrate the evolution relationship of the YUC gene family. A large number of cis-acting elements related to stress response and hormone regulation were revealed in the promoter sequences of MsYUCs . Expression analysis showed that MsYUCs had a tissue-specific, genotype-differential expression and a differential abiotic stress response pattern based on transcriptome data analysis of M. sativa online. In addition, RT-qPCR confirmed that salt stress significantly induced the expression of MsYUC1/MsYUC10 but significantly inhibited MsYUC2/MsYUC3 expression and the expression of MsYUC10/MsYUC11/MsYUC12 was significantly induced by cold treatment. These results could provide valuable information for functional analysis of YUC genes via gene engineering of the auxin synthetic IPA pathway in Medicago .
... Sub cellular localization of any protein is important in understanding protein function. Prediction of sub cellular localization of protein was carried out by CELLO v.2.5 (Yu et al., 2006;Yu et al., 2004). ...
... Also, the TMHMM Server v2.0 (https://services.healthtech.dtu.dk/service.php?TMHMM-2.0) was used to predict the putative trans-membrane (TM) regions in each CaAATs protein [37], while the WoLF PSORT/Cello life tool was used to predict the sub-cellular location of the proteins [38,39]. ...
Article
Amino acid transporters (AATs), besides, being a crucial component for nutrient partitioning system are also vital for growth and development of the plants and stress resilience. In order to understand the role of AAT genes in seed quality proteins, a comprehensive analysis of AAT gene family was carried out in chickpea leading to identification of 109 AAT genes, representing 10 subfamilies with random distribution across the chickpea genome. Several important stress responsive cis-regulatory elements like Myb, ABRE, ERE were detected in the promoter region of these CaAAT genes. Most of the genes belonging to the same sub-families shared the intron-exon distribution pattern owing to their conserved nature. Random distribution of these CaAAT genes was observed on plasma membrane, vacuolar membrane, Endoplasmic reticulum and Golgi membranes, which may be associated to distinct biochemical pathways. In total 92 out 109 CaAAT genes arise as result of duplication, among which segmental duplication was more prominent over tandem duplication. As expected, the phylogenetic tree was divided into 2 major clades, and further sub-divided into different sub-families. Among the 109 CaAAT genes, 25 were found to be interacting with 25 miRNAs, many miRNAs like miR156, miR159 and miR164 were interacting only with single AAT genes. Tissues specific expression pattern of many CaAAT genes was observed like CaAAP7 and CaAVT18 in nodules, CaAAP17, CaAVT5 and CaCAT9 in vegetative tissues while CaCAT10 and CaAAP23 in seed related tissues as per the expression analysis. Mature seed transcriptome data revealed that genotypes having high protein content (ICC 8397, ICC 13461) showed low CaAATs expression as compared to the genotypes having low protein content (FG 212, BG 3054). Amino acid profiling of these genotypes revealed a significant difference in amount of essential and non-essential amino acids, probably due to differential expression of CaAATs. Thus, the present study provides insights into the biological role of AAT genes in chickpea, which will facilitate their functional characterization and role in various developmental stages, stress responses and involvement in nutritional quality enhancement.
... dtu.dk/service.php?TargetP-2.0) and Cello 2.56 (http:// www.csbio.sjtu.edu.cn/bioinf/plant-multi/) [36]. ...
Article
Full-text available
The BABY BOOM (BBM) subfamily of the AP2/ERF transcription factor family is the main regulator of totipotency in plant cells and plays an important role in regulating cell proliferation and plant growth and development. Previously, we identified GmBBM7, a key gene of the soybean BBM gene subfamily (GmBBM family). In this study, we aimed to analyze its molecular characteristics and role in somatic embryogenesis in soybean. First, we identified 17 members of GmBBM family. Phylogenetic and collinear analyses revealed that the GmBBM family genes mainly exhibited a strong genetic relationship with the those of PvBBM family, the BBM family of common bean (Phaseolus vulgaris). Analysis of the promoters revealed that GmBBM family genes were mainly regulated by plant hormones. The expression of GmBBM7 varied in various tissues of soybean. Particularly, it was higher in the roots, immature embryos, grains and callus. Subcellular localization analysis indicated that GmBBM7 encodes a nuclear protein. Furthermore, overexpressed GmBBM7 could significantly enhance callus formation by regulating the levels of gibberellin A1, A3 and A7; abscisic acid; and salicylic acid and could promote root growth and development. In summary, GmBBM7 is an important regulatory gene for somatic embryogenesis and root elongation in soybean. GmBBM7 overexpression in soybean could increase the callus formation rate and density through hormonal pathways and could increase root growth. This study provided theoretical basis for efficient breeding of soybean via gene editing, transgenic and other techniques.
... Once the potential targets were selected, identification of the location of these proteins was attempted to know their functional assignment. For the identification of sub-cellular locations, the Cell Ontology-based Classi-fication (Cello) tool [31], PSORTb (https://www.psort.org/psortb/) [32], and a program for identification of sub-cellular localization of bacterial proteins (ProtCompB) were used. ...
Article
Full-text available
Urinary tract infections (UTIs) are one of the most frequent bacterial infections in the world, both in the hospital and community settings. Uropathogenic Escherichia coli (UPEC) are the predominant etiological agents causing UTIs. Extended-spectrum beta-lactamase (ESBL) production is a prominent mechanism of resistance that hinders the antimicrobial treatment of UTIs caused by UPEC and poses a substantial danger to the arsenal of antibiotics now in use. As bacteria have several methods to counteract the effects of antibiotics, identifying new potential drug targets may help in the design of new antimicrobial agents, and in the control of the rising trend of antimicrobial resistance (AMR). The public availability of the entire genome sequences of humans and many disease-causing organisms has accelerated the hunt for viable therapeutic targets. Using a unique, hierarchical, in silico technique using computational tools, we discovered and described potential therapeutic drug targets against the ESBL-producing UPEC strain NA114. Three different sets of proteins (chokepoint, virulence, and resistance genes) were explored in phase 1. In phase 2, proteins shortlisted from phase 1 were analyzed for their essentiality, non-homology to the human genome, and gut flora. In phase 3, the further shortlisted putative drug targets were qualitatively characterized, including their subcellular location, broad-spectrum potential, and druggability evaluations. We found seven distinct targets for the pathogen that showed no similarity to the human proteome. Thus, possibilities for cross-reactivity between a target-specific antibacterial and human proteins were minimized. The subcellular locations of two targets, ECNA114_0085 and ECNA114_1060, were predicted as cytoplasmic and periplasmic, respectively. These proteins play an important role in bacterial peptidoglycan biosynthesis and inositol phosphate metabolism, and can be used in the design of drugs against these bacteria. Inhibition of these proteins will be helpful to combat infections caused by MDR UPEC.
... Then, proteins retrieved in the manuscript were used to make subcellular localization predictions. 19,20 The protein structural domain was analyzed in the Pfam database v 35.0 (http://pfam.xfam.org/), in which all protein families were represented in the form of multiple sequence comparisons and hidden Markov models. 21 The InterProScan software v 5.62-94.0 ...
Article
To gain a comprehensive understanding of non-histone methylation during berry ripening in grape (Vitis vinifera L.), the methylation of non-histone lysine residues was studied using a 4D label-free quantitative proteomics approach. In total, 822 methylation sites in 416 methylated proteins were identified, with xxExxx_K_xxxxxx as the conserved motif. Functional annotation of non-histone proteins with methylated lysine residues indicated that these proteins were mostly associated with "ripening and senescence", "energy metabolism", "oxidation-reduction process", and "stimulus response". Most of the genes encoding proteins subjected to methylation during grape berry ripening showed a significant increase in expression during maturation at least at one developmental stage. The correlation of methylated proteins with QTLs, SNPs, and selective regions associated with fruit quality and development was also investigated. This study reports the first proteomic analysis of non-histone lysine methylation in grape berry and indicates that non-histone methylation plays an important role in grape berry ripening.
... The Subcellular Localization Predictive System (CELLO) was used to predict subcellular localization for each protein in the dataset [44]. We included this prediction, in addition to tissue specificity data obtained from the Genotype-Tissue Expression (GTEx) and the Human Protein Atlas (HPA) [18,38]. ...
Article
Full-text available
The identification of human proteins that are amenable to pharmacologic modulation without significant off-target effects remains an important unsolved challenge. Computational methods have been devised to identify features which distinguish between “druggable” and “undruggable” proteins, finding that protein sequence, tissue and cellular localization, biological role, and position in the protein–protein interaction network are all important discriminant factors. However, many prior efforts to automate the assessment of protein druggability suffer from low performance or poor interpretability. We developed a neural network-based machine learning model capable of generating druggability sub-scores based on each of four distinct categories, combining them to form an overall druggability score. The model achieves an excellent performance in separating drugged and undrugged proteins in the human proteome, with an area under the receiver operating characteristic (AUC) of 0.95. Our use of multiple sub-scores allows the assessment of potential protein targets of interest based on distinct contributors to druggability, leading to a more interpretable and holistic model to identify novel targets.
... 20 Cello version v.2.5 was used to perform subcellular localization of the protein structure, which provides the correct placement of the protein within the cell. 27 Using PROCHECK allowed for an investigation of the quality as well as the stereo-chemical aspects of the model structure. 28 The Ramachandran Plot gave the required information regarding the overall amino acid residues in the most preferred areas, the further permitted sections, the generously allowed regions, and the banned regions. ...
Article
Full-text available
produces a variety of bioactive compounds that prevent fungal growth, including aflatoxins. Aflatoxigenic fungi ( and ) are being researched concerning spp. and can prevent the spread of aflatoxins-producing fungi. Aflatoxin-degrading enzymes, which can convert poisonous aflatoxins into less dangerous compounds, are also produced by spp. The processes through which these microorganisms can be used to reduce aflatoxins in food and agricultural systems are still the subject of active research. To evaluate the novelty of tetracycline against the biosynthesis of aflatoxin in aflatoxigenic fungi via computational approach. In this study, we performed molecular docking of polyketide synthase (Pks-A), an enzyme that initiates aflatoxin biosynthesis using tetracycline, using the online SeamDock server. Our results showed that tetracycline had a strong affinity for Pks-A in the binding pocket. The binding energy of tetracycline was -12.7 kcal/mol, indicating a strong binding affinity between the two molecules. Furthermore, the binding site was located in the active site, which is a conserved region in Pks-A and is essential for catalysing the formation of aflatoxin. The results of our docking study suggest that tetracycline may be an effective inhibitor of aflatoxin biosynthesis.
... Additionally, the extracellular protein locations were predicted by the subCELlularLOcalization predictor (CELLO v2.5, http://cello.life.nctu.edu.tw) (Yu et al., 2004(Yu et al., , 2006. All the assays were performed in biological triplicates and analytical duplicates. ...
Article
To understand the aspects of how organisms cope with copper and chromium stress, Saccharomyces cerevisiae was used as a model. To achieve this purpose, Scanning Electron Microscopy coupled to X-Ray Dispersive Energy Spectrometry (SEM-EDS) was applied to analyze the microelemental composition and the surface mapping of microbial biomass, in the presence and absence of 30 μg mL − 1 Cu(II) and Cr(VI) after 72 h of incubation. Additionally, a shotgun proteomic analysis was carried out using nanoUHPLC-ESI-MS/MS on cytosolic proteins and the cell-free supernatants to analyze the differential protein expression at the intracellular and extracellular level in the presence of the metals. Bioinformatic analysis was performed using the Swiss-Prot database specific for S. cerevisiae and MASCOT v2.7.1. The comparative analysis of protein expression of the samples was performed using ProteoIQ v2.8. The microorganism responds by adjusting intracellular and extracellular protein expression, and also by adjusting microelemental composition variation. The results show that cells exposed to Cu(II) obtained the advantage of enduring unfavorable conditions, while cells exposed to Cr(VI) decreased the expression of proteins important for repair and cell function.
... Prediction of subcellular localization of the Rafs proteins was carried out using CELLO v.2.5 (http://cello.life.nctu.edu.tw) (accessed on 15 January 2021) [46]. Furthermore, the molecular properties of the 73 Rafs sequences, including their molecular weight, theoretical isoelectric point, instability coefficient, and hydrophilicity, were predicted using ExPASY (http://www.expasy.org/) ...
Article
Full-text available
Raffinose synthase (Rafs) is an important enzyme in the synthesis pathway of raffinose from sucrose and galactinol in higher plants and is involved in the regulation of seed development and plant responses to abiotic stresses. In this study, we analyzed the Rafs families and profiled their alternative splicing patterns at the genome-wide scale from 10 grass species representing crops and grasses. A total of 73 Rafs genes were identified from grass species such as rice, maize, foxtail millet, and switchgrass. These Rafs genes were assigned to six groups based the phylogenetic analysis. We compared the gene structures, protein domains, and expression patterns of Rafs genes, and also unraveled the alternative transcripts of them. In addition, different conserved sequences were observed at these putative splice sites among grass species. The subcellular localization of PvRafs5 suggested that the Rafs gene was expressed in the cytoplasm or cell membrane. Our findings provide comprehensive knowledge of the Rafs families in terms of genes and proteins, which will facilitate further functional characterization in grass species in response to abiotic stress.
... org/) was employed to analyze related physical and chemical properties, such as molecular weight (MW), isoelectric point (pI), grand average of hydropathicity (GRAVY), and instability index. Subcellular localization of HcCUC were predicted by CELLO v.2.5 (Yu et al. 2004) (http://cello.life. nctu.edu.tw/). ...
Article
Full-text available
CUP-SHAPED COTYLEDON (CUC) transcription factors have a central regulatory function in plant growth and development. However, their involvement in kenaf (Hibiscus cannabinus L.) remains largely unexplored. In this study, we conducted a comprehensive analysis to identify six HcCUC genes in the kenaf genome. Through bioinformatic analysis, we found that the kenaf HcCUC genes share similar motifs and highly conserved gene structures. Phylogenetic analysis categorized the six HcCUC genes into two groups, that shared similarities with CUC2 or CUC3 genes from other species. Collinearity analysis revealed the formation of 6 syntenic gene pairs among the HcCUC genes, and 8 homologous gene pairs with three AtCUC genes from Arabidopsis. To investigate tissue-specific expression, we analyzed transcriptome data, that showed differential expression of HcCUC genes, particularly in leaves during the seedling stage, buds during the maturation stage, and anthers at the dual-core period. Functional characterization of HcCUC1 was achieved through its overexpression in Arabidopsis, resulting in elongated cotyledons, absent of petioles and increased number of rosette leaf and lateral branches. qRT-PCR analysis revealed that HcCUC1 potentially influences leaf and lateral branch development by up-regulating the expression of auxin-related genes (AtYUC2, AtYUC4, AtPIN1, AtPIN3, AtPIN4) and leaf shape-related genes (AtKNAT2, AtKNAT6). Notably, overexpression of HcCUC1 down-regulated the expression of flowering-related genes (AtFT, AtAP1, AtLFY, AtFUL), causing delayed flowering. Overall, our findings emphasize the pivotal role of HcCUC1 in regulating leaf and lateral branch growth, development, and flowering time, provide valuable insights into the function and genetic regulation of HcCUC genes.
... The CELLO (v.2.5) [21,22], PSORTb (v3.0) [23], HMMTOP (v.2.0) [24,25], and TMHMM (v.2.0) [26,27] programs are used to detect the subcellular localization and protein topology analysis. ...
Article
Full-text available
Adaptation of infections and hosts has resulted in several metabolic mechanisms adopted by intracellular pathogens to combat the defense responses and the lack of fuel during infection. Human tuberculosis caused by Mycobacterium tuberculosis (MTB) is the world’s first cause of mortality tied to a single disease. This study aims to characterize and anticipate potential antigen characteristics for promising vaccine candidates for the hypothetical protein of MTB through computational strategies. The protein is associated with the catalyzation of dithiol oxidation and/or disulfide reduction because of the protein’s anticipated disulfide oxidoreductase properties. This investigation analyzed the protein's physicochemical characteristics, protein-protein interactions, subcellular locations, anticipated active sites, secondary and tertiary structures, allergenicity, antigenicity, and toxicity properties. The protein has significant active amino acid residues with no allergenicity, elevated antigenicity, and no toxicity.
Article
A higher prevalence of Acinetobacter baumannii infections and mortality rate has been reported recently in hospital-acquired infections (HAI). The biofilm-forming capability of A. baumannii makes it an extremely dangerous pathogen, especially in device-associated hospital-acquired infections (DA-HAI), thereby it resists the penetration of antibiotics. Further, the transmission of the SARS-CoV-2 virus was exacerbated in DA-HAI during the epidemic. This review specifically examines the complex interconnections between several components and genes that play a role in the biofilm formation and the development of infections. The current review provides insights into innovative treatments and therapeutic approaches to combat A. baumannii biofilm-related infections, thereby ultimately improving patient outcomes and reducing the burden of HAI.
Preprint
Full-text available
Background The Domain of unknown function 679 membrane proteins (DMPs) family, as a green plant-specific membrane protein, plays an important role in plant reproductive development, stress response and aging. To identify the DMP gene members of oat (AsDMP) and to investigate their family structural features and tissue expression profile characteristics, a study was conducted. Based on the whole genome and transcriptome data, in this investigation, we have scrutinized the physicochemical properties, gene structure, cisacting elements, phylogenetic relationships, conserved structural (CS) domains, CS motifs and expression patterns of the AsDMP family of oat. Results The DMP family genes of oat were found distributed across 17 chromosomal scaffolds with 33 members. We could divide the AsDMP genes into five subfamilies based on phylogenetic relationships. The gene structure suggests that oats may have also undergone an intron loss event during evolution. Covariance analysis suggests that genome-wide duplication/segmental duplication may be the major contributor to the expansion of the AsDMP gene family. Ka/Ks selective pressure analysis of oat DMP gene family, suggests that DMP gene pairs tend to be conserved over evolutionary time. The upstream promoter of these genes containing several cis-acting elements indicates a plausible role in abiotic stress and hormone induction. Gene expression pattern according to transcriptome data revealed participation of the DMP genes in tissue and organ development. In this study, AsDMP genes (AsDMP1, AsDMP19, and AsDMP22) were identified as potentially regulating oat seed senescence, and can be used as candidate genes for seed longevity and anti-aging germplasm breeding studies in oat. The study provides valuable information on the regulatory mechanism of the AsDMP gene family in the aging process of oat germplasm, and also provides theoretical support for further function investigation in the oat DMP gene and the molecular mechanism of seed anti-aging. Conclusions In this study, we found that the AsDMP gene is involved in the aging process of oat seeds, which is the first report on the potential role of DMP genes in oat seeds.
Article
Full-text available
Carbapenem-resistant Klebsiella pneumoniae (CRKP) is an important multidrug resistance (MDR) pathogen that threatens human health and is the main source of hospital-acquired infection. Outer membrane vesicles (OMVs) are extracellular vesicles derived from Gram-negative bacteria and contain materials involved in bacterial survival and pathogenesis. They also contribute to cellular communication to nearby or distant recipient cells and influence their functions and phenotypes. In this study, we sought to understand the mechanism of bacterial response to meropenem pressure and explore the relationship between pathogenic proteins and the high pathogenicity of bacteria. We performed whole-genome PacBio sequencing on a clinical CRKP strain, and its OMVs were characterized using nanoparticle tracking analysis, transmission electron microscopy, and proteomic analysis. Thousands of vesicle proteins have been identified in mass spectrometry-based high-throughput proteomics analyses of K. pneumoniae OMVs. Protein functionality analysis showed that the OMVs were predominantly involved in metabolic, intracellular compartments, nucleic acid binding, survival, defense, and antibiotic resistance, such as Chromosome partition protein MukB, 3-methyl-2-oxobutanoate hydroxymethyltransferase, methionine—tRNA ligase, Heat shock protein 60 family chaperone GroEL, and Gamma-glutamyl phosphate reductase. Additionally, a protein-protein interaction network demonstrated that OMVs from meropenem-treated K. pneumoniae showed the highest connectivity in DNA polymerase I, phenylalanine–tRNA ligase beta subunit, DNA-directed RNA polymerase subunit beta, methionine–tRNA ligase, DNA-directed RNA polymerase subunit beta, and DNA-directed RNA polymerase subunit alpha. The OMVs proteome expression profile indicates increased secretion of stress proteins released from meropenem-treated K. pneumoniae , which provides clues for revealing the biogenesis and pathophysiological functions of Gram-negative bacteria OMVs. The significant differentially expressed proteins identified in this study are of great significance for exploring effective control strategies for CRKP infection. IMPORTANCE Meropenem is one of the main antibiotics used in the clinical treatment of carbapenem-resistant Klebsiella pneumoniae (CRKP). This study demonstrated that some important metabolic changes occurred in meropenem-induced CRKP-outer membrane vesicles (OMVs), The OMVs proteome expression profile indicates increased secretion of stress proteins released from meropenem-induced Klebsiella pneumoniae . Furthermore, this is the first study to discuss the protein-protein interaction network of the OMVs released by CRKP, especially under antibiotic stress.
Article
Full-text available
Jinhua lean ham (LH), a dry-cured ham made from the defatted hind legs of pigs, has become increasingly popular among consumers with health concerns. However, the influence of fat removal on the quality of Jinhua ham is still not fully understood. Therefore, a label-free proteomics strategy was used to explore the protein differential profile between Jinhua fatty ham (FH) and lean ham (LH). Results showed that 179 differential proteins (DPs) were detected, including 82 up-regulated and 97 down-regulated DPs in LH vs. FH, among which actin, myosin, tropomyosin, aspartate aminotransferase, pyruvate carboxylase, and glucose-6-phosphate isomerase were considered the key DPs. GO analysis suggested that DPs were mainly involved in binding, catalytic activity, cellular process, and metabolic process, among which catalytic activity was significantly up-regulated in LH. Moreover, the main KEGG-enriched pathways of FH focused on glycogen metabolism, mainly including the TCA cycle, pyruvate metabolism, and glycolysis/gluconeogenesis. However, amino acid metabolism and oxidative phosphorylation were the main metabolic pathways in LH. From the protein differentiation perspective, fat removal significantly promoted protein degradation, amino acid metabolism, and the oxidative phosphorylation process. These findings could help us to understand the effects of fat removal on the nutritional metabolism of Jinhua hams and provide theoretical supports for developing healthier low-fat meat products.
Article
Mycobacterium tuberculosis, the etiological agent of tuberculosis, is one of the trickiest pathogens. We have only a few protective shields, like the BCG vaccine against the pathogen, which itself has poor efficacy in preventing adult tuberculosis. Even though different vaccine trials for an alternative vaccine have been conducted, those studies have not shown much promising results. In the current study, advanced computational technology was used to study the potential of a novel hypothetical mycobacterial protein, identified by subtractive hybridization, to be a vaccine candidate. NHP2 (Novel Hypothetical Protein 2), housed in the RD7 region of the clinical strains of M. tuberculosis, was studied for its physical, chemical, immunological and structural properties using different computational tools. PFAM studies and Gene ontology studies depicted NHP2 protein to be functionally active with a possible antibiotic binding domain too. Different computational tools used to assess the toxicity, allergenicity and antigenicity of the protein indicated its antigenic nature. Immune Epitope Database (IEDB) tools were used to study the T and B cell determinants of the protein. The 3D structure of the protein was designed, refined and authenticated using bioinformatics tools. The validated tertiary structure of the protein was docked against the TLR3 immune receptor to study the binding affinity and docking scores. Molecular dynamic simulation of the protein-protein complex formed were studied. NHP2 was found to activate host immune response against tubercle bacillus and could be explored as a potential vaccine in the fight against tuberculosis.
Article
As an important member of the two‐component system (TCS), histidine kinases (HKs) play important roles in various plant developmental processes and signal transduction in response to a wide range of biotic and abiotic stresses. So far, the HK gene family has not been investigated in Gossypium . In this study, a total of 177 HK gene family members were identified in cotton. They were further divided into seven groups, and the protein characteristics, genetic relationship, gene structure, chromosome location, collinearity, and cis ‐elements identification were comprehensively analyzed. Whole genome duplication (WGD) / segmental duplication may be the reason why the number of HK genes doubled in tetraploid Gossypium species. Expression analysis revealed that most cotton HK genes were mainly expressed in the reproductive organs and the fiber at initial stage. Gene expression analysis revealed that HK family genes are involved in cotton abiotic stress, especially drought stress and salt stress. In addition, gene interaction networks showed that HKs were involved in the regulation of cotton abiotic stress, especially drought stress. VIGS experiments have shown that GhHK8 is a negative regulatory factor in response to drought stress. Our systematic analysis provided insights into the characteristics of the HK genes in cotton and laid a foundation for further exploring their potential in drought stress resistance in cotton.
Preprint
Full-text available
Background Microorganisms play important ecological roles during interactions with plants. Although some microorganisms promote plant performance and are applied as biofertilizers, the molecular cross-talk of bacteria and plants is not fully understood. We aim to reveal which bacterial genes are tightly associated with the adaptation to the plant host by merging the outcomes of RNA-Seq data of a bacterium colonizing roots and comparative genomics analyses. Results Here, we show the expression of genes in a plant growth-promoting Pseudomonas strain interacting with plant roots. Our findings highlight that many of the upregulated genes have not been previously associated with this interaction. The occurrence of 184 of the upregulated bacterial genes in the interaction was higher in Pseudomonas isolates from plants compared to bacteria from other habitats, such as soils, animals or water. We argue that these genes may play relevant biological roles in this host, but only a few have been previously shown to be associated with plant-bacteria interactions. One of these genes is the yafL gene, encoding a cysteine peptidase with an NlpC/P60 hydrolytic domain, for which we demonstrate its involvement in bacterial plant growth promotion through the comparison of a wild-type bacterium with a yafL knockout strain. Conclusions Microbial plant growth promotion is a complex process with potential involvement of numerous bacterial genes, many of which still await characterization. By integrating the outcomes from in vivo RNA-Seq experiments and comparative genomic analyses, we have revealed several plant-associated genes and functions. Moreover, we have experimentally demonstrated the role of one of these genes in plant growth promotion.
Article
Purpose Brucella canis is pathogenic for dogs and humans. Serological diagnosis is a cost‐effective approach for disease surveillance, but a major drawback of current serological tests is the cross‐reactivity with other bacteria that results in false positive reactions. Development of indirect tests with improved sensitivity and specificity that use selected B. canis proteins instead of the whole antigen remain a priority. Experimental Design A western blotting assay was developed to define the serum antibody patterns associated to infection using a panel of positive and negative dog sera. B. canis positive sera recognized immunogenic bands ranging from 7 to 30 kDa that were then submitted to ESI–LC‐MS/MS and analyzed by bioinformatics tools. Results A total of 398 B. canis proteins were identified. Bioinformatics tools identified 16 non cytoplasmic immunogenic proteins predicted as non‐homologous with the most important Brucella cross‐reactive bacteria and nine B. canis proteins non‐homologous to B. ovis ; among the latter, one resulted non‐homologous to B. melitensis . Data are available via ProteomeXchange with identifier PXD042682. Conclusions and Clinical Relevance The western blotting test developed was able to distinguish between infected and non‐infected animals and may serve as a confirmatory test for the serological diagnosis of B. canis . The mass spectrometry and in silico results lead to the identification of specific candidate antigens that pave the way for the development of more accurate indirect diagnostic tests.
Article
Human milk is an ideal natural food for infants, and the infant's gender may have impact on protein composition of breast milk. In this study, we used 4D label-free quantitative proteomics techniques to identify and quantitatively analyze casein fraction in breast milk secreted for male and female infants. The results showed that a total of 2064 proteins were identified in human milk, and 95 of them were differentially abundant proteins. Compared to breast milk secreted by mothers of female infants, 21 proteins were up-regulated, and 59 proteins were down-regulated in breast milk secreted by mothers of male infants. The most abundant domain among the differentially abundant proteins was the immunoglobulin V-set domain, which may be involved in immune regulation. Gene Ontology functional analysis revealed that, the main biological processes, molecular functions, and cellular components corresponded to cellular process, binding, and cell part, respectively. The Kyoto Encyclopedia of Genes and Genomes pathways were mainly associated with human diseases and metabolism, with biosynthesis of cofactors being the most involved pathway. The results contribute to our understanding of the composition of casein in breast milk, and may provide information about the nutritional differences in breast milk from mothers of newborns of different genders.
Article
Full-text available
Automated prediction of bacterial protein subcellular localization is an important tool for genome annotation and drug discovery. PSORT has been one of the most widely used computational methods for such bacterial protein analysis; however, it has not been updated since it was introduced in 1991. In addition, neither PSORT nor any of the other computational methods available make predictions for all five of the localization sites characteristic of Gram-negative bacteria. Here we present PSORT-B, an updated version of PSORT for Gram-negative bacteria, which is available as a web-based application at http://www.psort.org. PSORT-B examines a given protein sequence for amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices and motifs corresponding to specific localizations. A probabilistic method integrates these analyses, returning a list of five possible localization sites with associated probability scores. PSORT-B, designed to favor high precision (specificity) over high recall (sensitivity), attained an overall precision of 97% and recall of 75% in 5-fold cross-validation tests, using a dataset we developed of 1443 proteins of experimentally known localization. This dataset, the largest of its kind, is freely available, along with the PSORT-B source code (under GNU General Public License).
Article
Full-text available
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to seven additional databases; a variety of new documentation files; the creation of TREMBL, and unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT.
Article
Full-text available
A new method, the motif identification neural design (MOTIFIND), has been developed for rapid and sensitive protein family identification. The method is an extension of our previous gene classification artificial neural system and employs new designs to enhance the detection of distant relationships. The new designs include an n-gram term weighting algorithm for extracting local motif patterns, an enhanced n-gram method for extracting residues of long-range correlation, and integrated neural networks for combining global and motif sequence information. The system has been tested and compared with several existing methods using three protein families, the cytochrome c, cytochrome b and flavodoxin. Overall it achieves 100% sensitivity and > 99.6% specificity, an accuracy comparable to BLAST, but at a speed of approximately 20 times faster. The system is much more robust than the PROSITE search which is based on simple signature patterns. MOTIFIND also compares favorably with BLIMPS, the Hidden Markov Model and PROFILESEARCH in detecting fragmentary sequences lacking complete motif regions and in detecting distant relationships, especially for members of under-represented subgroups within a family. MOTIFIND may be generally applicable to other proteins and has the potential to become a full-scale database search and sequence analysis tool.
Article
Full-text available
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.
Article
Full-text available
Molecular models of the trans-membrane domains of delta, kappa and mu opioid receptors, members of the G-protein coupled receptor (GPCR) superfamily, were developed using techniques of homology modeling and molecular dynamics simulations. Structural elements were predicted from sequence alignments of opioid and related receptors based on (i) the consensus, periodicities and biophysical interpretations of alignment-derived properties, and (ii) tertiary structure homology to rhodopsin. Initial model structures of the three receptors were refined computationally with energy minimization and the result of the first 210 ps of a 2 ns molecular dynamics trajectory at 300K. Average structures from the trajectory obtained for each receptor subtype after release of the initial backbone constraints show small backbone deviations, indicating stability. During the molecular dynamics phase, subtype-differentiated residues of the receptors developed divergent structures within the models, including changes in regions common to the three subtypes and presumed to belong to ligand binding regions. The divergent features developed by the model structures appear to be consistent with the observed ligand binding selectivities of the opioid receptors. The results thus implicate identifiable receptor microenvironments as primary determinants of some of the observed subtype specificities in opiate ligand binding and in functional effects of mutagenesis. Networks of interacting residues observed in the models are common to the opiate receptors and other GPCRs, indicating core interfaces that are potentially responsible for structural integrity and signal transduction. Analysis of extended molecular dynamics trajectories reveals concerted motions of distant parts of ligand-binding regions, suggesting motion-sensitive components of ligand binding. The comparative modeling results from this study help clarify experimental observations of subtype differences and suggest both structural and dynamic rationales for differences in receptor properties.
Article
Full-text available
Neural networks have been trained to predict the subcellular location of proteins in prokaryotic or eukaryotic cells from their amino acid composition. For three possible subcellular locations in prokaryotic organisms a prediction accuracy of 81% can be achieved. Assigning a reliability index, 33% of the predictions can be made with an accuracy of 91%. For eukaryotic proteins (excluding plant sequences) an overall prediction accuracy of 66% for four locations was achieved, with 33% of the sequences being predicted with an accuracy of 82% or better. With the subcellular location restricting a protein's possible function, this method should be a useful tool for the systematic analysis of genome data and is available via a server on the world wide web.
Article
Full-text available
The function of a protein is closely correlated with its subcellular location. With the rapid increase in new protein sequences entering into data banks, we are confronted with a challenge: is it possible to utilize a bioinformatic approach to help expedite the determination of protein subcellular locations? To explore this problem, proteins were classified, according to their subcellular locations, into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole. Based on the classification scheme that has covered almost all the organelles and subcellular compartments in an animal or plant cell, a covariant discriminant algorithm was proposed to predict the subcellular location of a query protein according to its amino acid composition. Results obtained through self-consistency, jackknife and independent dataset tests indicated that the rates of correct prediction by the current algorithm are significantly higher than those by the existing methods. It is anticipated that the classification scheme and concept and also the prediction algorithm can expedite the functionality determination of new proteins, which can also be of use in the prioritization of genes and proteins identified by genomic efforts as potential molecular targets for drug design.
Article
Full-text available
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDSs) in the EMBL Nucleotide Sequence Database, except the CDSs already included in SWISS-PROT. We also describe the Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. SWISS-PROT is available at: http://www.expasy.ch/sprot/ and http://www.ebi.ac.uk/swissprot/
Article
Full-text available
Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.
Article
Full-text available
The HMMTOP transmembrane topology prediction server predicts both the localization of helical transmembrane segments and the topology of transmembrane proteins. Recently, several improvements have been introduced to the original method. Now, the user is allowed to submit additional information about segment localization to enhance the prediction power. This option improves the prediction accuracy as well as helps the interpretation of experimental results, i.e. in epitope insertion experiments. Availability: HMMTOP 2.0 is freely available to non-commercial users at http://www.enzim.hu/hmmtop. Source code is also available upon request to academic users. Contact: tusi@enzim.hu * To whom correspondence should be addressed.
Article
Full-text available
Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft-margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims’ Xi-Alpha bound and the GACV of Wahba et al. come next and perform reasonably well. For SVMs with L2 soft-margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Article
In vivo, proteins occur in widely different physio-chemical environments, and, from in vitro studies, we know that protein structure can be very sensitive to environment. However, theoretical studies of protein structure have tended to ignore this complexity. In this paper, we have approached this problem by grouping proteins by their subcellular location and looking at structural properties that are characteristic to each location. We hypothesize that, throughout evolution, each subcellular location has maintained a characteristic physio-chemical environment, and that proteins in each location have adapted to these environments. If so, we would expect that protein structures from different locations will show characteristic differences, particularly at the surface, which is directly exposed to the environment. To test this hypothesis, we have examined all eukaryotic proteins with known three-dimensional structure and for which the subcellular location is known to be either nuclear, cytoplasmic, or extracellular. In agreement with previous studies, we find that the total amino acid composition carries a signal that identifies the subcellular location. This signal was due almost entirely to the surface residues. The surface residue signal was often strong enough to accurately predict subcellular location, given only a knowledge of which residues are at the protein surface. The results suggest how the accuracy of prediction of location from sequence can be improved. We concluded that protein surfaces show adaptation to their subcellular location. The nature of these adaptations suggests several principles that proteins may have used in adapting to particular physio-chemical environments; these principles may be useful for protein design.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Article
A neural network classification method is developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.
Article
To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses.
Article
We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations. For example, the rule to recognize an inner membrane protein is the presence of either a hydrophobic stretch in the predicted mature protein or an uncleavable N-terminal signal sequence. Lipoproteins are first recognized by a consensus pattern and then assumed present at either the inner or outer membrane. These two possibilities are further discriminated by examining an acidic residue in the mature N-terminal portion. Furthermore, we found an empirical rule that periplasmic and outer membrane proteins were successfully discriminated by their different amino acid composition. Overall, our system could predict 83% of the localization sites of proteins in our database.
Article
A correlation analysis of the amino acid composition and the cellular location of a protein is presented. The statistical analysis discriminates among the following five protein classes: integral membrane proteins, anchored membrane proteins, extracellular proteins, intracellular proteins and nuclear proteins. This segregation into protein classes related to their location can help researchers to design experimental work for testing hypotheses in order to find out the functionality of a reading frame in search of function. A program (ProtLock) to predict the cellular location of a protein has been designed.
Article
A new method is suggested here for topology prediction of helical transmembrane proteins. The method is based on the hypothesis that the localizations of the transmembrane segments and the topology are determined by the difference in the amino acid distributions in various structural parts of these proteins rather than by specific amino acid compositions of these parts. A hidden Markov model with special architecture was developed to search transmembrane topology corresponding to the maximum likelihood among all the possible topologies of a given protein. The prediction accuracy was tested on 158 proteins and was found to be higher than that found using prediction methods already available. The method successfully predicted all the transmembrane segments in 143 proteins out of the 158, and for 135 of these proteins both the membrane spanning regions and the topologies were predicted correctly. The observed level of accuracy is a strong argument in favor of our hypothesis.
Article
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server: http://www.cbs.dtu.dk/services/SignalP/.
Article
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.
Article
A novel method was introduced to predict protein subcellular locations from sequences. Using sequence data, this method achieved a prediction accuracy higher than previous methods based on the amino acid composition. For three subcellular locations in a prokaryotic organism, the overall prediction accuracy reached 89.1%. For eukaryotic proteins, prediction accuracies of 73.0% and 78.7% were attained within four and three location categories, respectively. These results demonstrate the applicability of this relative simple method and possible improvement of prediction for the protein subcellular location.
Article
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.
Article
The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are closely correlated with its biological functions. The success of human genome project and the rapid increase in the number of protein sequences entering into data bank have stimulated a challenging frontier: How to develop a fast and accurate method to predict the cellular attributes of a protein based on its amino acid sequence? The existing algorithms for predicting these attributes were all based on the amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns for protein sequences is extremely large, which has posed a formidable difficulty for realizing this goal. To deal with such a difficulty, the pseudo-amino acid composition is introduced. It is a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. A remarkable improvement in prediction quality has been observed by using the pseudo-amino acid composition. The success rates of prediction thus obtained are so far the highest for the same classification schemes and same data sets. It has not escaped from our notice that the concept of pseudo-amino acid composition as well as its mathematical framework and biochemical implication may also have a notable impact on improving the prediction quality of other protein features.
Article
Support Vector Machine (SVM), which is one class of learning machines, was applied to predict the subcellular location of proteins by incorporating the quasi-sequence-order effect (Chou [2000] Biochem. Biophys. Res. Commun. 278:477-483). In this study, the proteins are classified into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracellular, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane, and (12) vacuole, which account for most organelles and subcellular compartments in an animal or plant cell. Examinations for self-consistency and jackknife testing of the SVMs method were conducted for three sets consisting of 1,911, 2,044, and 2,191 proteins. The correct rates for self-consistency and the jackknife test values achieved with these protein sets were 94 and 83% for 1,911 proteins, 92 and 78% for 2,044 proteins, and 89 and 75% for 2,191 proteins, respectively. Furthermore, tests for correct prediction rates were undertaken with three independent testing datasets containing 2,148 proteins, 2,417 proteins, and 2,494 proteins producing values of 84, 77, and 74%, respectively.
Article
We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more directly related to the linear sequence of amino acids, and hence easier to predict, than protein structure. These attributes include features associated with post-translational modifications and protein sorting, but also much simpler aspects such as the length, isoelectric point and composition of the polypeptide chain.
Article
In the coarse-grained fold assignment of major protein classes, such as all-alpha, all-beta, alpha + beta, alpha/beta proteins, one can easily achieve high prediction accuracy from primary amino acid sequences. However, the fine-grained assignment of folds, such as those defined in the Structural Classification of Proteins (SCOP) database, presents a challenge due to the larger amount of folds available. Recent study yielded reasonable prediction accuracy of 56.0% on an independent set of 27 most populated folds. In this communication, we apply the support vector machine (SVM) method, using a combination of protein descriptors based on the properties derived from the composition of n-peptide and jury voting, to the fine-grained fold prediction, and are able to achieve an overall prediction accuracy of 69.6% on the same independent set-significantly higher than the previous results. On 10-fold cross-validation, we obtained a prediction accuracy of 65.3%. Our results show that SVM coupled with suitable global sequence-coding schemes can significantly improve the fine-grained fold prediction. Our approach should be useful in structure prediction and modeling.
Using neural networks for prediction of the subcellular location of proteins
  • Reinhardt