MOTIVATION: Interpreting and quantifying labeled mass-spectrometry data is complex and requires automated algorithms, particularly for large scale proteomic profiling. Here, we propose the use of bi-linear regression to quantify relative abundance across the elution profile in a unified model. The bi-linear regression model takes advantage of the fact that while peptides differ in overall abundance across the elution profile multiplicatively, the relative abundance between the mixed samples remains constant across the elution profile. We describe how to apply bi-linear regression models to (18)O stable-isotope labeled data, which allows for the direct comparison of two samples simultaneously. Interpretation of model parameters is also discussed. The incorporation rate of the labeling isotope is estimated as part of the modeling process and can be used as a measure of data quality. Application is demonstrated in a controlled experiment as well as in a complex mixture. RESULTS: Bi-linear regression models allow for more precise and accurate estimates of abundance, in comparison to methods that treat each spectrum independently, by taking into account the abundance of the molecule throughout the entire elution profile, with precision increased by one-to-two orders of magnitude.
In the current era of large-scale biology, systems biology has evolved as a powerful approach to identify complex interactions within biological systems. In addition to high throughput identification and quantification techniques, methods based on high-quality mono-specific antibodies remain an essential element of the approach. To assist the large-scale design and production of peptide-directed antibodies for systems biology studies, we developed a fully integrated online application, AbDesigner (http://helixweb.nih.gov/AbDesigner/), to help researchers select optimal peptide immunogens for antibody generation against relatively disordered regions of target proteins. Here we describe AbDesigner in terms of its features, comparing it to other software tools, and use it to design three antibodies against kidney disease-related proteins in human, viz. nephrin, podocin, and apolipoprotein L1.
In a bottom-up shotgun approach, the proteins of a mixture are enzymatically digested, separated, and analyzed via tandem mass spectrometry. The mass spectra relating fragment ion intensities (abundance) to the mass-to-charge are used to deduce the amino acid sequence and identify the peptides and proteins. The variables that influence intensity were characterized using a multi-factorial mixed-effects model, a ten-fold cross-validation, and stepwise feature selection on 6,352,528 fragment ions from 61,543 peptide ions. Intensity was higher in fragment ions that did not have neutral mass loss relative to any mass loss or that had a +1 charge state. Peptide ions classified for proton mobility as non-mobile had lowest intensity of all mobility levels. Higher basic residue (arginine, lysine or histidine) counts in the peptide ion and low counts in the fragment ion were associated with lower fragment ion intensities. Higher counts of proline in peptide and fragment ions were associated with lower intensities. These results are consistent with the mobile proton theory. Opposite trends between peptide and fragment ion counts and intensity may be due to the different impact of factor under consideration at different stages of the MS/MS experiment or to the different distribution of observations across peptide and fragment ion levels. Presence of basic residues at all three positions next to the fragmentation site was associated with lower fragment ion intensity. The presence of proline proximal to the fragmentation site enhanced fragmentation and had the opposite trend when located distant from the site. A positive association between fragment ion intensity and presence of sulfur residues (cysteine and methionine) on the vicinity of the fragmentation site was identified. These results highlight the multi-factorial nature of fragment ion intensity and could improve the algorithms for peptide identification and the simulation in tandem mass spectrometry experiments.
Arsenic is a widely distributed environmental toxicant that can cause multi-tissue pathologies. Proteomic assays allow for the identification of biological processes modulated by arsenic in diverse tissue types.
The altered abundance of proteins from HaCaT human keratinocyte cell line exposed to arsenic was quantified using a label-free LC-MS/MS mass spectrometry workflow. Selected proteomics results were validated using western blot and RT-PCR. A functional annotation analytics strategy that included visual analytical integration of heterogeneous data sets was developed to elucidate functional categories. The annotations integrated were mainly tissue localization, biological process and gene family.
The abundance of 173 proteins was altered in keratinocytes exposed to arsenic; in which 96 proteins had increased abundance while 77 proteins had decreased abundance. These proteins were also classified into 69 Gene Ontology biological process terms. The increased abundance of transferrin receptor protein (TFRC) was validated and also annotated to participate in response to hypoxia. A total of 33 proteins (11 increased abundance and 22 decreased abundance) were associated with 18 metabolic process terms. The Glutamate--cysteine ligase catalytic subunit (GCLC), the only protein annotated with the term sulfur amino acid metabolism process, had increased abundance while succinate dehydrogenase [ubiquinone] iron-sulfur subunit, mitochondrial precursor (SDHB), a tumor suppressor, had decreased abundance.
A list of 173 differentially abundant proteins in response to arsenic trioxide was grouped using three major functional annotations covering tissue localization, biological process and protein families. A possible explanation for hyperpigmentation pathologies observed in arsenic toxicity is that arsenic exposure leads to increased iron uptake in the normally hypoxic human skin. The proteins mapped to metabolic process terms and differentially abundant are candidates for evaluating metabolic pathways perturbed by arsenicals.
OBJECTIVES: The abuse of alcohol is a major public health problem, and the diagnosis and care of patients with alcohol abuse and dependence is hindered by the lack of tests that can detect dangerous levels of drinking or relapse during therapy. Gastroenterologists and other healthcare providers find it very challenging to obtain an accurate alcohol drinking history. We hypothesized that the effects of ethanol on numerous systems may well be reflected in changes in quantity or qualities of constituent or novel plasma proteins or protein fragments. Organ/tissue-specific proteins may be released into the blood stream when cells are injured by alcohol, or when systemic changes are induced by alcohol, and such proteins would be detected using a proteomic approach. The objective of this pilot study was to determine if there are plasma proteome profiles that correlate with heavy alcohol use. METHODS: Paired serum samples, before and after intensive alcohol treatment, were obtained from subjects who attended an outpatient alcohol treatment program. Serum proteomic profiles using MALDI -OTOF Mass Spectrometry were compared between pre- and post treatment samples. RESULTS: Of 16 subjects who enrolled in the study, 8 were females. The mean age of the study subjects was 49 yrs. The baseline laboratory data showed elevated AST (54 ± 37 IU/L), ALT (37 ± 19 IU/L), and MCV (99 ± 5 fl). Self-reported pre-treatment drinking levels for these subjects averaged 17 ± 7drinks/day and 103 ± 37 drinks/week. Mass spectrometry analyses showed a novel 5.9 kDa protein, a fragment of alpha fibrinogen, isoform 1, that might be might be a new novel marker for abusive alcohol drinking. CONCLUSIONS: We have shown in this pilot study that several potential protein markers have appeared in mass spectral profiles and that they may be useful clinically to determine the status of alcohol drinking by MALDI -OTOF mass spectrometry, especially a fragment of alpha fibrinogen, isoform 1. However, a large-scale study is needed to confirm and validate our current results.
Long chain acyl-CoA synthetase 1 (ACSL1) contributes 50 to 90% of total ACSL activity in liver, adipose tissue, and heart and appears to direct the use of long chain fatty acids for energy. Although the functional importance of ACSL1 is becoming clear, little is understood about its post-translational regulation. In order to investigate the post-translational modifications of ACSL1 under different physiological conditions, we overexpressed ACSL1 in hepatocytes, brown adipocytes, and 3T3-L1 differentiated adipocytes, treated these cells with different hormones, and analyzed the resulting phosphorylated and acetylated amino acids by mass spectrometry. We then compared these results to the post-translational modifications observed in vivo in liver and brown adipose tissue after mice were fasted or exposed to a cold environment. We identified universal N-terminal acetylation, 15 acetylated lysines, and 25 phosphorylation sites on ACSL1. Several unique acetylation and phosphorylation sites occurred under conditions in which fatty acid β-oxidation is normally enhanced. Thirteen of the acetylated lysines had not previously been identified, and none of the phosphorylation sites had been previously identified. Site-directed mutagenesis was used to introduce mutations at three potential acetylation and phosphorylation sites believed to be important for ACSL1 function. At the ATP/AMP binding site and at a highly conserved site near the C terminus, modifications of Ser278 or Lys676, respectively, totally inhibited ACSL1 activity. In contrast, mutations of Lys285 that mimicked acetylation (Lys285Ala and Lys285Gln) reduced ACSL activity, whereas full activity was retained by Lys285Arg, suggesting that acetylation of Lys285 would be likely to decrease ACSL1 activity. These results indicate that ACSL1 is highly modified post-translationally. Several of these modifications would be expected to alter enzymatic function, but others may affect protein stability or protein-protein interactions.
Molecular pathways regulating melanoma initiation and progression are potential targets of therapeutic development for this aggressive cancer. Identification and molecular analysis of these pathways in patients has been primarily restricted to targeted studies on individual proteins. Here, we report the most comprehensive analysis of formalin-fixed paraffin-embedded human melanoma tissues using quantitative proteomics. From 61 patient samples, we identified 171 proteins varying in abundance among benign nevi, primary melanoma, and metastatic melanoma. Seventy-three percent of these proteins were validated by immunohistochemistry staining of malignant melanoma tissues from the Human Protein Atlas database. Our results reveal that molecular pathways involved with tumor cell proliferation, motility, and apoptosis are mis-regulated in melanoma. These data provide the most comprehensive proteome resource on patient melanoma and reveal insight into the molecular mechanisms driving melanoma progression.
Protein S-acylation (also called palmitoylation) is a pervasive post-translational modification that plays critical roles in regulating protein trafficking, localization, stability, activity, and complex formation. The past decade has witnessed tremendous advances in the study of protein S-acylation, largely owing to the development of novel S-acylproteomics technologies. In this review, we summarize current S-acylproteomics approaches, critically review published S-acylproteomics studies, and envision future directions for the burgeoning S-acylproteomics field. Emerging S-acylproteomics technologies promise to shed new light on this distinct post-translational modification and facilitate the discovery of new disease mechanisms, biomarkers, and therapeutic targets.
It is important to investigate the reproducibility of raw mass spectrometry (MS) features of abundance, such as spectral count, peptide number and ion intensity values, when conducting replicate mass spectrometry measurements. Reproducibility can be inferred from these replicate data either formally with analyses of variance techniques or informally with graphical procedures, particularly, Bland-Altman plots on paired runs. In this note, we suggest range plots to provide a suitable generalization of Bland-Altman plots to experiments with more than two replicate runs. We describe range charts and their interpretation, and illustrate their use with data from a recent proteomic study relating to label-free analysis.
The availability of different scoring schemes and filter settings of protein database search algorithms has greatly expanded the number of search methods for identifying candidate peptides from MS/MS spectra. We have previously shown that consensus-based methods that combine three search algorithms yield higher sensitivity and specificity compared to the use of a single search engine (individual method). We hypothesized that union of four search engines (Sequest, Mascot, X!Tandem and Phenyx) can further enhance sensitivity and specificity. ROC plots were generated to measure the sensitivity and specificity of 5460 consensus methods derived from the same dataset. We found that Mascot outperformed individual methods for sensitivity and specificity, while Phenyx performed the worst. The union consensus methods generally produced much higher sensitivity, while the intersection consensus methods gave much higher specificity. The union methods from four search algorithms modestly improved sensitivity, but not specificity, compared to union methods that used three search engines. This suggests that a strategy based on specific combination of search algorithms, instead of merely 'as many search engines as possible', may be key strategy for success with peptide identification. Lastly, we provide strategies for optimizing sensitivity or specificity of peptide identification in MS/MS spectra for different user-specific conditions.
Assignment of glycosylation sites and site microheterogeneity is of both biological and clinical significance. Herein, the detailed N-glycosylation pattern of human serum alpha-2-macroglobulin was studied using an integrative approach, including permethylation of N-glycans, collision induced dissociation (CID) and electron transfer dissociation (ETD) of chymotryptic N-glycopeptides, and partial deglycosylation of chymotryptic N-glycopeptides with endo-β-N-acetylglucosaminidase F3 (Endo F3). Three N-glycosylation sites were found to be occupied by four biantennary complex type N-glycans using N-glycan analysis and the ETD/CID method. Endo F3 assisted mass spectrometric analysis yielded five N-glycosylation sites with and without core fucosylation. In total, six out of eight potential N-glycosylation sites were identified using this approach. This integrative approach was performed using only 10 μL of human serum for both N-glycosylation site assignment and site microheterogeneity determination.
Nickel (Ni) compounds are widely used in industrial and commercial products including household and cooking utensils, jewelry, dental appliances and implants. Occupational exposure to nickel is associated with an increased risk for lung and nasal cancers, is the most common cause of contact dermatitis and has an extensive effect on the immune system. The purpose of this study was two-fold: (i) to evaluate immune response to the occupational exposure to nickel measured by the presence of anti-glycan antibodies (AGA) using a new biomarker-discovery platform based on printed glycan arrays (PGA), and (ii) to evaluate and compile a sequence of bioinformatics and statistical methods which are specifically relevant to PGA-derived information and to identification of putative “Ni toxicity signature”. The PGAs are similar to DNA microarrays, but contain deposits of various carbohydrates (glycans) instead of spotted DNAs.
The study uses data derived from a set of 89 plasma specimens and their corresponding demographic information. The study population includes three subgroups: subjects directly exposed to Nickel that work in a refinery, subjects environmentally exposed to Nickel that live in a city where the refinery is located and subjects that live in a remote location. The paper describes the following sequence of nine data processing and analysis steps: (1) Analysis of inter-array reproducibility based on benchmark sera; (2) Analysis of intra-array reproducibility; (3) Screening of data - rejecting glycans which result in low intra-class correlation coefficient (ICC), high coefficient of variation and low fluorescent intensity; (4) Analysis of inter-slide bias and choice of data normalization technique; (5) Determination of discriminatory subsamples based on multiple bootstrap tests; (6) Determination of the optimal signature size (cardinality of selected feature set) based on multiple cross-validation tests; (7) Identification of the top discriminatory glycans and their individual performance based on nonparametric univariate feature selection; (8) Determination of multivariate performance of combined glycans; (9) Establishing the statistical significance of multivariate performance of combined glycan signature.
The above analysis steps have delivered the following results: inter-array reproducibility ρ=0.920 ± 0.030; intra-array reproducibility ρ=0.929 ± 0.025; 249 out of 380 glycans passed the screening at ICC>80%, glycans in selected signature have ICC ≥ 88.7%; optimal signature size (after quantile normalization)=3; individual significance for the signature glycans p=0.00015 to 0.00164, individual AUC values 0.870 to 0.815; observed combined performance for three glycans AUC=0.966, p=0.005, CI=[0.757, 0947]; specifity=94.4%, sensitivity=88.9%; predictive (cross-validated) AUC value 0.836.
Advanced prostate cancer (PCa) often spreads to distant organs, leading to increased morbidity and mortality. It is now well established that chemokines and their cognate receptors play a crucial role in the multi-step process of metastasis. We have previously identified CXCR5 to be highly expressed by PCa tissues and cell lines and its specific ligand, CXCL13, is significantly elevated in the serum of patients with PCa and differentiated PCa cases with other benign prostatic diseases. CXCR5:CXCL13 interactions promote PCa cell invasion, migration, and differential matrix metalloproteinase (MMP) expression. Thus, it is important to understand the molecular and cellular processes that mediate these events. In this study, we quantified changes in apoptosis, cell cycle, and cytoskeleton rearrangement biological pathways from CXCL13-treated hormone refractory PCa cell line (PC3) to better elucidate the signaling pathways activated by CXCL13:CXCR5 interaction. Using antibody arrays that displayed 343 different protein- and phosphorylation-specific antibodies, regulatory networks that control cancer progression signaling cascades were identified. Three regulatory networks were dramatically induced by CXCL13: Akt1/2-cyclin-dependent kinases (Cdk1/2)-Cdk inhibitor 1B (CDKN1B), Integrinβ3-focal adhesion kinase (Fak)/Src-Paxillin(PXN), and Akt-Jun-cAMP response-element binding protein (CREB1). In general, phosphoinositide-3 kinase (PI3K)/Akt and stress-activated protein kinase (SAPK)/c-jun kinase (JNK) were the major signaling pathways modulated by CXCL13 in PCa cells. This cluster analysis revealed proteins whose activation patterns can be attributed to CXCL13:CXCR5 interaction in the androgen-independent PC3 cell line. Taken together, these results suggest that CXCL13 contributes to cell-signaling cascades that regulate advanced PCa cell invasion, growth, and/or survival.
Neurodegeneration is an important component of diabetic retinopathy as demonstrated by increased neural apoptosis in the retina during experimental and human diabetes. Accumulation of sorbitol and fructose and the generation or enhancement of oxidative stress has been reported in the whole retina of diabetic animals. Aldose reductase (AR), the first and the rate limiting enzyme in the pathway reduces glucose to sorbitol and the diabetic complications are prevented by drugs that inhibit AR. In this study we examined the phosphorylation state of various retinal proteins in response to sorbitol-treatment by phosphor-site-specific antibody microarray. Our results suggest that various retinal protein kinases and cytoskeletal proteins either activated or down regulated in response to sorbitol treatment. Further, our study also indicates the activation of retinal insulin- and insulin growth factor 1 receptor and their downstream signaling proteins such as phosphoinositide 3-kinanse and protein kinase B (Akt). Understanding the regulation of retinal proteins involved in polyol (sorbitol) pathway would help to design therapeutic agents for the treatment of diabetic retinopathy.
Determining the functional role(s) of enzymes is very important to build the metabolic blueprint of an organism and to identify the potential roles enzymes may play in metabolic and disease pathways. With exponential growth in gene and protein sequence data, it is not feasible to experimentally characterize the function(s) of all enzymes. Alternatively, computational methods can be used to annotate the enormous amount of unannotated enzyme sequences. For function prediction and classification of enzymes, features based on amino acid composition, sequence and structural properties, domain composition and specific peptide information have been widely used by different computational approaches. Each feature space has its own merits and limitations on the overall prediction accuracy. Prediction accuracy improves when machine-learning methods are used to classify enzymes. Given the incomplete and unbalanced nature of annotations in biological databases, ensemble methods or methods that bank on a combination of orthogonal feature are more desirable for achieving higher accuracy and coverage in enzyme classification. In this review article, we systematically describe all the features and methods used thus far for enzyme class prediction. To the authors' knowledge, this review represents the most exhaustive description of methods used for computational prediction of enzyme classes.
Using a new pEnLox vector employed to generate gain-of-function mutants in Arabidopsis thaliana, the AtFAS4 mutant has been obtained and analyzed. The mutant is characterized by super-expression of the At1g33390 gene, which leads to the occurrence of a mutant phenotype - stem fasciation. The level of expression of the AtFAS4 gene in normally developing A.thaliana plants is extremely low thus accounting for almost complete absence of information on EST's of this gene. The generated AtFAS4 mutant has permitted full-length cDNA of the At1g33390 gene to be obtained and analyzed for the first time.
One of the major challenges in the genomic era is annotating structure/function to the vast quantities of sequence information now available. Indeed, most of the protein sequence database lacks comprehensive annotation, even when experimental evidence exists. Further, within structurally resolved and functionally annotated protein domains, additional functionalities contained in these domains are not apparent. To add further complication, small changes in the amino-acid sequence can lead to profound changes in both structure and function, underscoring the need for rapid and reliable methods to analyze these types of data. Phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. Using all of the structurally resolved Src-Homology-2 (SH2) domains, we demonstrate that knowledge-bases can be used to create single-amino acid phylogenetic profiles which reliably annotate lipid-binding. Indeed, these measures isolate the known phosphotyrosine and hydrophobic pockets as integral to lipid-binding function. In addition, we determined that the SH2 domain of Tec family kinases bind to lipids with varying affinity and specificity. Simulating mutations in Bruton's tyrosine kinase (BTK) that cause X-Linked Agammaglobulinemia (XLA) predict that these mutations alter lipid-binding, which we confirm experimentally. In light of these results, we propose that XLA-causing mutations in the SH3-SH2 domain of BTK alter lipid-binding, which could play a causative role in the XLA-phenotype. Overall, our study suggests that the number of lipid-binding proteins is drastically underestimated and, with further development, phylogenetic profiles can provide a method for rapidly increasing the functional annotation of protein sequences.
Functional analysis and interpretation of large-scale proteomics and gene expression data require effective use of bioinformatics tools and public knowledge resources coupled with expert-guided examination. An integrated bioinformatics approach was used to analyze cellular pathways in response to ionizing radiation. ATM, or ataxia-telangiectasia mutated , a serine-threonine protein kinase, plays critical roles in radiation responses, including cell cycle arrest and DNA repair. We analyzed radiation responsive pathways based on 2D-gel/MS proteomics and microarray gene expression data from fibroblasts expressing wild type or mutant ATM gene. The analysis showed that metabolism was significantly affected by radiation in an ATM dependent manner. In particular, purine metabolic pathways were differentially changed in the two cell lines. The expression of ribonucleoside-diphosphate reductase subunit M2 (RRM2) was increased in ATM-wild type cells at both mRNA and protein levels, but no changes were detected in ATM-mutated cells. Increased expression of p53 was observed 30min after irradiation of the ATM-wild type cells. These results suggest that RRM2 is a downstream target of the ATM-p53 pathway that mediates radiation-induced DNA repair. We demonstrated that the integrated bioinformatics approach facilitated pathway analysis, hypothesis generation and target gene/protein identification.
It is widely believed that discovery of specific, sensitive and reliable tumor biomarkers can improve the treatment of cancer. The goal of this study was to develop a novel fractionation protocol targeting hydrophobic proteins as possible cancer cell membrane biomarkers. Hydrophobic proteins of breast cancer tissues and cell lines were enriched by polymeric reverse phase columns. The retained proteins were eluted and digested for peptide identification by nano-liquid chromatography with tandem mass spectrometry using a hybrid linear ion-trap Orbitrap.Hundreds of proteins were identified from each of these three specimens: tumors, normal breast tissue, and breast cancer cell lines. Many of the identified proteins defined key cellular functions. Protein profiles of cancer and normal tissues from the same patient were systematically examined and compared. Stem cell markers were overexpressed in triple negative breast cancer (TNBC) compared with non-TNBC samples. Because breast cancer stem cells are known to be resistant to radiation and chemotherapy, and can be the source of metastasis frequently seen in patients with TNBC, our study may provide evidence of molecules promoting the aggressiveness of TNBC.The initial results obtained using a combination of hydrophobic fractionation and nano-LC mass spectrometry analysis of these proteins appear promising in the discovery of potential cancer biomarkers. When sufficiently refined, this approach may prove useful for early detection and better treatment of breast cancer.
The development of novel targeted cancer therapies and/or diagnostic tools is dependent upon an understanding of the differential expression of molecular targets between normal tissues and tumors. Many of these potential targets are cell-surface receptors; however, our knowledge of the cell-surface proteins upregulated in pancreatic tumors is limited, thus impeding the development of targeted therapies for pancreatic cancer. To develop new diagnostic and therapeutic tools to specifically target pancreatic tumors, we sought to identify cell-surface proteins that may serve as potential tumor-specfic targets.
Membrane glycoproteins on the pancreatic cancer cell lines BxPC-3 were labeled with the bifunctional linker biocytin hydrazide. Following proteolytic digestion, biotinylated glycopeptides were captured with streptavidin-coupled beads then released by PNGaseF-mediated endoglycosidase cleavage and identified by liquid chromatography-tandem mass spectrometry (MS). A protein identified by the cell-surface glycoprotein capture procedure, CD109, was evaluated by western analysis of lysates of pancreatic cancer cell lines and by immunohistochemistry in sections of pancreatic ductal adenocarcinoma and non- neoplastic pancreatic tissues.
MS/MS analysis of glycopeptides captured from BxPC-3 cells revealed 18 proteins predicted or known to be associated with the plasma membrane, including CD109, which has not been reported in pancreatic cancer. Western analysis of CD109 in lysates prepared from pancreatic cancer cell lines revealed it was expressed in 6 of 8 cell lines, with a high level of expression in BxPC-3, MIAPaCa-2, and Panc-1 cells. Immunohistochemical analyses of human pancreatic tissues indicate CD109 is significantly overexpressed in pancreatic tumors compared to normal pancreas.
The selective capture of glycopeptides from the surface of pancreatic cancer cell lines can reveal novel cell-surface glycoproteins expressed in pancreatic ductal adenocarcinomas.
The goal of this study is to use principal component analysis (PCA) for multivariate analysis of proteome dynamics based on both protein abundance and turnover information generated by high-resolution mass spectrometry. We previously reported assessing protein dynamics in iron-starved Mycobacterium tuberculosis, revealing interesting interconnection among the cellular processes involving protein synthesis, degradation, and secretion (Anal. Chem. 80, 6860-9). In this study, we use target-decoy database search approach to select peptides for quantitation at a false discovery rate of 4.2%. We further use PCA to reduce the data dimensions for simpler interpretation. The PCA results indicate that the protein turnover and relative abundance properties are approximately orthogonal in the data space defined by the first three principal components. We show the potential of the Hotelling's T2 (T2) value as a quantifiable index for comparing changes between protein functional categories. The T2 value represents the gross change of a protein in both abundance and turnover. Close examination of the antigen 85 complex demonstrates that T2 correctly predicts the coordinated changes of the antigen 85 complex proteins. The multi-dimensional protein dynamics data further reveal the secretion of the antigen 85 complex. Overall, this study demonstrates PCA as an effective means to facilitate interpretation of the multivariate proteome dynamics dataset which otherwise would remain a significant challenge using traditional methods.
Protein phosphorylation occurs in certain sequence/structural contexts that are still incompletely understood. The amino acids surrounding the phosphorylated residues are important in determining the binding of the kinase to the protein sequence. Upon phosphorylation these sequences also determine the binding of certain domains that specifically bind to phosphorylated sequences. Thus far, such 'motifs' have been identified through alignment of a limited number of well identified kinase substrates. RESULTS: Experimentally determined phosphorylation sites from Human Protein Reference Database were used to identify 1,167 novel serine/threonine or tyrosine phosphorylation motifs using a computational approach. We were able to statistically validate a number of these novel motifs based on their enrichment in known phosphopeptides datasets over phosphoserine/threonine/tyrosine peptides in the human proteome. There were 299 novel serine/threonine or tyrosine phosphorylation motifs that were found to be statistically significant. Several of the novel motifs that we identified computationally have subsequently appeared in large datasets of experimentally determined phosphorylation sites since we initiated our analysis. Using a peptide microarray platform, we have experimentally evaluated the ability of casein kinase I to phosphorylate a subset of the novel motifs discovered in this study. Our results demonstrate that it is feasible to identify novel phosphorylation motifs through large phosphorylation datasets. Our study also establishes peptide microarrays as a novel platform for high throughput kinase assays and for the validation of consensus motifs. Finally, this extended catalog of phosphorylation motifs should assist in a systematic study of phosphorylation networks in signal transduction pathways.
Correct identification of peptides and proteins in complex biological samples from proteomic mass-spectra is a challenging problem in bioinformatics. The sensitivity and specificity of identification algorithms depend on underlying scoring methods, some being more sensitive, and others more specific. For high-throughput, automated peptide identification, control over the algorithms' performance in terms of trade-off between sensitivity and specificity is desirable. Combinations of algorithms, called 'consensus methods', have been shown to provide more accurate results than individual algorithms. However, due to the proliferation of algorithms and their varied internal settings, a systematic understanding of relative performance of individual and consensus methods are lacking. We performed an in-depth analysis of various approaches to consensus scoring using known protein mixtures, and evaluated the performance of 2310 settings generated from consensus of three different search algorithms: Mascot, Sequest, and X!Tandem. Our findings indicate that the union of Mascot, Sequest, and X!Tandem performed well (considering overall accuracy), and methods using 80-99.9% protein probability and/or minimum 2 peptides and/or 0-50% minimum peptide probability for protein identification performed better (on average) among all consensus methods tested in terms of overall accuracy. The results also suggest method selection strategies to provide direct control over sensitivity and specificity.
Cross-species comparison of gene expression profiles allows deciphering fundamental and species-specific transcriptional programs of cells and offers insight into organization and evolution of the genome and genetic network. Here, we propose an algorithm for comparing microarray data from different species to unravel transcriptional modules that are conserved or divergent through evolution. The proposed algorithm is based on cross-species matrix decomposition that includes a nonlinear independent component analysis followed a generalized probabilistic sparse matrix factorization on microarray data from different species. The proposed algorithm captures transcriptional modularity that might result from highly nonlinear interactions among genes, and partitions genes into mutually non-exclusive transcriptional modules. The conserved transcriptional modules are identified by the latent variables that are associated with predominant biological prototypes shared across species. We illustrated the application of the proposed algorithm by an analysis of human and mouse embryonic stem cell (ESC) data. The analysis uncovered conserved and divergent transcriptional modules in the ESC transcriptomes, shedding light on the understanding of fundamental and species-specific regulatory mechanisms controlling ESC development.