[show abstract][hide abstract] ABSTRACT: High resolution, system-wide characterizations have demonstrated the capacity to identify genomic regions that undergo genomic aberrations. Such research efforts often aim at associating these regions with disease etiology and outcome. Identifying the corresponding biologic processes that are responsible for disease and its outcome remains challenging. Using novel analytic methods that utilize the structure of biologic networks, we are able to identify the specific networks that are highly significantly, nonrandomly altered by regions of copy number amplification observed in a systems-wide analysis. We demonstrate this method in breast cancer, where the state of a subset of the pathways identified through these regions is shown to be highly associated with disease survival and recurrence.
PLoS ONE 01/2011; 6(1):e14437. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: The PathOlogist is a new tool designed to transform large sets of gene expression data into quantitative descriptors of pathway-level behavior. The tool aims to provide a robust alternative to the search for single-gene-to-phenotype associations by accounting for the complexity of molecular interactions.
Molecular abundance data is used to calculate two metrics--'activity' and 'consistency'--for each pathway in a set of more than 500 canonical molecular pathways (source: Pathway Interaction Database, http://pid.nci.nih.gov). The tool then allows a detailed exploration of these metrics through integrated visualization of pathway components and structure, hierarchical clustering of pathways and samples, and statistical analyses designed to detect associations between pathway behavior and clinical features.
The PathOlogist provides a straightforward means to identify the functional processes, rather than individual molecules, that are altered in disease. The statistical power and biologic significance of this approach are made easily accessible to laboratory researchers and informatics analysts alike. Here we show as an example, how the PathOlogist can be used to establish pathway signatures that robustly differentiate breast cancer cell lines based on response to treatment.
[show abstract][hide abstract] ABSTRACT: Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.
Genome Research 09/2009; 19(12):2324-33. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: Microorganisms have been associated with many types of human diseases; however, a significant number of clinically important microbial pathogens remain to be discovered.
We have developed a genome-wide approach, called Digital Karyotyping Microbe Identification (DK-MICROBE), to identify genomic DNA of bacteria and viruses in human disease tissues. This method involves the generation of an experimental DNA tag library through Digital Karyotyping (DK) followed by analysis of the tag sequences for the presence of microbial DNA content using a compiled microbial DNA virtual tag library.
To validate this technology and to identify pathogens that may be associated with human cancer pathogenesis, we used DK-MICROBE to determine the presence of microbial DNA in 58 human tumor samples, including brain, ovarian, and colorectal cancers. We detected DNA from Human herpesvirus 6 (HHV-6) in a DK library of a colorectal cancer liver metastasis and in normal tissue from the same patient.
DK-MICROBE can identify previously unknown infectious agents in human tumors, and is now available for further applications for the identification of pathogen DNA in human cancer and other diseases.
BMC Medical Genomics 02/2009; 2:22. · 3.47 Impact Factor
[show abstract][hide abstract] ABSTRACT: The Pathway Interaction Database (PID, http://pid.nci.nih.gov) is a freely available collection of curated and peer-reviewed pathways composed of human molecular signaling and regulatory events and key cellular processes. Created in a collaboration between the US National Cancer Institute and Nature Publishing Group, the database serves as a research tool for the cancer research community and others interested in cellular pathways, such as neuroscientists, developmental biologists and immunologists. PID offers a range of search features to facilitate pathway exploration. Users can browse the predefined set of pathways or create interaction network maps centered on a single molecule or cellular process of interest. In addition, the batch query tool allows users to upload long list(s) of molecules, such as those derived from microarray experiments, and either overlay these molecules onto predefined pathways or visualize the complete molecular connectivity map. Users can also download molecule lists, citation lists and complete database content in extensible markup language (XML) and Biological Pathways Exchange (BioPAX) Level 2 format. The database is updated with new pathway content every month and supplemented by specially commissioned articles on the practical uses of other relevant online tools.
Nucleic Acids Research 11/2008; 37(Database issue):D674-9. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Cancer is recognized to be a family of gene-based diseases whose causes are to be found in disruptions of basic biologic processes. An increasingly deep catalogue of canonical networks details the specific molecular interaction of genes and their products. However, mapping of disease phenotypes to alterations of these networks of interactions is accomplished indirectly and non-systematically. Here we objectively identify pathways associated with malignancy, staging, and outcome in cancer through application of an analytic approach that systematically evaluates differences in the activity and consistency of interactions within canonical biologic processes. Using large collections of publicly accessible genome-wide gene expression, we identify small, common sets of pathways - Trka Receptor, Apoptosis response to DNA Damage, Ceramide, Telomerase, CD40L and Calcineurin - whose differences robustly distinguish diverse tumor types from corresponding normal samples, predict tumor grade, and distinguish phenotypes such as estrogen receptor status and p53 mutation state. Pathways identified through this analysis perform as well or better than phenotypes used in the original studies in predicting cancer outcome. This approach provides a means to use genome-wide characterizations to map key biological processes to important clinical features in disease.
[show abstract][hide abstract] ABSTRACT: The Pathway Interaction Database (*PID*, http://pid.nci.nih.gov) is a freely available
collection of curated and peer-reviewed signaling pathways composed of human
biomolecular interactions and cellular processes. Created in a collaboration between the
U.S. National Cancer Institute and Nature Publishing Group, the database is a research
tool for cell biologists, biochemists, computational biologists and bioinformaticians.
The PID offers a range of tools to facilitate pathway exploration. Users can browse the
pre-defi ned set of pathways and also create interaction network maps centered on
a single molecule of interest or an extensive list of molecules. In addition, users can
download complete data sets in extensible markup language (XML) and Biological
Pathway Exchange (BioPAX) Level 2 formats. The database is updated every month and
supplemented by a concise editorial section that provides synopses of recent noteworthy
papers in cell signaling and specially commissioned articles on the practical uses of
other relevant online tools. Users can sign up for free email alerts or RSS feeds to receive
[show abstract][hide abstract] ABSTRACT: Cancers have been described as wounds that do not heal, suggesting that the two share common features. By comparing microarray data from a model of renal regeneration and repair (RRR) with reported gene expression in renal cell carcinoma (RCC), we asked whether those two processes do, in fact, share molecular features and regulatory mechanisms. The majority (77%) of the genes expressed in RRR and RCC were concordantly regulated, whereas only 23% were discordant (i.e., changed in opposite directions). The orchestrated processes of regeneration, involving cell proliferation and immune response, were reflected in the concordant genes. The discordant gene signature revealed processes (e.g., morphogenesis and glycolysis) and pathways (e.g., hypoxia-inducible factor and insulin-like growth factor-I) that reflect the intrinsic pathologic nature of RCC. This is the first study that compares gene expression patterns in RCC and RRR. It does so, in particular, with relation to the hypothesis that RCC resembles the wound healing processes seen in RRR. However, careful attention to the genes that are regulated in the discordant direction provides new insights into the critical differences between renal carcinogenesis and wound healing. The observations reported here provide a conceptual framework for further efforts to understand the biology and to develop more effective diagnostic biomarkers and therapeutic strategies for renal tumors and renal ischemia.
Cancer Research 08/2006; 66(14):7216-24. · 8.65 Impact Factor
[show abstract][hide abstract] ABSTRACT: Membrane proteins are responsible for many critical cellular functions and identifying cell surface proteins on different keratinocyte populations by proteomic approaches would improve our understanding of their biological function. The ability to characterize membrane proteins, however, has lagged behind that of soluble proteins both in terms of throughput and protein coverage. In this study, a membrane proteomic investigation of keratinocytes using a two-dimensional liquid chromatography (LC) tandem-mass spectrometry (MS/MS) approach that relies on a buffered methanol-based solubilization, and tryptic digestion of purified plasma membrane is described. A highly enriched plasma membrane fraction was prepared from newborn foreskins using sucrose gradient centrifugation, followed by a single-tube solubilization and tryptic digestion of membrane proteins. This digestate was fractionated by strong cation-exchange chromatography and analyzed using microcapillary reversed-phase LC-MS/MS. In a set of 1306 identified proteins, 866 had a gene ontology (GO) annotation for cellular component, and 496 of these annotated proteins (57.3%) were assigned as known integral membrane proteins or membrane-associated proteins. Included in the identification of a large number of aqueous insoluble integral membrane proteins were many known intercellular adhesion proteins and gap junction proteins. Furthermore, 121 proteins from cholesterol-rich plasma membrane domains (caveolar and lipid rafts) were identified.
Journal of Investigative Dermatology 11/2004; 123(4):691-9. · 6.19 Impact Factor
[show abstract][hide abstract] ABSTRACT: The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
Genome Research 11/2004; 14(10B):2121-7. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: A combined, detergent- and organic solvent-based proteomic method for the analysis of detergent-resistant membrane rafts (DRMR) is described. These specialized domains of the plasma membrane contain a distinctive and dynamic protein and/or lipid complement, which can be isolated from most mammalian cells. Lipid rafts are predominantly involved in signal transduction and adapted to mediate and produce different cellular responses. To facilitate a better understanding of their biology and role, DRMR were isolated from Vero cells as a Triton X-100 insoluble fraction. After detergent removal, sonication in 60% buffered methanol was used to extract, solubilize and tryptically digest the resulting protein complement. The peptide digestate was analyzed by microcapillary reversed-phase liquid chromatography-tandem mass spectrometry. Gas-phase fractionation in the mass-to-charge range was employed to broaden the selection of precursor ions and increase the number of identifications in an effort to detect less abundant proteins. A total of 380 proteins were identified including all known lipid raft markers. A total of 91 (24%) proteins were classified as integral alpha-helical membrane proteins, of which 51 (56%) were predicted to have multiple transmembrane domains.
[show abstract][hide abstract] ABSTRACT: Cleavable isotope-coded affinity tag (cICAT) reagents were utilized to identify and quantitate protein expression differences in control and inorganic phosphate-treated murine MC3T3-E1 osteoblast cells. Proteins extracted from control and treated cells were labeled with the light and heavy isotopic versions of cICAT reagents, respectively. The cICAT-labeled samples were combined, proteolytically digested, and the cICAT-derivatized peptides isolated using immobilized avidin chromatography. The cICAT-labeled peptides were resolved into 96 fractions by strong cation-exchange (SCX) liquid chromatography (LC). Analysis of the SCX-LC cICAT peptide fractions by microcapillary reversed-phase LC-tandem mass spectrometry resulted in the identification and quantitation of 7227 unique peptides corresponding to 2501 proteins, or roughly 9% of the proteins currently predicted to be encoded by the mouse genome. A false positive analysis indicated a 98% confidence in the peptide identifications. To corroborate changes in abundance measured by cICAT with those detectable in traditionally prepared cell lysate, we chose to analyze cyclin D1. Cyclin D1 has been previously identified as a phosphate-responsive gene and was likewise identified as a phosphate-responsive protein in the current analysis. The 1.76-fold increase in abundance in cyclin D1 determined from cICAT corresponds well with the 2.41-fold increase as determined by Western blotting. These results demonstrate that quantitative proteomics is capable of providing a quantitative view of thousands of proteins in mammalian cells within a defined set of experiments.
[show abstract][hide abstract] ABSTRACT: In this study, we utilized a multidimensional peptide separation strategy combined with tandem mass spectrometry (MS/MS) for the identification of proteins in human serum. After enzymatically digesting serum with trypsin, the peptides were fractionated using liquid-phase isoelectric focusing (IEF) in a novel ampholyte-free format. Twenty IEF fractions were collected and analyzed by reversed-phase microcapillary liquid chromatography (microLC)-MS/MS. Bioinformatic analysis of the raw MS/MS spectra resulted in the identification of 844 unique peptides, corresponding to 437 proteins. This study demonstrates the efficacy of ampholyte-free peptide autofocusing, which alleviates peptide losses in ampholyte removal strategies. The results show that the separation strategy is effective for high-throughput characterization of proteins from complex proteomic mixtures.
[show abstract][hide abstract] ABSTRACT: Sites with substantive bioinformatics operations are challenged to build data processing and delivery infrastructure that provides reliable access and enables data integration. Locally generated data must be processed and stored such that relationships to external data sources can be presented. Consistency and comparability across data sets requires annotation with controlled vocabularies and, further, metadata standards for data representation. Programmatic access to the processed data should be supported to ensure the maximum possible value is extracted. Confronted with these challenges at the National Cancer Institute Center for Bioinformatics, we decided to develop a robust infrastructure for data management and integration that supports advanced biomedical applications.
We have developed an interconnected set of software and services called caCORE. Enterprise Vocabulary Services (EVS) provide controlled vocabulary, dictionary and thesaurus services. The Cancer Data Standards Repository (caDSR) provides a metadata registry for common data elements. Cancer Bioinformatics Infrastructure Objects (caBIO) implements an object-oriented model of the biomedical domain and provides Java, Simple Object Access Protocol and HTTP-XML application programming interfaces. caCORE has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources.
caCORE downloads and web interfaces can be accessed from links on the caCORE web site (http://ncicb.nci.nih.gov/core). caBIO software is distributed under an open source license that permits unrestricted academic and commercial use. Vocabulary and metadata content in the EVS and caDSR, respectively, is similarly unrestricted, and is available through web applications and FTP downloads.
http://ncicb.nci.nih.gov/core/publications contains links to the caBIO 1.0 class diagram and the caCORE 1.0 Technical Guide, which provide detailed information on the present caCORE architecture, data sources and APIs. Updated information appears on a regular basis on the caCORE web site (http://ncicb.nci.nih.gov/core).
[show abstract][hide abstract] ABSTRACT: Motivation: Sites with substantive bioinformatics operations are challenged to build data processing and delivery infrastruc- ture that provides reliable access and enables data integration. Locally generated data must be processed and stored such that relationships to external data sources can be presen- ted. Consistency and comparability across data sets requires annotation with controlled vocabularies and, further, metadata standards for data representation. Programmatic access to the processed data should be supported to ensure the maximum possible value is extracted. Confronted with these challenges at the National Cancer Institute Center for Bioinformatics, we decided to develop a robust infrastructure for data man- agement and integration that supports advanced biomedical applications. Results: We have developed an interconnected set of soft- ware and services called caCORE. Enterprise Vocabulary Services (EVS) provide controlled vocabulary, dictionary and thesaurus services. The Cancer Data Standards Repos- itory (caDSR) provides a metadata registry for common data elements. Cancer Bioinformatics Infrastructure Objects (caBIO) implements an object-oriented model of the bio- medical domain and provides Java, Simple Object Access Protocol and HTTP-XML application programming interfaces. caCORE has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources. Availability: caCORE downloads and web interfaces can be accessed from links on the caCORE web site (http://ncicb.nci.nih.gov/core). caBIO software is distributed under an open source license that permits unrestricted aca- demic and commercial use. Vocabulary and metadata content in the EVS and caDSR, respectively, is similarly unrestricted, and is available through web applications and FTP downloads.
[show abstract][hide abstract] ABSTRACT: The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).
Proceedings of the National Academy of Sciences 12/2002; 99(26):16899-903. · 9.74 Impact Factor
[show abstract][hide abstract] ABSTRACT: A gene's expression pattern provides clues to its role in normal physiology and disease. To provide quantitative expression levels on a genome-wide scale, the Cancer Genome Anatomy Project (CGAP) uses serial analysis of gene expression (SAGE). Over 5 million transcript tags from more than 100 human cell types have been assembled. To enhance the utility of this data, the CGAP SAGE project created SAGE Genie, a web site for the analysis and presentation of SAGE data (http://cgap.nci.nih.gov/SAGE). SAGE Genie provides an automatic link between gene names and SAGE transcript levels, accounting for alternative transcription and many potential errors. These informatics advances provide a rapid and intuitive view of transcript expression in the human body or brain, displayed on the SAGE Anatomic Viewer. We report here an easily accessible view of nearly any gene's expression in a wide variety of malignant and normal tissues.
Proceedings of the National Academy of Sciences 09/2002; 99(17):11287-92. · 9.74 Impact Factor
[show abstract][hide abstract] ABSTRACT: Researchers working collaboratively in Brazil and the United States have assembled an International Database of Cancer Gene Expression. Several strategies have been employed to generate gene expression data including expressed sequence tags (ESTs), serial analysis of gene expression (SAGE), and open reading-frame expressed sequence tags (ORESTES). The database contains six million gene tags that reflect the gene expression profiles in a wide variety of cancerous tissues and their normal counterparts. All sequences are deposited in the public databases, GenBank and SAGEmap. A suite of informatics tools was designed to facilitate in silico analysis of the gene expression datasets and are available through the NCI Cancer Genome Anatomy Project web site (http://cgap.nci.nih.gov).
The Pharmacogenomics Journal 02/2002; 2(3):156-64. · 5.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: The Cancer Genome Anatomy Project (CGAP) was designed and implemented to provide public datasets, material resources and informatics tools to serve as a platform to support the elucidation of the molecular signatures of cancer. This overview of CGAP describes the status of this effort to develop resources based on gene expression, polymorphism identification and chromosome aberrations, and we describe a variety of analytical tools designed to facilitate in silico analysis of these datasets.
Trends in Cell Biology 12/2001; 11(11):S66-71. · 11.72 Impact Factor