About
267
Publications
90,823
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
127,699
Citations
Introduction
Current institution
Publications
Publications (267)
Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide, presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to pri...
The evolutionary classification of protein domains (ECOD) classifies protein domains using a combination of sequence and structural data (http://prodata.swmed.edu/ecod). Here we present the culmination of our previous efforts at classifying domains from predicted structures, principally from the AlphaFold Database (AFDB), by integrating these domai...
InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families. It integrates predictive models, known as signatures, from multiple member databases to classify sequences into families and predict the presence of domains and significant sites. The InterPro database provides annota...
The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication, we describe ongoing changes to our production pipeline to limit the sequences available in UniProtKB to high-qual...
The Pfam protein families database is a comprehensive collection of protein domains and families used for genome annotation and protein structure and function analysis (https://www.ebi.ac.uk/interpro/). This update describes major developments in Pfam since 2020, including decommissioning the Pfam website and integration with InterPro, harmonizatio...
The Rfam database, a widely used repository of non-coding RNA families, has undergone significant updates in release 15.0. This paper introduces major improvements, including the expansion of Rfamseq to 26 106 genomes, a 76% increase, incorporating the latest UniProt reference proteomes and additional viral genomes. Sixty-five RNA families were enh...
The Rfam database, a widely-used repository of non-coding RNA (ncRNA) families, has undergone significant updates in release 15.0. This paper introduces major improvements, including the expansion of Rfamseq to 26, 106 genomes, a 76% increase, incorporating the latest UniProt reference proteomes and additional viral genomes. Sixty-five RNA families...
Motivation
Data reuse is a common and vital practice in molecular biology and enables the knowledge gathered over recent decades to drive discovery and innovation in the life sciences. Much of this knowledge has been collated into molecular biology databases, such as UniProtKB, and these resources derive enormous value from sharing data among thems...
Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific...
The carbon footprint of scientific computing is substantial, but environmentally sustainable computational science (ESCS) is a nascent field with many opportunities to thrive. To realize the immense green opportunities and continued, yet sustainable, growth of computer science, we must take a coordinated approach to our current challenges, includin...
Motivation
The visualization of biological data is a fundamental technique that enables researchers to understand and explain biology. Some of these visualizations have become iconic, for instance: tree views for taxonomy, cartoon rendering of 3D protein structures, or tracks to represent features in a gene or protein, for instance in a genome brow...
Motivation: The visualization of biological data is a fundamental technique that enables researchers to understand and explain biology. Some of these visualizations have become iconic, for instance: tree views for taxonomy, cartoon rendering of 3D protein structures, or tracks to represent features in a gene or protein, for instance in a genome bro...
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation....
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of seque...
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. Thes...
Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional...
The European Bioinformatics Institute (EMBL-EBI) maintains a comprehensive range of freely available and up-to-date molecular data resources, which includes over 40 resources covering every major data type in the life sciences. This year's service update for EMBL-EBI includes new resources, PGS Catalog and AlphaFold DB, and updates on existing reso...
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately...
Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microR...
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the S...
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that provides a single access point to 44 RNA resources and >18 million ncRNA sequences from a wide range of organisms and RNA types. RNAcentral now also includes secondary (2D) structure information for >13 million sequences, making RNAcentral the world’s largest RNA 2D str...
Non‐coding RNAs are essential for all life and carry out a wide range of functions. Information about these molecules is distributed across dozens of specialized resources. RNAcentral is a database of non‐coding RNA sequences that provides a unified access point to non‐coding RNA annotations from >40 member databases and helps provide insight into...
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences, collating information on ncRNA sequences of all types from a broad range of organisms. We have recently added a new genome mapping pipeline that identifies genomic locations for ncRNA sequences in 296 species. We have also added several new types of functional annotations,...
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences, collating information on ncRNA sequences of all types from a broad range of organisms. We have recently added a new genome mapping pipeline that identifies genomic locations for ncRNA sequences in 296 species. We have also added several new types of functional annotations,...
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their funct...
Rfam is a database of non‐coding RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. Using a combination of manual and literature‐based curation and a custom software pipeline, Rfam converts descriptions of RNA families found in the scientific literature into co...
The MEROPS database (http://www.ebi.ac.uk/merops/) is an integrated source of information about peptidases, their substrates and inhibitors. The hierarchical classification is: protein-species, family, clan, with an identifier at each level. The MEROPS website moved to the EMBL-EBI in 2017, requiring refactoring of the code-base and services provid...
The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We d...
Motivation:
Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep...
MOTIVATION
Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized, and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, the question of their s...
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by i...
Motivation:
Similarity based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the org...
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent
releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes,
with the counts of matched se...
During 11–12 August 2014, a Protein Bioinformatics and Community Resources Retreat was held at the Wellcome Trust Genome Campus
in Hinxton, UK. This meeting brought together the principal investigators of several specialized protein resources (such as
CAZy, TCDB and MEROPS) as well as those from protein databases from the large Bioinformatics centr...
The HMMER website, available at http://www.ebi.ac.uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence data...
As the volume of data relating to proteins increases, researchers rely more and more on the analysis of published data, thus increasing the importance of good access to these data that vary from the supplemental material of individual papers, all the way to major reference databases with professional staff and long-term funding. Specialist protein...
The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website f...
MEROPS is a database of proteolytic enzymes as well as their inhibitors and substrates. Proteolytic enzymes and protein inhibitors are organized into protein domain families. In turn, families are organized into clans. Each peptidase, inhibitor, family, and clan has associated annotation, a multiple sequence alignment, a phylogenetic tree, literatu...
UniProt is an important collection of protein sequences and their annotations, which has doubled in size to 80 million sequences
during the past year. This growth in sequences has prompted an extension of UniProt accession number space from 6 to 10 characters.
An increasing fraction of new sequences are identical to a sequence that already exists i...
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have diffe...
The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection
of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information
from an international consortium of established RNA sequence databases. The initial release con...
CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for cryst...
The database iPfam, available at http://ipfam.org, catalogues Pfam domain interactions based on known 3D structures that are found in the Protein Data Bank, providing interaction
data at the molecular level. Previously, the iPfam domain–domain interaction data was integrated within the Pfam database
and website, but it has now been migrated to a se...
A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we he...
The NTF2-like superfamily is a versatile group of protein domains sharing a common fold. The sequences of these domains are very diverse and they share no common sequence motif. These domains serve a range of different functions within the proteins in which they are found, including both catalytic and non-catalytic versions. Clues to the function o...
TreeFam (http://www.treefam.org) is a database of phylogenetic trees inferred from animal genomes. For every TreeFam family we provide homology predictions
together with the evolutionary history of the genes. Here we describe an update of the TreeFam database. The TreeFam project
was resurrected in 2012 and has seen two releases since. The latest r...
The mission of the Universal Protein Resource (UniProt) (http://www.uniprot.org) is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequences
and functional annotation. It integrates, interprets and standardizes data from literature and numerous resources to achieve
the most comprehen...
Every genome contains a large number of uncharacterized proteins that may encode entirely novel biological systems. Many of these uncharacterized proteins fall into related sequence families. By applying sequence and structural analysis we hope to provide insight into novel biology.
We analyze a previously uncharacterized Pfam protein family called...
Pie charts showing relative sequence similarity of uncharacterized proteins in COMBREX to experimentally characterized (green) proteins. (A) Blue proteins. (B) Black proteins. Within each pie, proteins are divided into those that exhibit “strong” similarity, “weak” similarity, or “no” similarity to characterized proteins. Strong similarity requires...
Flowchart of GSDB construction. Source information includes external databases such as UniProtKB and other databases (“Source DBs”), and genes nominated by users via the COMBREX website. All entries originating outside of UniProtKB must be assigned a unique UniProtKB accession number before entry into the process. All candidates with a UniProtKB ac...
Domain composition of proteins in COMBREX. All COMBREX proteins were clustered into groups based on identical domain composition. Along the x-axis, groups are separated based on the number of annotated Pfam domains per protein (as defined by Pfam). (A) Histogram, where the green portion of each bar indicates the number of proteins that have identic...
Format of functional descriptions in COMBREX.
(DOC)
More detailed description of the following topics: selected COMBREX-funded experimental results; functional inference from existing experimental information; use of structured vocabulary; and prioritization of genes for experimental characterization. Materials and Methods, including the following topics: the COMBREX website; functional status of ge...
Summary of proteins examined by COMBREX-funded projects.
(XLSX)
Association of structural data with uncharacterized proteins.
(DOC)
Function predictions submitted to COMBREX by external groups.
(DOC)
Number of clusters as a function of cluster size. Clusters are broken down into three types based on the functional status of their component proteins: clusters containing ≥1 experimentally characterized (green) gene are represented by the green line; clusters containing no experimentally characterized proteins but ≥1 protein with a predicted funct...
Free-text strings analyzed by GOCat.
(DOC)
Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function
prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges
remain, including problems with homologous overextension of alignments and with regions under c...
Salmonella Typhi and Typhimurium diverged only ∼50 000 years ago, yet have very different host ranges and pathogenicity. Despite the
availability of multiple whole-genome sequences, the genetic differences that have driven these changes in phenotype are only
beginning to be understood. In this study, we use transposon-directed insertion-site sequen...
We have identified a new protein domain, which we have named the SHOCT domain (ort -erminal domain). This domain is widespread in bacteria with over a thousand examples. But we found it is missing from the most commonly studied model organisms, despite being present in closely related species. It's predominantly C-terminal location, co-occurrence w...
Primer sequences. The underlined nucleotides represent the AscI and NotI restriction sites.
(DOCX)
Sequences of expressed peptides.
(DOCX)
Domain architectures of SHOCT domain-containing proteins.
(DOCX)
Background:
The Amoebozoa constitute one of the primary divisions of eukaryotes, encompassing taxa of both biomedical and evolutionary importance, yet its genomic diversity remains largely unsampled. Here we present an analysis of a whole genome assembly of Acanthamoeba castellanii (Ac) the first representative from a solitary free-living amoebozo...
It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was...
The International Society for Biocuration (ISB) was created in 2009 specifically to promote biocuration, the product of multidisciplinary teams of database curators, software developers and bioinformaticians. Biocurators, whose work facilitates research and education across the life sciences, create and maintain a wide variety of online tools and d...
The Rfam database (available via the website at http://rfam.sanger.ac.uk and through our mirror at http://rfam.janelia.org) is a collection of non-coding RNA families, primarily RNAs with a conserved RNA secondary structure, including both RNA
genes and mRNA cis-regulatory elements. Each family is represented by a multiple sequence alignment, predi...
Alternative inclusion of exons increases the functional diversity of proteins. Among alternatively spliced exons, tissue-specific exons play a critical role in maintaining tissue identity. This raises the question of how tissue-specific protein-coding exons influence protein function. Here we investigate the structural, functional, interaction, and...
We have identified a new bacterial protein domain that we hypothesise binds to peptidoglycan. This domain is called the YARHG domain after the most highly conserved sequence-segment. The domain is found in the extracellular space and is likely to be composed of four alpha-helices. The domain is found associated with protein kinase domains, suggesti...
Motivation: microRNAs are short non-coding RNAs that regulate gene expression by inhibiting target mRNA genes. Next-generation sequencing combined with bioinformatics analyses provide an opportunity to predict numerous novel miRNAs. The efficiency ...
Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. It...
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn,
as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability
of those labs to accurately identify and annotate all genes within a genome may often be lackin...
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss
issues relevant to the International Society for Biocuration’s (ISB) mission. Recurring themes this year included the creation
and promotion of gold standards, the need for more ontologies, and more formal interactions wit...
Wikipedia, the online encyclopedia, is the most famous wiki in use today. It contains over 3.7 million pages of content; with
many pages written on scientific subject matters that include peer-reviewed citations, yet are written in an accessible manner
and generally reflect the consensus opinion of the community. In this, the 19th Annual Database I...
Pfam is a widely used database of protein families, currently containing more than 13 000 manually curated protein families
as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR...
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely
available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures,
against which protein sequences can be searched to de...