ArticlePDF Available

Abstract and Figures

Motivation: Life Sciences have emerged as a key domain in the Linked Data community because of the diversity of data semantics and formats available through a great variety of databases and web technologies. Thus, it has been used as the perfect domain for applications in the web of data. Unfortunately, bioinformaticians are not exploiting the full potential of this already available technology, and experts in Life Sciences have real problems to discover, understand and devise how to take advantage of these interlinked (integrated) data. Results: In this article, we present Bioqueries, a wiki-based portal that is aimed at community building around biological Linked Data. This tool has been designed to aid bioinformaticians in developing SPARQL queries to access biological databases exposed as Linked Data, and also to help biologists gain a deeper insight into the potential use of this technology. This public space offers several services and a collaborative infrastructure to stimulate the consumption of biological Linked Data and, therefore, contribute to implementing the benefits of the web of data in this domain. Bioqueries currently contains 215 query entries grouped by database and theme, 230 registered users and 44 end points that contain biological Resource Description Framework information. Availability: The Bioqueries portal is freely accessible at http://bioqueries.uma.es. Supplementary information: Supplementary data are available at Bioinformatics online.
Content may be subject to copyright.
A preview of the PDF is not available
... There are several works on query generation and processing for ontologies, such as [59][60][61] . Queries can often be difficult to formulate across these datasets 62 . ...
... Unlike this work, we can provide not only valid but also meaningful query suggestions in a dynamic manner according to users' interesting topics. Godoy et al. presented a collaborative environment to allow user to register queries manually through wiki pages and share and execute the queries for linked data 61 . A series of desired queries might be generated using large ontologies like the NCI thesaurus by extracting relevant information 8 . ...
Article
Full-text available
Biomedical ontology refers to a shared conceptualization for a biomedical domain of interest that has vastly improved data management and data sharing through the open data movement. The rapid growth and availability of biomedical data make it impractical and computationally expensive to perform manual analysis and query processing with the large scale ontologies. The lack of ability in analyzing ontologies from such a variety of sources, and supporting knowledge discovery for clinical practice and biomedical research should be overcome with new technologies. In this study, we developed a Medical Topic discovery and Query generation framework (MedTQ), which was composed by a series of approaches and algorithms. A predicate neighborhood pattern-based approach introduced has the ability to compute the similarity of predicates (relations) in ontologies. Given a predicate similarity metric, machine learning algorithms have been developed for automatic topic discovery and query generation. The topic discovery algorithm, called the hierarchical K-Means algorithm was designed by extending an existing supervised algorithm (K-means clustering) for the construction of a topic hierarchy. In the hierarchical K-Means algorithm, a level-by-level optimization strategy was selected for consistent with the strongly association between elements within a topic. Automatic query generation was facilitated for discovered topic that could be guided users for interactive query design and processing. Evaluation was conducted to generate topic hierarchy for DrugBank ontology as a case study. Results demonstrated that the MedTQ framework can enhance knowledge discovery by capturing underlying structures from domain specific data and ontologies.
... Despite these approaches, constructing executable SPARQL code, even for a simple query, still remains a time-consuming task; thus, a mechanism that saves time of preparing SPARQL code is necessary to maximize the use of available RDF datasets. As an alternative approach to this issue, a wikibased portal for sharing SPARQL queries was constructed [13], which can bypass the burdensome coding task. Although the queries registered on this service can be executed on the portal site, a mechanism for reusing these queries in other environments would maximize the usefulness of the accumulated queries. ...
... The potential use of SPANG can be further extended by database users or database providers through development of SPARQL template libraries. Although a service for sharing SPARQL queries exists [13], it is difficult to execute them directly for instant reuse by users. In SPANG, users can directly call SPARQL templates across the Web. ...
Article
Full-text available
Background Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. ResultsWe developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. ConclusionsSPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang.
... The Bioqueries application [16] has been designed for two profiles with backgrounds in biology and bioinformatics sharing a virtual space in a wiki-based portal for the design and execution of (federated and non-federated) SPARQL queries that can be added, edited, executed and documented using natural language descriptions. ...
... With the increasing adoption of semantic web technologies (22)(23)(24)(25) and formalisms in biomedical and biomolecular areas, many popular database applications (such as Uniprot (36), Ensembl (9), BioModels (19), etc.) provide accessible data represented in a Resource Description Framework (RDF) format (10,13,27). As the World Wide Web Consortium (W3C) recommended standard, the graph-based RDF model is well suitable for explicitly publishing life science data and linking the diverse data resources (5,7,11,28). ...
Article
Full-text available
Resource Description Framework (RDF) is widely used for representing biomedical data in practical applications. With the increases of RDF-based applications, there is an emerging requirement of novel architectures to provide effective supports for the future RDF data explosion. Inspired by the success of the new designs in National Center for Biotechnology Information dbSNP (The Single Nucleotide Polymorphism Database) for managing the increasing data volumes using JSON (JavaScript Object Notation), in this paper we present an effective mapping tool that allows data migrations from RDF to JSON for supporting future massive data explosions and releases. We firstly introduce a set of mapping rules, which transform an RDF format into the JSON format, and then present the corresponding transformation algorithm. On this basis, we develop an effective and user-friendly tool called RDF2JSON, which enables automating the process of RDF data extractions and the corresponding JSON data generations.
... In recent years, many linked open biomedical knowledge graphs are published using Resource Description Framework (RDF) [3] format. Godoy et al. [6] provide the largest network of Linked Data for the Life Sciences. We have released a Chinese biomedical knowledge graph (CBioMedKG) in our prior work [17]. ...
Preprint
Full-text available
Medical activities, such as diagnoses, medicine treatments, and laboratory tests, as well as temporal relations between these activities are the basic concepts in clinical research. However, existing relational data model on electronic medical records (EMRs) lacks explicit and accurate semantic definitions of these concepts. It leads to the inconvenience of query construction and the inefficiency of query execution where multi-table join queries are frequently required. In this paper, we propose a patient event graph (PatientEG) model to capture the characteristics of EMRs. We respectively define five types of medical entities, five types of medical events and five types of temporal relations. Based on the proposed model, we also construct a PatientEG dataset with 191,294 events, 3,429 distinct entities, and 545,993 temporal relations using EMRs from Shanghai Shuguang hospital. To help to normalize entity values which contain synonyms, hyponymies, and abbreviations, we link them with the Chinese biomedical knowledge graph. With the help of PatientEG dataset, we are able to conveniently perform complex queries for clinical research such as auxiliary diagnosis and therapeutic effectiveness analysis. In addition, we provide a SPARQL endpoint to access PatientEG dataset and the dataset is also publicly available online. Also, we list several illustrative SPARQL queries on our website.
... This interface and the application itself have been tested in a number of SPARQL Endpoints, showing that the results are useful for the design of SPARQL queries. This application is being used to help Bioqueries [3] (http://bioqueries.uma.es) in designing queries on SPARQL Endpoints accessing biological data. ...
Conference Paper
Linked Open Data community is constantly producing new repositories that store information from different domains. The data included in these repositories follow the rules proposed by the W3C community, based on standards such as Resource Description Framework (RDF) and the SPARQL query language. The main advantage of this approach is the possibility of external developers accessing the data from their applications. This advantage is also one of the main challenges of this new technology due to the cost of exploring how the data is structured in a given repository in order to construct SPARQL queries to retrieve useful information. According to the reviewed literature, there are no applications to reconstruct the underlying semantic data models from an SPARQL endpoint. In this paper, we present an application for the reconstruction of the data model as an OWL (Ontology Web Language) ontology. This application, available as Open Source at http:// github. com/ estebanpua/ ontology-endpoint-extraction uses a set of SPARQL queries to discover the classes and the (object and data) properties for a given RDF database. A web application interface has also been implemented for users to browse through classes, properties of the ontology generated from the data structure (http:// khaos. uma. es/ oee). The ontologies generated by this application can help users to understand how the information is semantically organized, making easier the design of SPARQL queries.
... Bioqueries aims to start the process towards a greater understanding of Life Sciences LD sources through a collaborative environment based on social networks [1]. Bioqueries opens up a way to build-up communities around a shared interest in certain biological domains using public LD. ...
... Therefore, the task of querying the data remains an unresolved problem for many researchers. As a consequence, several efforts have been made to make the data more accessible and hide the complexities of the querying language from the end-user [4][5][6][7][8]. ...
Article
Full-text available
Background Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. Results SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. Conclusions This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Article
Full-text available
Biomedical data are growing at an incredible pace and require substantial expertise to organize data in a manner that makes them easily findable, accessible, interoperable and reusable. Massive effort has been devoted to using Semantic Web standards and technologies to create a network of Linked Data for the life sciences, among others. However, while these data are accessible through programmatic means, effective user interfaces for non-experts to SPARQL endpoints are few and far between. Contributing to user frustrations is that data are not necessarily described using common vocabularies, thereby making it difficult to aggregate results, especially when distributed across multiple SPARQL endpoints. We propose BioSearch — a semantic search engine that uses ontologies to enhance federated query construction and organize search results. BioSearch also features a simplified query interface that allows users to optionally filter their keywords according to classes, properties and datasets. User evaluation demonstrated that BioSearch is more effective and usable than two state of the art search and browsing solutions. Database URL:http://ws.nju.edu.cn/biosearch/
Article
Full-text available
In the last few years, the Life Sciences domain has experienced a rapid growth in the amount of available biological databases. The heterogeneity of these databases makes data integration a challenging issue. Some integration challenges are locating resources, relationships, data formats, synonyms or ambiguity. The Linked Data approach partially solves the heterogeneity problems by introducing a uniform data representation model. Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. This article introduces kpath, a database that integrates information related to metabolic pathways. kpath also provides a navigational interface that enables not only the browsing, but also the deep use of the integrated data to build metabolic networks based on existing disperse knowledge. This user interface has been used to showcase relationships that can be inferred from the information available in several public databases.Database URL: The public Linked Data repository can be queried at http://sparql.kpath.khaos.uma.es using the graph URI "www.khaos.uma.es/metabolic-pathways-app". The GUI providing navigational access to kpath database is available at http://browser.kpath.khaos.uma.es. © The Author(s) 2015. Published by Oxford University Press.
Article
Full-text available
Linked Data already gained popularity as a platform for data integration and analysis in the life science and health care domain. This paper is an ongoing report for the recent developments in the Linked Life Data platform and the Pathway and Interaction Knowledge Base (PIKB) dataset. They integrate semantically molecular information and realize its linking to the public data cloud. The dataset interconnects more than 20 complete data sources and helps to understand the "bigger picture" of a research problem by linking unrelated data from heterogeneous knowledge domains. To make efficient usage of the public linked data cloud, we have created instance alignment patterns that restore missing information relationships. As a final step, a massive number of semantic annotations (optimized for high recall or precision) is generated between the linked data instances and the unstructured information. The LDD prototype is available at http://linkedlifedata.com.
Article
Full-text available
Five questionnaires for assessing the usability of a website were compared in a study with 123 participants. The questionnaires studied were SUS, QUIS, CSUQ, a variant of Microsoft's Product Reaction Cards, and one that we have used in our Usability Lab for several years. Each participant performed two tasks on each of two websites: finance.yahoo.com and kiplinger.com. All five questionnaires revealed that one site was significantly preferred over the other. The data were analyzed to determine what the results would have been at different sample sizes from 6 to 14. At a sample size of 6, only 30-40% of the samples would have identified that one of the sites was significantly preferred. Most of the data reach an apparent asymptote at a sample size of 12, where two of the questionnaires (SUS and CSUQ) yielded the same conclusion as the full dataset at least 90% of the time.
Article
Full-text available
Usability does not exist in any absolute sense; it can only be defined with reference to particular contexts. This, in turn, means that there are no absolute measures of usability, since, if the usability of an artefact is defined by the context in which that artefact is used, measures of usability must of necessity be defined by that context too. Despite this, there is a need for broad general measures which can be used to compare usability across a range of contexts. In addition, there is a need for "quick and dirty" methods to allow low cost assessments of usability in industrial systems evaluation. This chapter describes the System Usability Scale (SUS) a reliable, low-cost usability scale that can be used for global assessments of systems usability.
Article
Full-text available
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10 g-8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. © 2010 Macmillan Publishers Limited. All rights reserved.
Conference Paper
Full-text available
The Semantic Web has recently seen a rise of large knowl- edge bases (such as DBpedia) that are freely accessible via SPARQL endpoints. The structured representation of the contained information opens up new possibilities in the way it can be accessed and queried. In this paper, we present an approach that extracts a graph covering relationships between two objects of interest. We show an interactive visualization of this graph that supports the systematic analysis of the found relationships by providing highlighting, previewing, and ltering features.
Article
Full-text available
Here, we describe the development of WikiPathways (http://www.wikipathways.org), a public wiki for pathway curation, since it was first published in 2008. New features are discussed, as well as developments in the community of contributors. New features include a zoomable pathway viewer, support for pathway ontology annotations, the ability to mark pathways as private for a limited time and the availability of stable hyperlinks to pathways and the elements therein. WikiPathways content is freely available in a variety of formats such as the BioPAX standard, and the content is increasingly adopted by external databases and tools, including Wikipedia. A recent development is the use of WikiPathways as a staging ground for centrally curated databases such as Reactome. WikiPathways is seeing steady growth in the number of users, page views and edits for each pathway. To assess whether the community curation experiment can be considered successful, here we analyze the relation between use and contribution, which gives results in line with other wiki projects. The novel use of pathway pages as supplementary material to publications, as well as the addition of tailored content for research domains, is expected to stimulate growth further.
Article
Full-text available
EcoliWiki is the community annotation component of the PortEco (http://porteco.org; formerly EcoliHub) project, an online data resource that integrates information on laboratory strains of Escherichia coli, its phages, plasmids and mobile genetic elements. As one of the early adopters of the wiki approach to model organism databases, EcoliWiki was designed to not only facilitate community-driven sharing of biological knowledge about E. coli as a model organism, but also to be interoperable with other data resources. EcoliWiki content currently covers genes from five laboratory E. coli strains, 21 bacteriophage genomes, F plasmid and eight transposons. EcoliWiki integrates the Mediawiki wiki platform with other open-source software tools and in-house software development to extend how wikis can be used for model organism databases. EcoliWiki can be accessed online at http://ecoliwiki.net.