[Show abstract][Hide abstract] ABSTRACT: Research in life sciences faces increasing amounts of cross-domain data, also kown as “big data”. This has notable effects on IT-departments and the dry lab desk alike. In this paper, we report on experiences from a decade of data management in a plant research institute. We explain the switch from personally managed files and heterogeneous information systems towards a centrally organised storage management. In particular, we discuss lessons that were learned within the last decade of productive research, data generation and software development from the perspective of a modern plant research institute and present the results of a strategic realignment of the data management infrastructure. Finally, we summarise the challenges which were solved and the questions which are still open.
[Show abstract][Hide abstract] ABSTRACT: Background
The life-science community faces a major challenge in handling “big data”, highlighting the need for high quality infrastructures capable of sharing and publishing research data. Data preservation, analysis, and publication are the three pillars in the “big data life cycle”. The infrastructures currently available for managing and publishing data are often designed to meet domain-specific or project-specific requirements, resulting in the repeated development of proprietary solutions and lower quality data publication and preservation overall.
e!DAL is a lightweight software framework for publishing and sharing research data. Its main features are version tracking, metadata management, information retrieval, registration of persistent identifiers (DOI), an embedded HTTP(S) server for public data access, access as a network file system, and a scalable storage backend. e!DAL is available as an API for local non-shared storage and as a remote API featuring distributed applications. It can be deployed “out-of-the-box” as an on-site repository.
e!DAL was developed based on experiences coming from decades of research data management at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Initially developed as a data publication and documentation infrastructure for the IPK’s role as a data center in the DataCite consortium, e!DAL has grown towards being a general data archiving and publication infrastructure. The e!DAL software has been deployed into the Maven Central Repository. Documentation and Software are also available at: http://edal.ipk-gatersleben.de.
[Show abstract][Hide abstract] ABSTRACT: Information Retrieval (IR) plays a central role in the exploration and interpretation of integrated biological datasets that represent the heterogeneous ecosystem of life sciences. Here, keyword based query systems are popular user interfaces. In turn, to a large extend, the used query phrases determine the quality of the search result and the effort a scientist has to invest for query refinement. In this context, computer aided query expansion and suggestion is one of the most challenging tasks for life science information systems. Existing query front-ends support aspects like spelling correction, query refinement or query expansion. However, the majority of the front-ends only make limited use of enhanced IR algorithms to implement comprehensive and computer aided query refinement workflows. In this work, we present the design of a multi-stage query suggestion workflow and its implementation in the life science IR system LAILAPS. The presented workflow includes enhanced tokenisation, word breaking, spelling correction, query expansion and query suggestion ranking. A spelling correction benchmark with 5,401 queries and manually selected use cases for query expansion demonstrate the performance of the implemented workflow and its advantages compared with state-of-the-art systems.
Full-text · Article · Jun 2014 · Journal of integrative bioinformatics
[Show abstract][Hide abstract] ABSTRACT: Knowledge found in biomedical databases is a major bioinformatics resource. In general, this biological knowledge is represented worldwide in a network of thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats, and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites, or DNA sequences, as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities.
[Show abstract][Hide abstract] ABSTRACT: Data intensive sciences such as astrophysics, social sciences, life sciences and in particular bioinformatics are the driving forces towards the 'e-science' age. High throughput technologies produce a huge amount of primary data. In the classic scientific publication process, primary data is usually aggregated to a number of paragraphs in a journal article and proven by figures or tables. Whereas, in addition to such condensed knowledge, the authors add to their articles links to externally managed supplemental material. However, the older an article is, the lower is the chance that these links are still accessible . In general many researchers are willing to share primary data, but in fact only few of them really provide their data sets . The access is often restricted to the project associated users. Thus, values of primary data, on which the scientific results are based on, get increased attention in public and in the research community. This leads to novel strategies for primary data citation, which must be substantively underpinned by enhancements to classic data management systems, which should provide six general features, which are shown in Figure 1. The mentioned challenges in primary data management towards a consistent data publication process lead to the conclusion for the need of a universal platform for primary data management.
[Show abstract][Hide abstract] ABSTRACT: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data is spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge except out of the interlinked databases. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities. This issue is being hampered by the fact, that only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results. The method is implemented as the tool IDPredictor, which is published under the DOI 10.5447/IPK/2012/4 and is freely available using the URL: http://dx.doi.org/10.5447/IPK/2012/4.
Preview · Article · Jun 2012 · Journal of integrative bioinformatics
[Show abstract][Hide abstract] ABSTRACT: ABSTRACT:
In modern life science research it is very important to have an efficient management of high throughput primary lab data. To realise such an efficient management, four main aspects have to be handled: (I) long term storage, (II) security, (III) upload and (IV) retrieval.
In this paper we define central requirements for a primary lab data management and discuss aspects of best practices to realise these requirements. As a proof of concept, we introduce a pipeline that has been implemented in order to manage primary lab data at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). It comprises: (I) a data storage implementation including a Hierarchical Storage Management system, a relational Oracle Database Management System and a BFiler package to store primary lab data and their meta information, (II) the Virtual Private Database (VPD) implementation for the realisation of data security and the LIMS Light application to (III) upload and (IV) retrieve stored primary lab data.
With the LIMS Light system we have developed a primary data management system which provides an efficient storage system with a Hierarchical Storage Management System and an Oracle relational database. With our VPD Access Control Method we can guarantee the security of the stored primary data. Furthermore the system provides high performance upload and download and efficient retrieval of data.
Full-text · Article · Oct 2011 · BMC Research Notes
[Show abstract][Hide abstract] ABSTRACT: Efficient and effective information retrieval in life sciences is one of the most pressing challenge in bioinformatics. The incredible growth of life science databases to a vast network of interconnected information systems is to the same extent a big challenge and a great chance for life science research. The knowledge found in the Web, in particular in life-science databases, are a valuable major resource. In order to bring it to the scientist desktop, it is essential to have well performing search engines. Thereby, not the response time nor the number of results is important. The most crucial factor for millions of query results is the relevance ranking. In this paper, we present a feature model for relevance ranking in life science databases and its implementation in the LAILAPS search engine. Motivated by the observation of user behavior during their inspection of search engine result, we condensed a set of 9 relevance discriminating features. These features are intuitively used by scientists, who briefly screen database entries for potential relevance. The features are both sufficient to estimate the potential relevance, and efficiently quantifiable. The derivation of a relevance prediction function that computes the relevance from this features constitutes a regression problem. To solve this problem, we used artificial neural networks that have been trained with a reference set of relevant database entries for 19 protein queries. Supporting a flexible text index and a simple data import format, this concepts are implemented in the LAILAPS search engine. It can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. LAILAPS is publicly available for SWISSPROT data at http://lailaps.ipk-gatersleben.de.
[Show abstract][Hide abstract] ABSTRACT: Search engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a novel feature model for relevance ranking, a machine learning approach to model user relevance profiles, ranking improvement by user feedback tracking and an intuitive and slim web user interface, that estimates relevance rank by tracking user interactions. Queries are formulated as simple keyword lists and will be expanded by synonyms. Supporting a flexible text index and a simple data import format, LAILAPS can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. With a set of features, extracted from each database hit in combination with user relevance preferences, a neural network predicts user specific relevance scores. Using expert knowledge as training data for a predefined neural network or using users own relevance training sets, a reliable relevance ranking of database hits has been implemented. In this paper, we present the LAILAPS system, the concepts, benchmarks and use cases. LAILAPS is public available for SWISSPROT data at http://lailaps.ipk-gatersleben.de.
Full-text · Article · Jan 2010 · Journal of integrative bioinformatics
[Show abstract][Hide abstract] ABSTRACT: To advance the comprehension of complex biological processes occurring in crop plants (e.g. for improvement of growth or yield)
it is of high interest to reconstruct and analyse detailed metabolic models. Therefore, we established a pipeline combining
software tools for (1) storage of metabolic pathway data and reconstruction of crop plant metabolic models, (2) simulation
and analysis of stoichiometric and kinetic models and (3) visualisation of data generated with these models. The applicability
of the approach is demonstrated by a case study of cereal seed metabolism.