Matthias Lange

Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Saxony-Anhalt, Germany

Are you Matthias Lange?

Claim your profile

Publications (32)17.21 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent methodological developments in plant phenotyping, as well as the growing importance of its applications in plant science and breeding, are resulting in a fast accumulation of multidimensional data. There is great potential for expediting both discovery and application if these data are made publicly available for analysis. However, collection and storage of phenotypic observations is not yet sufficiently governed by standards that would ensure interoperability among data providers and precisely link specific phenotypes and associated genomic sequence information. This lack of standards is mainly a result of a large variability of phenotyping protocols, the multitude of phenotypic traits that are measured, and the dependence of these traits on the environment. This paper discusses the current situation of standardization in the area of phenomics, points out the problems and shortages, and presents the areas that would benefit from improvement in this field. In addition, the foundations of the work that could revise the situation are proposed, and practical solutions developed by the authors are introduced. © The Author 2015. Published by Oxford University Press on behalf of the Society for Experimental Biology. All rights reserved. For permissions, please email:
    Journal of Experimental Botany 06/2015; 66(18):5417-5427. DOI:10.1093/jxb/erv271 · 5.53 Impact Factor

  • Datenbanksysteme für Business, Technologie und Web (BTW 2015) - Workshopband, 2.-3. März 2015, Hamburg, Germany; 01/2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the number of sequenced plant genomes growing, the number of predicted genes and functional annotations is also increasing. The association between genes and phenotypic traits is currently of great interest. Unfortunately, the information available today is widely scattered over a number of different databases. Information retrieval (IR) has become an all-encompassing bioinformatics methodology for extracting knowledge from complex, heterogeneous, and distributed databases, and therefore can be a useful tool for obtaining a comprehensive view of plant genomics, from genes to traits. Here we describe LAILAPS (, an IR system designed to link plant genomic data in the context of phenotypic attributes for a detailed forward genetic research. LAILAPS comprises around 65 million indexed documents, encompassing over 13 major life science databases with around 80 million links to plant genomic resources. The LAILAPS search engine allows fuzzy querying for candidate genes linked to specific traits over a loosely integrated system of indexed and interlinked genome databases. Query assistance and an evidence-based annotation system enable time-efficient and comprehensive information retrieval. An artificial neural network incorporating user feedback and behavior tracking allows relevance sorting of results. We fully describe LAILAPS's functionality and capabilities by comparing this system's performance to other widely used systems and by reporting both a validation in maize and a knowledge discovery use-case focusing on candidate genes in barley. © The Author(s) 2014. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists.
    Plant and Cell Physiology 12/2014; 56(1). DOI:10.1093/pcp/pcu185 · 4.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research in life sciences faces increasing amounts of cross-domain data, also kown as “big data”. This has notable effects on IT-departments and the dry lab desk alike. In this paper, we report on experiences from a decade of data management in a plant research institute. We explain the switch from personally managed files and heterogeneous information systems towards a centrally organised storage management. In particular, we discuss lessons that were learned within the last decade of productive research, data generation and software development from the perspective of a modern plant research institute and present the results of a strategic realignment of the data management infrastructure. Finally, we summarise the challenges which were solved and the questions which are still open.
    Data Integration in the Life Sciences; 10th International Workshop, DILS 2014, Lisbon, Portugal, 17-18 July 2014; 07/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The life-science community faces a major challenge in handling “big data”, highlighting the need for high quality infrastructures capable of sharing and publishing research data. Data preservation, analysis, and publication are the three pillars in the “big data life cycle”. The infrastructures currently available for managing and publishing data are often designed to meet domain-specific or project-specific requirements, resulting in the repeated development of proprietary solutions and lower quality data publication and preservation overall. Results e!DAL is a lightweight software framework for publishing and sharing research data. Its main features are version tracking, metadata management, information retrieval, registration of persistent identifiers (DOI), an embedded HTTP(S) server for public data access, access as a network file system, and a scalable storage backend. e!DAL is available as an API for local non-shared storage and as a remote API featuring distributed applications. It can be deployed “out-of-the-box” as an on-site repository. Conclusions e!DAL was developed based on experiences coming from decades of research data management at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Initially developed as a data publication and documentation infrastructure for the IPK’s role as a data center in the DataCite consortium, e!DAL has grown towards being a general data archiving and publication infrastructure. The e!DAL software has been deployed into the Maven Central Repository. Documentation and Software are also available at:
    BMC Bioinformatics 06/2014; 15(1):214. DOI:10.1186/1471-2105-15-214 · 2.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information Retrieval (IR) plays a central role in the exploration and interpretation of integrated biological datasets that represent the heterogeneous ecosystem of life sciences. Here, keyword based query systems are popular user interfaces. In turn, to a large extend, the used query phrases determine the quality of the search result and the effort a scientist has to invest for query refinement. In this context, computer aided query expansion and suggestion is one of the most challenging tasks for life science information systems. Existing query front-ends support aspects like spelling correction, query refinement or query expansion. However, the majority of the front-ends only make limited use of enhanced IR algorithms to implement comprehensive and computer aided query refinement workflows. In this work, we present the design of a multi-stage query suggestion workflow and its implementation in the life science IR system LAILAPS. The presented workflow includes enhanced tokenisation, word breaking, spelling correction, query expansion and query suggestion ranking. A spelling correction benchmark with 5,401 queries and manually selected use cases for query expansion demonstrate the performance of the implemented workflow and its advantages compared with state-of-the-art systems.
    Journal of integrative bioinformatics 06/2014; 11(2):237. DOI:10.2390/biecoll-jib-2014-237

  • Approaches in Integrative Bioinformatics, Edited by Ming Chen, Ralf Hofestädt, 01/2014: pages 73-109; Springer Berlin Heidelberg., ISBN: 978-3-642-41281-3
  • Source

  • Article: Editorial

    Journal of integrative bioinformatics 01/2013; 10(1):226. DOI:10.2390/biecoll-jib-2013-226

  • INFORMATIK 2013 – Informatik angepasst an Mensch, Organisation und Umwelt, Lecture Notes in Informatics edited by Matthias Horbach, 01/2013: pages 1834-1840; Gesellschaft für Informatik e.V. (GI)., ISBN: 978-3-88579-614-5
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data intensive sciences such as astrophysics, social sciences, life sciences and in particular bioinformatics are the driving forces towards the 'e-science' age. High throughput technologies produce a huge amount of primary data. In the classic scientific publication process, primary data is usually aggregated to a number of paragraphs in a journal article and proven by figures or tables. Whereas, in addition to such condensed knowledge, the authors add to their articles links to externally managed supplemental material. However, the older an article is, the lower is the chance that these links are still accessible [1]. In general many researchers are willing to share primary data, but in fact only few of them really provide their data sets [2]. The access is often restricted to the project associated users. Thus, values of primary data, on which the scientific results are based on, get increased attention in public and in the research community. This leads to novel strategies for primary data citation, which must be substantively underpinned by enhancements to classic data management systems, which should provide six general features, which are shown in Figure 1. The mentioned challenges in primary data management towards a consistent data publication process lead to the conclusion for the need of a universal platform for primary data management.
    Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on; 10/2012
  • Source
    Hendrik Mehlhorn · Matthias Lange · Uwe Scholz · Falk Schreiber ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data is spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge except out of the interlinked databases. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities. This issue is being hampered by the fact, that only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results. The method is implemented as the tool IDPredictor, which is published under the DOI 10.5447/IPK/2012/4 and is freely available using the URL:
    Journal of integrative bioinformatics 06/2012; 9(2):190. DOI:10.2390/biecoll-jib-2012-190
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT: In modern life science research it is very important to have an efficient management of high throughput primary lab data. To realise such an efficient management, four main aspects have to be handled: (I) long term storage, (II) security, (III) upload and (IV) retrieval. In this paper we define central requirements for a primary lab data management and discuss aspects of best practices to realise these requirements. As a proof of concept, we introduce a pipeline that has been implemented in order to manage primary lab data at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). It comprises: (I) a data storage implementation including a Hierarchical Storage Management system, a relational Oracle Database Management System and a BFiler package to store primary lab data and their meta information, (II) the Virtual Private Database (VPD) implementation for the realisation of data security and the LIMS Light application to (III) upload and (IV) retrieve stored primary lab data. With the LIMS Light system we have developed a primary data management system which provides an efficient storage system with a Hierarchical Storage Management System and an Oracle relational database. With our VPD Access Control Method we can guarantee the security of the stored primary data. Furthermore the system provides high performance upload and download and efficient retrieval of data.
    BMC Research Notes 10/2011; 4:413. DOI:10.1186/1756-0500-4-413

  • it - Information Technology 09/2011; 53:234-240. DOI:10.1524/itit.2011.0648
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Efficient and effective information retrieval in life sciences is one of the most pressing challenge in bioinformatics. The incredible growth of life science databases to a vast network of interconnected information systems is to the same extent a big challenge and a great chance for life science research. The knowledge found in the Web, in particular in life-science databases, are a valuable major resource. In order to bring it to the scientist desktop, it is essential to have well performing search engines. Thereby, not the response time nor the number of results is important. The most crucial factor for millions of query results is the relevance ranking. In this paper, we present a feature model for relevance ranking in life science databases and its implementation in the LAILAPS search engine. Motivated by the observation of user behavior during their inspection of search engine result, we condensed a set of 9 relevance discriminating features. These features are intuitively used by scientists, who briefly screen database entries for potential relevance. The features are both sufficient to estimate the potential relevance, and efficiently quantifiable. The derivation of a relevance prediction function that computes the relevance from this features constitutes a regression problem. To solve this problem, we used artificial neural networks that have been trained with a reference set of relevant database entries for 19 protein queries. Supporting a flexible text index and a simple data import format, this concepts are implemented in the LAILAPS search engine. It can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. LAILAPS is publicly available for SWISSPROT data at
    Journal of integrative bioinformatics 04/2010; 7(3). DOI:10.2390/biecoll-jib-2010-118
  • Matthias Lange · Stephan Weise · Uwe Scholz · Paul Verrier · Chris Rawlings ·

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Search engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a novel feature model for relevance ranking, a machine learning approach to model user relevance profiles, ranking improvement by user feedback tracking and an intuitive and slim web user interface, that estimates relevance rank by tracking user interactions. Queries are formulated as simple keyword lists and will be expanded by synonyms. Supporting a flexible text index and a simple data import format, LAILAPS can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. With a set of features, extracted from each database hit in combination with user relevance preferences, a neural network predicts user specific relevance scores. Using expert knowledge as training data for a predefined neural network or using users own relevance training sets, a reliable relevance ranking of database hits has been implemented. In this paper, we present the LAILAPS system, the concepts, benchmarks and use cases. LAILAPS is public available for SWISSPROT data at
    Journal of integrative bioinformatics 01/2010; 7(2):110. DOI:10.2390/biecoll-jib-2010-110
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To advance the comprehension of complex biological processes occurring in crop plants (e.g. for improvement of growth or yield) it is of high interest to reconstruct and analyse detailed metabolic models. Therefore, we established a pipeline combining software tools for (1) storage of metabolic pathway data and reconstruction of crop plant metabolic models, (2) simulation and analysis of stoichiometric and kinetic models and (3) visualisation of data generated with these models. The applicability of the approach is demonstrated by a case study of cereal seed metabolism.
    Data Integration in the Life Sciences, 6th International Workshop, DILS 2009, Manchester, UK, July 20-22, 2009. Proceedings; 01/2009

  • Informatik 2009: Im Focus das Leben, Beiträge der 39. Jahrestagung der Gesellschaft für Informatik e.V. (GI), 28.9.-2.10.2009, Lübeck, Proceedings; 01/2009
  • Source
    Matthias Lange · Axel Himmelbach · Patrick Schweizer · Uwe Scholz ·
    [Show abstract] [Hide abstract]
    ABSTRACT: To support the interpretation of measured molecular facts, like gene expression experiments or EST sequencing, the functional or the system biological context has to be considered. Doing so, the relationship to existing biological knowledge has to be discovered. In general, biological knowledge is worldwide represented in a network of databases. In this paper we present a method for knowledge extraction in life science databases, which prevents the scientists from screen scraping and web clicking approaches. We developed a method for extraction of knowledge networks from distributed, heterogeneous life science databases. To meet the requirement of the very large data volume, the method used is based on the concept of data linkage graphs (DLG).We present an efficient software which enables the joining of millions of data points over hundreds of databases. In order to motivate possible applications, we computed networks of protein knowledge, which interconnect metabolic, disease, enzyme and gene function data. The computed networks enabled a holistic relationship among measured experimental facts and the combined biological knowledge. This was successfully applied for a high throughput functional classification of barley EST and gene expression experiments with the perspective of an automated pipeline for the provisioning of controlled annotation of plant gene arrays and chips. Availability: The data linkage graphs (XML or TGF format), the schema integrated database schema (GML or GRAPH-ML) and the graph computation software may be downloaded from the following URL:
    Journal of integrative bioinformatics 01/2007; 4. DOI:10.2390/biecoll-jib-2007-68

Publication Stats

107 Citations
17.21 Total Impact Points


  • 2005-2014
    • Leibniz Institute of Plant Genetics and Crop Plant Research
      • • Department of Molecular Genetics
      • • Research Group Bioinformatics and Information technology
      Gatersleben, Saxony-Anhalt, Germany
  • 2002
    • Bielefeld University
      • Bioinformatics and Medical Informatics
      Bielefeld, North Rhine-Westphalia, Germany