[show abstract][hide abstract] ABSTRACT: Bioinformatics data is growing at a phenomenal rate. Besides the ex- ponential growth of individual databases, the number of data deposi- tories is increasing too. Because of the complexity of the biological concepts, bioinformatics data usually has complex data structures and cannot be easily captured with relational model. As a result, various at-le formats have been used. Although easy for human interpreta- tion, at-le formats lack of standards and are hard to be recognized automatically. As a result, manually written parsers are widely used to extract data from them. This has limited the readiness of the data for data consuming programs, such as integration systems. This paper presents a data mining based approach for automatically assigning schema labels to the attributes in a at-le biological dataset. In conjunction with our prior work on semi-automatically identifying the delimiters and automatically generating parsers, automatic schema labeling offers a novel and practical solution for integrating biological datasets on-the-y . Our approach for schema labeling is based on un- supervised learning, and uses a feature representation of an attribute by most frequently occurring data values in it. We combine the use of a biological ontology with heuristics. We are able to deal with noise in the datasets by using cutoff functions. Detailed experimental results from three datasets demonstrate the effectiveness of the use of data mining for biological applications.
Sixth IEEE International Symposium on BioInformatics and BioEngineering (BIBE 2006), 16-18 October 2006, Arlington, Virginia, USA; 01/2006
[show abstract][hide abstract] ABSTRACT: With a growing trend towards grid-based data repositories and data analysis services, scientific data analysis often involves accessing multiple data sources, and analyzing the data using a variety of analysis programs. One critical challenge in this, however, is that data sources often hold the same type of data in a number of different formats, and also, the formats expected and generated by various data analysis services are often distinct. We believe that the traditional approach for dealing with this problem, which is using hand-written wrappers, is not an effective and scalable solution for a grid environment. This paper presents a new approach, which involves generating wrappers automatically for enabling grid-based information integration and workflows. In this approach, a layout descriptor is used for describing the data format for each data source, as well as the input and output format for each tool or service. Efficient wrappers are then generated automatically for translation between any two data formats. Our design separates wrapper generation service from the wrapper execution. The wrapper generation service analyzes the layout descriptors and generates a WRAPINFO data structure. The wrapper comprises a set of application independent modules which take the WRAPINFO data structure as the input. We demonstrate our wrapper generation tool with two real case studies. Besides showing the effectiveness of our system, the experiments results from these two case studies show that the wrapper generation overhead is very small, automatically generated wrappers scale well to large datasets, and for the one case where this comparison was possible, the execution time of our wrapper was within 30% of that of a hand-written one.
Grid Computing, 2005. The 6th IEEE/ACM International Workshop on; 12/2005
[show abstract][hide abstract] ABSTRACT: One of the major problems in biological data integration is that many data sources are stored as atlasses, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task. We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three atlasses biological datasets.
Bioinformatics and Bioengineering, 2005. BIBE 2005. Fifth IEEE Symposium on; 11/2005
[show abstract][hide abstract] ABSTRACT: A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the
effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data
sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone.
This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach
involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically,
we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric,
we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second
step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is
to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three
popular flat-file biological datasets.
Data Integration in the Life Sciences, Second InternationalWorkshop, DILS 2005, San Diego, CA, USA, July 20-22, 2005, Proceedings; 01/2005