[Show abstract][Hide abstract] ABSTRACT: As an essential part of in vitro analysis, biological database query has become more and more important in the research process. A few challenges that are specific to bioinformatics applications are data heterogeneity, large data volume and exponential data growth, constant appearance of new data types and data formats. We have developed an integration system that processes data in their flat file formats. Its advantages include the reduction of overhead and programming efforts. In the paper, we discuss the usage of indicing techniques on top of this flat file query system. Besides the advantage of processing flat files directly, the system also improves its performance and functionality by using indexes. Experiments based on real life queries are used to test the integration system.
[Show abstract][Hide abstract] ABSTRACT: Traditionally, biologists focused on a single research subject. New high-throughput experimental and analytical technologies, such as microarray and BLAST programs, have changed this. An important functionality required now is the ability to process queries about multiple data entries with little user intervention. This paper presents the design, implementation, and evaluation of a data integration tool that supports database-like query operations across flat-file biological datasets. Compared with the existing solutions, our system has several advantages, i.e., no database management system is required, users can still use declarative languages to communicate with the system, and no data parsing, loading, or indexing utility programs need to be written. We have used the system on three biological queries, each of which was inspired by an actual study from bioinformatics research literature. These case studies have demonstrated the functionality and scalability of our tool. Overall, our approach provides a light-weight and scalable solution for data integration over flat-file datasets
[Show abstract][Hide abstract] ABSTRACT: Bioinformatics data is growing at a phenomenal rate. Besides the ex- ponential growth of individual databases, the number of data deposi- tories is increasing too. Because of the complexity of the biological concepts, bioinformatics data usually has complex data structures and cannot be easily captured with relational model. As a result, various at-le formats have been used. Although easy for human interpreta- tion, at-le formats lack of standards and are hard to be recognized automatically. As a result, manually written parsers are widely used to extract data from them. This has limited the readiness of the data for data consuming programs, such as integration systems. This paper presents a data mining based approach for automatically assigning schema labels to the attributes in a at-le biological dataset. In conjunction with our prior work on semi-automatically identifying the delimiters and automatically generating parsers, automatic schema labeling offers a novel and practical solution for integrating biological datasets on-the-y . Our approach for schema labeling is based on un- supervised learning, and uses a feature representation of an attribute by most frequently occurring data values in it. We combine the use of a biological ontology with heuristics. We are able to deal with noise in the datasets by using cutoff functions. Detailed experimental results from three datasets demonstrate the effectiveness of the use of data mining for biological applications.
[Show abstract][Hide abstract] ABSTRACT: With a growing trend towards grid-based data repositories and data analysis services, scientific data analysis often involves accessing multiple data sources, and analyzing the data using a variety of analysis programs. One critical challenge in this, however, is that data sources often hold the same type of data in a number of different formats, and also, the formats expected and generated by various data analysis services are often distinct. We believe that the traditional approach for dealing with this problem, which is using hand-written wrappers, is not an effective and scalable solution for a grid environment. This paper presents a new approach, which involves generating wrappers automatically for enabling grid-based information integration and workflows. In this approach, a layout descriptor is used for describing the data format for each data source, as well as the input and output format for each tool or service. Efficient wrappers are then generated automatically for translation between any two data formats. Our design separates wrapper generation service from the wrapper execution. The wrapper generation service analyzes the layout descriptors and generates a WRAPINFO data structure. The wrapper comprises a set of application independent modules which take the WRAPINFO data structure as the input. We demonstrate our wrapper generation tool with two real case studies. Besides showing the effectiveness of our system, the experiments results from these two case studies show that the wrapper generation overhead is very small, automatically generated wrappers scale well to large datasets, and for the one case where this comparison was possible, the execution time of our wrapper was within 30% of that of a hand-written one.
[Show abstract][Hide abstract] ABSTRACT: One of the major problems in biological data integration is that many data sources are stored as atlasses, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task. We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three atlasses biological datasets.
[Show abstract][Hide abstract] ABSTRACT: A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the
effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data
sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone.
This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach
involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically,
we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric,
we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second
step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is
to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three
popular flat-file biological datasets.
[Show abstract][Hide abstract] ABSTRACT: Summary form only given. As scientific simulations are generating large amounts of data, analyzing this data to gain insights into scientific phenomenon is increasingly becoming a challenge. We present a case study on the use of a cluster middleware for rapidly creating a scalable and parallel implementation of a scientific data analysis application. Using FREERIDE (framework for rapid implementation of data mining engines), we parallelize as well as scale to disk-resident datasets a feature extraction algorithm. We have developed a parallel algorithm for this problem which matches the communication and computation structure supported by the FREERIDE system. The main observations from our experimental results are as follows: 1) the overhead of using the middleware is quite small in most cases, 2) there is an overhead associated with breaking the datasets into more partitions or chunks, and 3) if the dataset is partitioned into the same number of chunks, the execution time stays proportional to the size of the dataset and inversely proportional to the number of nodes, i.e. the overhead of communication or reading disk-resident datasets is very small.