Xuan Zhang

The Ohio State University, Columbus, OH, USA

Are you Xuan Zhang?

Claim your profile

Publications (6)0 Total impact

  • Source
    Conference Proceeding: Supporting high performance bioinformatics flat-file data processing using indices
    Xuan Zhang, G. Agrawal
    [show abstract] [hide abstract]
    ABSTRACT: As an essential part of in vitro analysis, biological database query has become more and more important in the research process. A few challenges that are specific to bioinformatics applications are data heterogeneity, large data volume and exponential data growth, constant appearance of new data types and data formats. We have developed an integration system that processes data in their flat file formats. Its advantages include the reduction of overhead and programming efforts. In the paper, we discuss the usage of indicing techniques on top of this flat file query system. Besides the advantage of processing flat files directly, the system also improves its performance and functionality by using indexes. Experiments based on real life queries are used to test the integration system.
    Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on; 05/2008
  • Source
    Conference Proceeding: A Tool for Supporting Integration Across Multiple Flat-File Datasets
    Xuan Zhang, G. Agrawal
    [show abstract] [hide abstract]
    ABSTRACT: Traditionally, biologists focused on a single research subject. New high-throughput experimental and analytical technologies, such as microarray and BLAST programs, have changed this. An important functionality required now is the ability to process queries about multiple data entries with little user intervention. This paper presents the design, implementation, and evaluation of a data integration tool that supports database-like query operations across flat-file biological datasets. Compared with the existing solutions, our system has several advantages, i.e., no database management system is required, users can still use declarative languages to communicate with the system, and no data parsing, loading, or indexing utility programs need to be written. We have used the system on three biological queries, each of which was inspired by an actual study from bioinformatics research literature. These case studies have demonstrated the functionality and scalability of our tool. Overall, our approach provides a light-weight and scalable solution for data integration over flat-file datasets
    BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on; 11/2006
  • Source
    Conference Proceeding: Assigning Schema Labels Using Ontology And Hueristics.
    Xuan Zhang, Ruoming Jin, Gagan Agrawal
    Sixth IEEE International Symposium on BioInformatics and BioEngineering (BIBE 2006), 16-18 October 2006, Arlington, Virginia, USA; 01/2006
  • Source
    Conference Proceeding: Enabling information integration and workflows in a grid environment with automatic wrapper generation
    Xuan Zhang, G. Agrawal
    [show abstract] [hide abstract]
    ABSTRACT: With a growing trend towards grid-based data repositories and data analysis services, scientific data analysis often involves accessing multiple data sources, and analyzing the data using a variety of analysis programs. One critical challenge in this, however, is that data sources often hold the same type of data in a number of different formats, and also, the formats expected and generated by various data analysis services are often distinct. We believe that the traditional approach for dealing with this problem, which is using hand-written wrappers, is not an effective and scalable solution for a grid environment. This paper presents a new approach, which involves generating wrappers automatically for enabling grid-based information integration and workflows. In this approach, a layout descriptor is used for describing the data format for each data source, as well as the input and output format for each tool or service. Efficient wrappers are then generated automatically for translation between any two data formats. Our design separates wrapper generation service from the wrapper execution. The wrapper generation service analyzes the layout descriptors and generates a WRAPINFO data structure. The wrapper comprises a set of application independent modules which take the WRAPINFO data structure as the input. We demonstrate our wrapper generation tool with two real case studies. Besides showing the effectiveness of our system, the experiments results from these two case studies show that the wrapper generation overhead is very small, automatically generated wrappers scale well to large datasets, and for the one case where this comparison was possible, the execution time of our wrapper was within 30% of that of a hand-written one.
    Grid Computing, 2005. The 6th IEEE/ACM International Workshop on; 12/2005
  • Source
    Conference Proceeding: Using data mining techniques to learn layouts of flat-file biological datasets
    [show abstract] [hide abstract]
    ABSTRACT: One of the major problems in biological data integration is that many data sources are stored as atlasses, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task. We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three atlasses biological datasets.
    Bioinformatics and Bioengineering, 2005. BIBE 2005. Fifth IEEE Symposium on; 11/2005
  • Source
    Conference Proceeding: Learning Layouts of Biological Datasets Semi-automatically.
    Data Integration in the Life Sciences, Second InternationalWorkshop, DILS 2005, San Diego, CA, USA, July 20-22, 2005, Proceedings; 01/2005

Institutions

  • 2005–2008
    • The Ohio State University
      • Department of Computer Science and Engineering
      Columbus, OH, USA