Yi Zhang

Rutgers, The State University of New Jersey, New Brunswick, New Jersey, United States

Are you Yi Zhang?

Claim your profile

Publications (8)2.3 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we investigate the inverse protein folding (IPF) problem under the Canonical model on 3D and 2D lattices (W.E. Hart, On the computational complexity of sequence design problems, Proceedings of the First Annual International Conference on Computational Molecular Biology 1997, pp. 128-136; E.I. Shakhnovich, A.M. Gutin, Engineering of stable and fast-folding sequences of model proteins, Proc. Natl. Acad. Sci. 90 (1993) 7195-7199). In this problem, we are given a contact graph G = (V , E) of a protein sequence that is embeddable in a 3D (respectively, 2D) lattice and an integer 1 � K � |V |. The goal is to find an induced subgraph of G of at most K vertices with the maximum number of edges. In this paper, we prove the following results:
    Discrete Applied Mathematics 01/2007; 155:719-732. · 0.72 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A useful approach to the mathematical analysis of large-scale biological networks is based upon their decompositions into monotone dynamical systems. This paper deals with two computational problems associated to finding decompositions which are optimal in an appropriate sense. In graph-theoretic language, the problems can be recast in terms of maximal sign-consistent subgraphs. The theoretical results include polynomial-time approximation algorithms as well as constant-ratio inapproximabil- ity results. One of the algorithms, which has a worst-case guarantee of 87.9% from optimality, is based on the semidefinite programming relaxation approach of Goemans- Williamson (23). The algorithm was implemented and tested on a Drosophila segmen- tation network and an Epidermal Growth Factor Receptor pathway model, and it was found to perform close to optimally.
    Biosystems. 01/2007; 90:161-178.
  • Source
    Yi Zhang, Bing Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional text classification studied in the information retrieval and machine learning literature is mainly based on topics. That is, each class represents a particular topic, e.g., sports and politics. However, many real-world problems require more refined classification based on some semantic perspectives. For example, in a set of sentences about a disease, some may report outbreaks of the disease, some may describe how to cure the disease, and yet some may discuss how to prevent the disease. To classify sentences at this semantic level, the traditional bag-of-words model is no longer sufficient. In this paper, we study semantic sentence classification of disease reporting. We show that both keywords and sentence semantic features are useful. Our results demonstrated that this integrated approach is highly effective.
    Knowledge Discovery in Databases: PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007, Proceedings; 01/2007
  • Yi Zhang, Bing Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional text classification studied in the IR literature is mainly based on topics. That is, each class or category represents a particular topic, e.g., sports, politics or sciences. However, many real-world text classification problems require more refined classification based on some semantic aspects. For example, in a set of documents about a particular disease, some documents may report the outbreak of the disease, some may describe how to cure the disease, some may discuss how to prevent the disease, and yet some others may include all the above information. To classify text at this semantic level, the traditional "bag of words" model is no longer sufficient. In this paper, we report a text classification study at the semantic level and show that sentence semantic and structure features are very useful for such kind of classification. Our experimental results based on a disease outbreak dataset demonstrated the effectiveness of the proposed approach.
    SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007; 01/2007
  • Source
    Experimental Algorithms, 5th International Workshop, WEA 2006, Cala Galdana, Menorca, Spain, May 24-27, 2006, Proceedings; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A useful approach to the mathematical analysis of large-scale biological networks is based upon their decompositions into monotone dynamical systems. This paper deals with two computational problems associated to finding decompositions which are optimal in an appropriate sense. In graph-theoretic language, the problems can be recast in terms of maximal sign-consistent subgraphs. The theoretical results include polynomial-time approximation algorithms as well as constant-ratio inapproximability results. One of the algorithms, which has a worst-case guarantee of 87.9% from optimality, is based on the semidefinite programming relaxation approach of Goemans-Williamson [Goemans, M., Williamson, D., 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42 (6), 1115-1145]. The algorithm was implemented and tested on a Drosophila segmentation network and an Epidermal Growth Factor Receptor pathway model, and it was found to perform close to optimally.
    Biosystems 09/2005; 90(1):161-78. · 1.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we investigate the protein sequence design (PSD) problem (also known as the inverse protein folding problem) under the Canonical model 4 on 2D and 3D lattices (12, 25). The Canonical model is specified by (i) a geometric representation of a target protein structure with amino acid residues via its contact graph, (ii) a binary folding code in which the amino acids are classified as hydrophobic (H) or polar (P), (iii) an energy function Φ defined in terms of the target structure that should favor sequences with a dense hydrophobic core and penalize those with many solvent-exposed hydrophobic residues (in the Canonical model, the energy function Φ gives an H-H residue contact in the contact graph a value of �1 and all other contacts a value of 0), and (iv) to prevent the solution from being a biologically meaningless all H sequence, the number of H residues in the sequence S is limited by fixing an upper bound λ on the ratio between H and P amino acids. The sequence S is designed by specifying which residues are H and which ones are P in a way that realizes the global minima of the energy function Φ. In this paper, we prove the following results: (1) An earlier proof of NP-completeness of finding the global energy minima for the PSD problem on 3D lattices in (12) was based on the NP- completeness of the same problem on 2D lattices. However, the reduction was not correct and we show that the problem of finding the global energy minima for the PSD problem for 2D lattices can be solved efficiently in polynomial time. But, we show that the problem of finding the global energy minima for the PSD problem on 3D lattices is indeed NP-complete by a providing a different reduction from the problem of finding the largest clique on graphs. (2) Even though the problem of finding the global energy minima on 3D lattices is NP-complete, we show that an arbitrarily close approximation to the global energy minima can indeed be found efficiently by taking 4 The Canonical model is neither the same nor a subset of the Grand Canonical (GC)
    Combinatorial Pattern Matching, 15th Annual Symposium, CPM 2004, Istanbul,Turkey, July 5-7, 2004, Proceedings; 01/2004
  • Source
    Yi Zhang, Bing Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the problem of extract-ing locations of disease outbreaks from news articles. A novel technique is pro-posed based on two types of supervised sequential rule mining, i.e., label sequen-tial rules and class sequential rules. The two types of rules are also combined for extraction. In learning, instead of using sentences as the training sequences, paths from dependency trees are employed for the purpose. Our experimental results based on a large number of health news from Google News and reports from ProMED-mail show that the proposed technique works effectively. It outper-forms the state-of-the-art information ex-traction technique conditional random field dramatically.

Publication Stats

60 Citations
2.30 Total Impact Points

Institutions

  • 2007
    • Rutgers, The State University of New Jersey
      • Department of Mathematics
      New Brunswick, New Jersey, United States
  • 2005
    • University of Illinois at Chicago
      • Department of Computer Science
      Chicago, IL, United States