Publications (8)2.35 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: In this paper we investigate the inverse protein folding (IPF) problem under the Canonical model on 3D and 2D lattices (W.E. Hart, On the computational complexity of sequence design problems, Proceedings of the First Annual International Conference on Computational Molecular Biology 1997, pp. 128136; E.I. Shakhnovich, A.M. Gutin, Engineering of stable and fastfolding sequences of model proteins, Proc. Natl. Acad. Sci. 90 (1993) 71957199). In this problem, we are given a contact graph G = (V , E) of a protein sequence that is embeddable in a 3D (respectively, 2D) lattice and an integer 1 � K � V . The goal is to find an induced subgraph of G of at most K vertices with the maximum number of edges. In this paper, we prove the following results: 
Conference Paper: Semantic text classification of disease reporting
[Show abstract] [Hide abstract]
ABSTRACT: Traditional text classification studied in the IR literature is mainly based on topics. That is, each class or category represents a particular topic, e.g., sports, politics or sciences. However, many realworld text classification problems require more refined classification based on some semantic aspects. For example, in a set of documents about a particular disease, some documents may report the outbreak of the disease, some may describe how to cure the disease, some may discuss how to prevent the disease, and yet some others may include all the above information. To classify text at this semantic level, the traditional "bag of words" model is no longer sufficient. In this paper, we report a text classification study at the semantic level and show that sentence semantic and structure features are very useful for such kind of classification. Our experimental results based on a disease outbreak dataset demonstrated the effectiveness of the proposed approach. 
Conference Paper: Semantic Text Classification of Emergent Disease Reports.
[Show abstract] [Hide abstract]
ABSTRACT: Traditional text classification studied in the information retrieval and machine learning literature is mainly based on topics. That is, each class represents a particular topic, e.g., sports and politics. However, many realworld problems require more refined classification based on some semantic perspectives. For example, in a set of sentences about a disease, some may report outbreaks of the disease, some may describe how to cure the disease, and yet some may discuss how to prevent the disease. To classify sentences at this semantic level, the traditional bagofwords model is no longer sufficient. In this paper, we study semantic sentence classification of disease reporting. We show that both keywords and sentence semantic features are useful. Our results demonstrated that this integrated approach is highly effective.  [Show abstract] [Hide abstract]
ABSTRACT: A useful approach to the mathematical analysis of largescale biological networks is based upon their decompositions into monotone dynamical systems. This paper deals with two computational problems associated to finding decompositions which are optimal in an appropriate sense. In graphtheoretic language, the problems can be recast in terms of maximal signconsistent subgraphs. The theoretical results include polynomialtime approximation algorithms as well as constantratio inapproximabil ity results. One of the algorithms, which has a worstcase guarantee of 87.9% from optimality, is based on the semidefinite programming relaxation approach of Goemans Williamson (23). The algorithm was implemented and tested on a Drosophila segmen tation network and an Epidermal Growth Factor Receptor pathway model, and it was found to perform close to optimally. 
 [Show abstract] [Hide abstract]
ABSTRACT: A useful approach to the mathematical analysis of largescale biological networks is based upon their decompositions into monotone dynamical systems. This paper deals with two computational problems associated to finding decompositions which are optimal in an appropriate sense. In graphtheoretic language, the problems can be recast in terms of maximal signconsistent subgraphs. The theoretical results include polynomialtime approximation algorithms as well as constantratio inapproximability results. One of the algorithms, which has a worstcase guarantee of 87.9% from optimality, is based on the semidefinite programming relaxation approach of GoemansWilliamson [Goemans, M., Williamson, D., 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42 (6), 11151145]. The algorithm was implemented and tested on a Drosophila segmentation network and an Epidermal Growth Factor Receptor pathway model, and it was found to perform close to optimally.  [Show abstract] [Hide abstract]
ABSTRACT: In this paper we investigate the protein sequence design (PSD) problem (also known as the inverse protein folding problem) under the Canonical model 4 on 2D and 3D lattices (12, 25). The Canonical model is specified by (i) a geometric representation of a target protein structure with amino acid residues via its contact graph, (ii) a binary folding code in which the amino acids are classified as hydrophobic (H) or polar (P), (iii) an energy function Φ defined in terms of the target structure that should favor sequences with a dense hydrophobic core and penalize those with many solventexposed hydrophobic residues (in the Canonical model, the energy function Φ gives an HH residue contact in the contact graph a value of �1 and all other contacts a value of 0), and (iv) to prevent the solution from being a biologically meaningless all H sequence, the number of H residues in the sequence S is limited by fixing an upper bound λ on the ratio between H and P amino acids. The sequence S is designed by specifying which residues are H and which ones are P in a way that realizes the global minima of the energy function Φ. In this paper, we prove the following results: (1) An earlier proof of NPcompleteness of finding the global energy minima for the PSD problem on 3D lattices in (12) was based on the NP completeness of the same problem on 2D lattices. However, the reduction was not correct and we show that the problem of finding the global energy minima for the PSD problem for 2D lattices can be solved efficiently in polynomial time. But, we show that the problem of finding the global energy minima for the PSD problem on 3D lattices is indeed NPcomplete by a providing a different reduction from the problem of finding the largest clique on graphs. (2) Even though the problem of finding the global energy minima on 3D lattices is NPcomplete, we show that an arbitrarily close approximation to the global energy minima can indeed be found efficiently by taking 4 The Canonical model is neither the same nor a subset of the Grand Canonical (GC)  [Show abstract] [Hide abstract]
ABSTRACT: This paper studies the problem of extracting locations of disease outbreaks from news articles. A novel technique is proposed based on two types of supervised sequential rule mining, i.e., label sequential rules and class sequential rules. The two types of rules are also combined for extraction. In learning, instead of using sentences as the training sequences, paths from dependency trees are employed for the purpose. Our experimental results based on a large number of health news from Google News and reports from ProMEDmail show that the proposed technique works effectively. It outperforms the stateoftheart information extraction technique conditional random field dramatically.
Publication Stats
91  Citations  
2.35  Total Impact Points  
Top Journals
Institutions

2007

Rutgers, The State University of New Jersey
 Department of Mathematics
New Brunswick, New Jersey, United States


20052007

University of Illinois at Chicago
 Department of Computer Science
Chicago, Illinois, United States
