[show abstract][hide abstract] ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interest and touches on many important applications in text mining, computer vision and computational biology. This paper represents the collapsed LDA as a factor graph, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great success in learning LDA, the proposed BP is competitive in both speed and accuracy, as validated by encouraging experimental results on four large-scale document datasets. Furthermore, the BP algorithm has the potential to become a generic scheme for learning variants of LDA-based topic models in the collapsed space. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representations.
[show abstract][hide abstract] ABSTRACT: For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
Cybernetics, IEEE Transactions on. 01/2013; 43(4):1265-1276.
[show abstract][hide abstract] ABSTRACT: This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on; 10/2010
[show abstract][hide abstract] ABSTRACT: Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem.
We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods.
The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
[show abstract][hide abstract] ABSTRACT: This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2010; 7(3):550-62. · 2.25 Impact Factor
[show abstract][hide abstract] ABSTRACT: Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the low-level stroke segmentation process, we use a BU MRF model with smoothness prior to segment the character skeleton into directional substrokes based on self-organization of pixel-based directional features. In the high-level stroke extraction process, the segmented substrokes are sent to a TD MRF-based character model that, in turn, feeds back to guide the merging of corresponding substrokes to produce reliable candidate strokes for character recognition. The merit of the cascade MRF model is due to its ability to encode the local statistical dependencies of neighboring stroke components as well as prior knowledge of Chinese character structures. Encouraging stroke extraction and character recognition results confirm the effectiveness of our method, which integrates both BU/TD vision processing streams within the unified MRF framework.
[show abstract][hide abstract] ABSTRACT: DNA rigidity is an important physical property originating from the DNA three-dimensional structure. Although the general DNA rigidity patterns in human promoters have been investigated, their distinct roles in transcription are largely unknown. In this paper, we discover four highly distinct human promoter groups based on similarity of their rigidity profiles. First, we find that all promoter groups conserve relatively rigid DNAs at the canonical TATA box [a consensus TATA(A/T)A(A/T) sequence] position, which are important physical signals in binding transcription factors. Second, we find that the genes activated by each group of promoters share significant biological functions based on their gene ontology annotations. Finally, we find that these human promoter groups correlate with the tissue-specific gene expression.
[show abstract][hide abstract] ABSTRACT: Clustering MEDLINE documents is usually conducted by the vector space model, which computes the content similarity between two documents by basically using the inner-product of their word vectors. Recently, the semantic information of MeSH (Medical Subject Headings) thesaurus is being applied to clustering MEDLINE documents by mapping documents into MeSH concept vectors to be clustered. However, current approaches of using MeSH thesaurus have two serious limitations: first, important semantic information may be lost when generating MeSH concept vectors, and second, the content information of the original text has been discarded.
Our new strategy includes three key points. First, we develop a sound method for measuring the semantic similarity between two documents over the MeSH thesaurus. Second, we combine both the semantic and content similarities to generate the integrated similarity matrix between documents. Third, we apply a spectral approach to clustering documents over the integrated similarity matrix.
Using various 100 datasets of MEDLINE records, we conduct extensive experiments with changing alternative measures and parameters. Experimental results show that integrating the semantic and content similarities outperforms the case of using only one of the two similarities, being statistically significant. We further find the best parameter setting that is consistent over all experimental conditions conducted. We finally show a typical example of resultant clusters, confirming the effectiveness of our strategy in improving MEDLINE document clustering.
Supplementary data are available at Bioinformatics online.
[show abstract][hide abstract] ABSTRACT: This review describes important advances that have been made during the past decade for genome-wide human promoter recognition. Interest in promoter recognition algorithms on a genome-wide scale is worldwide and touches on a number of practical systems that are important in analysis of gene regulation and in genome annotation without experimental support of ESTs, cDNAs or mRNAs. The main focus of this review is on feature extraction and model selection for accurate human promoter recognition, with descriptions of what they are, what has been accomplished, and what remains to be done.
Briefings in Bioinformatics 07/2009; 10(5):498-508. · 5.30 Impact Factor
[show abstract][hide abstract] ABSTRACT: This paper discovers consensus physical signals around eukaryotic splice sites, transcription start sites, and replication origin start and end sites on a genome-wide scale based on their DNA flexibility profiles calculated by three different flexibility models. These salient physical signals are localized highly rigid and flexible DNAs, which may play important roles in protein-DNA recognition by the sliding search mechanism. The found physical signals lead us to a detailed hypothetical view of the search process in which a DNA-binding protein first finds a genomic region close to the target site from an arbitrary starting location by three-dimensional (3D) hopping and intersegment transfer mechanisms for long distances, and subsequently uses the one-dimensional (1D) sliding mechanism facilitated by the localized highly rigid DNAs to accurately locate the target flexible binding site within 30 bp (base pair) short distances. Guided by these physical signals, DNA-binding proteins rapidly search the entire genome to recognize a specific target site from the 3D to 1D pathway. Our findings also show that current promoter prediction programs (PPPs) based on DNA physical properties may suffer from lots of false positives because other functional sites such as splice sites and replication origins have similar physical signals as promoters do.
[show abstract][hide abstract] ABSTRACT: This paper presents a multimodal system for reliable human identity recognition under variant conditions. Our system fuses the recognition of face and speech with a general probabilistic framework. For face recognition, we propose a new spectral learning algorithm, which considers not only the discriminative relations among the training data but also the generative models for each class. Due to the tedious cost of face labeling in practice, our spectral face learning utilizes a semi-supervised strategy. That is, only a small number of labeled faces are used in our training step, and the labels are optimally propagated to other unlabeled training faces. Besides requiring much less labeled data, our algorithm also enables a natural way to explicitly train an outlier model that approximately represents unauthorized faces. To boost the robustness of our system for human recognition under various environments, our face recognition is further complemented by a speaker identification agent. Specifically, this agent models the statistical variations of fixed-phrase speech using speaker-dependent word hidden Markov models. Experiments on benchmark databases validate the effectiveness of our face recognition and speaker identification agents, and demonstrate that the recognition accuracy can be apparently improved by integrating these two independent biometric sources together.
[show abstract][hide abstract] ABSTRACT: We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, I., Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331–342)].
[show abstract][hide abstract] ABSTRACT: In this paper we propose the multirelational topic model (MRTM) for multiple types of link modeling such as citation and coauthor links in document networks. In the citation network, the MRTM models the citation link between each pair of documents as a binary variable conditioned on their topic distributions. In the coauthor network, the MRTM models the coauthor link between each pair of authors as a binary variable conditioned on their expertise distributions. The topic discovery is collectively regularized by multiple relations in both citation and coauthor networks. This model can summarize topics from the document network, predict citation links between documents and coauthor links between authors. Efficient inference and learning algorithms are derived based on Gibbs sampling. Experiments demonstrate that the MRTM significantly outperforms other state-of-the-art single-relational link modeling methods for large scientific document networks.
ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009; 01/2009
[show abstract][hide abstract] ABSTRACT: This paper presents a POCS-based (projection on convex set) method that estimates the unobserved time-points in microarray time-series data to make such data useful for clustering and aligning. Unobserved values are caused either by missing values or by unevenly sampling rates, and cannot be estimated accurately by straightforward interpolation due to very noisy and few replicated data. According to prior knowledge that each gene time-series is constrained in both time and frequency domains, POCS formulates these constraints by multiple convex sets and uses an iteratively convergent procedure to find the optimal value that satisfies all constraints by prior knowledge. To estimate the unobserved values, we use the cubic spline method to estimate the initial value and use POCS to find the optimal value iteratively. We show that POCS can improve the estimation of unobserved time-points with lower normalized root mean squared error compared with the statistical spline estimation for the continuous representation of microarray time-series data. Theoretically, the POCS-based method may improve the estimation performance further if more prior knowledge is available.
Machine Learning and Cybernetics, 2008 International Conference on; 08/2008
[show abstract][hide abstract] ABSTRACT: In this paper, we integrate type-2 (T2) fuzzy sets with Markov random fields (MRFs) referred to as T2 FMRFs, which may handle both fuzziness and randomness in the structural pattern representation. On the one hand, the T2 membership function (MF) has a 3-D structure in which the primary MF describes randomness and the secondary MF evaluates the fuzziness of the primary MF. On the other hand, MRFs can represent patterns statistical-structurally in terms of neighborhood system and clique potentials and, thus, have been widely applied to image analysis and computer vision. In the proposed T2 FMRFs, we define the same neighborhood system as that in classical MRFs. To describe uncertain structural information in patterns, we derive the fuzzy likelihood clique potentials from T2 fuzzy Gaussian mixture models. The fuzzy prior clique potentials are penalties for the mismatched structures based on prior knowledge. Because Chinese characters have hierarchical structures, we use T2 FMRFs to model character structures in the handwritten Chinese character recognition system. The overall recognition rate is 99.07%, which confirms the effectiveness of the proposed method.
IEEE Transactions on Fuzzy Systems 07/2008; · 5.48 Impact Factor
[show abstract][hide abstract] ABSTRACT: This paper proposes a statistical-structural character modeling method based on Markov random fields (MRFs) for handwritten Chinese character recognition (HCCR). The stroke relationships of a Chinese character reflect its structure, which can be statistically represented by the neighborhood system and clique potentials within the MRF framework. Based on the prior knowledge of character structures, we design the neighborhood system that accounts for the most important stroke relationships. We penalize the structurally mismatched stroke relationships with MRFs using the prior clique potentials, and derive the likelihood clique potentials from Gaussian mixture models, which encode the large variations of stroke relationships statistically. In the proposed HCCR system, we use the single-site likelihood clique potentials to extract many candidate strokes from character images, and use the pairsite clique potentials to determine the best structural match between the input candidate strokes and the MRF-based character models by relaxation labeling. The experiments on the KAIST character database demonstrate that MRFs can statistically model character structures, and work well in the HCCR system.
IEEE Transactions on Pattern Analysis and Machine Intelligence 06/2008; 30(5):767-80. · 4.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: The capacity of transcription factors to activate gene expression is encoded in the promoter sequences, which are composed of short regulatory motifs that function as transcription factor binding sites (TFBSs) for specific proteins. To the best of our knowledge, the structural property of TFBSs that controls transcription is still poorly understood. Rigidity is one of the important structural properties of DNA, and plays an important role in guiding DNA-binding proteins to the target sites efficiently. After analyzing the rigidity of 2897 TFBSs in 1871 human promoters, we show that TFBSs are generally more flexible than other genomic regions such as exons, introns, 3' untranslated regions, and TFBS-poor promoter regions. Furthermore, we find that the density of TFBSs is consistent with the average rigidity profile of human promoters upstream of the transcription start site, which implies that TFBSs directly influence the promoter structure. We also examine the local rigid regions probably caused by specific TFBSs such as the DNA sequence TATA(A/T)A(A/T) box, which may inhibit nucleosomes and thereby facilitate the access of transcription factors bound nearby. Our results suggest that the structural property of TFBSs accounts for the promoter structure as well as promoter activity.
[show abstract][hide abstract] ABSTRACT: Sequence-dependent DNA flexibility is an important structural property originating from the DNA 3D structure. In this paper, we investigate the DNA flexibility of the budding yeast (S. Cerevisiae) replication origins on a genome-wide scale using flexibility parameters from two different models, the trinucleotide and the tetranucleotide models. Based on analyzing average flexibility profiles of 270 replication origins, we find that yeast replication origins are significantly rigid compared with their surrounding genomic regions. To further understand the highly distinctive property of replication origins, we compare the flexibility patterns between yeast replication origins and promoters, and find that they both contain significantly rigid DNAs. Our results suggest that DNA flexibility is an important factor that helps proteins recognize and bind the target sites in order to initiate DNA replication. Inspired by the role of the rigid region in promoters, we speculate that the rigid replication origins may facilitate binding of proteins, including the origin recognition complex (ORC), Cdc6, Cdt1 and the MCM2-7 complex.
[show abstract][hide abstract] ABSTRACT: We present a subword lexical chaining approach to automatic story segmentation of Chinese broadcast news (BN). Conventional
lexical chains link related words with cohesion (e.g. repetition of words) and high concentration points of starting and ending
chains are indicative of story boundaries. However, inevitable speech recognition errors in BN transcripts may destroy the
cohesiveness of words, resulting in word match failures. We show the robustness of Chinese subwords (characters and syllables)
in lexical matching in errorful ASR transcripts. This motivates us to discover story boundaries on chains formed by character
and syllable n-gram units. Experimental results on the TDT2 Mandarin corpus show that chaining by character unigram exhibits the best story
segmentation performance with relative F-measure improvement of 6.06% over conventional word chaining. Integrations of multi-scales (words and subwords) exhibit further
improvement. For example, fusion by voting from different scales achieves an F-measure gain of 9.04% over words.
Advances in Multimedia Information Processing - PCM 2008, 9th Pacific Rim Conference on Multimedia, Tainan, Taiwan, December 9-13, 2008. Proceedings; 01/2008