[Show abstract][Hide abstract] ABSTRACT: Subgroup discovery methods find interesting subsets of objects of a given class. Motivated by an application in bioinformatics, we first define a generalized subgroup discovery problem. In this setting, a subgroup is interesting if its members are characteristic for their class, even if the classes are not identical. Then we further refine this setting for the case where subsets of objects, for example, subsets of objects that represent different time points or different phenotypes, are contrasted. We show that this allows finding subgroups of objects that could not be found with classical subgroup discovery. To find such subgroups, we propose an approach that consists of two subgroup discovery steps and an intermediate, contrast set definition step. This approach is applicable in various application areas. An example is biology, where interesting subgroups of genes are searched by using gene expression data. We address the problem of finding enriched gene sets that are specific for virus-infected samples for a specific time point or a specific phenotype. We report on experimental results on a time series dataset for virus-infected Solanum tuberosum (potato) plants. The results on S. tuberosum's response to virus-infection revealed new research hypotheses for plant biologists.
The Computer Journal 03/2013; 56(3):289-303. DOI:10.1093/comjnl/bxs132 · 0.79 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We introduce the problem of identifying representative nodes in probabilistic graphs, motivated by the need to produce different simple views to large BisoNets. We define a probabilistic similarity measure for nodes, and then apply clustering methods to find groups of nodes. Finally, a representative is output from each cluster. We report on experiments with real biomedical data, using both the k-medoids and hierarchical clustering methods in the clustering step. The results suggest that the clustering based approaches are capable of finding a representative set of nodes.
[Show abstract][Hide abstract] ABSTRACT: Subgroup discovery methods find interesting subsets of objects of a given class. We propose to extend subgroup discovery by a second subgroup discovery step to find interesting subgroups of objects specific for a class in one or more contrast classes. First, a subgroup discovery method is applied. Then, contrast classes of objects are defined by using set theoretic functions on the discovered subgroups of objects. Finally, subgroup discovery is performed to find interesting subgroups within the two contrast classes, pointing out differences between the characteristics of the two. This has various application areas, one being biology, where finding interesting subgroups has been addressed widely for gene-expression data. There, our method finds enriched gene sets which are common to samples in a class (e.g., differential expression in virus infected versus non-infected) and at the same time specific for one or more class attributes (e.g., time points or genotypes). We report on experimental results on a time-series data set for virus infected potato plants. The results present a comprehensive overview of potato's response to virus infection and reveal new research hypotheses for plant biologists.
[Show abstract][Hide abstract] ABSTRACT: We propose a relatively simple yet powerful model for choosing relevant and non-redundant pieces of information. The model addresses data mining or information retrieval settings where relevance is measured with respect to a set of key or query objects, either specified by the user or obtained by a data mining step. The problem addressed is not only to identify other relevant objects, but also ensure that they are not related to possible negative query objects, and that they are not redundant with respect to each other. The model proposed here only assumes a similarity or distance function for the objects. It has simple parameterization to allow for different behaviors with respect to query objects. We analyze the model and give two efficient, approximate methods. We illustrate and evaluate the proposed model on different applications: linguistics and social networks. The results indicate that the model and methods are useful in finding a relevant and non-redundant set of results. While this area has been a popular topic of research, our contribution is to provide a simple, generic model that covers several related approaches while providing a systematic model for taking account of positive and negative query objects as well as non-redundancy of the output.
[Show abstract][Hide abstract] ABSTRACT: In experimental data analysis, bioinformatics researchers increasingly rely on tools that enable the composition and reuse of scientific workflows. The utility of current bioinformatics workflow environments can be significantly increased by offering advanced data mining services as workflow components. Such services can support, for instance, knowledge discovery from diverse distributed data and knowledge sources (such as GO, KEGG, PubMed, and experimental databases). Specifically, cutting-edge data analysis approaches, such as semantic data mining, link discovery, and visualization, have not yet been made available to researchers investigating complex biological datasets.
We present a new methodology, SegMine, for semantic analysis of microarray data by exploiting general biological knowledge, and a new workflow environment, Orange4WS, with integrated support for web services in which the SegMine methodology is implemented. The SegMine methodology consists of two main steps. First, the semantic subgroup discovery algorithm is used to construct elaborate rules that identify enriched gene sets. Then, a link discovery service is used for the creation and visualization of new biological hypotheses. The utility of SegMine, implemented as a set of workflows in Orange4WS, is demonstrated in two microarray data analysis applications. In the analysis of senescence in human stem cells, the use of SegMine resulted in three novel research hypotheses that could improve understanding of the underlying mechanisms of senescence and identification of candidate marker genes.
Compared to the available data analysis systems, SegMine offers improved hypothesis generation and data interpretation for bioinformatics in an easy-to-use integrated workflow environment.
[Show abstract][Hide abstract] ABSTRACT: We introduce the problem of identifying a diverse set of nodes representing relations of query nodes. This is motivated by bio-logical graphs, where not only the obvious relations, but especially non-obvious relations are of interest. We introduce a method to find a set of diverse nodes given query nodes on a probabilistic graph. The method is based on a probabilistic similarity measure, a representative measures for query nodes, and an extended representative measure for already se-lected nodes or negative query nodes. Experimental results on real data sets show that our method is able to find such a diverse set of nodes. This is a seminar report about my own ongoing research and hence other's work is cited almost only in Section 4.
[Show abstract][Hide abstract] ABSTRACT: In this seminar report we describe two story generation systems: TALE-SPIN, one of the earliest approaches, and the Virtual Storyteller, one of the latest ap-proaches. Though the former produces compelling stories, we show why the latter pro-duces stories which are more close to human produced stories. We will discuss similar-ities and differences of the two systems. In particular, we will show that the main idea (character-and goal-orientation) is the same in both, and the main difference is that the Virtual Storyteller distinguishes between the fabula and story text generation, while TALE-SPIN does not. Finally, we present some open questions and ideas on how to improve virtual story generation.