Publications (6)6.84 Total impact
-
Article: Signalling network construction for modelling plant defence response.
[show abstract] [hide abstract]
ABSTRACT: Plant defence signalling response against various pathogens, including viruses, is a complex phenomenon. In resistant interaction a plant cell perceives the pathogen signal, transduces it within the cell and performs a reprogramming of the cell metabolism leading to the pathogen replication arrest. This work focuses on signalling pathways crucial for the plant defence response, i.e., the salicylic acid, jasmonic acid and ethylene signal transduction pathways, in the Arabidopsis thaliana model plant. The initial signalling network topology was constructed manually by defining the representation formalism, encoding the information from public databases and literature, and composing a pathway diagram. The manually constructed network structure consists of 175 components and 387 reactions. In order to complement the network topology with possibly missing relations, a new approach to automated information extraction from biological literature was developed. This approach, named Bio3graph, allows for automated extraction of biological relations from the literature, resulting in a set of (component1, reaction, component2) triplets and composing a graph structure which can be visualised, compared to the manually constructed topology and examined by the experts. Using a plant defence response vocabulary of components and reaction types, Bio3graph was applied to a set of 9,586 relevant full text articles, resulting in 137 newly detected reactions between the components. Finally, the manually constructed topology and the new reactions were merged to form a network structure consisting of 175 components and 524 reactions. The resulting pathway diagram of plant defence signalling represents a valuable source for further computational modelling and interpretation of omics data. The developed Bio3graph approach, implemented as an executable language processing and graph visualisation workflow, is publically available at http://ropot.ijs.si/bio3graph/and can be utilised for modelling other biological systems, given that an adequate vocabulary is provided.PLoS ONE 01/2012; 7(12):e51822. · 4.09 Impact Factor -
Article: SegMine workflows for semantic microarray data analysis in Orange4WS.
[show abstract] [hide abstract]
ABSTRACT: In experimental data analysis, bioinformatics researchers increasingly rely on tools that enable the composition and reuse of scientific workflows. The utility of current bioinformatics workflow environments can be significantly increased by offering advanced data mining services as workflow components. Such services can support, for instance, knowledge discovery from diverse distributed data and knowledge sources (such as GO, KEGG, PubMed, and experimental databases). Specifically, cutting-edge data analysis approaches, such as semantic data mining, link discovery, and visualization, have not yet been made available to researchers investigating complex biological datasets. We present a new methodology, SegMine, for semantic analysis of microarray data by exploiting general biological knowledge, and a new workflow environment, Orange4WS, with integrated support for web services in which the SegMine methodology is implemented. The SegMine methodology consists of two main steps. First, the semantic subgroup discovery algorithm is used to construct elaborate rules that identify enriched gene sets. Then, a link discovery service is used for the creation and visualization of new biological hypotheses. The utility of SegMine, implemented as a set of workflows in Orange4WS, is demonstrated in two microarray data analysis applications. In the analysis of senescence in human stem cells, the use of SegMine resulted in three novel research hypotheses that could improve understanding of the underlying mechanisms of senescence and identification of candidate marker genes. Compared to the available data analysis systems, SegMine offers improved hypothesis generation and data interpretation for bioinformatics in an easy-to-use integrated workflow environment.BMC Bioinformatics 01/2011; 12:416. · 2.75 Impact Factor -
Chapter: Workflow Construction for Service-Oriented Knowledge Discovery
[show abstract] [hide abstract]
ABSTRACT: The paper proposes a Service-oriented Knowledge Discovery (SoKD) framework and a prototype implementation named Orange4WS. To provide the proposed framework with semantics, we are using the Knowledge Discovery Ontology which defines relationships among the ingredients of knowledge discovery scenarios. It enables to reason which algorithms can be used to produce the results required by a specified knowledge discovery task, and to query the results of the knowledge discovery tasks. In addition, the ontology can also be used for automatic annotation of manually created workflows facilitating their reuse. Thus, the proposed framework provides an approach to third generation data mining: integration of distributed, heterogeneous data and knowledge resources and software into a coherent and effective knowledge discovery process. The abilities of the prototype implementation have been demonstrated on a text mining use case featuring publicly available data repositories, specialized algorithms, and third-party data analysis tools.11/2010: pages 313-327; -
Chapter: Efficient Visualization of Document Streams
[show abstract] [hide abstract]
ABSTRACT: In machine learning and data mining, multidimensional scaling (MDS) and MDS-like methods are extensively used for dimensionality reduction and for gaining insights into overwhelming amounts of data through visualization. With the growth of the Web and activities of Web users, the amount of data not only grows exponentially but is also becoming available in the form of streams, where new data instances constantly flow into the system, requiring the algorithm to update the model in near-real time. This paper presents an algorithm for document stream visualization through a MDS-like distance-preserving projection onto a 2D canvas. The visualization algorithm is essentially a pipeline employing several methods from machine learning. Experimental verification shows that each stage of the pipeline is able to process a batch of documents in constant time. It is shown that in the experimental setting with a limited buffer capacity and a constant document batch size, it is possible to process roughly 2.5 documents per second which corresponds to approximately 25% of the entire blogosphere rate and should be sufficient for most real-life applications.11/2010: pages 174-188; -
Chapter: Semi-supervised Constrained Clustering: An Expert-Guided Data Analysis Methodology
[show abstract] [hide abstract]
ABSTRACT: This paper presents a methodology for expert-guided analysis of large data sets, including large text corpora. Its main ingredient is the algorithm for semi-supervised data clustering using cluster size constraints which implements several improvements over existing k-means constrained clustering algorithms. First, it allows for a larger set of user-defined cluster size constraints of different types (lower- and upper-bound constraints). Second, it allows for dynamic re-assignment of predefined constraints to clusters in iterative cluster computation/optimization, thus improving the results of constrained clustering. Third, it allows for expert-guided cluster optimization achieved by combining constrained clustering and data visualization, which enables finer-grained expert’s control over the clustering process, leading to further improvements of the quality of obtained clustering solutions. Incorporating data visualization into the clustering process allows the user to select referential points which act as constraint anchors in the course of iterative cluster computation. The proposed semi-supervised constrained clustering methodology has been implemented using a service-oriented data mining environment Orange4WS and evaluated on different document corpora.08/2010: pages 219-230; -
Article: A fast algorithm for mining utility-frequent itemsets
[show abstract] [hide abstract]
ABSTRACT: Utility-based data mining is a new research area interested in all types of utility factors in data mining processes and targeted at in-corporating utility considerations in both predictive and descriptive data mining tasks. High utility itemset mining is a research area of utility-based descriptive data mining, aimed at finding itemsets that contribute most to the total utility. A specialized form of high utility itemset min-ing is utility-frequent itemset mining, which – in addition to subjectively defined utility – also takes into account itemset frequencies. This paper presents a novel efficient algorithm FUFM (Fast Utility-Frequent Min-ing) which finds all utility-frequent itemsets within the given utility and support constraints threshold. It is faster and simpler than the original 2P-UF algorithm (2 Phase Utility-Frequent), as it is based on efficient methods for frequent itemset mining. Experimental evaluation on artifi-cial datasets show that, in contrast with 2P-UF, our algorithm can also be applied to mine large databases.
Top Journals
- BMC Bioinformatics (1)
- PLoS ONE (1)
Institutions
-
2010–2011
-
Jožef Stefan Institute
Ljubljana, Ljubljana, Slovenia
-