About
74
Publications
18,007
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
569
Citations
Citations since 2017
Publications
Publications (74)
Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the...
While the interpretability of machine learning models is often equated with their mere syntactic comprehensibility, we think that interpretability goes beyond that, and that human interpretability should also be investigated from the point of view of cognitive science. The goal of this paper is to discuss to what extent cognitive biases may affect...
AMIE+ is a state-of-the-art algorithm for learning rules from RDF knowledge graphs (KGs). Based on association rule learning, AMIE+ constituted a breakthrough in terms of speed on large data compared to the previous generation of ILP-based systems. In this paper we present several algorithmic extensions to AMIE+, which make it faster, and the suppo...
So far, most user studies dealing with comprehensibility of machine learning models have used questionnaires or surveys to acquire input from participants. In this article, we argue that compared to questionnaires, the use of an adapted version of a real machine learning interface can yield a new level of insight into what attributes make a machine...
This book constitutes the proceedings of the International Joint Conference on Rules and Reasoning, RuleML+RR 2020, held in Oslo, Norway, during June-July 2020*. This is the 4th conference of a new series, joining the efforts of two existing conference series, namely “RuleML” (International Web Rule Symposium) and “RR” (Web Reasoning and Rule Syste...
Several methods for creating classifiers based on rules discovered via association rule mining have been proposed in the literature. These classifiers are called associative classifiers and the best-known algorithm is Classification Based on Associations (CBA). Interestingly, only very few implementations are available and, until recently, no imple...
It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpre...
The areas of machine learning and knowledge discovery in databases have considerably matured in recent years. In this article, we briefly review recent developments as well as classical algorithms that stood the test of time. Our goal is to provide a general introduction into different tasks such as learning from tabular data, behavioral data, or t...
The areas of machine learning and knowledge discovery in databases have considerably matured in recent years. In this article, we briefly review recent developments as well as classical algorithms that stood the test of time. Our goal is to provide a general introduction into different tasks such as learning from tabular data, behavioral data, or t...
SimLex-999 is a widely used lexical resource for tracking progress in word similarity computation. It anchors similarity in synonymy, while other researchers such as Agirre et al. (2009) adopt broader similarity definition, involving also hyponymy and antonymy relations. Paradigmatic association covers synonymy, antonymy and co-hyponymy relations (...
It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption, and recapitulate evidence for and against this...
EasyMiner (http://www.easyminer.eu) is a web-based system for interpretable machine learning based on frequent itemsets. It currently offers association rule learning (apriori, FP-Growth) and classification (CBA). EasyMiner offers a visual interface designed for interactivity, allowing the user to define a constraining pattern for the mining task....
Quantitative CBA is a postprocessing algorithm for association rule classification algorithm CBA (Liu et al, 1998). QCBA uses original, undiscretized numerical attributes to optimize the discovered association rules, refining the boundaries of literals in the antecedent of the rules produced by CBA. Some rules as well as literals from the rules can...
PMML is an industry-standard XML-based open format for representing statistical and data mining models. Since PMML does not yet support outlier (anomaly) detection, in this paper we propose a new outlier detection model to foster interoperability in this emerging field. Our proposal is included in the PMML RoadMap for PMML 4.4. We demonstrate the p...
This paper presents a use case for the data mining system EasyMiner
in European project OpenBudgets.eu, which is concerned with publication and
analysis of financial data of municipalities. EasyMiner is a web-based data mining system. This paper focuses on its new outlier detection functionality, which relies on frequent pattern mining. In additi...
Interest Beat (inbeat.eu) is an open source recommender framework that fulfills some of the demands raised by emerging applications that infer ratings from sensor input or use linked open data cloud for feature expansion. As a recommender algorithm, InBeat uses association rules, which allow to explain why a specific recommendation was made. Due to...
The type of the entity being described is one of the key pieces of information in linked data knowledge graphs. In this article, we introduce a novel technique for type inference that extracts types from the free text description of the entity combining lexico-syntactic pattern analysis with supervised classification. For lexico-syntactic (Hearst)...
In this paper, we present a crowdsourced dataset which adds entity salience (importance) annotations to the Reuters-128 dataset, which is subset of Reuters-21578. The dataset is distributed under a free license and publish in the NLP Interchange Format, which fosters interoperability and re-use. We show the potential of the dataset on the task of l...
In this paper, we present experiments evaluating Association Rule Classification algorithms on on-line and off-line recommender tasks of the CLEF NewsReel 2014 Challenge. The second focus of the experimental evaluation is to investigate possible performance optimizations of the Classification Based on Associations algorithm. Our findings indicate t...
EasyMiner is a web-based visual interface for association rule learning. This paper presents a preview of the next release, which uses the R environment as the data processing backend. EasyMiner/R uses the arules package to learn rules. It uses the Classifications Based on Associations (CBA) algorithm as a classifier and to perform rule pruning. Ex...
In this paper, we provide a brief summary of elementary research in rule learning. The two main research directions are descriptive rule learning, with the goal of discovering regularities that hold in parts of the given dataset, and predictive rule learning, which aims at generalizing the given dataset so that predictions on new data can be made....
Visual concept detection is
one of the most active research
areas
in
multimedia analysis. The
goal of visual concept detection is to assign to each elementary temporal segment of a video, a confidence score for each target concept (e.g. forest, ocean, sky, etc.). The establishment of such associations between the video content and the concept label...
http://ceur-ws.org/Vol-1417/
Vol-1417
http://ceur-ws.org/Vol-1417/preface.pdf
The Linked Hypernyms Dataset (LHD) provides entities described by Dutch, English and German Wikipedia articles with types in the DBpedia namespace. The types are extracted from the first sentences of Wikipedia articles using Hearst pattern matching over part-of-speech annotated text and disambiguated to DBpedia concepts. The dataset covers 1.3 mill...
Client-side execution of a recommender system requires en-richment of the content delivered to the user with a list of potentially related content. A possible bottleneck for client-side recommendation is the data volume entailed by trans-ferring the feature set describing each content item to the client, and the computational resources needed to pr...
This paper presents the implementation of a classification system based on learning of association rules in conjunction with Drools rule engine. The rules are interactively discovered with a web-based data mining system EasyMiner.eu. The rules are approved and edited by the domain expert before they are deployed for classification.
The main obstacles for a straightforward use of association rules as candidate business rules are the excessive number of rules dis-covered even on small datasets, and the fact that contradicting rules are generated. This paper shows that Association Rule Classification algo-rithms, such as CBA, solve both these problems, and provides a practical g...
This paper demonstrates Interest Beat (InBeat.eu) as a recommender system for online videos, which determines user interest in the content based on gaze tracking with Microsoft Kinect in addition to explicit user feedback. Content of the videos is represented using a semantic wikifier. User profile is constructed from preference rules, which are di...
The Linked Hypernyms Dataset (LHD) provides entities de-scribed by Dutch, English and German Wikipedia articles with types taken from the DBpedia namespace. LHD contains 2.8 million entity-type assignments. Accuracy evaluation is provided for all languages. These types are generated based on one-word hypernym extracted from the free text of Wikiped...
This paper presents a statistical type inference algorithm for ontology alignment, which assigns DBpedia entities with a new type (class). To infer types for a specific entity, the algorithm first identifies types that co-occur with the type the entity already has, and subsequently prunes the set of candidates for the most confident one. The algori...
This paper reports on the participation of the LKD team in the English entity linking task at the TAC KBP 2013. We evaluated various modifications and combinations of the Most-Frequent-Sense (MFS) based linking, the En-tity Co-occurrence based linking (ECC), and the Explicit Semantic Analysis (ESA) based linking. We employed two our Wikipedia-based...
GAIN (inbeat.eu) is a web application and service for capturing and preprocessing user interactions with semantically described content. GAIN outputs a set of instances in tabular form suitable for further processing with generic machine-learning algorithms.
GAIN is demoed as a component of a "SMART-TV'' recommender system. Content is automatically...
We present a wikifier evaluation framework consisting of software support and two datasets (News and Tweets), which were derived from datasets previously published at WEKEX 2011 and MSM Challenge 2013. Entities recognized in the original datasets were enriched with new annotations -- a link to Wikipedia and the most specific type from the DBpedia O...
Targeted Hypernym Discovery (THD) performs unsupervised classification of entities appearing in text. A hypernym mined from the free-text of the Wikipedia article describing the entity is used as a class. The type as well as the entity are cross-linked with their representation in DBpedia, and enriched with additional types from DBpedia and YAGO kn...
EasyMiner is a web-based association rule mining software based on the LISp-Miner system. This paper presents a proof-of-concept workflow for learning business rules with EasyMiner from transactional data. The approved rules are exported to the Drools business rules engine in the DRL format. The main focus is the transformation of GUHA association...
Is it possible to determine only by observing the behavior of a user what are his interests for a media? The aim of this project is to develop an application that can detect whether or not a user is viewing a content on the TV and use this information to build the user profile and to make it evolve dynamically. Our approach is based on the use of a...
I:ZI Miner (
sewebar.vse.cz/izi-miner
) is an association rule mining system with a user interface resembling a search engine. It brings to the web the notion of interactive pattern mining introduced by the MIME framework at ECML’11 and KDD’11. In comparison with MIME, I:ZI Miner discovers multi-valued attributes, supports the full range of logical...
UTA is a method for learning user preferences originally developed for multi-criteria decision making. UTA expects that the input attributes are monotone with respect to preferences, which limits the applicability of the method and requires manual input for each attribute. In this paper, we propose a heuristic attribute preprocessing algorithm that...
TV and Web convergence is becoming more and more a reality. This paper provides an overview of the opportunities and challenges that arise in fu-ture TV environments regarding unobtrusive, context-aware personalisation of digital media content. Subsequently, it describes the vision and first conceptual personalisation approach within the LinkedTV E...
SEWEBAR-CMS is a set of extensions for the Joomla! Content Management System (CMS) that extends it with functionality required
to serve as a communication platform between the data analyst, domain expert and the report user. SEWEBAR-CMS integrates with
existing data mining software through PMML. Background knowledge is entered via a web-based elici...
The long-pending research challenge of the association rule mining task is to identify, out of the multitude of discovered rules, the ones that are interesting for the domain expert. We will demonstrate a new feature of the SEWEBAR-CMS system that allows using any of the rules as a query-by-example, and in one click, discover whether this rule is i...
This paper introduces an attempt for a holistic approach to integrating background knowledge with PMML in the area of association rule mining. An XML format for capturing background knowledge inspired by PMML that scopes preprocessing and background knowledge association rules is presented. We argue that background knowledge should be kept separate...
Background (or sometimes referred to as domain) knowl-edge is extensively used in data mining for data pre-processing and for nugget-oriented data mining tasks: it is essential for constraining the search space and pruning the results. Despite the costs of eliciting back-ground knowledge from domain experts, there has been so far little effort to d...
This paper proposes the GUHA AR Model, an XML Schema-based formalism for representing the setting and results of association
rule (AR) mining tasks. In contrast to the item-based representation of the PMML 4.0 AssociationModel, the proposed expresses
the association rule as a couple of general boolean attributes related by condition on one or more...
The input for a Bag-of-Articles (BOA) classifier is a set of unlabeled entities - noun chunks and a set of target labeled entities - Wikipedia articles. The classifier locates Wikipedia articles that might define the unlabeled entity and performs disambiguation selecting one. Both unlabeled and labeled entity is represented with the proposed BOA te...
Competitive intelligence supports the decision makers in understand-ing the competitive environment by means of textual reports prepared based on public resources. CI is particularly demanding in the context of larger business clusters. We report on a long-term project featuring large-scale manual semantic annotation of CI reports wrt. business clu...
The principal problem of the association rule (AR) min- ing task is the selection of rules that might be interesting for the domain expert from the many rules typically generated by the software. SEWEBAR-CMS is a Joomla!-based Content Management System for post-processing AR models that supports the data analyst in this effort. The input for the sy...
Intelligent post-processing of data mining results can provide valuable knowledge. In this paper we present the first systematic
solution to post-processing that is based on semantic web technologies. The framework input is constituted by PMML and description
of background knowledge. Using the Topic Maps formalism, a generic Data Mining ontology an...
Tags pose an efficient and effective way of organization of resources, but they are not always available. A technique called SCM/THD investigated in this paper extracts entities from free-text annotations, and using the Lin similarity measure over the WordNet thesaurus classifies them into a controlled vocabulary of tags. Hypernyms extracted from W...
Competitive intelligence (CI) is a sub-discipline of business intelligence that supports the decision makers in understand- ing the competitive environment by means of textual reports prepared based on public resources. CI is particularly de- manding in the context of larger business clusters. We report on a long-term project featuring large-scale...
The motivation of this paper is to enhance the user perceived precision of results of Content Based Information Retrieval (CBIR) systems with Query Refinement (QR), Visual Analysis (VA) and Relevance Feedback (RF) algorithms. The proposed algorithms were implemented as modules into K-Space CBIR system. The QR module discovers hypernyms for the give...
We present a framework for efficiently exploiting free-text annotations as a complementary resource to image classifi-cation. A novel approach called Semantic Concept Mapping (SCM) is used to classify entities occurring in the text to a custom-defined set of concepts. SCM performs unsupervised classification by exploiting the relations between comm...
Targeted Hypernym Discovery (THD) applies lexico-syntactic (Hearst) patterns on a suitable corpus with the intent to extract one hy-pernym at a time. Using Wikipedia as the corpus in THD has recently yielded promising results in a number of tasks. We investigate the rea-sons that make Wikipedia articles such an easy target for lexicosyntactic patte...
Semantically enriched web usage data have high dimensionality when represented as fixed-length vectors which impairs performance of many data mining algorithms as well as the comprehensibility for a human analyst. The work presented here introduces visitor profile as a set of low-dimensional fixed-length vectors extracted from clickstream of an ind...
The task of classifying entities appearing in textual annota-tions to an arbitrary set of classes has not been extensively researched, yet it is useful in multimedia retrieval. We pro-posed an unsupervised algorithm, which expresses entities and classes as Wordnet synsets and uses Lin measure to clas-sify them. Real-time hypernym discovery from Wik...
UTA methods use linear programming techniques for finding additive utility functions that best explain stated preferences. However, most UTA methods including the popular UTA-Star are limited to monotonic preferences. UTA-NM (Non Monotonic) is inspired by UTA Star but allows non-monotonic partial utility functions if they decrease total model error...
Projects
Projects (6)
Analyze and propose a Semantics-based Researcher Network.
The broader goal of the one-year project is to contribute to the worldwide research effort in knowledge engineering for research data. The three threads to be followed by the student members of the team (who are the main workforce of the project) are integrated research knowledge graph construction from disparate portals, relying on a larger number of ontological types of entities (people, research artifacts, research areas, project, publications, events, etc.) in a balanced manner; modeling of the researcher (specifically, PhD student) research career progressing in time; and modeling aspects of the scholarly review process (focusing on review criteria and their polarity values). The connecting glue of the whole project will be a modular system of (both newly developed and reused) ontologies, populated with instance data so as to create a reusable knowledge graph. The publication outcomes of the project will be two published conference papers and one manuscript submitted to a top-tier journal.
Extensions and research related to the UTA method. UTA uses linear programming to infer additive value/utility functions representing preferences. The research in this project focuses on relaxing the monotonicity assumption (in the original method the utility curves have to be monotonic).
Web-based framework for sensor-based recommendation and preference learning with linked data support.