An ontology database is a basic relational database management system that models an ontology plus its instances. To reason over the transitive closure of instances in the subsumption hierarchy, for example, an ontology database can either unfold views at query time or propagate assertions using triggers at load time. In this paper, we use existing benchmarks to evaluate our method-using triggers-and we demonstrate that by forward computing inferences, we not only improve query time, but the improvement appears to cost only more space (not time). However, we go on to show that the true penalties were simply opaque to the benchmark, i.e., the benchmark inadequately captures load-time costs. We have applied our methods to two case studies in biomedicine, using ontologies and data from genetics and neuroscience to illustrate two important applications: first, ontology databases answer ontology-based queries effectively; second, using triggers, ontology databases detect instance-based inconsistencies-something not possible using views. Finally, we demonstrate how to extend our methods to perform data integration across multiple, distributed ontology databases.
The evaluation of the process of mining associations is an important and challenging problem in database systems and especially those that store critical data and are used for making critical decisions. Within the context of spatial databases we present an evaluation framework in which we use probability distributions to model spatial regions, and Bayesian networks to model the joint probability distribution and the structural relationships among spatial and non-spatial predicates. We demonstrate the applicability of the proposed framework by evaluating representatives from two well-known approaches that are used for learning associations, i.e., dependency analysis (using statistical tests of independence) and Bayesian methods. By controlling the parameters of the framework we provide extensive comparative results of the performance of the two approaches. We obtain measures of recovery of known associations as a function of the number of samples used, the strength, number and type of associations in the model, the number of spatial predicates associated with a particular non-spatial predicate, the prior probabilities of spatial predicates, the conditional probabilities of the non-spatial predicates, the image registration error, and the parameters that control the sensitivity of the methods. In addition to performance we investigate the processing efficiency of the two approaches.
Individuals spend a majority of their time in their home or workplace and for many, these places are our sanctuaries. As society and technology advance there is a growing interest in improving the intelligence of the environments in which we live and work. By filling home environments with sensors and collecting data during daily routines, researchers can gain insights on human daily behavior and the impact of behavior on the residents and their environments. In this article we provide an overview of the data mining opportunities and challenges that smart environments provide for researchers and offer some suggestions for future work in this area.
We consider the community detection problem from a partially observable network structure where some edges are not observable. Previous community detection methods are often based solely on the observed connectivity relation and the above situation is not explicitly considered. Even when the connectivity relation is partially observable, if some profile data about the vertices in the network is available, it can be exploited as auxiliary or additional information. We propose to utilize a graph structure (called a profile graph) which is constructed via the profile data, and propose a simple model to utilize both the observable connectivity relation and the profile graph. Furthermore, instead of a hierarchical approach, based the modularity matrix of the network structure, we propose an embedding approach which utilizes the regularization via the profile graph. Various experiments are conducted over a social network data and comparison with several state of the art methods is reported. The results are encouraging and indicate that it is promising to pursue this line of research.
In this paper, we show that the classical A.I. planning problem can be modelled using simple database constructs with logic-based semantics. The approach is similar to that used to model updates and nondeterminism in active database rules. We begin by showing that planning problems can be automatically converted to Datalog1S
programs with nondeterministic choice constructs, for which we provide a formal semantics using the concept of stable models. The resulting programs are characterized by a syntactic structure (XY-stratification) that makes them amenable to efficient implementation using compilation and fixpoint computation techniques developed for deductive database systems. We first develop the approach for sequential plans, and then we illustrate its flexibility and expressiveness by formalizing a model for parallel plans, where several actions can be executed simultaneously. The characterization of parallel plans as partially ordered plans allows us to develop (parallel) versions of partially ordered plans that can often be executed faster than the original partially ordered plans.
Workflow Management Systems (WfMSs) are used to support the modeling and coordinated execution of business processes within an organization or across organizational boundaries. Although some research efforts have addressed requirements for authorization and access control for workflow systems, little attention has been paid to the requirements as they apply to application data accessed or managed by WfMSs. In this paper, we discuss key access control requirements for application data in workflow applications using examples from the healthcare domain, introduce a classification of application data used in workflow systems by analyzing their sources, and then propose a comprehensive data authorization and access control mechanism for WfMSs. This involves four aspects: role, task, process instance-based user group, and data content. For implementation, a predicate-based access control method is used. We believe that the proposed model is applicable to workflow applications and WfMSs with diverse access control requirements.
The amount of information available to information workers recently has becomeoverwhelming. This confronts information workers with two majorproblems: finding the information needed, and accessing it; they arecalled the search problem and the access problem, respectively. Asthe main result of our research an architecture is specified of anautomated tool that provides integrated support for searching andaccessing multimedia documents that may be located at arbitraryplaces. The architecture contains a database with information aboutthe documents and with thesaurus-like information. The architecturealso contains a browse mechanism and a query mechanism for inspectingthe database. In the design process of the architecture, severalfundamental questions arose, like What is a document?and What is a medium kind?. The developed answers tosome of these questions are considered to have a general characterand thus to be useful also outside the scope of the research at hand.The paper concludes with an overview of the current status of theproject and a discussion of future work.
In this paper, wedescribe a general approach to scaling data mining applications thatwe have come to call meta-learning. Meta-Learningrefers to a general strategy that seeks to learn how to combine anumber of separate learning processes in an intelligent fashion. Wedesire a meta-learning architecture that exhibits two key behaviors.First, the meta-learning strategy must produce an accurate final classification system. This means that a meta-learning architecturemust produce a final outcome that is at least as accurate as aconventional learning algorithm applied to all available data.Second, it must be fast, relative to an individual sequential learningalgorithm when applied to massive databases of examples, and operatein a reasonable amount of time. This paper focussed primarily onissues related to the accuracy and efficacy of meta-learning as ageneral strategy. A number of empirical results are presenteddemonstrating that meta-learning is technically feasible in wide-area,network computing environments.
Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research (Buckles and Perty 1987; Bosc et al. 1988; Bosc and Pivert 1991; Kacprzyk et al. 1989; Prade and Testemale, 1987; Tahani, 1977; Umano, 1983; Zemankova and Kandel, 1985). Such queries place severe stress on the indexing and I/O subsystems of conventional database systems since they frequently involve the search of large numbers of records. The Datacycle (Datacycle is a trademark of Bellcore.) architecture and research prototype is a database processing system that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides high-performance query throughput, permits the use of ad hoc fuzzy membership functions and provides deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.
Adaptation in open, multi-agent information gathering systems isimportant for several reasons. These reasons include the inability toaccurately predict future problem-solving workloads, future changes inexisting information requests, future failures and additions of agents anddata supply resources, and other future task environment characteristicchanges that require system reorganization. We have developed a multi-agentdistributed system infrastructure, RETSINA (REusable Task Structure-based Intelligent Network Agents) that handles adaptation in an open Internetenvironment. Adaptation occurs both at the individual agent level as well asat the overall agent organization level. The RETSINA system has three typesof agents. Interface agents interact with the userreceiving user specifications and delivering results. They acquire, model,and utilize user preferences to guide system coordination in support of theusers tasks. Task agents help users perform tasks byformulating problem solving plans and carrying out these plans throughquerying and exchanging information with other software agents. Information agents provide intelligent access to a heterogeneouscollection of information sources. In this paper, we concentrate on theadaptive architecture of the information agents. We use as the domain ofapplication WARREN, a multi-agent financial portfolio management system thatwe have implemented within the RETSINA framework.
The paper presents a network model that can be used toproduce conceptual and logical schemas for Information Retrievalapplications. The model has interesting adaptability characteristicsand can be instantiated in various effective ways. The paper alsoreports the results of an experimental investigation into theeffectiveness of implementing associative and adaptive retrieval onthe proposed model by means of Neural Networks. The implementationmakes use of the learning and generalisation capabilities of theBackpropagation learning algorithm to build up and use applicationdomain knowledge in a sub-symbolic form. The knowledge is acquiredfrom examples of queries and relevant documents. Three differentlearning strategies are introduced, their performance is analysed andcompared with the performance of a traditional Information Retrievalsystem.
Today's workflow management systems (WFMSs) are only applicable in a secure and safe manner if the business process (BP) to be supported is well-structured and there is no need for ad hoc deviations at run-time. As only few BPs are static in this sense, this significantly limits the applicability of current workflow (WF) technology. On the other hand, to support dynamic deviations from premodeled task sequences must not mean that the responsibility for the avoidance of consistency problems and run-time errors is now completely shifted to the (naive) end user. In this paper we present a formal foundation for the support of dynamic structural changes of running WF instances. Based upon a formal WF model (ADEPT), we define a complete and minimal set of change operations (ADEPTflex) that support users in modifying the structure of a running WF, while maintaining its (structural) correctness and consistency. The correctness properties defined by ADEPT are used to determine whether a specific change can be applied to a given WF instance or not. If these properties are violated, the change is either rejected or the correctness must be restored by handling the exceptions resulting from the change. We discuss basic issues with respect to the management of changes and the undoing of temporary changes at the instance level. Recently we have started the design and implementation of ADEPTworkflow, the ADEPT workflow engine, which will make use of the change facilities presented in this paper.
In concept learning and data mining tasks, the learner is typically faced with a choice of many possible hypotheses or patterns characterizing the input data. If one can assume that training data contain no noise, then the primary conditions a hypothesis must satisfy are consistency and completeness with regard to the data. In real-world applications, however, data are often noisy, and the insistence on the full completeness and consistency of the hypothesis is no longer valid. In such situations, the problem is to determine a hypothesis that represents the best trade-off between completeness and consistency. This paper presents an approach to this problem in which a learner seeks rules optimizing a rule quality criterion that combines the rule coverage (a measure of completeness) and training accuracy (a measure of inconsistency). These factors are combined into a single rule quality measure through a lexicographical evaluation functional (LEF). The method has been implemented in the AQ18 learning system for natural induction and pattern discovery, and compared with several other methods. Experiments have shown that the proposed method can be easily tailored to different problems and can simulate different rule learners by modifying the parameter of the rule quality criterion.
We present a flexible retrieval system of face photographs based on their linguistic descriptions in terms of fuzzy predicates. Such expressions are a natural way for describing a (facial) image. However, due to their subjectivity they may lead to a poor performance of the retrieval operation. Regardless of the initial design of a retrieval system its capability ofadjustment to different users becomes very important. This paper explores the use of fuzzy logic techniques, for (i) describing image data, (ii) inference for retrieval, and (iii) inference for adjustment to a new user. The work presented in this paper builds on an earlier image modeling and retrieval system and we demonstrate the feasibility of adjustment to individual users, and the improvement resulting from it.
Traditionally the probabilistic ranking principle is used to rank the search
results while the ranking based on expected profits is used for paid placement
of ads. These rankings try to maximize the expected utilities based on the user
click models. Recent empirical analysis on search engine logs suggests a
unified click models for both ranked ads and search results. The segregated
view of document and ad rankings does not consider this commonality. Further,
the used models consider parameters of (i) probability of the user abandoning
browsing results (ii) perceived relevance of result snippets. But how to
consider them for improved ranking is unknown currently. In this paper, we
propose a generalized ranking function---namely "Click Efficiency (CE)"---for
documents and ads based on empirically proven user click models. The ranking
considers parameters (i) and (ii) above, optimal and has the same time
complexity as sorting. To exploit its generality, we examine the reduced forms
of CE ranking under different assumptions enumerating a hierarchy of ranking
functions. Some of the rankings in the hierarchy are currently used ad and
document ranking functions; while others suggest new rankings. While optimality
of ranking is sufficient for document ranking, applying CE ranking to ad
auctions requires an appropriate pricing mechanism. We incorporate a second
price based pricing mechanism with the proposed ranking. Our analysis proves
several desirable properties including revenue dominance over VCG for the same
bid vector and existence of a Nash Equilibrium in pure strategies. The
equilibrium is socially optimal, and revenue equivalent to the truthful VCG
equilibrium. Further, we relax the independence assumption in CE ranking and
analyze the diversity ranking problem. We show that optimal diversity ranking
is NP-Hard in general, and that a constant time approximation is unlikely.
The situation calculus is a versatile logic for reasoning about actions and formalizing dynamic domains. Using the non-Markovian
action theories formulated in the situation calculus, one can specify and reason about the effects of database actions under
the constraints of the classical, flat database transactions, which constitute the state of the art in database systems. Classical
transactions are characterized by the so-called ACID properties. With non-Markovian action theories, one can also specify,
reason about, and even synthesize various extensions of the flat transaction model, generally called advanced transaction models (ATMs). In this paper, we show how to use non-Markovian theories of the situation calculus to specify and reason about the
properties of ATMs. In these theories, one may refer to past states other than the previous one. ATMs are expressed as such
non-Markovian theories using the situation calculus. We illustrate our method by specifying (and sometimes reasoning about
the properties of) several classical models and known ATMs.
KeywordsKnowledge representation-Situation calculus-Transaction models-Reasoning about actions-Non-Markovian control-Logical foundations
IMPACT (Interactive Maryland Platform for Agents Collaborating Together) provides a platform and environment for agent and software interoperability being developed as a joint, multinational effort with participants from the University of Maryland, the Technische Universitt Wien, Bar-Ilan University, the University of Koblenz, and the Universita di Torino. Here, we describe the overall architecture of the IMPACT system, and outline how this architecture (i) allows agents to be developed either from scratch, or by extending legacy code-bases, (ii) allows agents to interact with one another, (iii) allows agents to have a variety of capabilities (reactive, autonomous, intelligent, mobile, replicating) and behaviors, and (iv) how IMPACT provides a variety of infrastructural services that may be used by agents to interact with one another.
This paper presents a multi-agent model of a distributed information system, using what is described as an engineering approach to real world application environment. The objective is to define, using proven ideas in the industrial context, the agent-based behaviour of the distributed system, which must operate correctly and effectively in an error-prone environment. Issues such as stability, robustness and scalability have also been addressed, along with some new ideas on a high-level communication strategies, as distinct from protocol-based communications. The work is being carried out under the DREAM theme at Keele, an earlier version of the approach having been successfully applied to agent-based manufacturing in an international project called HMS, in which some of the world’s major manufacturing industries participated.
Transportable agents are autonomous programs. They can movethrough a heterogeneous network of computers migrating from host tohost under their own control. They can sense the state of thenetwork, monitor software conditions, and interact with other agentsor resources. The network-sensing tools allow our agents to adapt tothe network configuration and to navigate under the control ofreactive plans. In this paper we describe the design andimplementation of a transportable-agent system and focus on navigationtools that give our agents autonomy. We also discuss the intelligentand adaptive behavior of autonomous agents in distributed information-access tasks.
Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the
Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader
cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk
of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing
the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents
to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes
text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature
selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define
a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms
in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach
is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating
groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of
terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering
process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are
implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on
three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing
these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional
term variance method.
KeywordsText clustering–Unsupervised feature selection–Genetic algorithm
Proxy servers are common solutions to relieve organizational networks from heavy traffic by storing the most frequently referenced web objects in their local cache. These proxies are commonly known as cooperative proxy systems and are usually organized in such a way as to optimize the utilization of their storage capacity. However, the design of the organizational structure of such proxy system depends heavily on the designer's knowledge of the network's performance. This article describes three methods to tackle this load balancing problem. They allow the self-organization of proxy servers by modeling each server as an autonomous entity that can make local decisions based on the traffic pattern it has served.
The improvements in disk speeds have not kept up with improvements in processor and memory speeds. Many techniques have been proposed and utilized to maximize the bandwidths of storage devices. These techniques have proven useful for conventional data, but when applied to multimedia data, they tend to be insufficient or inefficient due to the diversified data types, bandwidth requirements, file sizes and structures of complex objects of multimedia data. In this paper, we discuss the design of an efficient multimedia object allocation strategy that strives to achieve the expected retrieval rates and I/O computational requirements of objects; and also effectively balances the loads on the storage devices. We define a multimedia object model, describe the multimedia object and storage device characteristics, the classification of the multimedia objects according to their I/O requirements, and the fragmentation strategies. We use a bipartite graph model for mapping of fragments to storage devices. A cost function based on a disk utilization per allocated space, the amount of free space, and the bandwidth of a storage device is used to determine the optimal allocation for an object''s data.
SEWEBAR-CMS is a set of extensions for the Joomla! Content Management System (CMS) that extends it with functionality required
to serve as a communication platform between the data analyst, domain expert and the report user. SEWEBAR-CMS integrates with
existing data mining software through PMML. Background knowledge is entered via a web-based elicitation interface and is preserved
in documents conforming to the proposed Background Knowledge Exchange Format (BKEF) specification. SEWEBAR-CMS offers web
service integration with semantic knowledge bases, into which PMML and BKEF data are stored. Combining domain knowledge and
mining model visualizations with results of queries against the knowledge base, the data analyst conveys the results of the
mining through a semi-automatically generated textual analytical report to the end user. The paper demonstrates the use of
SEWEBAR-CMS on a real-world task from the cardiological domain and presents a user study showing that the proposed report
authoring support leads to a statistically significant decrease in the time needed to author the analytical report.
KeywordsData mining–Association rules–Background knowledge–Semantic web–Content management systems–Topic maps
An overview of the principle feature subset selection methods isgiven. We investigate a number of measures of feature subset quality, usinglarge commercial databases. We develop an entropic measure, based upon theinformation gain approach used within ID3 and C4.5 to build trees, which isshown to give the best performance over our databases. This measure is usedwithin a simple feature subset selection algorithm and the technique is usedto generate subsets of high quality features from the databases. A simulatedannealing based data mining technique is presented and applied to thedatabases. The performance using all features is compared to that achievedusing the subset selected by our algorithm. We show that a substantialreduction in the number of features may be achieved together with animprovement in the performance of our data mining system. We also present amodification of the data mining algorithm, which allows it to simultaneouslysearch for promising feature subsets and high quality rules. The effect ofvarying the generality level of the desired pattern is alsoinvestigated.
This paper presents and evaluates a simple but very effective method to implement large data warehouses on an arbitrary number of computers, achieving very high query execution performance and scalability. The data is distributed and processed in a potentially large number of autonomous computers using our technique called data warehouse striping (DWS). The major problem of DWS technique is that it would require a very expensive cluster of computers with fault tolerant capabilities to prevent a fault in a single computer to stop the whole system. In this paper, we propose a radically different approach to deal with the problem of the unavailability of one or more computers in the cluster, allowing the use of DWS with a very large number of inexpensive computers. The proposed approach is based on approximate query answering techniques that make it possible to deliver an approximate answer to the user even when one or more computers in the cluster are not available. The evaluation presented in the paper shows both analytically and experimentally that the approximate results obtained this way have a very small error that can be negligible in most of the cases.