Data & Knowledge Engineering

Published by Elsevier
Print ISSN: 0169-023X
Publications
Distribution design for databases usually addresses the problems of fragmentation, allocation and replication. However, the major purposes of distribution are to improve performance and to increase system reliability. The former aspect is particularly relevant in cases, where the desire to distribute originates from the distributed nature of an organisation with many data needs only arising locally, i.e., some data is retrieved and processed at only one or at most very few locations. Therefore, query optimisation should be treated as an intrinsic part of distribution design.In this paper the effects of fragmentation in databases on query processing are investigated using a query cost model. The considered databases are defined on higher-order data models, i.e., they capture complex value, object oriented and XML-based databases. The emphasis on higher-order data models enables a large variety for schema fragmentation, while at the same time it imposes restrictions on the way schemata can be fragmented. It is shown that the allocation of locations to the nodes of an optimised query tree is only marginally affected by the allocation of fragments. This implies that optimisation of query processing and optimisation of fragment allocation are largely orthogonal to each other, leading to several scenarios for fragment allocation. If elementary fragmentation operations are ordered according to their likeliness to impact on the query costs, a binary search procedure can be adopted to find an “optimal” fragmentation and allocation. We underline these findings with experimental results.
 
The Problem-Solving Method Heuristic Classification  
Sample Task Structure for Diagnosis  
Expertise Model for medical diagnosis (simplified CML notation)  
Steps and documents in the MIKE development process  
Ontologies in PROTÉGÉ-II  
This paper gives an overview of the development of the field of Knowledge Engineering over the last 15 years. We discuss the paradigm shift from a transfer view to a modeling view and describe two approaches which considerably shaped research in Knowledge Engineering: Role-limiting Methods and Generic Tasks. To illustrate various concepts and methods which evolved in recent years we describe three modeling frameworks: CommonKADS, MIKE and PROTÉGÉ-II. This description is supplemented by discussing some important methodological developments in more detail: specification languages for knowledge-based systems, problem-solving methods and ontologies. We conclude by outlining the relationship of Knowledge Engineering to Software Engineering, Information Integration and Knowledge Management.
 
The thematic and citation structures of Data and Knowledge Engineering (DKE) (1985–2007) are identified based on text analysis and citation analysis of the bibliographic records of full papers published in the journal. Temporal patterns are identified by detecting abrupt increases of frequencies of noun phrases extracted from titles and abstracts of DKE papers over time. Conceptual structures of the subject domain are identified by clustering analysis. Concept maps and network visualizations are presented to illustrate salient patterns and emerging thematic trends. A variety of statistics are reported to highlight key contributors and DKE papers that have made profound impacts.
 
By using data mining techniques, the data stored in a Data Warehouse (DW) can be analyzed for the purpose of uncovering and predicting hidden patterns within the data. So far, different approaches have been proposed to accomplish the conceptual design of DWs by following the multidimensional (MD) modeling paradigm. In previous work, we have proposed a UML profile for DWs enabling the specification of main MD properties at conceptual level. This paper presents a novel approach to integrating data mining models into multidimensional models in order to accomplish the conceptual design of DWs with Association Rules (AR). To this goal, we extend our previous work by providing another UML profile that allows us to specify Association Rules mining models for DW at conceptual level in a clear and expressive way. The main advantage of our proposal is that the Association Rules rely on the goals and user requirements of the Data Warehouse, instead of the traditional method of specifying Association Rules by considering only the final database implementation structures such as tables, rows or columns. In this way, ARs are specified in the early stages of a DW project, thus reducing the development time and cost. Finally, in order to show the benefits of our approach, we have implemented the specified Association Rules on a commercial database management server.
 
Although ontologies have been proposed as an important and natural means of representing real world knowledge for the development of database designs, most ontology creation is not carried out systematically. To be truly useful, a repository of ontologies, organized by application domain is needed, along with procedures for creating and integrating ontologies into database design methodologies. This research proposes a methodology for creating and managing domain ontologies. An architecture for an ontology management system is presented and implemented in a prototype. Empirical validation of the prototype demonstrates the effectiveness of the research.
 
In this study, four different 2D dual-tree complex wavelet (DT-CWT) based texture feature extraction methods are developed and their applications are demonstrated in segmenting and classifying tissues. Two of the methods use rotation variant texture features and the other two use rotation invariant features. This paper also proposes a novel approach to estimate 3D orientations of tissues based on rotation variant DT-CWT features. The method updates the strongest structural anisotropy direction with an iterative approach and converges to a volume orientation in few steps. Although classification and segmentation results show that there is no significant difference in the performance between rotation variant and invariant features; the latter are more robust to changes in texture rotation, which is essential for classification and segmentation of objects from 3D datasets such as medical tomography images.
 
Information overloading is today a serious concern that may hinder the potential of modern web-based information systems. A promising approach to deal with this problem is represented by knowledge extraction methods able to produce artifacts (also called patterns) that concisely represent data. Patterns are usually quite heterogeneous and voluminous. So far, little emphasis has been posed on developing an overall integrated environment for uniformly representing and querying different types of patterns. In this paper we consider the larger problem of modeling, storing, and querying patterns, in a database-like setting and use a Pattern-Base Management System (PBMS) for this purpose. Specifically, (a) we formally define the logical foundations for the global setting of pattern management through a model that covers data, patterns, and their intermediate mappings; (b) we present a formalism for pattern specification along with safety restrictions; and (c) we introduce predicates for comparing patterns and query operators.
 
Distributed data mining applications, such as those dealing with health care, finance, counter-terrorism and homeland defence, use sensitive data from distributed databases held by different parties. This comes into direct conflict with an individual’s need and right to privacy. In this paper, we come up with a privacy-preserving distributed association rule mining protocol based on a new semi-trusted mixer model. Our protocol can protect the privacy of each distributed database against the coalition up to n − 2 other data sites or even the mixer if the mixer does not collude with any data site. Furthermore, our protocol needs only two communications between each data site and the mixer in one round of data collection.
 
It is widely accepted that the conceptual schema of a data warehouse must be structured according to the multidimensional model. Moreover, it has been suggested that the ideal scenario for deriving the multidimensional conceptual schema of the data warehouse would consist of a hybrid approach (i.e., a combination of data-driven and requirement-driven paradigms). Thus, the resulting multidimensional schema would satisfy the end-user requirements and would be conciliated with the data sources. Most current methods follow either a data-driven or requirement-driven paradigm and only a few use a hybrid approach. Furthermore, hybrid methods are unbalanced and do not benefit from all of the advantages brought by each paradigm.In this paper we present our approach for multidimensional design. The most relevant step in our framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements. MDBE introduces several advantages over previous approaches, which can be summarized as three main contributions. (i) The MDBE method is a fully automatic approach that handles and analyzes the end-user requirements automatically. (ii) Unlike data-driven methods, we focus on data of interest to the end-user. However, the user may not be aware of all the potential analyses of the data sources and, in contrast to requirement-driven approaches, MDBE can propose new multidimensional knowledge related to concepts already queried by the user. (iii) Finally, MDBE proposes meaningful multidimensional schemas derived from a validation process. Therefore, the proposed schemas are sound and meaningful.
 
Concept-based query languages allow users to specify queries directly against conceptual schemas. The primary goal of their development is ease-of-use and user-friendliness. However, existing concept-based query languages require the end-user to explicitly specify query paths in totality, thereby rendering such systems not as easy to use and user-friendly as they could be. The conceptual query language (CQL) discussed in this paper also allows end-users to specify queries directly against the conceptual schemas of database applications, using concepts and constructs that are native to and exist on the schemas. Unlike other existing concept-based query languages, however, CQL queries are abbreviated, i.e., the entire path of a query does not have to be specified. CQL is, therefore, an abbreviated concept-based query language. CQL is developed with the aim of combining the ease-of-use and user-friendliness of concept-based languages with the power of formal languages. It does not require end-users to be familiar with the structure and organization of the application database, but only with the content. Therefore, it makes minimal demands on end-users' cognitive knowledge of database technology without sacrificing expressive power. In this paper, the formal semantics and the theoretical basis of CQL are presented. It is shown that, while CQL is easy to use and user-friendly, it is nonetheless more than first-order complete. A contribution of this study is the use of the semantic roles played by entities in their associations with other entities to support abbreviated conceptual queries. Although only mentioned here in passing, a prototype of CQL has been implemented as a front-end to a relational database manager.
 
In the early phases of requirements engineering, inspecting previously constructed conceptual models is important both for reusing specifications and for gaining general knowledge about how similar problems have been solved in other projects. However, as the users at this stage have only vague ideas about the system to be implemented, they are seldom able to specify a query to the CASE repository that is precise enough to capture the relevant conceptual models. In this paper, we present an information retrieval approach that frees the user from knowing the details of the modeling languages used in the repository and helps him retrieve models from other domains that are structurally similar to the one he intends to build. The system exploits the linguistic, semantic properties of the query to suggest what kind of representation he should look for, and it afterwards retrieves all models that are consistent with the query structure and the semantic representations of the words included in the query.
 
A methodology for in situ migration of a handcrafted Directed Acyclic Graph (DAG), to a formal and expressive OWL version is presented. Well-known untangling methodologies recommend wholesale re-coding. Unable to do this, we have tackled portions of the DAG, lexically dissecting term names to property-based descriptions in OWL. The different levels of expressivity are presented in a model called the “feature escalator”, where the user can choose the level needed for the application and the expressivity that delivers requirement. The results of applying the methodology to some areas of the gene ontology (GO) demonstrate the validity of the methodology.
 
We introduce a mathematical framework where a formal semantics for object identity can be built irrespectively to computer related things like object identifiers, memory allocations etc. Then, on this base, we build formal semantics for a few major constructs of conceptual modeling (CM) such as association, aggregation, generalization, isA- and isPartOf-relationships. We also give a formal meaning to the two fundamental dichotomies of CM: objects vs. values and entities vs. relationships.On the syntactical side, the language we use for specifying our formal semantic constructs is graph-based and brief: specifications are directed graphs consisting only of three kinds of items––nodes, arrows and marked diagrams. The latter are configurations of nodes and arrows closed in some technical sense and marked with predicate labels taken from a predefined signature. We show that this format does provide a universal abstract syntax for the entire CM-field. Then any particular CM-notation appears as a particular visualization superstructure (concrete syntax) over the same basic specification format as above.
 
This paper deals with the issue of abstracting a data source characterized by one among several possible representation formats. First we show that data source abstraction plays a central role in several important application problems in the area of information system design. Then we propose a new approach which is capable of semi-automatically carrying out the abstraction of a data source possibly encoded according to one among a variety of formats such as structured databases, OEM graphs and XML documents. The capability to handle heterogeneous formats is obtained via the usage of a particular conceptual model, called SDR-Network, which is able to uniformly represent and handle data sources with different formats. As a significant application of the presented data source abstraction algorithm, the construction of an Intensional Repository is also illustrated.
 
This paper presents two techniques to integrate and abstract database schemes. The techniques assume the existence of a collection of interscheme properties describing semantic relationships holding among input database scheme objects. The former technique uses interscheme properties to produce an integrated scheme encoding a global, unified view of the whole semantics represented within input schemes. The latter one takes a (integrated) scheme as the input and yields in output an abstracted scheme encoding the same semantics as the input scheme, but represented at an higher, application-dependent abstraction level. In addition, the paper illustrates a possible application of these algorithms to the construction of a data repository. Finally, the paper presents the application of proposed techniques to some database schemes of Italian Central Governmental Offices.
 
We present an object algebra for manipulating complex objects in object-oriented database systems. All operators are recursively defined. Unlike most of the existing query languages, the design of this object algebra is based on aggregation abstraction. It allows to take complex objects collectively as a unit of high level queries and enables complex objects to be accessed at all levels of aggregation hierarchies without resorting to any kind of path expressions. Features of aggregation abstraction, such as acyclicity of aggregation hierarchies and aggregation inheritance, have played important roles in such a development. We also formally described the output type of each operator in order to support dynamic classification of query results in the IsA type/class semi-lattice. The algebraic-equivalence rewriting rules for query optimization of this algebra are developed, too.
 
The relational data model has become the standard for mainstream database processing despite its well-known weakness in the area of representing application semantics. The research community's response to this situation has been the development of a collection of semantic data models that allow more of the meaning of information to be presented in a database. The primary tool for accomplishing this has been the use of various data abstractions, most commonly: inclusion, aggregation and association. This paper develops a general model for analyzing data abstractions, and then applies it to these three best-known abstractions.
 
Flat graphical, conceptual modeling techniques are widely accepted as visually effective ways in which to specify and communicate the conceptual data requirements of an information system. Conceptual schema diagrams provide modelers with a picture of the salient structures underlying the modeled universe of discourse, in a form that can readily be understood by and communicated to users, programmers and managers. When complexity and size of applications increase, however, the success of these techniques in terms of comprehensibility and communicability deteriorates rapidly.This paper proposes a method to offset this deterioration, by adding abstraction layers to flat conceptual schemas. We present an algorithm to recursively derive higher levels of abstraction from a given (flat) conceptual schema. The driving force of this algorithm is a hierarchy of conceptual importance among the elements of the universe of discourse.
 
Software now accounts for most of the cost of computer-based systems. Over the past thirty years, abstraction techniques such as high level programming languages and abstract data types have improved our ability to develop software. However, the increasing size and complexity of software systems have introduced new problems that are not solved by the current techniques. These new problems involve the system-level design of software, in which the important decisions are concerned with the kinds of modules and subsystems to use and the way these modules and subsystems are organized. This level of organization, the software architecture level, requires new kinds of abstractions. These new abstractions will capture essential properties of major subsystems and the ways they interact.
 
The rapid growth of the biological text data repository makes it difficult for human beings to access required information in a convenient and effective manner. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured text that computers cannot interpret very easily. In this paper we have presented an ontology-based Biological Information Extraction and Query Answering (BIEQA) System, which initiates text mining with a set of concepts stored in a biological ontology, and thereafter mines possible biological relations among those concepts using NLP techniques and co-occurrence-based analysis. The system extracts all frequently occurring biological relations among a pair of biological concepts through text mining. A mined relation is associated to a fuzzy membership value, which is proportional to its frequency of occurrence in the corpus and is termed a fuzzy biological relation. The fuzzy biological relations extracted from a text corpus along with other relevant information components like biological entities occurring within a relation, are stored in a database. The database is integrated with a query-processing module. The query-processing module has an interface, which guides users to formulate biological queries at different levels of specificity.
 
Some XML query processors operate on an internal representation of XML documents and can leverage neither the XML storage structure nor the possible access methods dedicated to this storage structure. Such query processors are often used in organizations that usually process transient XML documents received from other organizations. In this paper, we propose a different approach to accelerating query execution on XML source documents in such environments. The approach is based on the notion of query equivalence of XML documents with respect to a query. Under this equivalence, we propose two different document transformation strategies which prune parts of the documents irrelevant to the query, just before executing the query itself. The proposed transformations are implemented and evaluated using a two-level index structure: a structural directory capturing document paths and an inverted index of tag offsets.
 
Fine-grained access controls for XML define access privileges at the granularity of individual XML nodes. In this paper, we present a fine-grained access control mechanism for XML data. This mechanism exploits the structural locality of access rights as well as correlations among the access rights of different users to produce a compact physical encoding of the access control data. This encoding can be constructed using a single pass over a labeled XML database. It is block-oriented and suitable for use in secondary storage. We show how this access control mechanism can be integrated with a next-of-kin (NoK) XML query processor to provide efficient, secure query evaluation. The key idea is that the structural information of the nodes and their encoded access controls are stored together, allowing the access privileges to be checked efficiently. Our evaluation shows that the access control mechanism introduces little overhead into the query evaluation process.
 
The detection of the nearest neighbor object to a given point in the reference space (NN query) is a common problem in geographical information systems (GISs). Data structures supporting range queries are not always adequate to support NN queries. For this reason, additional data structures, mainly relying on the use of some kind of tree, have been proposed. The main drawback of such solutions is that at least one tree path has to be analyzed in order to determine the NN object. In this paper, we overcome this problem by considering information on the reference space to improve the search. In particular, we propose a data structure that is obtained by integrating the R+-tree with a regular grid, indexed by using a hashing technique. The resulting data structure combines the advantages of a rectangular decomposition of the space, typical of R+-trees, with a direct access to each portion of the space, typical of hashing. The proposed technique is then compared both theoretically and experimentally with the R+-tree.
 
The current hype on Extensible Mark-up Language (XML) produced hundreds of XML-based applications. Many of them offer document type definitions (DTDs) to structure actual XML documents. Access to these documents relies on special purpose applications or on query languages that are closely tied to the document structure. Our approach uses ontologies to derive canonical structures, i.e., DTDs, to access sets of distributed XML documents on a conceptual level. We will show how the combination of conceptual modeling, inheritance, and inference mechanisms with the popularity, simplicity, and flexibility of XML leads to applications providing a broad range of high-quality information.
 
Data broadcasting is an efficient data dissemination method in a wireless client–server system. A data server broadcasts data items periodically, and mobile clients cache data items to save communication bandwidth, resource usage, and data access time. The server also broadcasts invalidation reports (IRs) to maintain the consistency between server data and the clients’ cached data. Most existing cache invalidation policies in a wireless environment based on IRs simply purge the entire cache after a client has been disconnected long enough to miss a certain number (window size) of IRs. We present a cache invalidation scheme that provides better cache reusability and better data access time after a long disconnection. Our scheme attempts to increase cache reusability by respecting update rates at a server, broadcast intervals, the communication bandwidth, and data sizes, as well as disconnection time. Simulation results show that the increased cache reusability of our scheme can improve the data access time after a long disconnection.
 
In this work we propose models for access cost estimation that are suitable in the physical design of a relational database when a set of secondary indexes has to be built on some attributes of the relations. The models are tailored to deal with distinct kinds of queries (partial-match, interval, join, etc.) and are based on a measure of association, the clustering factor, which applies between an attribute and the physical location of records in a file as well as between two (sets of) attributes. The use of clustering factors and the value selectivity of a query (i.e. how many distinct values satisfy a query) allow design time models to be derived without previously needing to estimate the record selectivities (i.e. how many records satisfy a query) or the corresponding access costs of all the query instances that can occur at run time. In practice, unlike previous approaches to the problem, run time models are derived by specializing design time models, rather than vice versa. Estimation of access costs with alternative ordering criteria is also considered, and a model is proposed that allows the primary attribute to be chosen wothout the need to sort the tuples. The proposed models achieve a good tradeoff between accuracy and simplicity, without being based on restrictive assumptions as to data, and easily allow the design process to take advantage of semantic information about the application domain even if the data are not yet loaded in the database.
 
In this document, we show how linguistic tools and resources such as organized classes of verbs and their semantic representation contribute to a better database conceptual modelling of a domain. Organized verb semantic classes introduce a higher linguistic and cognitive adequacy and a greater flexibility in the definition of relations and in the overall organization of the conceptual model. They also permit the definition of advanced natural-language front-ends.
 
In this paper we present a performance comparison of access methods for time-evolving regional data. Initially, we briefly review four temporal extensions of the Linear Region Quadtree: the Time-Split Linear Quadtree, the Multiversion Linear Quadtree, the Multiversion Access Structure for Evolving Raster Images and Overlapping Linear Quadtrees. These methods comprise a family of specialized access methods that can efficiently store and manipulate consecutive raster images. A new simpler implementation solution that provides efficient support for spatio-temporal queries referring to the past through these methods, is suggested. An extensive experimental space and time performance comparison of all the above access methods follows. The comparison is made under a common and flexible benchmarking environment in order to choose the best technique depending on the application and on the image characteristics. These experimental results show that in most cases the Overlapping Linear Quadtrees method is the best choice.
 
Access control policies are security policies that govern access to resources. The need for real-time update of such policies while they are in effect and enforcing the changes immediately, arise in many scenarios. Consider, for example, a military environment responding to an international crisis, such as a war. In such situations, countries change strategies necessitating a change of policies. Moreover, the changes to policies must take place in real-time while the policies are in effect. In this paper we address the problem of real-time update of access control policies in the context of a database system. Access control policies, governing access to the data objects, are specified in the form of policy objects. The data objects and policy objects are accessed and modified through transactions. We consider an environment in which different kinds of transactions execute concurrently some of which may be policy update transactions. We propose algorithms for the concurrent and real-time update of security policies. The algorithms differ on the basis of the concurrency provided and the semantic knowledge used.
 
A new variation of Overlapping B+-trees is presented, which provides efficient indexing of transaction time and keys in a two dimensional key-time space. Modification operations (i.e. insertions, deletions and updates) are allowed at the current version, whereas queries are allowed to any temporal version, i.e. either in the current or in past versions. Using this structure, snapshot and range-timeslice queries can be answered optimally. However, the fundamental objective of the proposed method is to deliver efficient performance in case of a general pure-key query (i.e. ‘history of a key’). The trade-off is a small increase in time cost for version operations and storage requirements.
 
In the past a number of file organizations have been proposed for processing different types of queries efficiently. To our knowledge none of the existing file organizations is capable of supporting all types of accesses equally efficiently. In this paper we have taken a different approach for designing an integated data structure which offers multiple access paths for processing different types of queries efficiently. The data structure reported here can be implemented on disk based as well as main memory database systems, however, in this paper we report its behavior mainly in main memory database environment. Our approach is to fuse those data structures which offer an efficient access paths for a particular type. To show the feasibility of our scheme we fused the B+-tree, the grid file and extendible hashing structures, using a proper interface. We implemented and measured its performance through simulation modeling. Our results show that the data structure does improve concurrency and offers a higher throughput for a variety of transaction processing workloads. We argue that our scheme is different than creating secondary indexes for improving concurrency. In the absence of a data strucure which can provide all types of access equally efficiently, an integrated data structure is an acceptable solution which offers an efficient way for increasing the performance of database management systems.
 
The secure multi-party computation (SMC) model provides means for balancing the use and confidentiality of distributed data. This is especially important in the field of privacy-preserving data mining (PPDM). Increasing security concerns have led to a surge in work on practical secure multi-party computation protocols. However, most are only proven secure under the semi-honest model, and security under this adversary model is insufficient for many PPDM applications. SMC protocols under the malicious adversary model generally have impractically high complexities for PPDM. We propose an accountable computing (AC) framework that enables liability for privacy compromise to be assigned to the responsible party without the complexity and cost of an SMC-protocol under the malicious model. We show how to transform a circuit-based semi-honest two-party protocol into a protocol satisfying the AC-framework. The transformations are simple and efficient. At the same time, the verification phase of the transformed protocol is capable of detecting any malicious behaviors that can be prevented under the malicious model.
 
Verification of Rule Based Systems has largely concentrated on checking the consistency, conciseness and completeness of the rulebase. However, the accuracy of rules vis-à-vis the knowledge that they represent, is not addressed, with the result that a large amount of testing has to be done to validate the system. For any reasonably-sized rulebase it becomes difficult to know the adequacy and completeness of the test-cases. In case a particular test-case is omitted the chances of an inaccurate rule remaining undetected increases. We discuss this issue and define a notion of accuracy of rules. We take the view that a rule represents a concept of the domain and in the scenario of Formal Concept Analysis, works on objects and attribute-value space. We then present a mechanism to measure the level of accuracy using the Rough Set Theory. In this framework, accuracy can be computed as a ratio of the objects definitely selected by the rule (the lower approximation) to the objects possibly selected by the rule (the upper approximation) with respect to the concept that it encodes. Our algorithm and its implementation for PROLOG clauses is discussed.
 
Classification Association Rule Mining (CARM) systems operate by applying an Association Rule Mining (ARM) method to obtain classification rules from a training set of previously classified data. The rules thus generated will be influenced by the choice of ARM parameters employed by the algorithm (typically support and confidence threshold values). In this paper we examine the effect that this choice has on the predictive accuracy of CARM methods. We show that the accuracy can almost always be improved by a suitable choice of parameters, and describe a hill-climbing method for finding the best parameter settings. We also demonstrate that the proposed hill-climbing method is most effective when coupled with a fast CARM algorithm such as the TFPC algorithm which is also described.
 
Hierarchical agglomerative clustering (HAC) is very useful but due to high CPU time and memory complexity its practical use is limited. Earlier, we proposed an efficient partitioning – partially overlapping partitioning (POP) – based on the fact that in HAC small and closely placed clusters are agglomerated initially, and only towards the end larger and distant clusters are agglomerated. Here, we present the parallel version of POP, pPOP. Theoretical analysis shows that, compared to the existing algorithms, pPOP achieves CPU time speed-up and memory scale-down of O(c) without compromising accuracy where c is the number of cells in the partition. A shared memory implementation shows that pPOP outperforms existing algorithms significantly.
 
With the proliferation of the Web and ICT technologies there have been concerns about the handling and use of sensitive information by data mining systems. Recent research has focused on distributed environments where the participants in the system may also be mutually mistrustful. In this paper we discuss the design and security requirements for large-scale privacy-preserving data mining (PPDM) systems in a fully distributed setting, where each client possesses its own records of private data. To this end we argue in favor of using some well-known cryptographic primitives, borrowed from the literature on Internet elections. More specifically, our framework is based on the classical homomorphic election model, and particularly on an extension for supporting multi-candidate elections. We also review a recent scheme [Z. Yang, S. Zhong, R.N. Wright, Privacy-preserving classification of customer data without loss of accuracy, in: SDM’ 2005 SIAM International Conference on Data Mining, 2005] which was the first scheme that used the homomorphic encryption primitive for PPDM in the fully distributed setting. Finally, we show how our approach can be used as a building block to obtain Random Forests classification with enhanced prediction performance.
 
The accurate measurement of the functional size of applications that are automatically generated in MDA environments is a challenge for the software development industry. This paper introduces the OO-Method COSMIC Function Points (OOmCFP) procedure, which has been systematically designed to measure the functional size of object-oriented applications generated from their conceptual models by means of model transformations. The OOmCFP procedure is structured in three phases: a strategy phase, a mapping phase, and a measurement phase. Finally, a case study is presented to illustrate the use of OOmCFP, as well as an analysis of the results obtained.
 
One of the major limitations of current NLP systems is a poor encoding of lexical knowledge (morphologic lexicon, grammar, and semantic dictionary). This paper describes a high-coverage system, DANTE, for natural language processing and query-answering. At the current state of implementation, the morphological analyzer provides 100% coverage over the corpus (5000 press agency releases with about 100,000 different words) and the parser can analyze 80% of the sentences correctly. A semantic lexicon provides a detailed case-based representation of word senses. The morphologic lexicon (10,000 elementary lemmata plus affixes and suffixes) and the grammar (100 rules) was manually entered; during the first phase of the DANTE project, the semantic knowledge was also manullly encoded. More recently, a methodology for semi-automatic acquisition of a case-based semantic lexicon has been devised.
 
This paper introduces a well defined co-operation between domain expert, knowledge engineer, and knowledge acquisition and transformation tools. First, the domain expert supported by a hypertext tool generates an intermediate representation from parts of authentic texts of a domain. As a side effect, this representation serves as human readable documentation. In subsequent steps, this representation is semi-automatically transformed into a formal representation by knowledge acquisition tools. These tools are fully adapted to the expert's domain both in terminology and model structure which are developed by the knowledge engineer from a library of generic models and with preparation tools.
 
We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining. Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks and eliminate potential conflicts at the same time. Our approach is based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages. Web pages across different sites are considered simultaneously and hence information from different sources can be effectively leveraged. An approximate learning algorithm is developed to conduct inference over the graphical model to tackle the information extraction and feature mining tasks. We demonstrate the efficacy of our framework by applying it to two applications, namely, important product feature mining from vendor sites, and hot item feature mining from auction sites. Extensive experiments on real-world data have been conducted to demonstrate the effectiveness of our framework.
 
In this paper, given a set of sequence databases across multiple domains, we aim at mining multi-domain sequential patterns, where a multi-domain sequential pattern is a sequence of events whose occurrence time is within a pre-defined time window. We first propose algorithm Naive in which multiple sequence databases are joined as one sequence database for utilizing traditional sequential pattern mining algorithms (e.g., PrefixSpan). Due to the nature of join operations, algorithm Naive is costly and is developed for comparison purposes. Thus, we propose two algorithms without any join operations for mining multi-domain sequential patterns. Explicitly, algorithm IndividualMine derives sequential patterns in each domain and then iteratively combines sequential patterns among sequence databases of multiple domains to derive candidate multi-domain sequential patterns. However, not all sequential patterns mined in the sequence database of each domain are able to form multi-domain sequential patterns. To avoid the mining cost incurred in algorithm IndividualMine, algorithm PropagatedMine is developed. Algorithm PropagatedMine first performs one sequential pattern mining from one sequence database. In light of sequential patterns mined, algorithm PropagatedMine propagates sequential patterns mined to other sequence databases. Furthermore, sequential patterns mined are represented as a lattice structure for further reducing the number of sequential patterns to be propagated. In addition, we develop some mechanisms to allow some empty sets in multi-domain sequential patterns. Performance of the proposed algorithms is comparatively analyzed and sensitivity analysis is conducted. Experimental results show that by exploring propagation and lattice structures, algorithm PropagatedMine outperforms algorithm IndividualMine in terms of efficiency (i.e., the execution time).
 
We describe a family of heuristics-based clustering strategies to support the merging of XML data from multiple sources. As part of this research, we have developed a comprehensive classification for schematic and semantic conflicts that can occur when reconciling related XML data from multiple sources. Given the fact that element clustering is compute-intensive, especially when comparing large numbers of data elements that exhibit great representational diversity, performance is a critical, yet so far neglected aspect of the merging process. We have developed five heuristics for clustering data in the multi-dimensional metric space. Equivalence of data elements within the individual clusters is determined using several distance functions that calculate the semantic distances among the elements.The research described in this article is conducted within the context of the Integration Wizard (IWIZ) project at the University of Florida. IWIZ enables users to access and retrieve information from multiple XML-based sources through a consistent, integrated view. The results of our qualitative analysis of the clustering heuristics have validated the feasibility of our approach as well as its superior performance when compared to other similarity search techniques.
 
To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing semantically corresponding entities in the real world, across the sources. While decision tree techniques have been used to learn entity matching rules, most decision tree learners have an inherent representational bias, that is, they generate univariate trees and restrict the decision boundaries to be axis-orthogonal hyper-planes in the feature space. Cascading other classification methods with decision tree learners can alleviate this bias and potentially increase classification accuracy. In this paper, the authors apply a recently-developed constrained cascade generalization method in entity matching and report on empirical evaluation using real-world data. The evaluation results show that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case.
 
There has been a considerable amount of research into the provision of explicit representation of control regimes for resolution-based theorem provers. However, most of the existing systems are either not adequate or too inefficient to be of practical use. In this paper a theorem prover, ACT-P, which is adequate but retains satisfactory efficiency is presented. It does so by providing a number of user-changeable heuristics which are called at specific points during the search for a proof. The set of user-changeable heuristics was determined on the basis of a classification of the heuristics used by existing resolution-based theorem provers.
 
Inconsistencies frequently occur in news about the real-world. Some of these inconsistencies may be more significant than others, and some news may contain more inconsistencies than others. This creates problems of deciding whether to act on these inconsistencies, and if so how. Possible actions on an inconsistency in a news report include ignore the inconsistency, resolve the inconsistency, and reject the report. To support this, we extend and apply a general characterization of inconsistency, based on Belnap’s four-valued logic. For conflicts arising between the news and background knowledge, we analyse coherence and significance of the corresponding the four-valued models for that knowledge and show how this analysis can indicate an appropriate course of action.
 
Ontologies have been recognized as a fundamental infrastructure for advanced approaches to Knowledge Management (KM) automation, and the conceptual foundations for them have been discussed in some previous reports. Nonetheless, such conceptual structures should be properly integrated into existing ontological bases, for the practical purpose of providing the required support for the development of intelligent applications. Such applications should ideally integrate KM concepts into a framework of commonsense knowledge with clear computational semantics. In this paper, such an integration work is illustrated through a concrete case study, using the large OpenCyc knowledge base. Concretely, the main elements of the Holsapple and Joshi KM ontology and some existing work on e-learning ontologies are explicitly linked to OpenCyc definitions, providing a framework for the development of functionalities that use the built-in reasoning services of OpenCyc in KM activities. The integration can be used as the point of departure for the engineering of KM-oriented systems that account for a shared understanding of the discipline and rely on public semantics provided by one of the largest open knowledge bases available.
 
The design of workflows is a complicated task. In those cases where the control flow between activities cannot be modeled in advance but simply occurs during enactment time (run time), we speak of ad-hoc processes. Ad-hoc processes allow for the flexibility needed in real-life business processes. Since ad-hoc processes are highly dynamic, they represent one of the most difficult challenges, both, technically and conceptually. Caramba is one of the few process-aware collaboration systems allowing for ad-hoc processes. Unlike in classical workflow systems, the users are no longer restricted by the system. Therefore, it is interesting to study the actual way people and organizations work. In this paper, we propose process mining techniques and tools to analyze ad-hoc processes. We introduce process mining, discuss the concept of mining in the context of ad-hoc processes, and demonstrate a concrete application of the concept using Caramba, process mining tools such as EMiT and MinSoN, and a newly developed extraction tool named Teamlog.
 
Traditionally, distributed query optimization techniques generate static query plans at compile time. However, the optimality of these plans depends on many parameters (such as the selectivities of operations, the transmission speeds and workloads of servers) that are not only difficult to estimate but are also often unpredictable and fluctuant at runtime. As the query processor cannot dynamically adjust the plans at runtime, the system performance is often less than satisfactory. In this paper, we introduce a new highly adaptive distributed query processing architecture. Our architecture can quickly detect fluctuations in selectivities of operations, as well as transmission speeds and workloads of servers, and accordingly change the operation order of a distributed query plan during execution. We have implemented a prototype based on the Telegraph system [Telegragraph project. Available from <http://telegraph.cs.berkeley.edu/>]. Our experimental study shows that our mechanism can adapt itself to the changes in the environment and hence approach to an optimal plan during execution.
 
This paper describes a rule-based query optimizer. The originality of the approach is through a uniform high-level rule language used to model both query rewriting and planning, as well as search strategies. Rules are given to specify operation permutation, recursive query optimization, integrity constraint addition, to model join ordering and access path selection. Therefore, meta-rules are presented to model multiple search strategies, including enumerative and randomized search. To illustrate these ideas, we describe a query optimizer for an extensible database server that supports abstract data types, complex objects, deductive capabilities and integrity constraints. A prototype of the query optimizer proposed in this paper is operational and has been demonstrated at the 1991 ESPRIT week in the EDS project.
 
Current workflow management systems still lack support for dynamic and automatic workflow adaptations. However, this functionality is a major requirement for next–generation workflow systems to provide sufficient flexibility to cope with unexpected failure events. We present the concepts and implementation of AgentWork, a workflow management system supporting automated workflow adaptations in a comprehensive way. A rule-based approach is followed to specify exceptions and necessary workflow adaptations. AgentWork uses temporal estimates to determine which remaining parts of running workflows are affected by an exception and is able to predictively perform suitable adaptations. This helps to ensure that necessary adaptations are performed in time with minimal user interaction which is especially valuable in complex applications such as for medical treatments.
 
Top-cited authors
Rudi Studer
  • FZI Forschungszentrum Informatik
V. Richard Benjamins
Wil Van der Aalst
  • RWTH Aachen University
Manfred Reichert
  • Ulm University
Veda C. Storey