IEEE Transactions on Knowledge and Data Engineering (IEEE T KNOWL DATA EN)

Publisher: Institute of Electrical and Electronics Engineers; IEEE Computer Society, Institute of Electrical and Electronics Engineers

Journal description

Research, design, and development of knowledge and data engineering methodologies, strategies, and systems. Topics include acquiring and managing data, learning and storage of new knowledge, prolonging the life of useful data, system modeling and design, data access, security, and integrity control.

RG Journal Impact: 2.92 *

*This value is calculated using ResearchGate data and is based on average citation counts from work published in this journal. The data used in the calculation may not be exhaustive.

RG Journal impact history

2019Available summer 2020
20182.92
20173.37
20163.89
20154.23
20144.12
20134.30
20124.07
20114.75
20104.55
20094.92
20084.70
20074.95
20065.62
20055.12
20043.59
20033.01
20024.03
20013.84
20003.52

RG Journal impact over time

RG Journal impact
RG Journal impact over timeGraph showing a linear path with a yearly representation of impact points of the journal

Additional details

Cited half-life7.40
Immediacy index0.36
Eigenfactor0.01
Article influence1.16
Websitehttp://ieeexplore.ieee.org/servlet/opac?punumber=69
Website descriptionIEEE Transactions on Knowledge and Data Engineering website
Other titlesIEEE transactions on knowledge and data engineering, Institute of Electrical and Electronics Engineers transactions on knowledge and data engineering, Transactions on knowledge and data engineering, Knowledge and data engineering
ISSN1041-4347
OCLC18766852
Material typePeriodical, Internet resource
Document typeJournal / Magazine / Newspaper, Internet Resource

Publications in this journal

The problem of making decisions among propositions based on both uncertain data items and arguments which are not certain is addressed. The primary knowledge discovery issue addressed is a classification problem: which classification does the available evidence support? The method investigated seeks to exploit information available from conventional database systems, namely, the integrity assertions or data dependency information contained in the database. This information allows ranking arguments in terms of their strengths. As a step in the process of discovering classification knowledge, using a database as a secondary knowledge discovery exercise, latent knowledge pertinent to arguments of relevance to the purpose at hand is explicated. This is called evidence. Information is requested via user prompts from an evidential reasoner. It is fed as evidence to the reasoner. An object-oriented structure for managing evidence is used to model the conclusion space and to reflect the evidence structure. The implementation of the evidence structure and an example of its use are outlined
We introduce an instance-weighting method to induce cost-sensitive trees. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree to be induced-minimum error trees or minimum high cost error trees. We demonstrate that it can be easily adapted to an existing tree learning algorithm. Previous research provides insufficient evidence to support the idea that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm incorporating the instance-weighting method is found to be better than the original algorithm in in of total misclassification costs, the number of high cost errors, and tree size two-class data sets. The instance-weighting method is simpler and more effective in implementation than a previous method based on altered priors
Introduces a variation of the 2D string representation for symbolic pictures, the non-redundant 2D string, and analyze it with respect to compactness and non-ambiguity. It results that the non-redundant 2D string is a more compact representation than the 2D string, and that the class of unambiguous pictures under the non-redundant 2D string is almost equal to the class of unambiguous pictures under the reduced 2D string, up to a special case. Moreover, we show that the compactness of the new index does not affect the time complexity of picture retrieval
An algorithm for the induction of rules from examples is introduced. The algorithm is novel in the sense that it not only learns rules for a given concept (classification), but it simultaneously learns rules relating multiple concepts. This type of learning, known as generalized rule induction, is considerably more general than existing algorithms, which tend to be classification oriented. Initially, it is focused on the problem of determining a quantitative, well-defined rule preference measure. In particular, a quantity called the J -measure is proposed as an information-theoretic alternative to existing approaches. The J -measure quantifies the information content of a rule or a hypothesis. The information theoretic origins of this measure are outlined, and its plausibility as a hypothesis preference measure is examined. The ITRULE algorithm, which uses the measure to learn a set of optimal rules from a set of data samples, is defined. Experimental results on real-world data are analyzed
Similarity search for 3D structure data sets is fundamental to many database applications such as molecular biology, image registration, and computer-aided design. Identifying the common 3D subtructures between two objects is an important research problem. However, it is well known that computing structural similarity is very expensive due to the high exponential time complexity of structure similarity measures. As the structure databases keep growing rapidly, real-time search from large-structure databases becomes problematic. In this paper, we present a novel statistical model, that is, the multiresolution Localized Co-Occurrence Model (LCM), to approximately measure the similarity between the two point-based 3D structures in linear time complexity for fast retrieval. LCM could capture both distribution characteristics and spatial structure of 3D data by localizing the point co-occurrence relationship within a predefined neighborhood system. As a step further, a novel structure query processing method called the incremental and Bounded search (iBound) is also proposed to speed up the search process. iBound avoids a large amount of expensive computation at higher resolution LCMs. By superposing two LCMs, their largest common substructure can also be found quickly. Finally, our experiment results prove the effectiveness and efficiency of our methods.
This paper discusses 3D visualization and interactive exploration of large relational data sets through the integration of several well-chosen multidimensional data visualization techniques and for the purpose of visual data mining and exploratory data analysis. The basic idea is to combine the techniques of grand tour, direct volume rendering, and data aggregation in databases to deal with both the high dimensionality of data and a large number of relational records. Each technique has been enhanced or modified for this application. Specifically, positions of data clusters are used to decide the path of a grand tour. This cluster-guided tour makes intercluster-distance-preserving projections in which data clusters are displayed as separate as possible. A tetrahedral mapping method applied to cluster centroids helps in choosing interesting cluster-guided projections. Multidimensional footprint splatting is used to directly render large relational data sets. This approach abandons the rendering techniques that enhance 3D realism and focuses on how to efficiently produce real-time explanatory images that give comprehensive insights into global features such as data clusters and holes. Examples are given where the techniques are applied to large (more than a million records) relational data sets.
A video query model based on the content of video and iconic indexing is proposed. We extend the notion of two-dimensional strings to three-dimensional strings (3D-Strings) for representing the spatial and temporal relationships among the symbols in both a video and a video query. The problem of video query processing is then transformed into a problem of three-dimensional pattern matching. To efficiently match the 3D-Strings, a data structure, called 3D-List, and its related algorithms are proposed. In this approach, the symbols of a video in the video database are retrieved from the video index and organized as a 3D-List according to the 3D-String of the video query. The related algorithms are then applied on the 3D-List to determine whether this video is an answer to the video query. Based on this approach, we have started a project called Vega. In this project, we have implemented a user friendly interface for specifying video queries, a video index tool for constructing the video index, and a video query processor based on the notion of 3D-List. Some experiments are also performed to show the efficiency and effectiveness of the proposed algorithms
Spatial relationships between objects are important features for designing a content-based image retrieval system. We propose a new scheme, called 9D-SPA representation, for encoding the spatial relations in an image. With this representation, important functions of intelligent image database systems such as visualization, browsing, spatial reasoning, iconic indexing, and similarity retrieval can be easily achieved. The capability of discriminating images based on 9D-SPA representation is much more powerful than any spatial representation method based on minimum bounding rectangles or centroids of objects. The similarity measures using 9D-SPA representation provide a wide range of fuzzy matching capability in similarity retrieval to meet different user's requirements. Experimental results showed that our system is very effective in terms of recall and precision. In addition, the 9D-SPA representation can be incorporated into a two-level index structure to help reduce the search space of each query processing. The experimental results also demonstrated that, on average, only 0.1254 percent ∼ 1.6829 percent of symbolic pictures (depending on various degrees of similarity) were accessed per query in an image database containing 50,000 symbolic pictures.
Generating abductive explanations is the basis of several problem solving activities such as diagnosis, planning, and interpretation. Temporal abduction means generating explanations that do not only account for the presence of observations, but also for temporal information on them, based on temporal knowledge in the domain theory. We focus on the case where such a theory contains temporal constraints that are required to be consistent with temporal information on observations. Our aim is to propose efficient algorithms for computing temporal abductive explanations. Temporal constraints in the theory and in the observations can be used actively by an abductive reasoner in order to prune inconsistent candidate explanations at an early stage during their generation. However, checking temporal constraint satisfaction frequently generates some overhead. We analyze two incremental ways of making this process efficient. First we show how, using a specific class of temporal constraints (which is expressive enough for many applications), such an overhead can be reduced significantly, yet preserving a full pruning power. In general, the approach does not affect the asymptotic complexity of the problem, but it provides significant advantages in practical cases. We also show that, for some special classes of theories, the asymptotic complexity is also reduced. We then show how, compiled knowledge based on temporal information, can be used to further improve the computation, thus, extending to the temporal framework previous results in the case of atemporal abduction. The paper provides both analytic and experimental evaluations of the computational advantages provided by our approaches.
Several artificial intelligence architectures and systems based on “deep” models of a domain have been proposed, in particular for the diagnostic task. These systems have several advantages over traditional knowledge based systems, but they have a main limitation in their computational complexity. One of the ways to face this problem is to rely on a knowledge compilation phase, which produces knowledge that can be used more effectively with respect to the original one. We show how a specific knowledge compilation approach can focus reasoning in abductive diagnosis, and, in particular, can improve the performances of AID, an abductive diagnosis system. The approach aims at focusing the overall diagnostic cycle in two interdependent ways: avoiding the generation of candidate solutions to be discarded a posteriori and integrating the generation of candidate solutions with discrimination among different candidates. Knowledge compilation is used off-line to produce operational (i.e., easily evaluated) conditions that embed the abductive reasoning strategy and are used in addition to the original model, with the goal of ruling out parts of the search space or focusing on parts of it. The conditions are useful to solve most cases using less time for computing the same solutions, yet preserving all the power of the model-based system for dealing with multiple faults and explaining the solutions. Experimental results showing the advantages of the approach are presented
With the availability of affordable sensors and sensor networks, sensor-based human activity recognition has attracted much attention in artificial intelligence and ubiquitous computing. In this paper, we present a novel two-phase approach for detecting abnormal activities based on wireless sensors attached to a human body. Detecting abnormal activities is a particular important task in security monitoring and healthcare applications of sensor networks, among many others. Traditional approaches to this problem suffer from a high false positive rate, particularly when the collected sensor data are biased towards normal data while the abnormal events are rare. Therefore, there is a lack of training data for many traditional data mining methods to be applied. To solve this problem, our approach first employs a one-class support vector machine (SVM) that is trained on commonly available normal activities, which filters out the activities that have a very high probability of being normal. We then derive abnormal activity models from a general normal model via a kernel nonlinear regression (KNLR) to reduce false positive rate in an unsupervised manner. We show that our approach provides a good tradeoff between abnormality detection rate and false alarm rate, and allows abnormal activity models to be automatically derived without the need to explicitly label the abnormal training data, which are scarce. We demonstrate the effectiveness of our approach using real data collected from a sensor network that is deployed in a realistic setting.
The author analyzes and compares the performance of two timestamp ordering concurrency control algorithms, namely, the basic and multiversion, for database systems. The multiversion algorithm improves the performance of the basic timestamp ordering algorithm by keeping multiple versions of data objects. The author discusses the performance enhancement in the multiversion algorithm over the basic algorithm. The author also discusses the storage overhead due to multiple versions of data objects in the multiversion algorithm. The exact performance model of these algorithms is so complex that it is impossible to find a closed-form solution. The author reduces the complexity of the analysis by analyzing a single transaction in isolation and reflects the presence of other transactions on the isolated transaction by the probability of conflict/abort. The analysis provides useful insight into the performance of these algorithms
Increasing the parallelism in transaction processing and maintaining data consistency appear to be two conflicting goals in designing distributed database systems (DDBSs). This problem is especially difficult if the DDBS is serving long-lived transactions (LLTs). A special case of LLTs, called sagas, has been introduced that addresses this problem. A DDBS with sagas provides high parallelism to transactions by allowing sagas to release their locks as early as possible. However, it is also subject to an overhead, due to the efforts needed to restore data consistency in the case of failure. We conduct a series of simulation studies to compare the performance of LLT systems with and without saga implementation in a faulty environment. The studies show that saga systems outperform their nonsaga counterparts under most of conditions, including heavy failure cases. We thus propose an analytical queuing model to investigate the performance behavior of saga systems. The development of this analytical model assists us to quantitatively study the performance penalty of a saga implementation due to the failure recovery overhead. Furthermore, the analytical solution can be used by system administrators to fine-tune the performance of a saga system. This analytical model captures the primary aspects of a saga system, namely data locking, resource contention and failure recovery. Due to the complicated nature of the analytical modeling, we solve the model approximately for various performance metrics using decomposition methods, and validate the accuracy of the analytical results via simulations
The design of communication protocols to support guaranteed real-time communication for distributed multimedia systems is examined. A network level abstraction called φ-channel that supports the requirements of real-time applications is proposed. A φ-channel represents a fractional, simplex, end-to-end communication channel between a source and a destination. The channel is characterized by a set of specific performance parameters associated with its traffic. The required performance characteristics of a φ-channel are specified in terms of the packet maximum end-to-end delay and the maximum number of packets that can be sent over that delay. The primary attribute supported by the φ-channel is the on-time reliability. Based on the specified parameters, the underlying delivery system verifies the feasibility of supporting such a channel. The performance of an accepted φ-channel is guaranteed under any conditions, barring hardware failures. The basic scheme that the model uses to verify the feasibility of accepting a φ-channel and the run-time support used to guarantee its performance are described. The results of a simulation experiment implementing the basic functionalities of the proposed scheme are also presented
Two important features of modern database models are support for complex data structures and support for high-level data retrieval and update. The first issue has been studied by the development of various semantic data models; the second issue has been studied through universal relation data models. How the advantages of these two approaches can be combined is presently examined. A new data model that incorporates standard concepts from semantic data models such as entities, aggregations, and ISA hierarchies is introduced. It is then shown how nonnavigational queries and updates can be interpreted in this model. The main contribution is to demonstrate how universal relation techniques can be extended to a more powerful data model. Moreover, the semantic constructs of the model allow one to eliminate many of the limitations of previous universal relation models
Delta abstractions are introduced as a mechanism for managing database states during the execution of active database rules. Delta abstractions build upon the use of object deltas, capturing changes to individual objects through a system-supported, collapsible type structure. The object delta structure is implemented using object-oriented concepts such as encapsulation and inheritance so that all database objects inherit the ability to transparently create and manage delta values. Delta abstractions provide an additional layer to the database programmer for organizing object deltas according to different language components that induce database changes, such as methods and active rules. As with object deltas, delta abstractions are transparently created and maintained by the active database system. We define different types of delta abstractions as views of object deltas and illustrate how the services of delta abstractions can be used to inspect the state of active rule execution. An active rule analysis and debugging tool has been implemented to demonstrate the use of object deltas and delta abstractions for dynamic analysis of active rules at runtime.
The caching of accessed disk pages has been successfully used for decades in database technology, resulting in effective amortization of I/O operations needed within a stream of query or update requests. However, in modern complex databases, like multimedia databases, the I/O cost becomes a minor performance factor. In particular, metric access methods (MAMs), used for similarity search in complex unstructured data, have been designed to minimize rather the number of distance computations than I/O cost (when indexing or querying). Inspired by I/O caching in traditional databases, in this paper we introduce the idea of distance caching for usage with MAMs - a novel approach to streamline similarity search. As a result, we present the D-cache, a main-memory data structure which can be easily implemented into any MAM, in order to spare the distance computations spent by queries/updates. In particular, we have modified two state-of-the-art MAMs to make use of D-cache - the M-tree and Pivot tables. Moreover, we present the D-file, an index-free MAM based on simple sequential search augmented by D-cache. The experimental evaluation shows that performance gain achieved due to D-cache is significant for all the MAMs, especially for the D-file.
The paper considers an access control model and proposes extensions to it to deal with authentication and revocation. The model is then applied to represent access control policy in a mental health system. In the first part of the paper, extensions to the schematic protection model (SPM) are presented. The authentication and revocation extensions are independent of one another in the sense that each one affects a different part of the decision algorithm. The extensions comprise a modification of the syntax to be able to represent the new concepts and, more importantly, a modification of the decision algorithm for the safety problem to take these changes into account. We introduce the concept of conditional tickets and use it to provide authentication. Apart from this, we have found this concept to be useful in modeling systems. Hence we have separated this (syntactical) issue from the definition of the new algorithm. The second part considers the access policy for a mental health application. We have used the extensions of SPM to model part of this access policy. Even with our extensions, SPM still remains a monotonic model, where rights can be removed only in very special cases, and this makes it impossible to represent all the aspects of the problem. Other than to serve as an example for the extensions we propose, the paper also helps to separate aspects of this access control policy which are inherently monotonic from parts which are defined in a non-monotonic way, but can still be represented in a monotonic model
Examines the effect of skewed database access on the transaction response time in a multisystem data sharing environment, where each computing node has access to shared data on disks, and has a local buffer of recently accessed granules. Skewness in data access can increase data contention since most accesses go to few data items. For the same reason, it can also increase the buffer hit probability. We quantify the resultant effect on the transaction response time, which depends not only on the various system parameters but also on the concurrency control (CC) protocol. Furthermore, the CC protocol can give rise to rerun transactions that have different buffer hit probabilities. In a multisystem environment, when a data block gets updated by a system, any copies of that block in other systems' local buffers are invalidated. Combining these effects, we find that higher skew does not necessarily lead to worse performance, and that with skewed access, optimistic CC is more robust than pessimistic CC. Examining the buffer hit probability as a function of the buffer size, we find that the effectiveness of additional buffer allocation can be broken down into multiple regions that depend on the access frequency distribution
Two graph models are developed to determine the minimum required buffer size for achieving the theoretical lower bound on the number of disk accesses for performing relational joins. Here, the lower bound implies only one disk access per joining block or page. The first graph model is based on the block connectivity of the joining relations. Using this model, the problem of determining an ordered list of joining blocks that requires the smallest buffer is considered. It is shown that this problem as well as the problem of computing the least upper bound on the buffer size is NP-hard. The second graph model represents the page connectivity of the joining relations. It is shown that the problem of computing the least upper bound on the buffer size for the page connectivity model is also NP-hard. Heuristic procedures are presented for the page connectivity model and it is shown that the sequence obtained using the heuristics requires a near-optimal buffer size The authors also show the performance improvement of the proposed heuristics over the hybrid-has join algorithm for a wide range of join factors
Assume a database storing N objects with d numerical attributes or feature values. All objects in the database can be assigned an overall score that is derived from their single feature values (and the feature values of a user-defined query). The problem considered here is then to efficiently retrieve the k objects with minimum (or maximum) overall score. The well-known threshold algorithm (TA) was proposed as a solution to this problem. TA views the database as a set of d sorted lists storing the feature values. Even though TA is optimal with regard to the number of accesses, its overall access cost can be high since, in practice, some list accesses may be more expensive than others. We therefore propose to make TA access cost aware by choosing the next list to access such that the overall cost is minimized. Our experimental results show that this overall cost is close to the optimal cost and significantly lower than the cost of prior approaches.
The cryptographic key assignment problem is to assign cryptographic keys to a set of partially ordered classes so that the cryptographic key of a higher class can be used to derive the cryptographic key of a lower class. In this paper, we propose a time-bound cryptographic key assignment scheme in which the cryptographic keys of a class are different for each time period, that is, the cryptographic key of class C<sub>i</sub> at time t is K<sub>i, t.</sub> Key derivation is constrained not only by the class relation, but also the time period. In our scheme, each user holds some secret parameters whose number is independent of the number of the classes in the hierarchy and the total time periods. We present two novel applications of our scheme. One is to broadcast data to authorized users in a multilevel-security way and the other is to construct a flexible cryptographic key backup system
Organizing massive amount of data on wireless communication networks in order to provide fast and low power access to users equipped with palmtops, is a new challenge to the data management and telecommunication communities. Solutions must take under consideration the physical restrictions of low network bandwidth and limited battery life of palmtops. This paper proposes algorithms for multiplexing clustering and nonclustering indexes along with data on wireless networks. The power consumption and the latency for obtaining the required data are considered as the two basic performance criteria for all algorithms. First, this paper describes two algorithms namely, (1, m) indexing and Distributed Indexing, for multiplexing data and its clustering index. Second, an algorithm called Nonclustered Indexing is described for allocating static data and its corresponding nonclustered index. Then, the Nonclustered indexing algorithm is generalized to the case of multiple indexes. Finally, the proposed algorithms are analytically demonstrated to lead to significant improvement of battery life while retaining a low latency
We present a new access method, called the path dictionary index (PDI) method, for supporting nested queries on object-oriented databases. PDI supports object traversal and associative search, respectively, with a path dictionary and a set of attribute indexes built on top of the path dictionary. We discuss issues on indexing and query processing in object-oriented databases; describe the operations of the new mechanism; develop cost models for its storage overhead and query and update costs; and compare the new mechanism to the path index method. The result shows that the path dictionary index method is significantly better than the path index method over a wide range of parameters in terms of retrieval and update costs and that the storage overhead grows slowly with the number of indexed attributes
A class of order-preserving dynamic hashing structures is introduced and analyzed. The access method is referred to as the dynamic random-sequential access method (DRSAM) and is derived from linear hashing. A new logical to physical mapping that is based on sequential bucket allocations in hash order is proposed. With respect to previous methods, this allocation technique has the following characteristics: (1) the structure captures the hashed order in consecutive storage areas so that order preserving (OPH) schemes should result in performance improvements for range queries and sequential processing; and (2) it adapts elastic buckets for the control of file growth. Under specific conditions, this approach outperforms the partial expansion method previously proposed by P.-A. Larson (1982)
An efficient means of accessing indexed hierarchical databases using a relational query language is presented. The purpose is to achieve an effective sharing of heterogeneous distributed databases. Translation of hierarchical data to an equivalent relational data definition, translation of a relational query language statement to an equivalent program that can be processed by a hierarchical database management system, and automatic selection of secondary indexes of hierarchical databases are investigated. A major portion of the result has been implemented, and the performance of the implemented system is analyzed. The performance of the system is satisfactory for a wide range of test data and test queries. It is shown that the utilization of the secondary index significantly enhances the efficiency in accessing hierarchical databases
Prefetching is an effective method for minimizing the number of fetches between the client and the server in a database management system. In this paper, we formally define the notion of prefetching. We also formally propose new notions of the type-level access locality and type-level access pattern. The type-level access locality is a phenomenon that repetitive patterns exist in the attributes referenced. The type-level access pattern is a pattern of attributes that are referenced in accessing the objects. We then develop an efficient capturing and prefetching policy based on this formal framework. Existing prefetching methods are based on object-level or page-level access patterns, which consist of object-ids or page-ids of the objects accessed. However, the drawback of these methods is that they work only when exactly the same objects or pages are accessed repeatedly. In contrast, even though the same objects are not accessed repeatedly, our technique effectively prefetches objects if the same attributes are referenced repeatedly, i.e., if there is type-level access locality. Many navigational applications in object-relational database management systems (ORDBMSs) have type-level access locality. Therefore, our technique can be employed in ORDBMSs to effectively reduce the number of fetches, thereby significantly enhancing the performance. We also address issues in implementing the proposed algorithm. We have conducted extensive experiments in a prototype ORDBMS to show effectiveness of our algorithm. Experimental results using the 007 benchmark, a real GIS application, and an XML application show that our technique reduces the number of fetches by orders of magnitude and improves the elapsed time by several factors over on-demand fetching and context-based prefetching, which is a state-of-the-art prefetching method. These results indicate that our approach provides a new paradigm in prefetching that improves performance of navigational applications significantly and is a practical method that can be implemented in commercial ORDBMSs.
Describes an approach for multiparadigmatic visual access integration of different interaction paradigms. The user is provided with an adaptive interface augmented by a user model, supporting different visual representations of both data and queries. The visual representations are characterized on the basis of the chosen visual formalisms, namely forms, diagrams and icons. To access different databases, a unified data model called the “graph model” is used as a common underlying formalism to which databases, expressed in the most popular data models, can be mapped. Graph model databases are queried through the adaptive interface. The semantics of the query operations is formally defined in terms of graphical primitives. Such a formal approach permits us to define the concept of an “atomic query”, which is the minimal portion of a query that can be transferred from one interaction paradigm to another and processed by the system. Since certain interaction modalities and visual representations are more suitable for certain user classes, the system can suggest to the user the most appropriate interaction modality as well as the visual representation, according to the user model. Some results on user model construction are presented
A method is proposed for dealing with nonuniform data distributions in database organizations in order to estimate the expected number of blocks containing the tuples requested by a query. When tuples with equal attribute value are not uniformly distributed over the blocks of secondary memory that store the relation, a clustering effect is observed. This can be detected by means of a single parameter, the clustering factor, which can be stored in the system catalog. The method can be applied to uniform data distributions as well, since it is shown that a uniform distribution can be viewed as a particular instance of a class of clustered distributions. In this case the proposed method allows considerable reduction of the number of computational steps needed to compute the estimated result
Experiences with the implementation of the cell tree dynamic access method for spatial databases are reported, and the results of an experimental performance comparison with the R-tree of A. Guttman (1984) and with the R-tree of T. Sellis et al. (1987) are given. Cell tree design and implementation are discussed. Although the cell tree often requires more storage space and more CPU time to answer a search query, it usually obtains the results with a lower number of disk accesses than the two rival structures
Tzeng (2002) proposed a time-bound cryptographic key assignment scheme for access control in a partial-order hierarchy. In this paper, we show that Tzeng's scheme is insecure against the collusion attack whereby three users conspire to access some secret class keys that they should not know according to Tzeng's scheme.
Real-time update of access control policies, that is, updating policies while they are in effect and enforcing the changes immediately, is necessary for many security-critical applications. In this paper, we consider real-time update of access control policies in a database system. Updating policies while they are in effect can lead to potential security problems, such as, access to database objects by unauthorized users. In this paper, we propose several algorithms that not only prevent such security breaches but also ensure the correctness of execution. The algorithms differ from each other in the degree of concurrency provided and the semantic knowledge used. Of the algorithms presented, the most concurrency is achieved when transactions are decomposed into atomic steps. Once transactions are decomposed, the atomicity, consistency, and isolation properties no longer hold. Since the traditional transaction processing model can no longer be used to ensure the correctness of the execution, we use an alternate semantic-based transaction processing model. To ensure correct behavior, our model requires an application to satisfy a set of necessary properties, namely, semantic atomicity, consistent execution, sensitive transaction isolation, and policy-compliant. We show how one can verify an application statically to check for the existence of these properties.
ADMS is an advanced database management system developed-to experiment with incremental access methods for large and distributed databases. It has been developed over the past eight years at the University of Maryland. The paper provides an overview of ADMS, and describes its capabilities and the performance attained by its incremental access methods. This paper also describes an enhanced client-server architecture that allows an incremental gateway access to multiple heterogeneous commercial database management systems
The paper examines the issue of scheduling page accesses in join processing, and proposes new heuristics for the following scheduling problems: 1) an optimal page access sequence for a join such that there are no page reaccesses using the minimum number of buffer pages, and 2) an optimal page access sequence for a join such that the number of page reaccesses for a given number of buffer pages is minimum. The experimental performance results show that the new heuristics perform better than existing heuristics for the first problem and also perform better for the second problem, provided that the number of available buffer pages is not much less than the optimal buffer size
XML (extensible markup language) is fast becoming the de facto standard for information exchange over the Internet. As more and more sensitive information gets stored in the form of XML, proper access control to the XML documents becomes increasingly important. However, traditional access control methodologies that have been adapted for XML documents do not address the performance issue of access control. This paper proposes a bitmap-indexing scheme in which access control decisions can be sped up. Authorization policies of the form (subject, object, and action) are encoded as bitmaps in the same manner as XML document indexes are constructed. These two are then efficiently pipelined and manipulated for "fast" access control and "secure" retrieval of XML documents.
An efficient multiversion access structure for a transaction-time database is presented. Our method requires optimal storage and query times for several important queries and logarithmic update times. Three version operations-inserts, updates, and deletes-are allowed on the current database, while queries are allowed on any version, present or past. The following query operations are performed in optimal query time: key range search, key history search, and time range view. The key-range query retrieves all records having keys in a specified key range at a specified time; the key history query retrieves all records with a given key in a specified time range; and the time range view query retrieves all records that were current during a specified time interval. Special cases of these queries include the key search query, which retrieves a particular version of a record, and the snapshot query which reconstructs the database at some past time. To the best of our knowledge no previous multiversion access structure simultaneously supports all these query and version operations within these time and space bounds. The bounds on query operations are worst case per operation, while those for storage space and version operations are (worst-case) amortized over a sequence of version operations. Simulation results show that good storage utilization and query performance is obtained
By supporting the valid and transaction time dimensions, bitemporal databases represent reality more accurately than conventional databases. The authors examine the issues involved in designing efficient access methods for bitemporal databases, and propose the partial-persistence and the double-tree methodologies. The partial-persistence methodology reduces bitemporal queries to partial persistence problems for which an efficient access method is then designed. The double-tree methodology “sees” each bitemporal data object as consisting of two intervals (a valid-time and a transaction-time interval) and divides objects into two categories according to whether the right endpoint of the transaction time interval is already known. A common characteristic of both methodologies is that they take into account the properties of each time dimension. Their performance is compared with a straightforward approach that “sees” the intervals associated with a bitemporal object as composing one rectangle, which is stored in a single multidimensional access method. Given that some limited additional space is available, the experimental results show that the partial-persistence methodology provides the best overall performance, especially for transaction timeslice queries. For those applications that require ready, off-the-shelf, access methods, the double-tree methodology is a good alternative
A fundamental problem for peer-to-peer (P2P) applications in mobile-pervasive computing environment is to efficiently identify the node that stores particular data items and download them while preserving battery power. In this paper, we propose a P2P Minimum Boundary Rectangle (PMBR, for short) which is a new spatial index specifically designed for mobile P2P environments. A node that contains desirable data item (s) can be easily identified by reading the PMBR index. Then, we propose a selective tuning algorithm, called Distributed exponential Sequence Scheme (DSS, for short), that provides clients with the ability of selective tuning of data items, thus preserving the scarce power resource. The proposed algorithm is simple but efficient in supporting linear transmission of spatial data and processing of location-aware queries. The results from theoretical analysis and experiments show that the proposed algorithm with the PMBR index is scalable and energy efficient in both range queries and nearest neighbor queries.
In this paper, we address the problem of cache invalidation in mobile and wireless client/server environments. We present cache invalidation techniques that can scale not only to a large number of mobile clients, but also to a large number of data items that can be cached in the mobile clients. We propose two scalable algorithms: the Multidimensional Bit-Sequence (MD-BS) algorithm and the Multilevel Bit-Sequence (ML-BS) algorithm. Both algorithms are based on our prior work on the Basic Bit-Sequences (BS) algorithm. Our study shows that the proposed algorithms are effective for a large number of cached data items with low update rates. The study also illustrates that the algorithms ran be used with other complementary techniques to address the problem of cache invalidation for data items with varied update and access rates.
The exact expression for the expected number of disk accesses required to retrieve a given number of records, called the Yao function, requires iterative computations. Several authors have developed approximations to the Yao function, all of which have substantial errors in some situations. We derive and evaluate simple upper and lower bounds that never differ by more than a small fraction of a disk access
XML is emerging as a useful platform-independent data representation language. As more and more XML data is shared across data sources, it becomes important to consider the issue of XML access control. One promising approach to store the accessibility information is based on the CAM (compressed accessibility map). We make two advancements in this direction: 1) Previous work suggests that for each user group and each operation type, a different CAM is built. We observe that the performance and storage requirements can be further improved by combining multiple CAMs into an ICAM (integrated CAM). We explore this possibility and propose an integration mechanism. 2) If the change in structure of the XML data is not frequent, we suggest an efficient lookup method, which can be applied to CAMs or ICAMs, with a much lower time complexity compared to the previous approach. We show by experiments the effectiveness of our approach.
Spatial databases, addressing the growing data management and analysis needs of spatial applications such as geographic information systems, have been an active area of research for more than two decades. This research has produced a taxonomy of models for space, spatial data types and operators, spatial query languages and processing strategies, as well as spatial indexes and clustering techniques. However, more research is needed to improve support for network and field data, as well as query processing (e.g., cost models, bulk load). Another important need is to apply spatial data management accomplishments to newer applications, such as data warehouses and multimedia information systems. The objective of this paper is to identify recent accomplishments and associated research needs of the near term
The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.
A new absorbing multiaction learning automaton that is epsilon-optimal is introduced. It is a hierarchical discretized pursuit nonlinear learning automaton that uses a new algorithm for positioning the actions on the leaves of the hierarchical tree. The proposed automaton achieves the highest performance (speed of convergence, central processing unit (CPU) time, and accuracy) among all the absorbing learning automata reported in the literature up to now. Extensive simulation results indicate the superiority of the proposed scheme. Furthermore, it is proved that the proposed automaton is epsilon-optimal in every stationary stochastic environment
The area under the ROC (receiver operating characteristics) curve, or simply AUC, has been traditionally used in medical diagnosis since the 1970s. It has recently been proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. We establish formal criteria for comparing two different measures for learning algorithms and we show theoretically and empirically that AUC is a better measure (defined precisely) than accuracy. We then reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results. For example, it has been well-established and accepted that Naive Bayes and decision trees are very similar in predictive accuracy. We show, however, that Naive Bayes is significantly better than decision trees in AUC. The conclusions drawn in this paper may make a significant impact on machine learning and data mining applications.
Analysis of range queries on spatial (multidimensional) data is both important and challenging. Most previous analysis attempts have made certain simplifying assumptions about the data sets and/or queries to keep the analysis tractable. As a result, they may not be universally applicable. This paper proposes a set of five analysis techniques to estimate the selectivity and number of index nodes accessed in serving a range query. The underlying philosophy behind these techniques is to maintain an auxiliary data structure, called a density file, whose creation is a one-time cost, which can be quickly consulted when the query is given. The schemes differ in what information is kept in the density file, how it is maintained, and how this information is looked up. It is shown that one of the proposed schemes, called cumulative density (CD), gives very accurate results (usually less than 5 percent error) using a diverse suite of point and rectangular data sets, that are uniform or skewed, and a wide range of query window parameters. The estimation takes a constant amount of time, which is typically lower than 1 percent of the time that it would take to execute the query, regardless of data set or query window parameters.
Classification of large data sets is an important data mining problem that has wide applications. Jumping emerging patterns (JEPs) are those itemsets whose supports increase abruptly from zero in one data set to nonzero in another data set. In this paper, we propose a fast, accurate, and less complex classifier based on a subset of JEPs, called strong jumping emerging patterns (SJEPs). The support constraint of SJEP removes potentially less useful JEPs while retaining those with high discriminating power. Previous algorithms based on the manipulation of border as well as consEPMiner cannot directly mine SJEPs. In this paper, we present a new tree-based algorithm for their efficient discovery. Experimental results show that: 1) the training of our classifier is typically 10 times faster than earlier approaches, 2) our classifier uses much fewer patterns than the JEP-classifier to achieve a similar (and, often, improved) accuracy, and 3) in many cases, it is superior to other state-of-the-art classification systems such as naive Bayes, CBA, C4.5, and bagged and boosted versions of C4.5. We argue that SJEPs are high-quality patterns which possess the most differentiating power. As a consequence, they represent sufficient information for the construction of accurate classifiers. In addition, we generalize these patterns by introducing noise-tolerant emerging patterns (NEPs) and generalized noise-tolerant emerging patterns (GNEPs). Our tree-based algorithms can be adopted to easily discover these variations. We experimentally demonstrate that SJEPs, NEPs, and GNEPs are extremely useful for building effective classifiers that can deal well with noise.
Spatial data appear in numerous applications, such as GIS, multimedia and even traditional databases. Most of the analysis on spatial data has focused on point data, typically using the uniformity assumption, or, more accurately, a fractal distribution. However, no results exist for nonpoint spatial data, like 2D regions (e.g., islands), 3D volumes (e.g., physical objects in the real world), etc. This is exactly the problem we solve in this paper. Based on experimental evidence that real areas and volumes follow a "power law," that we named REGAL (REGion Area Law), we show 1) the theoretical implications of our model and its connection with the ubiquitous fractals and 2) the first of its practical uses, namely, the selectivity estimation for range queries. Experiments on a variety of real data sets (islands, lakes, and human-inhabited areas) show that our method is extremely accurate, enjoying a maximum relative error ranging from 1 to 5 percent, versus 30-70 percent of a naive model that uses the uniformity assumption

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed.