Article

Cardinality Constraint Automata: A Core Technology for Efficient XML Schema-aware Parsers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This article presents a novel class of Finite State Machines called Cardinality Constraint Automata (CCAs). CCAs are especially suited for the construction of XML Schema-aware, validating XML parsers. Parsers built on top CCAs ac- cept richer semantics for XML Schema's all, derivation-by- extension, and minimal/maximal occurrence concepts, and are nevertheless extremely ecient. The paper explains the CCA concept and shows how CCA-based parsers are gener- ated from XML Schema definitions. An illustrating example is given to enhance readability of the paper.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Grammar-specific parser generation techniques have been shown by several authors [3, 7, 9, 10, 15] to significantly improve XML parsing performance. It is our belief, however , that for most real-world use-cases, from individual adhoc scenarios to enterprise-scale business computing environments , the increased tooling and deployment complexity of the compilation model undermines the ease of use and reliability of XML-based technology. ...
... As instruction dispatch overhead is proportional to the number of instructions , we expect that interpreting the actions and predicates directly would lead to a large performance penalty. Several authors [10, 15] avoid limitations of the generalized automata technique through the use of a two-level approach. Cardinality-constraint automata [10] extend deterministic finite automata with cardinality constraints on state transitions. ...
... Several authors [10, 15] avoid limitations of the generalized automata technique through the use of a two-level approach. Cardinality-constraint automata [10] extend deterministic finite automata with cardinality constraints on state transitions. These automata easily handle the encoding of occurrence constraints. ...
Conference Paper
XML delivers key advantages in interoperability due to its flexibility, expressiveness, and platform-neutrality. As XML has become a performance-critical aspect of the next gener- ation of business computing infrastructure, however, it has become increasingly clear that XML parsing often carries a heavy performance penalty, and that current, widely-used parsing technologies are unable to meet the performance demands of an XML-based computing infrastructure. Sev- eral eorts have been made to address this performance gap through the use of grammar-based parser generation. While the performance of generated parsers has been significantly improved, adoption of the technology has been hindered by the complexity of compiling and deploying the generated parsers. Through careful analysis of the operations required for parsing and validation, we have devised a set of spe- cialized bytecodes, designed for the task of XML parsing and validation. These bytecodes are designed to engender the benefits of fine-grained composition of parsing and val- idation that make existing compiled parsers fast, while be- ing coarse-grained enough to minimize interpreter overhead. This technique of using an interpretive, validating parser balances the need for performance against the requirements of simple tooling and robust scalable infrastructure. Our ap- proach is demonstrated with a specialized schema compiler, used to generate bytecodes which in turn drive an interpre- tive parser. With almost as little tooling and deployment complexity as a traditional interpretive parser, the bytecode- driven parser usually demonstrates performance within 20% of the fastest fully compiled solutions.
... This approach is a burden for the constraint memory resources and doesn't scale. Hence we developed the so called Cardinally Constraint Automata (CCAs) [8,9] for validating. They are not only highly efficient, but do also allow wider semantics for the all construct including the specification of maximum occurrence constraints. ...
Article
Full-text available
In this paper we present a novel approach to information centric application development for wireless sensor networks acting as a swarm. We introduce the concept of a distributed virtual shared information space based on XML that makes use of a content based forwarding mechanism we call XCast for information sharing between devices and cardinally constraint automata for validating XML messages efficiently. We then outline a straightforward application development process and conclude with first experiences with regard to code efficiency.
... The extension of schema languages by cardinality constraints has been proposed in several papers, e.g., [117, 35]. They also consider extended automata models with (limited) counting abilities. ...
Article
Automata play an important role for the theoretical foundations of XML data management, but also in tools for various XML processing tasks. This survey article aims to give an overview of fundamental properties of the different kinds of automata used in this area and to relate them to the four key aspects of XML processing: schemas, navigation, querying and transformation.
... They focus on the structure of the elements. In [11] , Reuter and Luttenberger develop cardinality constraint automata, which handle occurrence constraints and <all> more efficiently than tree automata. Lexical analysis is still performed separately, however. ...
Article
The validation of XML instances against a schema is usually per-formed separately from the parsing of the more basic syntactic aspects of XML. We posit, however, that schema information can be used during parsing to improve performance, using what we call schema-specific parsing. This paper develops a framework for schema-specific parsing centered on an intermediate representation we call generalized automata, which abstracts the computational steps necessary to validate against a schema. The generalized automata can then be used to generate optimized code which might be onerous to write manually. We present results that suggest this is a viable approach to high-performance XML parsing.
... A novel kind of finite state machines called " Cardinality Constraint Automata " (CCA) has been developed for this purpose. CCAs are very efficient and at the same time extend the semantics of the W3C XML Schema [1, 2]. Preparing work was done by the authors and may be reviewed in [3]. ...
Conference Paper
In this paper, we assume that sensor nodes in a wireless sensor network cooperate by posting their information to a distributed virtual shared information space that is built on the basis of advanced XML technology. Using a flooding protocol to maintain the shared information space is an obvious solution, but flooding must be tightly controlled, because sensor nodes suffer from severe resource constraints. For this purpose we propose a content-based flooding control approach (”XCast”) whose performance is analyzed in the paper both analytically and by ns2 simulations. Results show that already a generic XCast instance effectively controls the number of messages generated in a sensor network. It is argued that application-specific XCast instances may expose an even better behavior.
... Cardinality-constraint automata (CCA) [16] offers an efficient schema-aware XML parsing technique by extending deterministic finite automata with cardinality constraints on state transitions. These automata can easily take care of oc-currences constraints imposed by schema. ...
Conference Paper
Full-text available
This paper presents an adaptive XML parser that is based on table-driven XML (TDX) parsing technology. This technique can be used for developing extensible high-performance Web services for large complex systems that typically require extensible schemas. The parser integrates scanning, parsing, and validation into a single-pass without backtracking by utilizing compact tabular representations of schemas and a push-down automaton (PDA) at runtime. The tabular forms are constructed from a set of schemas or WSDL descriptions through the use of permutation grammar. The engine is implemented as a PDA-based, table-driven driver, as a result, it is independent of XML schemas. When XML schemas are updated or extended, the tabular forms can be regenerated and populated to the generic engine without requirement of redeployment of the parser. This adaptive approach balances the need for performance against the requirements of reconstruction and redeployment of the Web services. Our experiments show the adaptive parser usually demonstrates performance of 5 times faster than traditional validating parsers and performance drop within 20% of the fastest fully compiled traditional validating parsers.
... To construct them, we need a formal model for XML documents. Several formal models have been created [2, 6, 12, 14], each to satisfy a different purpose. So far these models are primarily syntactic in nature and capture information about structure and types. ...
Article
Web services have the potential to dramatically reduce the complexities and costs of software integration projects. The most obvious and perhaps most significant difference between Web services and traditional applications is that Web services use a common communication infrastructure, XML and SOAP, to communicate through the Internet. The method of communication introduces complexities to the problems of verifying and validating Web services that do not exist in traditional software. This paper presents a new approach to testing Web services based on data perturbation. Existing XML messages are modified based on rules defined on the message grammars, and then used as tests. Data perturbation uses two methods to test Web services: data value perturbation and interaction perturbation. Data value perturbation modifies values according to the data type. Interaction perturbation classifies the communication messages into two categories: RPC communication and data communication. At present, this method is restricted to peer-to-peer interactions. The paper presents preliminary empirical evidence of its usefulness.
... Um diesen Gedanken auf die Informationsfilterung bei XML-kodierten Daten übertragen zu können, war es nötig, die Semantik von XML Schema [16, 5] zu erweitern. Vorarbeiten der Autoren findet man in [15]. Die Funktion Ψ bestimmt das Sendeverhalten. ...
... Representing the structure of an XML schema is relatively straightforward, because the XML language is inherently structured, but it is not as easy to represent the constraints. Several other researchers have created formal models of XML [4,13,22], but none represent representation constraints, which are important to our research. Our model is based heavily on these previous models and a preliminary version was discussed in a workshop paper [20]. ...
Conference Paper
The eXtensible Markup Language (XML) is widely used to transmit data across the Internet. XML schemas are used to defile the syntax of XML messages. XML-based applications can receive messages from arbitrary applications, as long as they follow the protocol defined by the schema. A receiving application must either validate XML messages, process the data in the XML message without validation, or modify the XML message to ensure that it conforms to the XML schema. A problem for developers is how well the application performs the validation, data processing, and, when necessary, transformation. This paper describes and gives examples of a method to generate tests for XML-based communication by modifying and then instantiating XML schemas. The modified schemas are based on precisely defined schema primitive perturbation operators.
Article
With the widespread adoption of SOAP and Web services, XML-based processing, and parsing of XML documents in particular, is becoming a performance-critical aspect of business computing. In such scenarios, XML is often constrained by an XML Schema grammar, which can be used during parsing to improve performance. Although traditional grammar-based parser generation techniques could be applied to the XML Schema grammar, the expressiveness of XML Schema does not lend itself well to the generic intermediate representations associated with these approaches. In this paper we present a method for generating efficient parsers by using the schema component model itself as the representation of the grammar. We show that the model supports the full expressive power of the XML Schema, and we present results demonstrating significant performance improvements over existing parsers.
Conference Paper
Full-text available
In sensor networks [ASSC02] - and also other environments with small devices - the classical client/server co-operation paradigm does no longer seem to be adequate for a number of reasons: (1) Sensor nodes communicate via unreliable wireless media
Conference Paper
XML fills a critical role in many software infrastructures such as SOA (Service-Oriented Architecture), Web Services, and Grid Computing. In this paper, we propose a high performance XML parser used as a fundamental component to increase the viability of such infrastructures even for mission-critical business applications. We previously proposed an XML parser based on the notion of differential processing under the hypothesis that XML documents are similar to each other, and in this paper we enhance this approach to achieve higher performance by leveraging static information as well as dynamic information. XML schema languages can represent the static information that is used for optimizing the inside state transitions. Meanwhile, statistics for a set of instance documents are used as dynamic information. These two approaches can be used in complementary ways. Our experimental results show that each of the proposed optimization techniques is effective and the combination of multiple optimizations is especially effective, resulting in a 73.2% performance improvement compared to our earlier work.
Conference Paper
This paper describes an experimental system in which customized high performance XML parsers are prepared using parser generation and compilation techniques. Parsing is integrated with Schema-based validation and deserialization, and the resulting validating processors are shown to be as fast as or in many cases significantly faster than traditional nonvalidating parsers. High performance is achieved by integration across layers of software that are traditionally separate, by avoiding unnecessary data copying and transformation, and by careful attention to detail in the generated code. The effect of API design on XML performance is also briefly discussed.
ResearchGate has not been able to resolve any references for this publication.