ACM Transactions on Internet Technology

Published by Association for Computing Machinery
Print ISSN: 1533-5399
XML is emerging as a major standard for representing data on the World-Wide-Web. Recently, many XML storage models have been proposed to manage XML data. We propose several benchmarks including XMark and XMach in order to assess an XML database's abilities to deal with XML queries. We report our first set of results on benchmarking a set of XML database implementations using two XML benchmarks. In general, XML data can be managed as text files, by existing DBMSs, or by the so-called native XML engines. We implemented three XML database systems. VXMLR, and XParent were built on top of RDBMS, and XBase was implemented as a native XML engine. For each approach, variations on schema mapping and storage methods were also implemented for comparison.
The goal of this paper is to develop a principled understanding of when it is beneficial to bundle technologies or services whose value is heavily dependent on the size of their user base, i.e., exhibits positive exernalities. Of interest is how the joint distribution, and in particular the correlation, of the values users assign to components of a bundle affect its odds of success. The results offer insight and guidelines for deciding when bundling new Internet technologies or services can help improve their overall adoption. In particular, successful outcomes appear to require a minimum level of value correlation.
Most previous analysis of Twitter user behavior is focused on individual information cascades and the social followers graph. We instead study aggregate user behavior and the retweet graph with a focus on quantitative descriptions. We find that the lifetime tweet distribution is a type-II discrete Weibull stemming from a power law hazard function, the tweet rate distribution, although asymptotically power law, exhibits a lognormal cutoff over finite sample intervals, and the inter-tweet interval distribution is power law with exponential cutoff. The retweet graph is small-world and scale-free, like the social graph, but is less disassortative and has much stronger clustering. These differences are consistent with it better capturing the real-world social relationships of and trust between users. Beyond just understanding and modeling human communication patterns and social networks, applications for alternative, decentralized microblogging systems-both predicting real-word performance and detecting spam-are discussed.
We study combinatorial auctions for the secondary spectrum market. In this market, short-term licenses shall be given to wireless nodes for communication in their local neighborhood. In contrast to the primary market, channels can be assigned to multiple bidders, provided that the corresponding devices are well separated such that the interference is sufficiently low. Interference conflicts are described in terms of a conflict graph in which the nodes represent the bidders and the edges represent conflicts such that the feasible allocations for a channel correspond to the independent sets in the conflict graph. In this paper, we suggest a novel LP formulation for combinatorial auctions with conflict graph using a non-standard graph parameter, the so-called inductive independence number. Taking into account this parameter enables us to bypass the well-known lower bound of \Omega(n^{1-\epsilon}) on the approximability of independent set in general graphs with n nodes (bidders). We achieve significantly better approximation results by showing that interference constraints for wireless networks yield conflict graphs with bounded inductive independence number. Our framework covers various established models of wireless communication, e.g., the protocol or the physical model. For the protocol model, we achieve an O(\sqrt{k})-approximation, where k is the number of available channels. For the more realistic physical model, we achieve an O(\sqrt{k} \log^2 n) approximation based on edge-weighted conflict graphs. Combining our approach with the the LP-based framework of Lavi and Swamy, we obtain incentive compatible mechanisms for general bidders with arbitrary valuations on bundles of channels specified in terms of demand oracles.
Recently several authors have proposed stochastic evolutionary models for the growth of the web graph and other networks that give rise to power-law distributions. These models are based on the notion of preferential attachment leading to the ``rich get richer'' phenomenon. We present a generalisation of the basic model by allowing deletion of individual links and show that it also gives rise to a power-law distribution. We derive the mean-field equations for this stochastic model and show that by examining a snapshot of the distribution at the steady state of the model, we are able to tell whether any link deletion has taken place and estimate the link deletion probability. Our model enables us to gain some insight into the distribution of inlinks in the web graph, in particular it suggests a power-law exponent of approximately 2.15 rather than the widely published exponent of 2.1.
Example of a situation in which options are desirable for both buyer and seller. 
The seller should fix the exercise prices at certain thresholds (in this case 2.5), where the synergy buyer participates in more auctions. 
Examples of auction sequences used in simulation: A(left): An auction sequence from Sect. 5.1. B(right): An auction sequence from Sect. 5.2. 
Percentage increase in profits with multiple synergy buyers. A. Two synergy buyers interact indirectly via the same exercise prices (see Fig. 3A). B. The buyers compete directly (Fig. 3B). C. Larger setting, random market entry of synergy buyers 
This paper studies the benefits of using priced options for solving the exposure problem that bidders with valuation synergies face when participating in multiple, sequential auctions. We consider a model in which complementary-valued items are auctioned sequentially by different sellers, who have the choice of either selling their good directly or through a priced option, after fixing its exercise price. We analyze this model from a decision-theoretic perspective and we show, for a setting where the competition is formed by local bidders, that using options can increase the expected profit for both buyers and sellers. Furthermore, we derive the equations that provide minimum and maximum bounds between which a synergy buyer’s bids should fall in order for both sides to have an incentive to use the options mechanism. Next, we perform an experimental analysis of a market in which multiple synergy bidders are active simultaneously.
Benchmarking the performance of public cloud providers is a common research topic. Previous research has already extensively evaluated the performance of different cloud platforms for different use cases, and under different constraints and experiment setups. In this paper, we present a principled, large-scale literature review to collect and codify existing research regarding the predictability of performance in public Infrastructure-as-a-Service (IaaS) clouds. We formulate 15 hypotheses relating to the nature of performance variations in IaaS systems, to the factors of influence of performance variations, and how to compare different instance types. In a second step, we conduct extensive real-life experimentation on Amazon EC2 and Google Compute Engine to empirically validate those hypotheses. At the time of our research, performance in EC2 was substantially less predictable than in GCE. Further, we show that hardware heterogeneity is in practice less prevalent than anticipated by earlier research, while multi-tenancy has a dramatic impact on performance and predictability.
The Infrastructure-as-a-Service (IaaS) model of cloud computing is a promising approach towards building elastically scaling systems. Unfortunately, building such applications today is a complex, repetitive and error-prone endeavor, as IaaS does not provide any abstraction on top of naked virtual machines. Hence, all functionality related to elasticity needs to be implemented anew for each application. In this paper, we present JCloudScale, a Java-based middleware that supports building elastic applications on top of a public or private IaaS cloud. JCloudScale allows to easily bring applications to the cloud, with minimal changes to the application code. We discuss the general architecture of the middleware as well as its technical features, and evaluate our system with regard to both, user acceptance (based on a user study) and performance overhead. Our results indicate that JCloudScale indeed allowed many participants to build IaaS applications more efficiently, comparable to the convenience features provided by industrial Platform-as-a-Service (PaaS) solutions. However, unlike PaaS, using JCloudScale does not lead to a loss of control and vendor lock-in for the developer.
How did we get from a world where cookies were something you ate and where "non-techies" were unaware of "Netscape cookies" to a world where cookies are a hot-button privacy issue for many computer users? This paper will describe how HTTP "cookies" work, and how Netscape's original specification evolved into an IETF Proposed Standard. I will also offer a personal perspective on how what began as a straightforward technical specification turned into a political flashpoint when it tried to address non-technical issues such as privacy.
Understanding the factors that impact the popularity dynamics of social media can drive the design of effective information services, besides providing valuable insights to content generators and online advertisers. Taking YouTube as case study, we analyze how video popularity evolves since upload, extracting popularity trends that characterize groups of videos. We also analyze the referrers that lead users to videos, correlating them, features of the video and early popularity measures with the popularity trend and total observed popularity the video will experience. Our findings provide fundamental knowledge about popularity dynamics and its implications for services such as advertising and search.
Representatives of several Internet service providers (ISPs) have expressed their wish to see a substantial change in the pricing policies of the Internet. In particular, they would like to see content providers (CPs) pay for use of the network, given the large amount of resources they use. This would be in clear violation of the "network neutrality" principle that had characterized the development of the wireline Internet. Our first goal in this paper is to propose and study possible ways of implementing such payments and of regulating their amount. We introduce a model that includes the users' behavior, the utilities of the ISP and of the CPs, and the monetary flow that involves the content users, the ISP and CP, and in particular, the CP's revenues from advertisements. We consider various game models and study the resulting equilibria; they are all combinations of a noncooperative game (in which the ISPs and CPs determine how much they will charge the users) with a "cooperative" one on how the CP and the ISP share the payments. We include in our model a possible asymmetric weighting parameter (that varies between zero to one). We also study equilibria that arise when one of the CPs colludes with the ISP. We also study two dynamic game models and study the convergence of prices to the equilibrium values.
The Kerberos authentication protocol.
Different combinations of policy specification and decision making with respect to (de)centralization.
Delegation as part of the low-level policy system.
Maximizing local autonomy by delegating functionality to end nodes when possible (the end-to-end design principle) has led to a scalable Internet. Scalability and the capacity for distributed control have unfortunately not extended well to resource access-control policies and mechanisms. Yet management of security is becoming an increasingly challenging problem in no small part due to scaling up of measures such as number of users, protocols, applications, network elements, topological constraints, and functionality expectations. In this article, we discuss scalability challenges for traditional access-control mechanisms at the architectural level and present a set of fundamental requirements for authorization services in large-scale networks. We show why existing mechanisms fail to meet these requirements and investigate the current design options for a scalable access-control architecture. We argue that the key design options to achieve scalability are the choice of the representation of access control policy, the distribution mechanism for policy, and the choice of the access-rights revocation scheme. Although these ideas have been considered in the past, current access-control systems in use continue to use simpler but restrictive architectural models. With this article, we hope to influence the design of future access-control systems towards more decentralized and scalable mechanisms.
We propose the network early warning system (NEWS) to protect servers and networks from flash crowds, which usually happen when too many requests are sent to a web site simultaneously. NEWS is an self-tuning admission control mechanism, which imposes application-level congestion control (AppCC) between requests and responses. NEWS detects flash crowds from changes in web response rate. Based on the application-level observations, NEWS adjusts the admitted request rate automatically and adaptively. Simulation results show that NEWS detects flash crowds within 10 minutes (about 2--3 detection intervals). By delaying 56% of requests, NEWS is able to reduce the packet drop rate for responses from 17% to 1%. The aggregated response rate for admitted requests is twice as fast with NEWS as compared to without. This performance is similar to the best possible rate limiter.
Web submission form
Sample linkbase in HTML
Architecture overview
Syllabus study timings
xlinkit is a lightweight application service that provides rule-based link generation and checks the consistency of distributed Web content. It leverages standard Internet technologies, notably XML, XPath, and XLink. xlinkit can be used as part of a consistency management scheme or in applications that require smart link generation, including portal construction and management of large document repositories. In this article we show how consistency constraints can be expressed and checked.We describe a novel semantics for first-order logic that produces links instead of truth values and give an account of our content management strategy.We present the architecture of our service and the results of two substantial case studies that use xlinkit for checking course syllabus information and for validating UML models supplied by industrial partners.
WASH/CGI is an embedded domain-specific language for server-side Web scripting. Due to its reliance on the strongly typed, purely functional programming language Haskell as a host language, it is highly flexible and - at the same time - it provides extensive guarantees due to its pervasive use of type information. WASH/CGI can be structured into a number of sublanguages addressing different aspects of the application. The document sublanguage provides tools for the generation of parameterized XHTML documents and forms. Its typing guarantees that almost all generated documents are valid XHTML documents. The session sublanguage provides a session abstraction with a transparent notion of session state and allows the composition of documents and Web forms to entire interactive scripts. Both are integrated with the widget sublanguage which describes the communication (parameter passing) between client and server. It imposes a simple type discipline on the parameters that guarantees that forms posted by the client are always understood by the server. That is, the server never asks for data not submitted by the client and the data submitted by the client has the type requested by the server. In addition, parameters are received in their typed internal representation, not as strings. Finally, the persistence sublanguage deals with managing shared state on the server side as well as individual state on the client side. It presents shared state as an abstract data type, where the script can control whether it wants to observe mutations due to concurrently executing scripts. It guarantees that states from different interaction threads cannot be confused.
Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help data mining. We first identify various scenarios, where different applications have different requirements on the accuracy of the estimated frequency. Then we develop several "frequency estimators" for the identified scenarios. In developing the estimators, we analytically show how precise/effective the estimators are, and we show that the estimators that we propose can improve precision significantly. 1 Introduction With the explosive growth of the internet, many data sources are available online. Most of the data sources are autonomous and are updated independently of the clients that access the sources. For instance, popular news web sites, such as CNN and NY Times, update their contents periodically, whenever there are new developments. Also, many online stores update the price/availab...
This article provides a detailed implementation study on the behavior of web serves that serve static requests where the load fluctuates over time (transient overload). Various external factors are considered, including WAN delays and losses and different client behavior models. We find that performance can be dramatically improved via a kernel-level modification to the web server to change the scheduling policy at the server from the standard FAIR (processor-sharing) scheduling to SRPT (shortest-remaining-processing-time) scheduling. We find that SRPT scheduling induces no penalties. In particular, throughput is not sacrificed and requests for long files experience only negligibly higher response times under SRPT than they did under the original FAIR scheduling.
XDuce is a statically typed programming language for XML processing. Its basic data values are XML documents, and its types (so-called regular expression types) directly correspond to document schemas. XDuce also provides a flexible form of regular expression pattern matching, integrating conditional branching, tag checking, and subtree extraction, as well as dynamic typechecking. We survey the principles of XDuce's design, develop examples illustrating its key features, describe its foundations in the theory of regular tree automata, and present a complete formal definition of its core, along with a proof of type safety.
We o#er an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. We draw for this presentation from the literature, and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts, and the results of several performance analyses we conducted to compare di#erent designs.
Algorithmic tools for searching and mining the Web are becoming increasingly sophisticated and vital. In this context, algorithms that use and exploit structural information about the Web perform better than generic methods in both efficiency and reliability.We present an extensive characterization of the graph structure of the Web, with a view to enabling high-performance applications that make use of this structure. In particular, we show that the Web emerges as the outcome of a number of essentially independent stochastic processes that evolve at various scales. A striking consequence of this scale invariance is that the structure of the Web is "fractal"---cohesive subregions display the same characteristics as the Web at large. An understanding of this underlying fractal nature is therefore applicable to designing data services across multiple domains and scales.We describe potential applications of this line of research to optimized algorithm design for Web-scale data analysis.
The Globe Distribution Network (GDN) is an application for the efficient, worldwide distribution of freely redistributable software packages. Distribution is made efficient by encapsulating the software into special distributed objects which efficiently replicate themselves near to the downloading clients. The Globe Distribution Network takes a novel, optimistic approach to stop the illegal distribution of copyrighted and illicit material via the network. Instead of having moderators check the packages at upload time, illegal content is removed and its uploader's access to the network permanently revoked only when the violation is discovered. Other protective measures defend the GDN against internal and external attacks to its availability. By exploiting the replication of the software and using fault-tolerant server software, the Globe Distribution Network achieves high availability. A prototype implementation of the GDN is available from
Emerging cloud services, including mobile offices, Web-based storage services, and content delivery services, run diverse workloads under various device platforms, networks, and cloud service providers. They have been realized on top of SSL/TLS, which is the de facto protocol for end-to-end secure communication over the Internet. In an attempt to achieve a cognitive SSL/TLS with heterogeneous environments (device, network, and cloud) and workload awareness, we thoroughly analyze SSL/TLS-based data communication and identify three critical mismatches in a conventional SSL/TLS-based data transmission. The first mismatch is the performance of loosely coupled encryption-compression and communication routines that lead to underutilized computation and communication resources. The second mismatch is that the conventional SSL/TLS only provides a static compression mode, irrespective of the dynamically changing status of each SSL/TLS connection and the computing power gap between the cloud service provider and diverse device platforms. The third is the memory allocation overhead due to frequent compression switching in the SSL/TLS. As a remedy to these rudimentary operations, we present a system called an Adaptive Cryptography Plugged Compression Network (ACCENT) for SSL/TLS-based cloud services. It is comprised of the following three novel mechanisms, each of which aims to provide an optimal SSL/TLS communication and maximize the network transfer performance of an SSL/TLS protocol stack: tightly-coupled threaded SSL/TLS coding, floating scale-based adaptive compression negotiation, and unified memory allocation for seamless compression switching. We implemented and tested the mechanisms in OpenSSL-1.0.0. ACCENT is integrated into the Web-interface layer and SSL/TLS-based secure storage service within a real cloud computing service, called iCubeCloud, as the key primitive for SSL/TLS-based data delivery over the Internet.
Plot comparing accuracy, coverage and model size with the order of Markov model.
The problem of predicting a user's behavior on a Web site has gained importance due to the rapid growth of the World Wide Web and the need to personalize and influence a user's browsing experience. Markov models and their variations have been found to be well suited for addressing this problem. Of the different variations of Markov models, it is generally found that higher-order Markov models display high predictive accuracies on Web sessions that they can predict. However, higher-order models are also extremely complex due to their large number of states, which increases their space and run-time requirements. In this article, we present different techniques for intelligently selecting parts of different order Markov models so that the resulting model has a reduced state complexity, while maintaining a high predictive accuracy.
The large diffusion of e-learning technologies represents a great opportunity for underserved segments of population. This is particularly true for people with disabilities for whom digital barriers should be overstepped with the aim of reengaging them back into society to education. In essence, before a mass of learners can be engaged in a collective educational process, each single member should be put in the position to enjoy accessible and customized educational experiences, regardless of the wide diversity of their personal characteristics and technological equipment. To respond to this demand, we developed LOT (Learning Object Transcoder), a distributed PHP-based service-oriented system designed to deliver flexible and customized educational services for a multitude of learners, each with his/her own diverse preferences and needs. The main novelty of LOT amounts to a broking service able to manage the transcoding activities needed to convert multimedia digital material into the form which better fits a given student profile. Transcoding activities are performed based on the use of Web service technologies. Experimental results gathered from several field trials with LOT (available online at∼lot/) have confirmed the viability of our approach.
Recent years have witnessed the emergence and rapid development of collaborative Web-based applications exemplified by Web-based office productivity applications. One major challenge in building these applications is maintaining data consistency while meeting the requirements of fast local response, total work preservation, unconstrained interaction, and customizable collaboration mode. These requirements are important in determining users’ experiences in interaction and collaboration, and in meeting users’ diverse needs under complex and dynamic collaboration and networking environments; but none of existing solutions is able to meet all of them. In this article, we present a data consistency maintenance solution capable of meeting these requirements for collaborative Web-based applications. Major technical contributions include an efficient sequence-based operation transformation control algorithm based on the concept of contextualization, an operation broadcast protocol for supporting a variety of collaboration modes, an operation replaying algorithm for ensuring fast local response and efficient remote operation replay, and a set of communication protocols for managing the integrity of collaborative Web-based sessions. The proposed solution has been implemented in a prototype collaborative Web-based editor WRACE and the correctness of the solution is formally verified in the article.
This article presents a Web content adaptation and delivery mechanism based on application-level quality of service (QoS) policies. To realize effective Web content delivery for users, two kinds of application-level QoS policies, transmission time and transmission order of inline objects, are introduced. Next, we define a language to specify these policies. We show that transmission order control can be implemented using HTTP/1.1 pipelined requests in which a client recognizes the transmission order description in a Web page and simulates parallel transmission of inline objects by HTTP/1.1 range requests. Experimental results show that our proposed mechanism realizes effective content delivery to a diverse group of Internet users. Finally, we introduce two methods to specify application-level QoS policies, one by content authors, and the other by end users.
We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.
User-adaptive applications cater to the needs of each individual computer user, taking for example users' interests, level of expertise, preferences, perceptual and motoric abilities, and the usage environment into account. Central user modeling servers collect and process the information about users that different user-adaptive systems require to personalize their user interaction.Adaptive systems are generally better able to cater to users the more data their user modeling systems collect and process about them. They therefore gather as much data as possible and "lay them in stock" for possible future usage. Moreover, data collection usually takes place without users' initiative and sometimes even without their awareness, in order not to cause distraction. Both is in conflict with users' privacy concerns that became manifest in numerous recent consumer polls, and with data protection laws and guidelines that call for parsimony, purpose-orientation, and user notification or user consent when personal data are collected and processed.This article discusses security requirements to guarantee privacy in user-adaptive systems and explores ways to keep users anonymous while fully preserving personalized interaction with them. User anonymization in personalized systems goes beyond current models in that not only users must remain anonymous, but also the user modeling system that maintains their personal data. Moreover, users' trust in anonymity can be expected to lead to more extensive and frank interaction, hence to more and better data about the user, and thus to better personalization. A reference model for pseudonymous and secure user modeling is presented that meets many of the proposed requirements.
Network Address Translation (NAT) alleviates the shortage of IPv4 addresses but incurs peer-to-peer communication, application functionality and packet integrity problems. To date, no approach has yet been proposed to solve these three problems. By exploiting mobile agent and active networking technologies, we propose a Programmable Network Address Translation (PNAT) implementation that enables peer-to-peer communication while maintaining application functionality and packet integrity. For peer-to-peer communication, our proposed PNAT approach works for various NAT types (including the Symmetric NAT) with simple APIs supported by our proposed NAT design. For application functionality, the PNAT uses the mobile code to update protocol information in packet payloads according to different application needs. For packet integrity, the PNAT allows applications to delay their data encryption until NAT begins to translate addresses and ports in packet headers. To validate our proposed PNAT approach, we implemented the PNAT design on Windows 2000, and we present an empirical performance evaluation of the implemented design.
Despite the slowdown in the economy, advertisement revenue remains a significant source of income for many Internet-based organizations. Banner advertisements form a critical component of this income, accounting for 40 to 50 percent of the total revenue. There are considerable gains to be realized through the efficient scheduling of banner advertisements. This problem has been observed to be intractable via traditional optimization techniques, and has received only limited attention in the literature. This paper presents a procedure to generate advertisement schedules under the most commonly used advertisement pricing scheme—the CPM model. The solution approach is based on Lagrangean decomposition and is seen to provide extremely good advertisement schedules in a relatively short period of time, taking only a few hundred seconds of elapsed time on a 450 MHz PC compared to a few thousand seconds of CPU time on a workstation that other approaches need. Additionally, this approach can be incorporated into an actual implementation with minimal alterations and hence is of particular interest.
To participate in sponsored search online advertising, an advertiser bids on a set of keywords relevant to his/her product or service. When one of these keywords matches a user search string, the ad is then considered for display among sponsored search results. Advertisers compete for positions in which their ads appear, as higher slots typically result in more user clicks. All existing position allocating mechanisms charge more per click for a higher slot. Therefore, an advertiser must decide whether to bid high and receive more, but more expensive, clicks. In this work, we propose a novel methodology for building forecasting landscapes relating an individual advertiser bid to the expected click-through rate and/or the expected daily click volume. Displaying such landscapes is currently offered as a service to advertisers by all major search engine providers. Such landscapes are expected to be instrumental in helping the advertisers devise their bidding strategies. We propose a triply monotone regression methodology. We start by applying the current state-of-the-art monotone regression solution. We then propose to condition on the ad position and to estimate the bid-position and position-click effects separately. While the latter translates into a standard monotone regression problem, we devise a novel solution to the former based on approximate maximum likelihood. We show that our proposal significantly outperforms the standard monotone regression solution, while the latter similarly improves upon routinely used ad-hoc methods. Last, we discuss other e-commerce applications of the proposed intermediate variable regression methodology.
Network analysis has proved to be very useful in many social and natural sciences, and in particular Small World topologies have been exploited in many application fields. In this article, we focus on P2P file sharing applications, where spontaneous communities of users are studied and analyzed. We define a family of structures that we call “Affinity Networks” (or even Graphs) that show self-organized interest-based clusters. Empirical evidence proves that affinity networks are small worlds and shows scale-free features. The relevance of this finding is augmented with the introduction of a proactive recommendation scheme, namely DeHinter, that exploits this natural feature. The intuition behind this scheme is that a user would trust her network of “elective affinities” more than anonymous and generic suggestions made by impersonal entities. The accuracy of the recommendation is evaluated by way of a 10-fold cross validation, and a prototype has been implemented for further feedbacks from the users.
We study management strategies for main memory database clusters that are interposed between Internet applications and back-end databases as content caches. The task of management is to allocate data across individual cache databases and to route queries to the appropriate databases for execution. The goal is to maximize effective cache capacity and to minimize synchronization cost. We propose an affinity-based management system for main memory database cLUsters (ALBUM). ALBUM executes each query in two stages in order to take advantage of the query affinity that is observed in a wide range of applications. We evaluate the data/query distribution strategy in ALBUM with a set of trace-based simulations. The results show that ALBUM reduces cache miss ratio by a factor of 1.7 to 9 over alternative strategies. We have implemented a prototype of ALBUM, and compare its performance to that of an existing infrastructure: a fully replicated database with large buffer cache. The results show that ALBUM outperforms the existing infrastructure with the same number of server machines by a factor of 2 to 7, and that ALBUM with only 1/3 to 1/2 of the server machines achieves the same throughput as the existing infrastructure.
Although personalization and ubiquity are key properties for on-line services, they challenge the development of these systems due to the complexity of the required architectures. In particular, the current infrastructures for the development of personalized, ubiquitous services are not flexible enough to accommodate the configuration requirements of the various application domains. To address such issues, highly configurable infrastructures are needed.In this article, we describe Seta2000, an infrastructure for the development of recommender systems that support personalized interactions with their users and are accessible from different types of devices (e.g., desktop computers and mobile phones). The Seta2000 infrastructure offers a built-in recommendation engine, based on a multi-agent architecture. Moreover, the infrastructure supports the integration of heterogeneous software and the development of agents that can be configured to offer specialized facilities within a recommender system, but also to dynamically enable and disable such facilities, depending on the requirements of the application domain. The Seta2000 infrastructure has been exploited to develop two prototypes: SeTA is an adaptive Web store personalizing the recommendation and presentation of products in the Web. INTRIGUE is a personalized, ubiquitous information system suggesting attractions to possibly heterogeneous tourist groups.
this paper reports on the development of a heuristic decision making framework that an autonomous agent can exploit to tackle the problem of bidding across multiple auctions with varying start and end times and with varying protocols (including English, Dutch and Vickrey). The framework is flexible, configurable, and enables the agent to adopt varying tactics and strategies that attempt to ensure that the desired item is delivered in a manner consistent with the user's preferences. Given this large space of possibilities, we employ a genetic algorithm to search (offline) for effective strategies in common classes of environment. The strategies that emerge from this evolution are then codified into the agent's reasoning behaviour so that it can select the most appropriate strategy to employ in its prevailing circumstances. The proposed framework has been implemented in a simulated marketplace environment and its effectiveness has been empirically demonstrated
New services based on the best-effort paradigm could complement the current deterministic services of an electronic financial exchange. Four crucial aspects of such systems would benefit from a hybrid stance: proper use of processing resources, bandwidth management, fault tolerance, and exception handling. We argue that a more refined view on Quality-of-Service control for exchange systems, in which the principal ambition of upholding a fair and orderly marketplace is left uncompromised, would benefit all interested parties.
The amount of attention space available for recommending suppliers to consumers on e-commerce sites is typically limited. We present a competitive distributed recommendation mechanism based on adaptive software agents for efficiently allocating the "consumer attention space," or banners. In the example of an electronic shopping mall, the task is delegated to the individual shops, each of which evaluates the information that is available about the consumer and his or her interests (e.g. keywords, product queries, and available parts of a profile). Shops make a monetary bid in an auction where a limited amount of "consumer attention space" for the arriving consumer is sold. Each shop is represented by a software agent that bids for each consumer. This allows shops to rapidly adapt their bidding strategy to focus on consumers interested in their offerings. For various basic and simple models for on-line consumers, shops, and profiles, we demonstrate the feasibility of our system by evolutionary simulations as in the field of agent-based computational economics (ACE). We also develop adaptive software agents that learn bidding-strategies, based on neural networks and strategy exploration heuristics. Furthermore, we address the commercial and technological advantages of this distributed market-based approach. The mechanism we describe is not limited to the example of the electronic shopping mall, but can easily be extended to other domains.
This article investigates if and how mobile agents can execute secure electronic transactions on untrusted hosts. An overview of the security issues of mobile agents is first given. The problem of untrusted (i.e., potentially malicious) hosts is one of these issues, and appears to be the most difficult to solve. The current approaches to counter this problem are evaluated, and their relevance for secure electronic transactions is discussed. In particular, a state-of-the-art survey of mobile agent-based secure electronic transactions is presented. Categories and Subject Descriptors: A.1 (Introductory and Survey); E.3 (Data Encryption); K.6.5 (Management of Computing and Information Systems): Security and Protection
Advances in distributed service-oriented computing and Internet technology have formed a strong technology push for outsourcing and information sharing. There is an increasing need for organizations to share their data across organization boundaries both within the country and with countries that may have lesser privacy and security standards. Ideally, we wish to share certain statistical data and extract the knowledge from the private databases without revealing any additional information of each individual database apart from the aggregate result that is permitted. In this article, we describe two scenarios for outsourcing data aggregation services and present a set of decentralized peer-to-peer protocols for supporting data sharing across multiple private databases while minimizing the data disclosure among individual parties. Our basic protocols include a set of novel probabilistic computation mechanisms for important primitive data aggregation operations across multiple private databases such as max, min, and top k selection. We provide an analytical study of our basic protocols in terms of precision, efficiency, and privacy characteristics. Our advanced protocols implement an efficient algorithm for performing kNN classification across multiple private databases. We provide a set of experiments to evaluate the proposed protocols in terms of their correctness, efficiency, and privacy characteristics.
This article describes the CLEVER search system developed at the IBM Almaden Research Center. We present a detailed and unified exposition of the various algorithmic components that make up the system, and then present results from two user studies.
Negotiator often rely on learning an opponent's behavior and on then using the knowledge gained to arrive at a better deal. However, in an electronic negotiation setting in which the parties involved are often unknown to (and therefore lack information about) each other, this learning has to be accomplished with only the bid offers submitted during an ongoing negotiation. In this article, we consider such a scenario and develop learning algorithms for electronic agents that use a common negotiation tactic, namely, the time-dependent tactic (TDT), in which the values of the negotiating issues are dependent on the time elapsed in the negotiation. Learning algorithms for this tactic have not been proposed in the literature. Our approach is based on using the derivatives of the Taylor's series approximation of the TDT function in a three-phase algorithm that enumerates over a partial discretized version of the solution space. Computational results with our algorithms are encouraging.
We present an effective method of eliminating unsolicited electronic mail (so-called spam) and discuss its publicly accessible prototype implementation. A subscriber to our system is able to obtain an unlimited number of aliases of his/her permanent (protected) E-Mail address to be handed out to parties willing to communicate with the subscriber. It is also possible to set up publishable aliases, which can be used by human correspondents to contact the subscriber, while being useless to harvesting robots and spammers. The validity of an alias can be easily restricted to a specific duration in time, a specific number of received messages, a specific population of senders, and/or in other ways. The system is fully compatible with the existing E-Mail infrastructure and can be immediately accessed via any standard E-Mail client software (MUA). It can be easily deployed at any institution or organization running its private E-Mail server (MTA) with a trivial modification to that server. Our system offers a simple method to salvage the existing population of E-Mail addresses while eliminating all spam aimed at them.
Increasingly, application developers seek the ability to search for existing Web services within large Internet-based repositories. The goal is to retrieve services that match the user's requirements. With the growing number of services in the repositories and the challenges of quickly finding the right ones, the need for clustering related services becomes evident to enhance search engine results with a list of similar services for each hit. In this article, a statistical clustering approach is presented that enhances an existing distributed vector space search engine for Web services with the possibility of dynamically calculating clusters of similar services for each hit in the list found by the search engine. The focus is laid on a very efficient and scalable clustering implementation that can handle very large service repositories. The evaluation with a large service repository demonstrates the feasibility and performance of the approach. Categories and Subject Descriptors: H.3.3 (Information Storage and Retrieval): Information Search and Retrieval
Anomaly detection involves identifying observations that deviate from the normal behavior of a system. One of the ways to achieve this is by identifying the phenomena that characterize “normal” observations. Subsequently, based on the characteristics of data learned from the “normal” observations, new observations are classified as being either “normal” or not. Most state-of-the-art approaches, especially those which belong to the family of parameterized statistical schemes, work under the assumption that the underlying distributions of the observations are stationary. That is, they assume that the distributions that are learned during the training (or learning) phase, though unknown, are not time-varying. They further assume that the same distributions are relevant even as new observations are encountered. Although such a “stationarity” assumption is relevant for many applications, there are some anomaly detection problems where stationarity cannot be assumed. For example, in network monitoring, the patterns which are learned to represent normal behavior may change over time due to several factors such as network infrastructure expansion, new services, growth of user population, and so on. Similarly, in meteorology, identifying anomalous temperature patterns involves taking into account seasonal changes of normal observations. Detecting anomalies or outliers under these circumstances introduces several challenges. Indeed, the ability to adapt to changes in nonstationary environments is necessary so that anomalous observations can be identified even with changes in what would otherwise be classified as “normal” behavior. In this article we propose to apply a family of weak estimators for anomaly detection in dynamic environments. In particular, we apply this theory to spam email detection. Our experimental results demonstrate that our proposal is both feasible and effective for the detection of such anomalous emails.
Many projects have tried to analyze the structure and dynamics of application overlay networks on the Internet using packet analysis and network flow data. While such analysis is essential for a variety of network management and security tasks, it is infeasible on many networks: either the volume of data is so large as to make packet inspection intractable, or privacy concerns forbid packet capture and require the dissociation of network flows from users’ actual IP addresses. Our analytical framework permits useful analysis of network usage patterns even under circumstances where the only available source of data is anonymized flow records. Using this data, we are able to uncover distributions and scaling relations in host-to-host networks that bear implications for capacity planning and network application design. We also show how to classify network applications based entirely on topological properties of their overlay networks, yielding a taxonomy that allows us to accurately identify the functions of unknown applications. We repeat this analysis on a more recent dataset, allowing us to demonstrate that the aggregate behavior of users is remarkably stable even as the population changes.
In this article, we investigate a first step towards the long-term vision of the Semantic Web by studying the problem of answering queries posed through a mediated ontology to multiple information sources whose content is described as views over the ontology relations. The contributions of this paper are twofold. We first offer a uniform logical setting which allows us to encompass and to relate the existing work on answering and rewriting queries using views. In particular, we make clearer the connection between the problem of rewriting queries using views and the problem of answering queries using extensions of views. Then we focus on an instance of the problem of rewriting conjunctive queries using views through an ontology expressed in a description logic, for which we exhibit a complete algorithm.
We introduce a method for learning to find documents on the Web that contain answers to a given natural language question. In our approach, questions are transformed into new queries aimed at maximizing the probability of retrieving answers from existing information retrieval systems. The method involves automatically learning phrase features for classifying questions into different types, automatically generating candidate query transformations from a training set of question/answer pairs, and automatically evaluating the candidate transformations on target information retrieval systems such as real-world general purpose search engines. At run-time, questions are transformed into a set of queries, and reranking is performed on the documents retrieved. We present a prototype search engine, Tritus, that applies the method to Web search engines. Blind evaluation on a set of real queries from a Web search engine log shows that the method significantly outperforms the underlying search engines, and outperforms a commercial search engine specializing in question answering. Our methodology cleanly supports combining documents retrieved from different search engines, resulting in additional improvement with a system that combines search results from multiple Web search engines.
Web service discovery is one of the main applications of semantic Web services, which extend standard Web services with semantic annotations. Current discovery solutions were developed in the context of automatic service composition. Thus, the “client” of the discovery procedure is an automated computer program rather than a human, with little, if any, tolerance to inexact results. However, in the real world, services which might be semantically distanced from each other are glued together using manual coding. In this article, we propose a new retrieval model for semantic Web services, with the objective of simplifying service discovery for human users. The model relies on simple and extensible keyword-based query language and enables efficient retrieval of approximate results, including approximate service compositions. Since representing all possible compositions and all approximate concept references can result in an exponentially-sized index, we investigate clustering methods to provide a scalable mechanism for service indexing. Results of experiments, designed to evaluate our indexing and query methods, show that satisfactory approximate search is feasible with efficient processing time.
Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual implementation of the site; (b) a tool called Homer, conceived to support the site design and implementation process, by allowing the designer to move through the various steps of the methodology, and to automate the generation of the code needed to implement the actual site.Our approach to site design is based on a clear separation between several design activities, namely database design, hypertext design, and presentation design. All these activities are carried on by using high-level models, all subsumed by an extension of the nested relational model; the mappings between the models can be nicely expressed using an extended relational algebra for nested structures. Based on the design artifacts produced during the design process, and on their representation in the algebraic framework, Homer is able to generate all the code needed for the actual generation of the site, in a completely automatic way.
Net neutrality represents the idea that Internet users are entitled to service that does not discriminate on the basis of source, destination, or ownership of Internet traffic. The United States Congress is considering legislation on net neutrality, and debate over the issue has generated intense lobbying. Congressional action will substantially affect the evolution of the Internet and of future Internet research. In this article, we argue that neither the pro nor anti net neutrality positions are consistent with the philosophy of Internet architecture. We develop a net neutrality policy founded on a segmentation of Internet services into infrastructure services and application services, based on the Internet's layered architecture. Our net neutrality policy restricts an Internet service Provider's ability to engage in anticompetitive behavior while simultaneously ensuring that it can use desirable forms of network management. We illustrate the effect of this policy by discussing acceptable and unacceptable uses of network management.
Top-cited authors
Magdalini Eirinaki
  • San Jose State University
Michalis Vazirgiannis
  • École Polytechnique
Junghoo Cho
  • University of California, Los Angeles
Sriram Raghavan
Toshiyuki Amagasa
  • University of Tsukuba