ACM Transactions on the Web

Published by Association for Computing Machinery
Print ISSN: 1559-1131
Publications
Fragment of the domain ontology
Annotation system (GUI) 
Semantic annotations of web services can facilitate the discovery of services, as well as their composition into workflows. At present, however, the practical utility of such annotations is limited by the small number of service annotations available for general use. Resources for manual annotation are scarce, and therefore some means is required by which services can be automatically (or semi-automatically) annotated. In this paper, we show how information can be inferred about the semantics of operation parameters based on their connections to other (annotated) operation parameters within tried-and-tested workflows. In an open-world context, we can infer only constraints on the semantics of parameters, but these so-called loose annotations are still of value in detecting errors within workflows, annotations and ontologies, as well as in simplifying the manual annotation task.
 
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
 
(a) The size of the largest connected component of customers over time. The inset shows the linear growth in the number of customers n over time.
The number of recommendations sent by a user with each curve representing a different depth of the user in the recommendation chain. A power law exponent γ is fitted to all but the tail, which shows an exponential drop-off at around 100 recommendations sent). This drop-off is consistent across all depth levels, and may reflect either a natural disinclination to send recommendation to over a hundred people, or a technical issue that might have made it more inconvenient to do so. The fitted lines follow the order of the level number (i.e. top line corresponds to level 0 and bottom to level 4).
A Bayesian network showing the dependencies between the variables. s: recommendation success rate, n: number of nodes, ns: number of senders of recommendations, nr: log number of receivers, r: number of recommendations, e: number of edges, p: price, v: number of reviews, t: average rating.
We present an analysis of a person-to-person recommendation network, consisting of 4 million people who made 16 million recommendations on half a million products. We observe the propagation of recommendations and the cascade sizes, which we explain by a simple stochastic model. We analyze how user behavior varies within user communities defined by a recommendation network. Product purchases follow a 'long tail' where a significant share of purchases belongs to rarely sold items. We establish how the recommendation network grows over time and how effective it is from the viewpoint of the sender and receiver of the recommendations. While on average recommendations are not very effective at inducing purchases and do not spread very far, we present a model that successfully identifies communities, product and pricing categories for which viral marketing seems to be very effective.
 
The extensive adoption of Web service-based applications in dynamic business scenarios, such as on-demand computing or highly reconfigurable virtual enterprises, advocates for methods and tools for the management of Web service nonfunctional aspects, such as Quality of Service (QoS). Concerning contracts on Web service QoS, the literature has mostly focused on the contract definition and on mechanisms for contract enactment, such as the monitoring of the satisfaction of negotiated QoS guarantees. In this context, this article proposes a framework for the automation of the Web service contract specification and establishment. An extensible model for defining both domain-dependent and domain-independent Web service QoS dimensions and a method for the automation of the contract establishment phase are proposed. We describe a matchmaking algorithm for the ranking of functionally equivalent services, which orders services on the basis of their ability to fulfill the service requestor requirements, while maintaining the price below a specified budget. We also provide an algorithm for the configuration of the negotiable part of the QoS Service-Level Agreement (SLA), which is used to configure the agreement with the top-ranked service identified in the matchmaking phase. Experimental results show that, in a utility theory perspective, the contract establishment phase leads to efficient outcomes. We envision two advanced application scenarios for the Web service contracting framework proposed in this article. First, it can be used to enhance Web services self-healing properties in reaction to QoS-related service failures; second, it can be exploited in process optimization for the online reconfiguration of candidate Web services QoS SLAs.
 
With organizations increasingly depending on Web services to build complex applications, security and privacy concerns including the protection of access control policies are becoming a serious issue. Ideally, service providers would like to make sure that clients have knowledge of only portions of the access control policy relevant to their interactions to the extent to which they are entrusted by the Web service and without restricting the client’s choices in terms of which operations to execute. We propose ACConv, a novel model for access control in Web services that is suitable when interactions between the client and the Web service are conversational and long-running. The conversation-based access control model proposed in this article allows service providers to limit how much knowledge clients have about the credentials specified in their access policies. This is achieved while reducing the number of times credentials are asked from clients and minimizing the risk that clients drop out of a conversation with the Web service before reaching a final state due to the lack of necessary credentials. Clients are requested to provide credentials, and hence are entrusted with part of the Web service access control policies, only for some specific granted conversations which are decided based on: (1) a level of trust that the Web service provider has vis-à-vis the client, (2) the operation that the client is about to invoke, and (3) meaningful conversations which represent conversations that lead to a final state from the current one. We have implemented the proposed approach in a software prototype and conducted extensive experiments to show its effectiveness.
 
It is likely that mobile phones will soon come to rival more traditional devices as the primary platform for information access. Consequently, it is important to understand the emerging information access behavior of mobile Internet (MI) users especially in relation to their use of mobile handsets for information browsing and query-based search. In this article, we describe the results of a recent analysis of the MI habits of more than 600,000 European MI users, with a particular emphasis on the emerging interest in mobile search. We consider a range of factors including whether there are key differences between browsing and search behavior on the MI compared to the Web. We highlight how browsing continues to dominate mobile information access, but go on to show how search is becoming an increasingly popular information access alternative especially in relation to certain types of mobile handsets and information needs. Moreover, we show that sessions involving search tend to be longer and more data-rich than those that do not involve search. We also look at the type of queries used during mobile search and the way that these queries tend to be modified during the course of a mobile search session. Finally we examine the overlap among mobile search queries and the different topics mobile users are interested in.
 
Numerous studies over the past ten years have shown that concern for personal privacy is a major impediment to the growth of e-commerce. These concerns are so serious that most if not all consumer watchdog groups have called for some form of privacy protection for Internet users. In response, many nations around the world, including all European Union nations, Canada, Japan, and Australia, have enacted national legislation establishing mandatory safeguards for personal privacy. However, recent evidence indicates that Web sites might not be adhering to the requirements of this legislation. The goal of this study is to examine the posted privacy policies of Web sites, and compare these statements to the legal mandates under which the Web sites operate. We harvested all available P3P (Platform for Privacy Preferences Protocol) documents from the 100,000 most popular Web sites (over 3,000 full policies, and another 3,000 compact policies). This allows us to undertake an automated analysis of adherence to legal mandates on Web sites that most impact the average Internet user. Our findings show that Web sites generally do not even claim to follow all the privacy-protection mandates in their legal jurisdiction (we do not examine actual practice, only posted policies). Furthermore, this general statement appears to be true for every jurisdiction with privacy laws and any significant number of P3P policies, including European Union nations, Canada, Australia, and Web sites in the USA Safe Harbor program.
 
In the past, enterprise resource planning systems were designed as monolithic software systems running on centralized mainframes. Today, these systems are (re-)designed as a repository of enterprise services that are distributed throughout the available computing infrastructure. These service oriented architectures (SOAs) require advanced automatic and adaptive management concepts in order to achieve a high quality of service level in terms of, for example, availability, responsiveness, and throughput. The adaptive management has to allocate service instances to computing resources, adapt the resource allocation to unforeseen load fluctuations, and intelligently schedule individual requests to guarantee negotiated service level agreements (SLAs). Our AutoGlobe platform provides such a comprehensive adaptive service management comprising —static service-to-server allocation based on automatically detected service utilization patterns, —adaptive service management based on a fuzzy controller that remedies exceptional situations by automatically initiating, for example, service migration, service replication (scale-out), and —adaptive scheduling of individual service requests that prioritizes requests depending on the current degree of service level conformance. All three complementary control components are described in detail, and their effectiveness is analyzed by means of realistic business application scenarios.
 
Overall Design of the Entity Disambiguation System.  
In this article, we demonstrate the applicability of semantic techniques for detection of Conflict of Interest (COI). We explain the common challenges involved in building scalable Semantic Web applications, in particular those addressing connecting-the-dots problems. We describe in detail the challenges involved in two important aspects on building Semantic Web applications, namely, data acquisition and entity disambiguation (or reference reconciliation). We extend upon our previous work where we integrated the collaborative network of a subset of DBLP researchers with persons in a Friend-of-a-Friend social network (FOAF). Our method finds the connections between people, measures collaboration strength, and includes heuristics that use friendship/affiliation information to provide an estimate of potential COI in a peer-review scenario. Evaluations are presented by measuring what could have been the COI between accepted papers in various conference tracks and their respective program committee members. The experimental results demonstrate that scalability can be achieved by using a dataset of over 3 million entities (all bibliographic data from DBLP and a large collection of FOAF documents).
 
An ads-portal domain refers to a Web domain that shows only advertisements, served by a third-party advertisement syndication service, in the form of ads listing. We develop a machine-learning-based classifier to identify ads-portal domains, which has 96% accuracy. We use this classifier to measure the prevalence of ads-portal domains on the Internet. Surprisingly, 28.3/25% of the (two-level) *.com/*.net web domains are ads-portal domains. Also, 41/39.8% of *.com/*.net ads-portal domains are typos of well-known domains, also known as typo-squatting domains. In addition, we use the classifier along with DNS trace files to estimate how often Internet users visit ads-portal domains. It turns out that ∼5% of the two-level *.com, *.net, *.org, *.biz and *.info web domains on the traces are ads-portal domains and ∼50% of these accessed ads-portal domains are typos. These numbers show that ads-portal domains and typo-squatting ads-portal domains are prevalent on the Internet and successful in attracting many visits. Our classifier represents a step towards better categorizing the web documents. It can also be helpful to search engines ranking algorithms, helpful in identifying web spams that redirects to ads-portal domains, and used to discourage access to typo-squatting ads-portal domains.
 
compensations,at runtime. We introduce the abstract service and adapter components,which allow us to separate the compensation logic from the coordination logic. In this way, we can easily plug in or plug out different compensation,strategies based on a specification language defined on top of basic compensation,activities and complex,compensation types. Experiments with our approach and environment show that such an approach to compensation is feasible and beneficial. Additionally, we introduce a cost-benefit model to evaluate the proposed environment based on net value analysis. The evaluation shows under which circumstances the environment is economical. Categories and Subject Descriptors: B.1.3 [Control Structure Reliability, Testing, and Fault-Tolerance]: Diagnostics, Error Checking; C.2.4 [Distributed Systems]: Distributed Applications; C.4 [Performance of Systems]: Fault tolerance, Reliability, availability, and serviceability; H.3.4 [Systems and Software]: Distributed systems, Information networks; H.3.5 [Online Information Services]: Web-based services General Terms: Design, Reliability Additional Key Words and Phrases: Web Services, Transactions, Compensations, Forward-Recovery A preliminary version of this paper appeared in proceedings of ICWE 2007 [Sch¨ afer et al. 2007] Authors’ addresses: Michael Sch¨ afer and Wolfgang Nejdl, L3S Research Center, University of Han-
 
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due ...
 
This article describes an approach for incorporating externally specified aggregate ratings information into certain types of recommender systems, including two types of collaborating filtering and a hierarchical linear regression model. First, we present a framework for incorporating aggregate rating information and apply this framework to the aforementioned individual rating models. Then we formally show that this additional aggregate rating information provides more accurate recommendations of individual items to individual users. Further, we experimentally confirm this theoretical finding by demonstrating on several datasets that the aggregate rating information indeed leads to better predictions of unknown ratings. We also propose scalable methods for incorporating this aggregate information and test our approaches on large datasets. Finally, we demonstrate that the aggregate rating information can also be used as a solution to the cold start problem of recommender systems.
 
The Service Schema for the Car Brokerage Scenario
An Example of An Operation Graph
We present a query algebra that supports optimized access of Web services through service-oriented queries. The service query algebra is defined based on a formal service model that provides a high- level abstraction of Web services across an application domain. The algebra defines a set of algebraic operators. Algebraic service queries can be formulated using these operators. This allows users to query their desired services based on both functionality and quality. We provide the implementation of each algebraic operator. This enables the generation of Service Execution Plans (SEPs) that can be used by users to directly access services. We present an optimization algorithm by extending the Dynamic Programming (DP) approach to efficiently select the SEPs with the best user-desired quality. The experimental study validates the proposed algorithm by demonstrating significant performance improvement compared with the traditional DP approach.
 
The emergence of Web 2.0 and the consequent success of social network Web sites such as Del.icio.us and Flickr introduce us to a new concept called social bookmarking, or tagging. Tagging is the action of connecting a relevant user-defined keyword to a document, image, or video, which helps the user to better organize and share their collections of interesting stuff. With the rapid growth of Web 2.0, tagged data is becoming more and more abundant on the social network Web sites. An interesting problem is how to automate the process of making tag recommendations to users when a new resource becomes available. In this article, we address the issue of tag recommendation from a machine learning perspective. From our empirical observation of two large-scale datasets, we first argue that the user-centered approach for tag recommendation is not very effective in practice. Consequently, we propose two novel document-centered approaches that are capable of making effective and efficient tag recommendations in real scenarios. The first, graph-based, method represents the tagged data in two bipartite graphs, (document, tag) and (document, word), then finds document topics by leveraging graph partitioning algorithms. The second, prototype-based, method aims at finding the most representative documents within the data collections and advocates a sparse multiclass Gaussian process classifier for efficient document classification. For both methods, tags are ranked within each topic cluster/class by a novel ranking method. Recommendations are performed by first classifying a new document into one or more topic clusters/classes, and then selecting the most relevant tags from those clusters/classes as machine-recommended tags. Experiments on real-world data from Del.icio.us, CiteULike, and BibSonomy examine the quality of tag recommendation as well as the efficiency of our recommendation algorithms. The results suggest that our document-centered models can substantially improve the performance of tag recommendations when compared to the user-centered methods, as well as topic models LDA and SVM classifiers.
 
Service-Oriented Architecture (SOA) provides a flexible framework for service composition. Using standard-based protocols (such as SOAP and WSDL), composite services can be constructed by integrating atomic services developed independently. Algorithms are needed to select service components with various QoS levels according to some application-dependent performance requirements. We design a broker-based architecture to facilitate the selection of QoS-based services. The objective of service selection is to maximize an application-specific utility function under the end-to-end QoS constraints. The problem is modeled in two ways: the combinatorial model and the graph model. The combinatorial model defines the problem as a multidimension multichoice 0-1 knapsack problem (MMKP). The graph model defines the problem as a multiconstraint optimal path (MCOP) problem. Efficient heuristic algorithms for service processes of different composition structures are presented in this article and their performances are studied by simulations. We also compare the pros and cons between the two models.
 
F-measure values with box error bars (which extend above and below each point by one standard deviation) for ccTLD and IP algorithms. 
A pruned version (only top nodes) of the decision tree trained on ODP + SER for English. The "English IP" refers to the feature indicating whether the URL is hosted at a Web server which is located in an English-speaking country. "Dict." refers to dictionary. At a leaf node f (+) is the fraction of positive test URLs and f (−) is the fraction of negative test URLs (in ODP + SER) that ended up at the leaf. 
Macro-averaged (averaged over languages) F1-measure values for various classifiers: ccTLD, IP, SVM with allgrams, DT with custom-made features, RE with allgrams. 
A plot showing how the performance of different approaches changes on ODP + SER test set as the amount of training data is increased from a total of 450 URLs (for each language classifier) to 450k URLs. 
A plot showing how the performance of different approaches changes on the Flash test set as the amount of training data is increased from a total of 450 URLs (for each language classifier) to 450k URLs. 
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
 
Since many Internet applications employ a multitier architecture, in this article, we focus on the problem of analytically modeling the behavior of such applications. We present a model based on a network of queues where the queues represent different tiers of the application. Our model is sufficiently general to capture (i) the behavior of tiers with significantly different performance characteristics and (ii) application idiosyncrasies such as session-based workloads, tier replication, load imbalances across replicas, and caching at intermediate tiers. We validate our model using real multitier applications running on a Linux server cluster. Our experiments indicate that our model faithfully captures the performance of these applications for a number of workloads and configurations. Furthermore, our model successfully handles a comprehensive range of resource utilization—from 0 to near saturation for the CPU—for two separate tiers. For a variety of sce- narios, including those with caching at one of the application tiers, the average response times predicted by our model were within the 95% confidence intervals of the observed average response times. Our experiments also demonstrate the utility of the model for dynamic capacity provisioning, performance prediction, bottleneck identification, and session policing. In one scenario, where the request arrival rate increased from less than 1500 to nearly 4200 requests/minute, a dynamic pro- visioning technique employing our model was able to maintain response time targets by increasing the capacity of two of the tiers by factors of 2 and 3.5, respectively.
 
Load-aware Anycast CDN Architecture 
Application level redirection for long-lived sessions 
IP Anycast has many attractive features for any service that involve the replication of multiple instances across the Internet. IP Anycast allows multiple instances of the same service to be “naturally” discovered, and requests for this service to be delivered to the closest instance. However, while briefly considered as an enabler for content delivery networks (CDNs) when they first emerged, IP Anycast was deemed infeasible in that environment. The main reasons for this decision were the lack of load awareness of IP Anycast and unwanted side effects of Internet routing changes on the IP Anycast mechanism. In this article we re-evaluate IP Anycast for CDNs by proposing a load-aware IP Anycast CDN architecture. Our architecture is prompted by recent developments in route control technology, as well as better understanding of the behavior of IP Anycast in operational settings. Our architecture makes use of route control mechanisms to take server and network load into account to realize load-aware Anycast. We show that the resulting redirection requirements can be formulated as a Generalized Assignment Problem and present practical algorithms that address these requirements while at the same time limiting connection disruptions that plague regular IP Anycast. We evaluate our algorithms through trace based simulation using traces obtained from a production CDN network.
 
Service matching approaches trade precision for recall, creating the need for users to choose the correct services, which obviously is a major obstacle for automating the service discovery and ag- gregation processes. Our approach to overcome this problem, is to eliminate the appearance of false positives by returning only the correct services. As different users have different semantics for what is correct, we argue that the correctness of the matching results must be determined according to the achievement of users' goals: that only services achieving users' goals are considered correct. To determine such correctness, we argue that the matching process should be based primarily on the high-level functional specifications (namely goals, achievement contexts, and external behaviors). In this article, we propose models, data structures, algorithms, and theorems required to correctly match such specifications. We propose a model called G+, to capture such specifications, for both services and users, in a machine-understandable format. We propose a data structure, called a Con- cepts Substitutability Graph (CSG), to capture the substitution semantics of application domain concepts in a context-based manner, in order to determine the semantic-preserving mapping trans- formations required to match different G+ models. We also propose a behavior matching approach that is able to match states in an m-to-n manner, such that behavior models with different numbers of state transitions can be matched. Finally, we show how services are matched and aggregated according to their G+ models. Results of supporting experiments demonstrate the advantages of the proposed service matching approaches. Categories and Subject Descriptors: C.2.4 (Computer-Communication Networks): Distributed
 
On-demand streaming from a remote server through best-effort Internet poses several challenges because of network losses and variable delays. The primary technique used to improve the quality of distributed content service is replication. In the context of the Internet, Web caching is the traditional mechanism that is used. In this article we develop a new staged delivery model for a distributed architecture in which video is streamed from remote servers to edge caches where the video is buffered and then streamed to the client through a last-mile connection. The model uses a novel revolving indexed cache buffer management mechanism at the edge cache and employs selective retransmissions of lost packets between the remote and edge cache for a best-effort recovery of the losses. The new Web cache buffer management scheme includes a dynamic adjustment of cache buffer parameters based on network conditions. In addition, performance of buffer management and retransmission policies at the edge cache is modeled and assessed using a probabilistic analysis of the streaming process as well as system simulations. The influence of different endogenous control parameters on the quality of stream received by the client is studied. Calibration curves on the QoS metrics for different network conditions have been obtained using simulations. Edge cache management can be done using these calibration curves. ISPs can make use of calibration curves to set the values of the endogenous control parameters for specific QoS in real-time streaming operations based on network conditions. A methodology to benchmark transmission characteristics using real-time traffic data is developed to enable effective decision making on edge cache buffer allocation and management strategies.
 
The Business Process Execution Language (BPEL) stan- dardizes the development of composite enterprise applica- tions that make use of software components exposed as Web services. BPEL processes are currently executed by a cen- tralized orchestration engine, in which issues such as scala- bility and platform heterogeneity can be difficult to manage. This paper proposes a distributed agent-based orchestration engine in which several light-weight agents execute a portion of the original business process and collaborate in order to execute the complete process. The complete set of standard BPEL activities are supported, and the transformations of several BPEL activities to the agent-based architecture are described. Evaluations on an implementation of this archi- tecture demonstrate that agent-based execution scales bet- ter than a non-distributed approach, with at least 70% and 120% improvements in process execution time, and through- put, respectively, even with a large number of concurrent process instances. In addition, the distributed architecture successfully executes large processes that are shown to be infeasible to execute with a non-distributed engine.
 
Service-oriented architectures are increasingly used in the context of business processes. However, the proven practices for process-oriented integration of services are not well documented yet. In addition, modeling approaches for the integration of processes and services are neither mature nor do they exactly reflect the proven practices. In this paper, we propose a pattern language for process-oriented integration of services to describe the proven practices. Our main contribution is a modeling concept based on pattern primitives for these patterns. A pattern primitive is a fundamental, precisely specified modeling element that represents a pattern. We present a catalog of pattern primitives that are precisely modeled using OCL constraints and map these primitives to the patterns in the pattern language of process-oriented integration of services. We also present a model validation tool that we have developed to support modeling the process-oriented integration of services, and an industrial case study in which we have applied our results.
 
In service-oriented architectures, everything is a service and everyone is a service provider. Web services (or simply services) are loosely coupled software components that are published, discovered, and invoked across the Web. As the use of Web service grows, in order to correctly interact with them, it is important to understand the business protocols that provide clients with the information on how to interact with services. In dynamic Web service environments, service providers need to constantly adapt their business protocols for reflecting the restrictions and requirements proposed by new applications, new business strategies, and new laws, or for fixing problems found in the protocol definition. However, the effective management of such a protocol evolution raises critical problems: one of the most critical issues is how to handle instances running under the old protocol when it has been changed. Simple solutions, such as aborting them or allowing them to continue to run according to the old protocol, can be considered, but they are inapplicable for many reasons (for example, the loss of work already done and the critical nature of work). In this article, we present a framework that supports service managers in managing the business protocol evolution by providing several features, such as a variety of protocol change impact analyses automatically determining which ongoing instances can be migrated to the new version of protocol, and data mining techniques inferring interaction patterns used for classifying ongoing instances migrateable to the new protocol. To support the protocol evolution process, we have also developed database-backed GUI tools on top of our existing system. The proposed approach and tools can help service managers in managing the evolution of ongoing instances when the business protocols of services with which they are interacting have changed.
 
The Web holds a great quantity of material that can be used to enhance classroom instruction. However, it is not easy to retrieve this material with the search engines currently available. This study produced a specialized search assistant based on Google that significantly increases the number of instances in which teachers find the desired learning objects as compared to using this popular public search engine directly. Success in finding learning objects by study participants went from 80% using Google alone to 96% when using our search assistant in one scenario and, in another scenario, from a 40% success rate with Google alone to 66% with our assistant. This specialized search assistant implements features such as bilingual search and term suggestion which were requested by teacher participants to help improve their searches. Study participants evaluated the specialized search assistant and found it significantly easier to use and more useful than the popular search engine for the purpose of finding learning objects.
 
ABSTRACT DNS rebinding attacks subvert the same-origin policy of browsers and convert them into open network proxies. We survey new,DNS rebinding attacks that exploit the inter- action between browsers and their plug-ins, such as Flash Player and Java. These attacks can be used to circumvent firewalls and are highly cost-eective,for sending spam e- mail and defrauding pay-per-click advertisers, requiring less than $100 to temporarily hijack 100,000 IP addresses. We show that the classic defense against these attacks, called “DNS pinning,” is ineective in modern browsers. The pri- mary focus of this work, however, is the design of strong defenses against DNS rebinding attacks that protect mod- ern browsers: we suggest easy-to-deploy patches for plug-ins that prevent large-scale exploitation, provide a defense tool, dnswall, that prevents firewall circumvention, and detail two defense options, policy-based pinning and host name authorization. Categories and Subject Descriptors
 
Recently, we have seen increasing numbers of denial of service (DoS) attacks against online services and Web applications either for extortion reasons or for impairing and even disabling the competition. These DoS attacks have increasingly targeted the application level. Application-level DoS attacks emulate the same request syntax and network-level traffic characteristics as those of legitimate clients, thereby making the attacks much harder to detect and counter. Moreover, such attacks often target bottleneck resources such as disk bandwidth, database bandwidth, and CPU resources. In this article, we propose handling DoS attacks by using a twofold mechanism. First, we perform admission control to limit the number of concurrent clients served by the online service. Admission control is based on port hiding that renders the online service invisible to unauthorized clients by hiding the port number on which the service accepts incoming requests. Second, we perform congestion control on admitted clients to allocate more resources to good clients. Congestion control is achieved by adaptively setting a client's priority level in response to the client's requests in a way that can incorporate application-level semantics. We present a detailed evaluation of the proposed solution using two sample applications: Apache HTTPD and the TPCW benchmark (running on Apache Tomcat and IBM DB2). Our experiments show that the proposed solution incurs low performance overhead and is resilient to DoS attacks.
 
We present a mathematical model of the eBay auction protocol and perform a detailed analysis of the effects that the eBay proxy bidding system and the minimum bid increment have on the auction properties. We first consider the revenue of the auction, and we show analytically that when two bidders with independent private valuations use the eBay proxy bidding system there exists an optimal value for the minimum bid increment at which the auctioneer's revenue is maximized. We then consider the sequential way in which bids are placed within the auction, and we show analytically that independent of assumptions regarding the bidders' valuation distribution or bidding strategy the number of visible bids placed is related to the logarithm of the number of potential bidders. Thus, in many cases, it is only a minority of the potential bidders that are able to submit bids and are visible in the auction bid history (despite the fact that the other hidden bidders are still effectively competing for the item). Furthermore, we show through simulation that the minimum bid increment also introduces an inefficiency to the auction, whereby a bidder who enters the auction late may find that its valuation is insufficient to allow them to advance the current bid by the minimum bid increment despite them actually having the highest valuation for the item. Finally, we use these results to consider appropriate strategies for bidders within real world eBay auctions. We show that while last-minute bidding (sniping) is an effective strategy against bidders engaging in incremental bidding (and against those with common values), in general, delaying bidding is disadvantageous even if delayed bids are sure to be received before the auction closes. Thus, when several bidders submit last-minute bids, we show that rather than seeking to bid as late as possible, a bidder should try to be the first sniper to bid (i.e., it should “snipe before the snipers”).
 
Search engines and large-scale IR systems need to cache query results for efficiency and scalability purposes. Static and dynamic caching techniques (as well as their combinations) are employed to effectively cache query results. In this study, we propose cost-aware strategies for static and dynamic caching setups. Our research is motivated by two key observations: (i) query processing costs may significantly vary among different queries, and (ii) the processing cost of a query is not proportional to its popularity (i.e., frequency in the previous logs). The first observation implies that cache misses have different, that is, nonuniform, costs in this context. The latter observation implies that typical caching policies, solely based on query popularity, can not always minimize the total cost. Therefore, we propose to explicitly incorporate the query costs into the caching policies. Simulation results using two large Web crawl datasets and a real query log reveal that the proposed approach improves overall system performance in terms of the average query execution time.
 
The ability to compute the differences that exist between two RDF/S Knowledge Bases (KB) is an important step to cope with the evolving nature of the Semantic Web (SW). In particular, RDF/S deltas can be employed to reduce the amount of data that need to be exchanged and managed over the network in order to build SW synchronization and versioning services. By considering deltas as sets of change operations, in this article we introduce various RDF/S differential functions which take into account inferred knowledge from an RDF/S knowledge base. We first study their correctness in transforming a source to a target RDF/S knowledge base in conjunction with the semantics of the employed change operations (i.e., with or without side-effects on inferred knowledge). Then we formally analyze desired properties of RDF/S deltas such as size minimality, semantic identity, redundancy elimination, reversibility, and composability, as well as identify those RDF/S differential functions that satisfy them. Subsequently, we experimentally evaluate the computing time and size of the produced deltas over real and synthetic RDF/S knowledge bases.
 
The blogosphere has grown to be a mainstream forum of social interaction as well as a commercially attractive source of information and influence. Tools are needed to better understand how communities that adhere to individual blogs are constituted in order to facilitate new personal, socially-focused browsing paradigms, and understand how blog content is consumed, which is of interest to blog authors, big media, and search. We present a novel approach to blog subcommunity characterization by modeling individual blog readers using mixtures of an extension to the LDA family that jointly models phrases and time, Ngram Topic over Time (NTOT), and cluster with a number of similarity measures using Affinity Propagation. We experiment with two datasets: a small set of blogs whose authors provide feedback, and a set of popular, highly commented blogs, which provide indicators of algorithm scalability and interpretability without prior knowledge of a given blog. The results offer useful insight to the blog authors about their commenting community, and are observed to offer an integrated perspective on the topics of discussion and members engaged in those discussions for unfamiliar blogs. Our approach also holds promise as a component of solutions to related problems, such as online entity resolution and role discovery.
 
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it). Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.
 
Given that my friends on Flickr use cameras of brand X, am I more likely to also use a camera of brand X? Given that one of these friends changes her brand, am I likely to do the same? Do new camera models pop up uniformly in the friendship graph? Or do early adopters then “convert” their friends? Which factors influence the conversion probability of a user? These are the kind of questions addressed in this work. Direct applications involve personalized advertising in social networks. For our study, we crawled a complete connected component of the Flickr friendship graph with a total of 67M edges and 3.9M users. 1.2M of these users had at least one public photograph with valid model metadata, which allowed us to assign camera brands and models to users and time slots. Similarly, we used, where provided in a user’s profile, information about a user’s geographic location and the groups joined on Flickr. Concerning brand congruence, our main findings are the following. First, a pair of friends on Flickr has a higher probability of being congruent, that is, using the same brand, compared to two random users (27% vs. 19%). Second, the degree of congruence goes up for pairs of friends (i) in the same country (29%), (ii) who both only have very few friends (30%), and (iii) with a very high cliqueness (38%). Third, given that a user changes her camera model between March-May 2007 and March-May 2008, high cliqueness friends are more likely than random users to do the same (54% vs. 48%). Fourth, users using high-end cameras are far more loyal to their brand than users using point-and-shoot cameras, with a probability of staying with the same brand of 60% vs 33%, given that a new camera is bought. Fifth, these “expert” users’ brand congruence reaches 66% for high cliqueness friends. All these differences are statistically significant at 1%. As for the propagation of new models in the friendship graph, we observe the following. First, the growth of connected components of users converted to a particular, new camera model differs distinctly from random growth. Second, the decline of dissemination of a particular model is close to random decline. This illustrates that users influence their friends to change to a particular new model, rather than from a particular old model. Third, having many converted friends increases the probability of the user to convert herself. Here differences between friends from the same or from different countries are more pronounced for point-and-shoot than for digital single-lens reflex users. Fourth, there was again a distinct difference between arbitrary friends and high cliqueness friends in terms of prediction quality for conversion.
 
Current web browsers are plagued with vulnerabilities, providing hackers with easy access to computer systems via browser-based attacks. Browser security efforts that retrofit existing browsers have had limited success because the design of modern browsers is fundamentally flawed. To enable more secure web browsing, we design and implement a new browser, called the OP web browser, that attempts to improve the state-of-the-art in browser security. We combine operating system design principles with formal methods to design a more secure web browser by drawing on the expertise of both communities. Our design philosophy is to partition the browser into smaller subsystems and make all communication between subsystems simple and explicit. At the core of our design is a small browser kernel that manages the browser subsystems and interposes on all communications between them to enforce our new browser security features. To show the utility of our browser architecture, we design and implement three novel security features. First, we develop flexible security policies that allow us to include browser plugins within our security framework. Second, we use formal methods to prove useful security properties including user interface invariants and browser security policy. Third, we design and implement a browser-level information-flow tracking system to enable post-mortem analysis of browser-based attacks. In addition to presenting the OP browser architecture, we discuss the design and implementation of a second version of OP, OP2, that includes features from other secure web browser designs to improve on the overall security and performance of OP. To evaluate our design, we implemented OP2 and tested both performance, memory, and filesystem impact while browsing popular pages. We show that the additional security features in OP and OP2 introduce minimal overhead.
 
We propose a new web page transformation method to facilitate web browsing on handheld devices such as Personal Digital Assistants (PDAs). In our approach, an original web page that does not flt on the screen is transformed into a set of sub-pages, each of which flts on the screen. This transformation is done through slicing the original page into page blocks iteratively with several factors considered. These factors include the size of the screen, the size of each page block, the number of blocks in each transformed page, the depth of the tree hierarchy that the transformed pages form, as well as the semantic coherence between blocks. We call the tree hierarchy of the transformed pages an SP-tree. In an SP-tree, an internal node consists of a textually-enhanced thumbnail image with hyperlinks, and a leaf node is a block extracted from a sub-page of the original web page. We adaptively adjust the fanout and the height of the SP-tree so that each thumbnail image is clear enough for users to read, while at the same time, the number of clicks needed to reach a leaf page is few. Through this transformation algorithm, we preserve the contextual information in the original web page and reduce scrolling. We have implemented this transformation module on a proxy server and have conducted usability studies on its performance. Our system achieved a shorter task completion time compared with that of transformations from the Opera browser in nine of ten tasks. The average improvement on familiar pages was 44%. The average improvement on unfamiliar pages was 37%. Subjective responses were positive. Categories and Subject Descriptors: H.4.3 (Information Systems Applications): Communica- tions Applications|Information Browsers; H.5.4 (Information Interfaces and Presentation):
 
The latest generation of WWW tools and services enables Web users to generate applications that combine content from multiple sources. This type of Web application is referred to as a mashup. Many of the tools for constructing mashups rely on a widget paradigm, where users must select, customize, and connect widgets to build the desired application. While this approach does not require programming, the users must still understand programming concepts to successfully create a mashup. As a result, they are put off by the time, effort, and expertise needed to build a mashup. In this article, we describe our programming-by-demonstration approach to building mashup by example. Instead of requiring a user to select and customize a set of widgets, the user simply demonstrates the integration task by example. Our approach addresses the problems of extracting data from Web sources, cleaning and modeling the extracted data, and integrating the data across sources. We implemented these ideas in a system called Karma, and evaluated Karma on a set of 23 users. The results show that, compared to other mashup construction tools, Karma allows more of the users to successfully build mashups and makes it possible to build these mashups significantly faster compared to using a widget-based approach.
 
We introduce the concern of confidentiality protection of business information for the publication of search engine query logs and derived data. We study business confidentiality, as the protection of nonpublic data from institutions, such as companies and people in the public eye. In particular, we relate this concern to the involuntary exposure of confidential Web site information, and we transfer this problem into the field of privacy-preserving data mining. We characterize the possible adversaries interested in disclosing Web site confidential data and the attack strategies that they could use. These attacks are based on different vulnerabilities found in query log for which we present several anonymization heuristics to prevent them. We perform an experimental evaluation to estimate the remaining utility of the log after the application of our anonymization techniques. Our experimental results show that a query log can be anonymized against these specific attacks while retaining a significant volume of useful data.
 
In this article we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log influence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
 
One of the central problems in studying and understanding complex networks, such as online social networks or World Wide Web, is to discover hidden, either physically (e.g., interactions or hyperlinks) or logically (e.g., profiles or semantics) well-defined topological structures. From a practical point of view, a good example of such structures would be so-called network communities. Earlier studies have introduced various formulations as well as methods for the problem of identifying or extracting communities. While each of them has pros and cons as far as the effectiveness and efficiency are concerned, almost none of them has explicitly dealt with the potential relationship between the global topological property of a network and the local property of individual nodes. In order to study this problem, this paper presents a new algorithm, called ICS, which aims to discover natural network communities by inferring from the local information of nodes inherently hidden in networks based on a new centrality, that is, clustering centrality, which is a generalization of eigenvector centrality. As compared with existing methods, our method runs efficiently with a good clustering performance. Additionally, it is insensitive to its built-in parameters and prior knowledge.
 
Video sharing services that allow ordinary Web users to upload video clips of their choice and watch video clips uploaded by others have recently become very popular. This article identifies invariants in video sharing workloads, through comparison of the workload characteristics of four popular video sharing services. Our traces contain metadata on approximately 1.8 million videos which together have been viewed approximately 6 billion times. Using these traces, we study the similarities and differences in use of several Web 2.0 features such as ratings, comments, favorites, and propensity of uploading content. In general, we find that active contribution, such as video uploading and rating of videos, is much less prevalent than passive use. While uploaders in general are skewed with respect to the number of videos they upload, the fraction of multi-time uploaders is found to differ by a factor of two between two of the sites. The distributions of lifetime measures of video popularity are found to have heavy-tailed forms that are similar across the four sites. Finally, we consider implications for system design of the identified invariants. To gain further insight into caching in video sharing systems, and the relevance to caching of lifetime popularity measures, we gathered an additional dataset tracking views to a set of approximately 1.3 million videos from one of the services, over a twelve-week period. We find that lifetime popularity measures have some relevance for large cache (hot set) sizes (i.e., a hot set defined according to one of these measures is indeed relatively “hot”), but that this relevance substantially decreases as cache size decreases, owing to churn in video popularity.
 
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.
 
LDAP directories have proliferated as the appropriate storage framework for various and heterogeneous data sources, operating under a wide range of applications and services. Due to the increased amount and heterogeneity of the LDAP data, there is a requirement for appropriate data organization schemes. The LPAIR & LMERGE (LP-LM) algorithm, presented in this article, is a hierarchical agglomerative structure-based clustering algorithm which can be used for the LDAP directory information tree definition. A thorough study of the algorithm’s performance is provided, which designates its efficiency. Moreover, the Relative Link as an alternative merging criterion is proposed, since as indicated by the experimentation, it can result in more balanced clusters. Finally, the LP and LM Query Engine is presented, which considering the clustering-based LDAP data organization, results in the enhancement of the LDAP server’s performance.
 
As Web applications mature and evolve, the nature of the semistructured data that drives these applications also changes. An important trend is the need for increased flexibility in the structure of Web documents. Hence, applications cannot rely solely on schemas to provide the complex knowledge needed to visualize, use, query and manage documents. Even when XML Web documents are valid with regard to a schema, the actual structure of such documents may exhibit significant variations across collections for several reasons: the schema may be very lax (e.g., RSS feeds), the schema may be large and different subsets of it may be used in different documents (e.g., industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). For these reasons, many applications that incorporate XPath queries to process a large Web document collection require an understanding of the actual structure present in the collection, and not just the schema. To support modern Web applications, we introduce DescribeX, a powerful framework that is capable of describing complex XML summaries of Web collections. DescribeX supports the construction of heterogenous summaries that can be declaratively defined and refined by means of axis path regular expression (AxPREs). AxPREs provide the flexibility necessary for declaratively defining complex mappings between instance nodes (in the documents) and summary nodes. These mappings are capable of expressing order and cardinality, among other properties, which can significantly help in the understanding of the structure of large collections of XML documents and enhance the performance of Web applications over these collections. DescribeX captures most summary proposals in the literature by providing (for the first time) a common declarative definition for them. Experimental results demonstrate the scalability of DescribeX summary operations (summary creation, as well as refinement and stabilization, two key enablers for tailoring summaries) on multi-gigabyte Web collections.
 
Impact of targeted attacks  
Tagging systems allow users to interactively annotate a pool of shared resources using descriptive strings called tags. Tags are used to guide users to interesting resources and help them build communities that share their expertise and resources. As tagging systems are gaining in popularity, they become more susceptible to tag spam: misleading tags that are generated in order to increase the visibility of some resources or simply to confuse users. Our goal is to understand this problem better. In particular, we are interested in answers to questions such as: How many malicious users can a tagging system tolerate before results significantly degrade? What types of tagging systems are more vulnerable to malicious attacks? What would be the effort and the impact of employing a trusted moderator to find bad postings? Can a system automatically protect itself from spam, for instance, by exploiting user tag patterns? In a quest for answers to these questions, we introduce a framework for modeling tagging systems and user tagging behavior. We also describe a method for ranking documents matching a tag based on taggers' reliability. Using our framework, we study the behavior of existing approaches under malicious attacks and the impact of a moderator and our ranking method.
 
Semi-automatic anti-spam algorithms propagate either trust through links from a good seed set (e. g., TrustRank) or distrust through inverse links from a bad seed set (e. g., Anti-TrustRank) to the entire Web. These kinds of algorithms have shown their powers in combating link-based Web spam since they integrate both human judgement and machine intelligence. Nevertheless, there is still much space for improvement. One issue of most existing trust/distust propagation algorithms is that only trust or distrust is propagated and only a good seed set or a bad seed set is used. According to Wu et al. [2006a], a combined usage of both trust and distrust propagation can lead to better results, and an effective framework is needed to realize this insight. Another more serious issue of existing algorithms is that trust or distrust is propagated in nondifferential ways, that is, a page propagates its trust or distrust score uniformly to its neighbors, without considering whether each neighbor should be trusted or distrusted. Such kinds of blind propagating schemes are inconsistent with the original intention of trust/distrust propagation. However, it seems impossible to implement differential propagation if only trust or distrust is propagated. In this article, we take the view that each Web page has both a trustworthy side and an untrustworthy side, and we thusly assign two scores to eachWeb page: T-Rank, scoring the trustworthiness of the page, and D-Rank, scoring the untrustworthiness of the page. We then propose an integrated framework that propagates both trust and distrust. In the framework, the propagation of T-Rank/D-Rank is penalized by the target's current D-Rank/T-Rank. In other words, the propagation of T-Rank/D-Rank is decided by the target's current (generalized) probability of being trustworthy/untrustworthy; thus a page propagates more trust/distrust to a trustworthy/untrustworthy neighbor than to an untrustworthy/trustworthy neighbor. In this way, propagating both trust and distrust with target differentiation is implemented. We use T-Rank scores to realize spam demotion and D-Rank scores to accomplish spam detection. The proposed Trust-DistrustRank (TDR) algorithm regresses to TrustRank and Anti-TrustRank when the penalty factor is set to 1 and 0, respectively. Thus TDR could be seen as a combinatorial generalization of both TrustRank and Anti-TrustRank. TDR not only makes full use of both trust and distrust propagation, but also overcomes the disadvantages of both TrustRank and Anti-TrustRank. Experimental results on benchmark datasets show that TDR outperforms other semi-automatic anti-spam algorithms for both spam demotion and spam detection tasks under various criteria.
 
The predominant business model for Web search engines is sponsored search, which generates billions in yearly revenue. But are sponsored links providing online consumers with relevant choices for products and services? We address this and related issues by investigating the relevance of sponsored and nonsponsored links for e-commerce queries on the major search engines. The results show that average relevance ratings for sponsored and nonsponsored links are practically the same, although the relevance ratings for sponsored links are statistically higher. We used 108 ecommerce queries and 8,256 retrieved links for these queries from three major Web search engines: Yahoo!, Google, and MSN. In addition to relevance measures, we qualitatively analyzed the e-commerce queries, deriving five categorizations of underlying information needs. Product-specific queries are the most prevalent (48%). Title (62%) and summary (33%) are the primary basis for evaluating sponsored links with URL a distant third (2%). To gauge the effectiveness of sponsored search campaigns, we analyzed the sponsored links from various viewpoints. It appears that links from organizations with large sponsored search campaigns are more relevant than the average sponsored link. We discuss the implications for Web search engines and sponsored search as a long-term business model and as a mechanism for finding relevant information for searchers.
 
Compressed graph representations, in particular for Web graphs, have become an attractive research topic because of their applications in the manipulation of huge graphs in main memory. The state of the art is well represented by the WebGraph project, where advantage is taken of several particular properties of Web graphs to offer a trade-off between space and access time. In this paper we show that the same properties can be exploited with a different and elegant technique that builds on grammar-based compression. In particular, we focus on Re-Pair and on Ziv-Lempel compression, which, although cannot reach the best compression ratios of WebGraph, achieve much faster navigation of the graph when both are tuned to use the same space. Moreover, the technique adapts well to run on secondary memory and in distributed scenarios. As a byproduct, we introduce an approximate Re-Pair version that works efficiently with severely limited main memory.
 
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by characterizing the properties of its representative graphs, in which vertices and directed edges are identified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls have been analyzed by several groups resulting in a general picture of the WWW that encompasses many of the complex properties typical of rapidly evolving networks. In this article, we report a detailed statistical analysis of the topological properties of four different WWW graphs obtained with different crawlers. We find that, despite the very large size of the samples, the statistical measures characterizing these graphs differ quantitatively, and in some cases qualitatively, depending on the domain analyzed and the crawl used for gathering the data. This spurs the issue of the presence of sampling biases and structural differences of Web crawls that might induce properties not representative of the actual global underlying graph. In short, the stability of the widely accepted statistical description of the Web is called into question. In order to provide a more accurate characterization of the Web graph, we study statistical measures beyond the degree distribution, such as degree-degree correlation functions or the statistics of reciprocal connections. The latter appears to enclose the relevant correlations of the WWW graph and carry most of the topological information of the Web. The analysis of this quantity is also of major interest in relation to the navigability and searchability of the Web.
 
Semantic annotations of web services can support the effective and efficient discovery of services, and guide their composition into workflows. At present, however, the practical utility of such annotations is limited by the small number of service annotations ...
 
Searchers on the Web often aim to find key resources about a topic. Finding such results is called topic distillation. Previous research has shown that the use of sources of evidence such as page indegree and URL structure can significantly improve search performance on interconnected collections such as the Web, beyond the use of simple term distribution statistics. This article presents a new approach to improve topic distillation by exploring the use of external sources of evidence: link structure, including query dependent indegree and outdegree; and web page characteristics, such as the density of anchor links. Our experiments with the TREC .GOV collection, an 18GB crawl of the US .gov domain from 2002, show that using such evidence can significantly improve search effectiveness, with combinations of evidence leading to significant performance gains over both full-text and anchor-text baselines. Moreover, we demonstrate that, at a different scope level, both local query-dependent outdegree and query-dependent indegree out-performed their global query-independent counterparts; and at the same scope level, outdegree out-performed indegree. Adding query-dependent indegree or page characteristics to query-dependent outdegree could have a small, but not significant, improvement.
 
Top-cited authors
Kwei-Jay Lin
  • University of California, Irvine
Yue Zhang
  • University of California, Irvine
Bernardo Huberman
Jure Leskovec
  • Stanford University
Wei-Ying Ma