Article

Understanding Website Complexity: Measurements, Metrics, and Implications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Over the years, the web has evolved from simple text content from one server to a complex ecosystem with different types of content from servers spread across several administrative domains. There is anecdotal evidence of users being frustrated with high page load times or when obscure scripts cause their browser windows to freeze. Because page load times are known to directly impact user satisfaction, providers would like to understand if and how the complexity of their websites affects the user experience. While there is an extensive literature on measuring web graphs, website popularity, and the nature of web traffic, there has been little work in understanding how complex individual websites are, and how this complexity impacts the clients' experience. This paper is a first step to address this gap. To this end, we identify a set of metrics to characterize the complexity of websites both at a content-level (e.g., number and size of images) and service-level (e.g., number of servers/origins). We find that the distributions of these metrics are largely independent of a website's popularity rank. However, some categories (e.g., News) are more complex than others. More than 60% of websites have content from at least 5 non-origin sources and these contribute more than 35% of the bytes downloaded. In addition, we analyze which metrics are most critical for predicting page render and load times and find that the number of objects requested is the most important factor. With respect to variability in load times, however, we find that the number of servers is the best indicator.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A key performance metric, page load time, predicts user experience during development by encompassing events like downloading and rendering HTML, JavaScript, CSS, and images [6,13]. This metric is influenced by extrinsic factors (e.g., network latency, bandwidth, server capacity) and intrinsic factors (e.g., page size, resource usage, thirdparty content). ...
... Existing research explores various ML approaches for this purpose. Butkiewicz web page performance into different tiers (excellent, good, fair, unacceptable) [13,23]. Additionally, Calvano investigates the correlation between performance metrics and page characteristics [24]. ...
... Early models by Menasce et al. [16] and Zhi [28] considered factors like page size and bandwidth. More complex models included server and client processing times and payload size (Peter Sevcik et al. [26], Nagarajan et al. [29], Butkiewicz et al. [13], Krzysztof et al. [30]). Machine learning techniques have also been explored for performance prediction (Zhou et al. [23]). ...
Article
Full-text available
This study introduces a novel evaluation framework for predicting web page performance, utilizing state-of-the-art machine learning algorithms to enhance the accuracy and efficiency of web quality assessment. We systematically identify and analyze 59 key attributes that influence website performance, derived from an extensive literature review spanning from 2010 to 2024. By integrating a comprehensive set of performance metrics—encompassing usability, accessibility, content relevance, visual appeal, and technical performance—our framework transcends traditional methods that often rely on limited indicators. Employing various classification algorithms, including Support Vector Machines (SVMs), Logistic Regression, and Random Forest, we compare their effectiveness on both original and feature-selected datasets. Our findings reveal that SVMs achieved the highest predictive accuracy of 89% with feature selection, compared to 87% without feature selection. Similarly, Random Forest models showed a slight improvement, reaching 81% with feature selection versus 80% without. The application of feature selection techniques significantly enhances model performance, demonstrating the importance of focusing on impactful predictors. This research addresses critical gaps in the existing literature by proposing a methodology that utilizes newly extracted features, making it adaptable for evaluating the performance of various website types. The integration of automated tools for evaluation and predictive capabilities allows for proactive identification of potential performance issues, facilitating informed decision-making during the design and development phases. By bridging the gap between predictive modeling and optimization, this study contributes valuable insights to practitioners and researchers alike, establishing new benchmarks for future investigations in web page performance evaluation.
... Gaining insight into the above questions and understanding how much ads contribute to the breakdown of different activities in modern browsers can inform the design of efficient ads and optimizations targeting those specific activities. Unfortunately, only a handful of studies [30,36,38,50] have been devoted to the performance analysis of ads, yet many such important open questions remain to be answered. ...
... While user privacy and security are crucial, even ads that are safe and not tracking users can have a significant performance impact that has cascading effects on user satisfaction and Internet costs. Some notable studies [30,36,50,52,56] lean on ad blockers to measure the performance cost of web ads. The key distinction between our approach and prior efforts is that we do not rely on ad blockers and content-blocking for performance analysis of ads for three main reasons: ...
... Butkiewicz et al. [30] break down the content of non-origin requests by MIME type and reports images and HTML/XML contribute to 42% and 9% respectively, which is slightly higher than our measurements, whereas, JavaScript contribution (25%) is far less than our measurements. Given the fact that 70% of these non-origin requests belong to advertising and analytics, this comparison signifies the rise of responsive and interactive ads within the past few years. ...
Article
Full-text available
Monetizing websites and web apps through online advertising is widespread in the web ecosystem, creating a billion-dollar market. This has led to the emergence of a vast network of tertiary ad providers and ad syndication to facilitate this growing market. Nowadays, the online advertising ecosystem forces publishers to integrate ads from these third-party domains. On the one hand, this raises several privacy and security concerns that are actively being studied in recent years. On the other hand, the ability of today's browsers to load dynamic web pages with complex animations and Javascript has also transformed online advertising. This can have a significant impact on webpage performance. The latter is a critical metric for optimization since it ultimately impacts user satisfaction. Unfortunately, there are limited literature studies on understanding the performance impacts of online advertising which we argue is as important as privacy and security. In this paper, we apply an in-depth and first-of-a-kind performance evaluation of web ads. Unlike prior efforts that rely primarily on adblockers, we perform a fine-grained analysis on the web browser's page loading process to demystify the performance cost of web ads. We aim to characterize the cost by every component of an ad, so the publisher, ad syndicate, and advertiser can improve the ad's performance with detailed guidance. For this purpose, we develop a tool, adPerf, for the Chrome browser that classifies page loading workloads into ad-related and main-content at the granularity of browser activities. Our evaluations show that online advertising entails more than 15% of browser page loading workload and approximately 88% of that is spent on JavaScript. On smartphones, this additional cost of ads is 7% lower since mobile pages include fewer and well-optimized ads. We also track the sources and delivery chain of web ads and analyze performance considering the origin of the ad contents. We observe that 2 of the well-known third-party ad domains contribute to 35% of the ads performance cost and surprisingly, top news websites implicitly include unknown third-party ads which in some cases build up to more than 37% of the ads performance cost.
... This web "bloat" has been further exacerbated by the widespread use of third-party scripts from content delivery networks (CDNs), analytics services, and other external resources. Butkiewicz et al. [10] showed that, on average, modern web pages rely on at least 5 non-origin sources, contributing to more than 35% of the total bytes downloaded. ...
... Indeed, modern webpages consist of a large number of web elements hosted across several domains. Butkiewicz et al. [10] have shown that more than 60% of webpages request data from at least 5 different nonorigin sources, contributing to more than 35% of the overall page size. Furthermore, a modern browser must fetch and render several objects, including HTML, JS, CSS, and images, forming a complex object dependency graph [11,42,49]. ...
Preprint
Full-text available
The web experience in developing regions remains subpar, primarily due to the growing complexity of modern webpages and insufficient optimization by content providers. Users in these regions typically rely on low-end devices and limited bandwidth, which results in a poor user experience as they download and parse webpages bloated with excessive third-party CSS and JavaScript (JS). To address these challenges, we introduce the Mobile Application Markup Language (MAML), a flat layout-based web specification language that reduces computational and data transmission demands, while replacing the excessive bloat from JS with a new scripting language centered on essential (and popular) web functionalities. Last but not least, MAML is backward compatible as it can be transpiled to minimal HTML/JavaScript/CSS and thus work with legacy browsers. We benchmark MAML in terms of page load times and sizes, using a translator which can automatically port any webpage to MAML. When compared to the popular Google AMP, across 100 testing webpages, MAML offers webpage speedups by tens of seconds under challenging network conditions thanks to its significant size reductions. Next, we run a competition involving 25 university students porting 50 of the above webpages to MAML using a web-based editor we developed. This experiment verifies that, with little developer effort, MAML is quite effective in maintaining the visual and functional correctness of the originating webpages.
... Therefore, understanding how page content and infrastructure influence PLT remains crucial. Previous works exploring the relationship between page complexity metrics, namely page content and infrastructure metrics, and PLT have focused either on a few selected pages, with individual analyses being performed for each [Asrese et al. 2019], [Vogel and Springer 2022], or on a diverse group of pages, with an overall analysis conducted for all of the pages simultaneously [Saverimoutou et al. 2019], [Butkiewicz et al. 2011]. ...
... In [Saverimoutou et al. 2019], an analysis of time to first visual rendering showed lower RTTs (Round Trip Times) and fewer requests in "good response" navigations under various conditions, underscoring the significance of the number of requests in page load time prediction. Similarly, [Butkiewicz et al. 2011] finds a strong correlation between the number of bytes and PLT, while also identifying the number of requests as the best predictor of PLT. ...
Conference Paper
We study the metric Page Load Time (PLT) which has a significant impact on user experience, search engine optimization, and conversion rates. We explore how page complexity metrics, specifically content and infrastructure, affect PLT. We employ both supervised and unsupervised machine learning models to analyze the influence of these metrics at multiple levels: single page, page category, cluster, and general. Our study shows that the number of bytes, requests, and distinct images are key features in PLT prediction, with the page category model generally outperforming others. The results contribute to a better understanding of the factors influencing PLT and show some insights into how to optimize web pages for better user experiences and business outcomes.
... The various CMS was run to observe how they perform with the use of CSS. Butkiewicz(2011) in a study noted that, CSS file give good and better layout to websites. A website with a good layout is able to contain a lot of feautures and content on a page, because it helps to manage a lot of space on the webpage. ...
... These records are shown below in table 7 below. (Butkiewicz, 2011). A web page with more cascading style sheet will have good layout than a page with less or no CSS file. ...
Article
Full-text available
The most recent trend in website development is the usage of Content Management Systems (CMS). These systems provide user-friendly interactive interface for persons without prior computer coding knowledge to utilize them to create websites. Different kinds of content management system technologies are now available, hence, it becomes challenging for users/developers to decide on which one is best to work with. This study, researched into three popular CMS's namely, Joomla, Drupal and WordPress to explore which of these proposed three will be ideal for developing efficient website in terms of performance. This study resulted in researching into details the features common to all the three CMSs. These features were evaluated based on four performance criteria. The method this study used was a quantitative method. Practically, the researcher developed three websites using the three CMSs, and tested them against performance criteria namely, page load time, page size , number of cascading style sheet file and number of JavaScript files. The results or data collected through the page performance exercise was then tabulated, compared, and later analyzed. The study found that Joomla CMS created much JavaScript files as compared to all, and as a result was very interactive, WordPress also created much CSS file, as a result gives the best web layout. The researcher concluded that each of these CMS performed well in different criteria. As a result, the best CMS will depend on the interest of the developer for his or her site. These researched done will help readers and the academia in general to acquire knowledge that reveals the characteristics of the three CMS in terms of strengths and shortfalls.
... Moreover, there are approximately 1.72 billion websites accessible through desktops and smartphone devices [2]. However, research studies have reported that, on average, users abandon websites within the first 3 seconds due to diverse reasons, including the complexity and slow loading times of sites [3]. Therefore, developers must produce efficient yet compelling web designs that accommodate the taste of their users and keep up with the technological changes. ...
... Moreover, the studies, e.g. [3], [20], [39], which were conducted concerning web design produced a wealth of suggestions and principles that will help maximize efficiency, accessibility, safety, and usability of intelligent web designs. However, further research is required to organize, clarify, and validate these design specific recommendations [45]. ...
... JavaScript and HTTP re-directions [17]. Together, these factors collectively lead to poor web performance in constrained network conditions in emerging markets [23,52]. ...
Preprint
Full-text available
Despite increasing mobile Internet penetration in developing regions, mobile users continue to experience a poor web experience due to two key factors: (i) lack of locally relevant content; (ii) poor web performance due to complex web pages and poor network conditions. In this paper, we describe our design, implementation and deployment experiences of GAIUS, a mobile content ecosystem that enables efficient creation and dissemination of locally relevant web content into hyperlocal communities in emerging markets. The basic building blocks of GAIUS are a lightweight content edge platform combined with a mobile application that collectively provide a Hyperlocal Web abstraction for mobile users to create and consume locally relevant content and interact with other users via a community abstraction. The GAIUS platform uses MAML, a web specification language that dramatically simplifies web pages to reduce the complexity of Web content within the GAIUS ecosystem, improve page load times and reduce network costs. In this paper, we describe our experiences deploying GAIUS across a large user base in India, Bangladesh and Kenya.
... Many studies consider websites characteristics in terms of network level [8], [9] and client level [7]. Previous work concentrate on performance of webpages in latency level than data usage based on download contain page sizes. ...
Article
Full-text available
Nowadays Images represent approximately about half website size, Which effect the performance of website therefore optimization become necessary to avoid increasing image transfer size as well as page load time. Many studies showed that image optimization is lacking on most websites. This study introduced image optimization approaches used with joomla websites. The paper ends up with all optimization approaches reduce total page size and approve overall performance with different level of optimization. INTRODUCTION Image Analysis effect the page weight which impact data usage significantly more than half percent of total page weight for desktop or mobile sites represented by images of instance Entertainment websites use large number of images for both mobile and desktop sites, and finance websites as have average image size. In this study compression and optimizations on images has been applied in order to decrease the total overhead while maintaining quality. Our analysis shows that most websites does not apply in image optimization. Image types (GIF,PNG,JPG) act as most used image format each of those format utilize different compression technique in terms of size and quality optimization. We used the OptiPNG image compression technique [1] to convert GIF image to PNG. The analysis an average of 15% file size reduction. Summary Our image overhead analysis showed that image optimization is lacking on most websites. From our average size measurements, we observe that a total of saving of about 263 kilobytes can be realized on desktop sites, and 121 kilobytes for mobile websites. These are significant size reductions that alone can reduce mobile website sizes by an average of 13.1%, and desktop website sizes by an average of 15.1%. The size reduction can be further enhanced if the image quality is appropriately optimized for the target screen quality.
... To understand and derive the benefits of web analysis [2], you must first understand metrics and the different kinds of measures available for analyzing user information. Although metrics may seem basic, once collected, you can use these metrics to analyze web traffic and improve a web platform to meet the expectations of the site's traffic [3]. These metrics generally fall into four categories: site usage, referrers (or how visitors arrived at the site), site content analysis, and quality assurance. ...
Chapter
This chapter presents the process of web analytics for web platforms, web systems, and web apps. The process outlines how basic visitor information, such as the number of visitors and visit duration, can be collected using log files and page tagging. This basic visitor information is then combined to create meaningful key performance indicators that are tailored not only to the business goals of the company running the web platform but also to the goals and content of the web platform. Finally, this chapter presents several analytic tools and explains how to choose the right tool for the web platform’s needs. The ultimate goal of this chapter is to provide methods for increasing revenue and customer satisfaction through careful analysis of visitor interaction with a web platform.
... Today DNS is a key determinant, directly and indirectly, of users' quality of experience (QoE) and privy to their tastes, preferences, and even the devices they own. It directly determines user performance as, for instance, accessing any website requires tens of DNS resolutions [13,14,16]. Indirectly, a user's specific DNS resolver determines their QoE as many content delivery networks (CDNs) continue to rely on DNS for replica selection. ...
Preprint
Full-text available
The Domain Name System (DNS) is both a key determinant of users' quality of experience (QoE) and privy to their tastes, preferences, and even the devices they own. Growing concern about user privacy and QoE has brought a number of alternative DNS services, from public DNS to encrypted and Oblivious DNS. While offering valuable features, these DNS variants are operated by a handful of providers, reinforcing a trend towards centralization that has raised concerns about privacy, competition, resilience and Web QoE. The goal of this work is to let users take advantage of third-party DNS services, without sacrificing privacy or performance. We follow Wheeler's advice, adding another level of indirection with an end-system DNS resolver, Onoma, that improves privacy, avoiding DNS-based user-reidentification by inserting and sharding requests across resolvers, and improves performance by running resolution races among resolvers and reinstating the client-resolver proximity assumption content delivery networks rely on. As our evaluation shows, while there may not be an ideal service for all clients in all places, Onoma dynamically finds the best service for any given location.
... Хотя конфиденциальность и безопасность пользователей имеют решающее значение, даже безопасная реклама, не отслеживающая пользователей, может оказать значительное влияние на производительность, что имеет каскадный эффект на удовлетворенность пользователей и расходы на интернет. Некоторые известные исследования [9,10] опираются на блокировщиков рекламы для измерения стоимости эффективности веб-рекламы. Выделяют следующие проблемы блокировщиков интернет-рекламы. ...
... However, resources on modern web pages are often served by the direct-connect and multiple third-party domains. 12,13 Even with the dependency structure of web page from the direct-connect domain, the client still needs to initialize connection and send requests to multiple third-party domains. When the network latency between client and multiple domains is large, the time overhead in initiating connection and requesting objects degrades transmission efficiency of the client-side solutions. ...
Article
Full-text available
The dependencies between the resources on web page slow down the page load process, resulting in degradation of user experience and provider revenue. Recent solutions reprioritize requests at client side to fetch resources according to the page's dependency structure. However, since resources on modern web pages are often served by the direct-connect and multiple third-party domains, optimizing web performance only at the client side is not able to minimize the long waiting time for the resources across multiple domains. To address this inefficiency problem, we present fast page load (FPL), a scheme of restructuring the interaction between multiple domains to accelerate web page loads. The key of our solution is that the credible servers in third-party domains proactively push resources to client with the aid of server in the direct-connect domain. Therefore, FPL eliminates the waiting time of resource requesting and TCP handshaking between the client and servers in third-party domains. Furthermore, we propose FPL+ to schedule resources based on the dependency and size of objects to improve user perception and experience in terms of time-to-first-paint. Experiment results show that our approach FPL and FPL+ effectively reduces the median page load time by up to 44% across popular websites.
... To resolve the above issue we make use of a additional heuristics based on Subject Alternative Names (SANs List) [13]. If the website uses HTTPs, we find the site's SANs list via the SSL certificate of the website. ...
Preprint
There is a growing concern about consolidation trends in Internet services, with, for instance, a large fraction of popular websites depending on a handful of third-party service providers. In this paper, we report on a large-scale study of third-party dependencies around the world, using vantage points from 50 countries, from all inhabited continents, and regional top-500 popular websites.This broad perspective shows that dependencies vary widely around the world. We find that between 15% and as much as 80% of websites, across all countries, depend on a DNS, CDN or CA third-party provider.Sites critical dependencies, while lower, are equally spread ranging from 9% and 61% (CDN and DNS in China, respectively).Despite this high variability, our results suggest a highly concentrated market of third-party providers: three third-party providers across all countries serve an average of 91.2% and Google, by itself, serves an average of 72% of the surveyed websites. We explore various factors that may help explain the differences and similarities in degrees of third-party dependency across countries, including economic conditions, Internet development, language, and economic trading partners.
... Part of the problem is that much of the web has been designed, implicitly or not, for users on good networks and devices or, at least, without considerations of performance implications [11]. This has resulted in more complex websites [5], with heavy web fonts, external resources, large images, and animation that, while perhaps visually appealing for high-end users can be frustrating to the rest. ...
Preprint
The quality of experience with the mobile web remains poor, partially as a result of complex websites and design choices that worsen performance, particularly for users in suboptimal networks or devices. Prior proposed solutions have seen limited adoption due in part to the demand they place on developers and content providers, and the performing infrastructure needed to support them. We argue that Document and Permissions Policies -- an ongoing effort to enforce good practices on web design -- may offer the basis for a readily-available and easily-adoptable solution. In this paper, we evaluate the potential performance cost of violating well-understood policies and how common such violations are in today's web. Our analysis show, for example, that controlling for unsized-media policy, something applicable to 70% of the top-1million websites, can indeed reduce Cumulative Layout Shift metric.
... Large-scale characterization and analysis of web pages has been the subject of previous work [28], [29], early showing the high complexity of modern web page contents and the underlying hosting/server infrastructure. Content-based web page categorization has been done in the past mainly for retrieval and information management purposes. ...
Article
Full-text available
The properties of a web page have a strong impact on its overall loading process, including the download of its contents and their progressive rendering at the browser. As a consequence, web page content has a strong impact on the experience of web users. In this paper, we present WebCLUST, a clustering-based classification approach for web pages, which groups pages into quality-meaningful content classes impacting the Quality of Experience (QoE) of the users. Groups are defined based on standard Multipurpose Internet Mail Extensions (MIME) content breakdown and external subdomain connections, obtained through in-browser, application level measurements. Using a large corpus of multi-device, heterogeneous web content and QoE-relevant measurements for the top-500 most popular websites in the Internet, we show how WebCLUST can automatically identify relevant web content classes showing significantly different performance in terms of Web QoE relevant metrics, such as Speed Index. We additionally evaluate the impact of content caching and device type on the identification performance of WebCLUST, showing how content classes might look significantly different, depending on the access device type (desktop vs mobile), as well as when considering browser caching. Our findings suggest that Web QoE assessment should explicitly consider page content and subdomain embedding within the analysis, especially when it comes to recent work on Web QoE inference through machine learning models. To the best of our knowledge, this is the first study showing the impact of web content on Web QoE metrics, opening the door to new Web QoE assessment strategies.
... Browsers collect and report fine-grained page performance and usage info using the Telemetry API [6,43], without identifying the reasons for the measured performance. Systems such as WebPageTest [48] and others [61] provide similar information from a small set of dedicated hosts in multiple locations. The Wprof [177] approach helps explain the resource dependencies that affect page load times, but it requires a custom browser and has been used primarily in a lab environment. ...
Thesis
In today's rapidly growing smartphone society, the time users are spending on their smartphones is continuing to grow and mobile applications are becoming the primary medium for providing services and content to users. With such fast paced growth in smart-phone usage, cellular carriers and internet service providers continuously upgrade their infrastructure to the latest technologies and expand their capacities to improve the performance and reliability of their network and to satisfy exploding user demand for mobile data. On the other side of the spectrum, content providers and e-commerce companies adopt the latest protocols and techniques to provide smooth and feature-rich user experiences on their applications. To ensure a good quality of experience, monitoring how applications perform on users' devices is necessary. Often, network and content providers lack such visibility into the end-user application performance. In this dissertation, we demonstrate that having visibility into the end-user perceived performance, through system design for efficient and coordinated active and passive measurements of end-user application and network performance, is crucial for detecting, diagnosing, and addressing performance problems on mobile devices. My dissertation consists of three projects to support this statement. First, to provide such continuous monitoring on smartphones with constrained resources that operate in such a highly dynamic mobile environment, we devise efficient, adaptive, and coordinated systems, as a platform, for active and passive measurements of end-user performance. Second, using this platform and other passive data collection techniques, we conduct an in-depth user trial of mobile multipath to understand how Multipath TCP (MPTCP) performs in practice. Our measurement study reveals several limitations of MPTCP. Based on the insights gained from our measurement study, we propose two different schemes to address the identified limitations of MPTCP. Last, we show how to provide visibility into the end- user application performance for internet providers and in particular home WiFi routers by passively monitoring users' traffic and utilizing per-app models mapping various network quality of service (QoS) metrics to the application performance.
... Websites can be analysed by technological as well as by content-based approach. The technology-based and measurability-focused examination model's systemic approach (Butkiewicz et al., 2011) is illustrated in Figure 2. The model demonstrates the focal point of its examination that is the Figure 2. Examination model of the web ecosystem CR central website, the main part of a system, the "web-ecosystem". This approach serves as the basis of the conceptual model of the examination in this paper. ...
Article
Purpose The purpose of this paper is to examine the importance of websites and social media platforms to find out how they contribute to the improvement of business performance. A new automated data collection method is developed to determine the technology maturity level of websites. These website quality indicators are linked to and compared against small and medium enterprise (SME) competitiveness data set to find competency pillars having significant impacts on the online presence, and to identify most important factors for online digital transformation. In this way, periodic analysis of websites can signal early warnings if competitiveness data of an SME is worth to refresh. Continuous maturity monitoring of competitors’ websites provides useful benchmark information for an enterprise as well. Design/methodology/approach A conceptual model was developed for the examination of the online presence and its effect on the competitiveness of small- and medium-sized businesses. An innovative, automatically generated WebIX indicator was developed through technical and content analysis of websites of 958 SMEs’ included in the Global Competitiveness Project (GCP) network data set. A series of ANOVA analysis was used for both data sources to determine the relationships between Web quality and competitiveness levels to define the online presence maturity categories. Findings Both the existence and the quality of the websites proved to have positive impact on the SME’s competitiveness. Different online presence maturity categories contribute to different competitiveness pillars; therefore, key factors of online digital transformation were identified. According to the findings, company websites are more related to marketing functions than information technology from the point of competitiveness. Originality/value Competency relationships were identified between online activity and competitiveness. The foundations of automated competitiveness measures were developed. The traditional survey based subjective data collection was combined with objective data collection methodology in a reproducible way.
... The measurement study in [14] discusses mobile traffic composition and investigates the performance and energy efficiency of TCP transfers. Butkiewicz et al. [5] studies parameters that affect web page load times across websites, whereas WProf [32] performs in-browser performance profiling. In [25], the usage of bandwidth and energy in mobile web browsing is studied in detail using traffic collection and analysis tools, whereas [3] and [31] focus on analyzing the energy consumption of mobile devices' communication, particularly mobile web browsing. ...
Preprint
Full-text available
Internet of Things (IoT) devices and applications are generating and communicating vast quantities of data, and the rate of data collection is increasing rapidly. These high communication volumes are challenging for energy-constrained, data-capped, wireless mobile devices and networked sensors. Compression is commonly used to reduce web traffic, to save energy, and to make network transfers faster. If not used judiciously, however, compression can hurt performance. This work proposes and evaluates mechanisms that employ selective compression at the network's edge, based on data characteristics and network conditions. This approach (i) improves the performance of network transfers in IoT environments, while (ii) providing significant data savings. We demonstrate that our library speeds up web transfers by an average of 2.18x and 2.03x under fixed and dynamically changing network conditions respectively. Furthermore, it also provides consistent data savings, compacting data down to 19% of the original data size.
... With the increased presence of images and videos, Web pages are increasing in size, however, the majority of Web pages are still relatively small (less than 1000 KB) [16]. As a result, we assume that the probability of requesting a small page p is 0:9 making the probability of requesting large pages 1 − p. ...
Article
Understanding the resource consumption of the mobile web is an important topic that has garnered much attention in recent years. However, existing works mostly focus on the networking or computational aspects of the mobile web and largely ignore memory, which is an important aspect given the mobile web's reliance on resource-heavy JavaScript. In this paper, we propose a framework, called JS Capsule, for characterizing the memory of JavaScript functions and, using this framework, we investigate the key browser mechanics that contribute to the memory overhead. Leveraging our framework on a testbed of Android mobile phones, we conduct measurements of the Alexa top 1K websites. While most existing frameworks focus on V8 --- the JavaScript engine used in most popular browsers --- in the context of memory, our measurements show that the memory implications of JavaScript extends far beyond V8 due to the cascading effects that certain JavaScript calls have on the browser's rendering mechanics. We quantify and highlight the direct impact that website DOM have on JavaScript memory overhead and present, to our knowledge, the first root-cause analysis to dissect and characterize their impact on JavaScript memory overheads.
Chapter
Third-party dependencies expose websites to shared risks and cascading failures. The dependencies impact African websites as well e.g., Afrihost outage in 2022 [15]. While the prevalence of third-party dependencies has been studied for globally popular websites, Africa is largely underrepresented in those studies. Hence, this work analyzes the prevalence of third-party infrastructure dependencies in Africa-centric websites from 4 African vantage points. We consider websites that fall into one of the four categories: Africa-visited (popular in Africa) Africa-hosted (sites hosted in Africa), Africa-dominant (sites targeted towards users in Africa), and Africa-operated (websites operated in Africa). Our key findings are: 1) 93% of the Africa-visited websites critically depend on a third-party DNS, CDN, or CA. In perspective, US-visited websites are up to 25% less critically dependent. 2) 97% of Africa-dominant, 96% of Africa-hosted, and 95% of Africa-operated websites are critically dependent on a third-party DNS, CDN, or CA provider. 3) The use of third-party services is concentrated where only 3 providers can affect 60% of the Africa-centric websites. Our findings have key implications for the present usage and recommendations for the future evolution of the Internet in Africa.
Article
Understanding the resource consumption of the mobile web is an important topic that has garnered much attention in recent years. However, existing works mostly focus on the networking or computational aspects of the mobile web and largely ignore memory, which is an important aspect given the mobile web's reliance on resource-heavy JavaScript. In this paper, we propose a framework, called JS Capsules, for characterizing the memory of JavaScript functions and, using this framework, we investigate the key browser mechanics that contribute to the memory overhead. Leveraging our framework on a testbed of Android mobile phones, we conduct measurements of the Alexa top 1K websites. While most existing frameworks focus on V8 - the JavaScript engine used in most popular browsers - in the context of memory, our measurements show that the memory implications of JavaScript extends far beyond V8 due to the cascading effects that certain JavaScript calls have on the browser's rendering mechanics. We quantify and highlight the direct impact that website DOM have on JavaScript memory overhead and present, to our knowledge, the first root-cause analysis to dissect and characterize their impact on JavaScript memory overheads.
Article
We describe the results of a large-scale study of third-party dependencies around the world based on regional top-500 popular websites accessed from vantage points in 50 countries, together covering all inhabited continents. This broad perspective shows that dependencies on a third-party DNS, CDN or CA provider vary widely around the world, ranging from 19% to as much as 76% of websites, across all countries. The critical dependencies of websites -- where the site depends on a single third-party provider -- are equally spread ranging from 5% to 60% (CDN in Costa Rica and DNS in China, respectively). Interestingly, despite this high variability, our results suggest a highly concentrated market of third-party providers: three providers across all countries serve an average of 92% and Google, by itself, serves an average of 70% of the surveyed websites. Even more concerning, these differences persist a year later with increasing dependencies, particularly for DNS and CDNs. We briefly explore various factors that may help explain the differences and similarities in degrees of third-party dependency across countries, including economic conditions, Internet development, economic trading partners, categories, home countries, and traffic skewness of the country's top-500 sites.
Article
The majority of Web content is delivered by only a few companies that provide Content Delivery Infrastructuress (CDIss) such as Content Delivery Networkss (CDNss) and cloud hosts. Due to increasing concerns about trends of centralization, empirical studies on the extent and implications of resulting Internet consolidation are necessary. Thus, we present an empirical view on consolidation of the Web by leveraging datasets from two different measurement platforms. We first analyze Web consolidation around CDIs at the level of landing webpages, before narrowing down the analysis to a level of embedded page resources. The datasets cover 1(a) longitudinal measurements of DNS records for 166.5 M Web domains over five years, 1(b) measurements of DNS records for Alexa Top 1 M over a month and (2) measurements of page loads and renders for 4.3 M webpages, which include data on 392.3 M requested resources. We then define CDIs penetration as the ratio of CDI-hosted objects to all measured objects, which we use to quantify consolidation around CDIs. We observe that CDI penetration has close to doubled since 2015, reaching a lower bound of 15% for all .com , .net , and .org Web domains as of January 2020. Overall, we find a set of six CDIss to deliver the majority of content across all datasets, with these six CDIss being responsible for more than 80% of all 221.9 M CDI-delivered resources (56.6% of all resources in total). We find high dependencies of Web content on a small group of CDIss, in particular, for fonts, ads, and trackers, as well as JavaScript resources such as jQuery. We further observe CDIss to play important roles in rolling out IPv6 and TLS 1.3 support. Overall, these observations indicate a potential oligopoly, which brings both benefits but also risks to the future of the Web.
Article
With the evolution of the online advertisement and tracking ecosystem, content-blockers have become the reference tool for improving the security, privacy and browsing experience when surfing the Internet. It is also commonly believed that using content-blockers to stop unsolicited content decreases the time needed for loading websites. In this work, we perform a large-scale study on the actual improvements of using content-blockers in terms of performance and quality of experience. For measuring it, we analyze the page size and loading times of the 100K most popular websites, as well as the most relevant QoE metrics, such as the Speed Index, Time to Interactive or the Cumulative Layout Shift, for the subset of the top 10K of them. Our experiments show that using content-blockers results in small improvements in terms of performance. However, contrary to popular belief, this has a negligible impact in terms of loading time and quality of experience. Moreover, in the case of small and lightweight websites, the overhead introduced by content-blockers can even result in decreased performance. Finally, we evaluate the improvement in terms of QoE based on the Mean Opinion Score (MOS) and find that two of the three studied content-blockers present an overall decrease between 3% and 5% instead of the expected improvement.
Article
Website features and characteristics have shown the ability to detect various web threats – phishing, drive-by downloads, and command and control (C2). Prior research has thoroughly explored the practice of choosing features ahead of time (a priori) and building detection models. However, there is an opportunity to investigate new techniques and features for detection. We perform a comprehensive evaluation of discovering features for malicious website detection versus selecting features a priori. We gather 46,580 features derived from a response to a web request and, through a series of feature selection techniques, discover features for detection and compare their performance to those used in prior research. We build several detection models using unsupervised and supervised learning algorithms over various sampling and feature transformation scenarios. Our approach is evaluated on a diverse dataset composed of common threats on the internet. Overall, we find that discovered features can achieve more efficient and comparable detection performance to a priori features with 66% fewer features and can achieve a Matthews Correlation Coefficient (MCC) of up to 0.9008.
Chapter
During the last 30 years, the web has evolved from simple information HTML pages to complex applications supporting business, television, newspapers, entertainment, and others. While there are many articles on website popularity, there has been little work in understanding the complexity of individual web pages. In the article, we present a measurement-driven study of the complexity of web pages today. We measured 426 866 web pages in about 12 weeks. Our study is devoted to two problems. The first problem was to describe the complexity of a web page with metrics based on the content they included and the kind of service they offered. The second focus of our study was to build probabilistic models of observed distributions. Such models can be used in HTTP request generators modelling the work of modern web systems. Separate models are proposed for each category of web pages and all pages together.
Article
Modern mobile operating systems support displaying Web pages in native mobile applications. When an app user navigates to a specific location containing a Web page, the Web page will be loaded and rendered from within the app. Such kind of Web browsing, as we call embedded Web browsing , is different from traditional Web browsing, which involves typing a URL on a browser and loading the Web page. However, little has been known about the prevalence or performance of such embedded Web pages. In this paper, we conduct, to the best of our knowledge, the first measurement study on embedded Web browsing on Android. Our study on 22,521 popular Android apps shows that 57.9 and 73.8 percent of apps embed Web pages on two popular app markets, that is, Google Play and Wandoujia, respectively. We design and implement EWProfiler, a tool that can automatically search for embedded Web pages inside apps, trigger page loads, and retrieve performance metrics to analyze the embedded Web browsing performance at scale. Based on 445 embedded Web pages obtained by EWProfiler in 99 popular apps from the two app markets, we investigate the characteristics and performance of embedded Web pages, and find that embedded Web pages significantly impede the app user experience. We investigate the effectiveness of three techniques, i.e., separating the browser kernel to a different process, loading pages from the local storage, and pre-rendering, to optimize the performance of embedded Web browsing. We believe that our findings could draw the attention of Web developers, browser vendors, app developers, and mobile OS vendors together toward a better performance of embedded Web browsing.
Chapter
In a large-scale network environment, the traffic component of the gateway is complex and changeable, and the number of web pages is huge and unevenly distributed, which poses a challenge to the malicious web pages detection. Among them, the most challenge comes from the dynamic correlation of web resources. Due to Internet advertising, CDN acceleration mechanisms, cloud services, and high compression coding, these responses are usually multi-source links and are transmitted in fragmented form. The web page resources such as text, LOGO, pictures, audio and video received after an access request may come from different servers, and the information is out of order, cluttered and fragmented. How to obtain web page source such as web pages source code, audio and video, file and picture in a high-speed network environment and associated assembled into a complete web page efficiently so that they can make detection, which is of great significance for malicious web page homology detection, application network usage situation analysis, and traffic perception management. This paper focuses on the problem of resource association in the situation of large-scale traffic, and proposes a multi-domain feature association method to associate multimedia resources as much as possible to obtain a complete web page relatively. The experimental results show that the multi-domain association success rate can reach more than 87%, which has improved over 60% compared with the existing association methods based on a single feature.
Article
Full-text available
Recent studies show that a significant part of Internet traffic is delivered through Web-based applications. To cope with the increasing demand for Web content, large scale content hosting and delivery infrastructures, such as data-centers and content distribution networks, are continuously being deployed. Being able to identify and classify such hosting infrastructures is helpful not only to content producers, content providers, and ISPs, but also to the research community at large. For example, to quantify the degree of hosting infrastructure deployment in the Internet or the replication of Web content. In this paper, we introduce Web Content Cartography, i.e., the identification and classification of content hosting and delivery infrastructures. We propose a lightweight and fully automated approach to discover hosting infrastructures based only on DNS measurements and BGP routing table snapshots. Our experimental results show that our approach is feasible even with a limited number of well-distributed vantage points. We find that some popular content is served exclusively from specific regions and ASes. Furthermore, our classification enables us to derive content-centric AS rankings that complement existing AS rankings and shed light on recent observations about shifts in inter-domain traffic and the AS topology.
Article
Full-text available
Cloud-based Web applications powered by new technologies such as Asynchronous Javascript and XML (Ajax) place a significant burden on network operators and enterprises to effectively manage traffic. Despite increase of their popularity, we have little understanding of characteristics of these cloud applications. Part of the problem is that there exists no systematic way to generate their workloads, observe their network behavior today and keep track of the changing trends of these applications. This paper focuses on addressing these issues by developing a tool, called AJAXTRACKER, that automatically mimics a human interaction with a cloud application and collects associated network traces. These traces can further be post-processed to understand various characteristics of these applications and those characteristics can be fed into a classifier to identify new traffic for a particular application in a passive trace. The tool also can be used by service providers to automatically generate relevant workloads to monitor and test specific applications.
Conference Paper
Full-text available
The rise of the software-as-a-service paradigm has led to the development of a new breed of sophisticated, interactive applications often called Web 2.0. While Web applications have become larger and more complex, Web application developers today have little visibility into the end-toend behavior of their systems. This article presents AjaxScope, a dynamic instrumentation platform that enables cross-user monitoring and just-in-time control of Web application behavior on end-user desktops. AjaxScope is a proxy that performs on-the-fly parsing and instrumentation of JavaScript code as it is sent to users ’ browsers. AjaxScope provides facilities for distributed and adaptive instrumentation in order to reduce the client-side overhead, while giving fine-grained visibility into the code-level behavior of Web applications. We present a variety of policies demonstrating the power of AjaxScope, ranging from simple error reporting and performance profiling to more complex memory leak detection and optimization analyses. We also apply our prototype to analyze the behavior of over 90 Web 2.0 applications and sites that use significant amounts of JavaScript.
Conference Paper
Full-text available
Online Social Networks (OSNs) have already attracted more than half a billion users. However, our understanding of which OSN fea- tures attract and keep the attention of these users is poor. S tudies thus far have relied on surveys or interviews of OSN users or fo- cused on static properties, e. g., the friendship graph, gat hered via sampled crawls. In this paper, we study how users actually inter- act with OSNs by extracting clickstreams from passively monitored network traffic. Our characterization of user interactions within the OSN for four different OSNs (Facebook, LinkedIn, Hi5, and Stu- diVZ) focuses on feature popularity, session characterist ics, and the dynamics within OSN sessions. We find, for example, that user s commonly spend more than half an hour interacting with the OSNs while the byte contributions per OSN session are relatively small.
Conference Paper
Full-text available
In recent years, navigability has become the pivot of website designs. Existing works fall into two categories. The first is to evaluate and assess a website's navigability against a set of criteria or check list. The second is to analyse usage data of the website, such as the server log files. This paper investigates a metric approach to website navigability measurement. In comparison with existing assessment and analysis methods, navigability metrics have the advantages of objectiveness and the possibility of using automated tools to evaluate large-scale websites. This paper proposes a number of metrics for website navigability measurement based on measuring website structural complexity. We will validate these metrics against Weyuker's software complexity axioms, and report the results of empirical studies of the metrics.
Conference Paper
Full-text available
The success of the World-Wide Web is largely due to the simplicity, hence ease of implementation, of the Hypertext Transfer Protocol (HTTP). HTTP, however, makes inefficient use of network and server resources, and adds unnecessary latencies, by creating a new TCP connection for each request. Modifications to HTTP have been proposed that would transport multiple requests over each TCP connection. These modifications have led to debate over their actual impact on users, on servers, and on the network. This paper reports the results of log-driven simulations of several variants of the proposed modifications, which demonstrate the value of persistent connections.
Conference Paper
Full-text available
Abstract Web users often face a long waiting time for downloading Web pages. Although various technologies and techniques have been implemented to alleviate the situation ,and to comfort the impatient users, little research has been done to assess what constitutes an acceptable and tolerable waiting time for Web users. This research reviews the literature oncomputer response time and users’ waiting time for download of Web pages, and assesses Web users’ tolerable waiting time in information ,retrieval. It addresses the following questions through an experimental study: What is the effect of feedback ,on users’ tolerable waiting time? How long are users willing to wait for a Web,page to be downloaded,before abandoning,it? The results from this,study suggest that the presence of feedback prolongs Web users’ tolerable waiting time and the tolerable waiting time for information retrieval is approximately2 seconds. , 2
Conference Paper
Full-text available
The rapid advent of "Web 2.0" applications has unleashed new HTTP traffic patterns which differ from the conventional HTTP req uest-response model. In particular, asynchronous pre-fetching of data in order t o provide a smooth web browsing experience and richer HTTP payloads (e.g., Javascript libraries) of Web 2.0 applications induce larger, heavier, and more bursty traffic on the underlying networks. We present a traffic study of several We b 2.0 applications including Google Maps, modern web-email, and social networking web sites, and compare their traffic characteristics with the ambient HTTP traffic. We highlight the key differences between Web 2.0 traffic and all HTTP traffi c through statis- tical analysis. As such our work elucidates the changing face of one of the most popular application on the Internet: The World Wide Web.
Conference Paper
Full-text available
Today, large-scale web services run on complex sys- tems, spanning multiple data centers and content dis- tribution networks, with performance depending on di- verse factors in end systems, networks, and infrastructure servers. Web service providers have many options for improving service performance, varying greatly in feasi- bility, cost and benet, but have few tools to predict the impact of these options. A key challenge is to precisely capture web object de- pendencies, as these are essential for predicting perfor- mance in an accurate and scalable manner. In this pa- per, we introduce WebProphet, a system that automates performance prediction for web services. WebProphet employs a novel technique based on timing perturba- tion to extract web object dependencies, and then uses these dependencies to predict the performance impact of changes to the handling of the objects. We have built, deployed, and evaluated the accuracy and ef- ciency of WebProphet. Applying WebProphet to the Search and Maps services of Google and Yahoo, we nd WebProphet predicts the median and 95 th percentiles of the page load time distribution with an error rate smaller than 16% in most cases. Using Yahoo Maps as an exam- ple, we nd that WebProphet reduces the problem of per- formance optimization to a small number of web objects whose optimization would reduce the page load time by nearly 40%.
Article
Full-text available
Today’s Web provides many different functionalities, including communication, entertainment, social networking, and information retrieval. In this article, we analyze traces of HTTP activity from a large enterprise and from a large university to identify and characterize Web-based service usage. Our work provides an initial methodology for the analysis of Web-based services. While it is nontrivial to identify the classes, instances, and providers for each transaction, our results show that most of the traffic comes from a small subset of providers, which can be classified manually. Furthermore, we assess both qualitatively and quantitatively how the Web has evolved over the past decade, and discuss the implications of these changes.
Article
Full-text available
Web page loading speed continues to vex users, even as broadband adoption increases. Several studies have addressed delays in the context of Web sites as well as interactive corporate systems, and have recommended a wide range of ¡°rules of thumb.¡± Some studies conclude that response times should be no greater than 2 seconds while other studies caution on delays of 12 seconds or more. One of the strongest conclusions was that complex tasks seemed to allow longer response times. This study examined delay times of 0, 2, 4, 6, 8, 10, and 12 seconds using 196 undergraduate students in an experiment. Randomly assigned a constant delay time, subjects were asked to complete 9 search tasks, exploring a familiar and an unfamiliar site. Plots of the dependent variables performance, attitudes, and behavioral intentions, along those delays, suggested the use of non-linear regression, and the explained variance was in the neighborhood of 2%, 5%, and 7%, respectively. Focusing only on the familiar site, explained variance in attitudes and behavioral intentions grew to about 16%. A sensitivity analysis implies that decreases in performance and behavioral intentions begin to flatten when the delays extend to 4 seconds or longer, and attitudes flatten when the delays extend to 8 seconds or longer. Future research should include other factors such as expectations, variability, and feedback, and other outcomes such as actual purchasing behavior, to more fully understand the effects of delays in today¡¯s Web environment.
Conference Paper
Full-text available
New source-level models for aggregated HTTP traffic and a design for their integration with the TCP transport layer are built and validated using two large-scale collections of TCP/IP packet header traces. An implementation of the models and the design in the ns network simulator can be used to generate web traffic in network simulations
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Article
Article
Today’s Web provides many different functionalities, including communication, entertainment, social networking, and information retrieval. In this article, we analyze traces of HTTP activity from a large enterprise and from a large university to identify and characterize Web-based service usage. Our work provides an initial methodology for the analysis of Web-based services. While it is nontrivial to identify the classes, instances, and providers for each transaction, our results show that most of the traffic comes from a small subset of providers, which can be classified manually. Furthermore, we assess both qualitatively and quantitatively how the Web has evolved over the past decade, and discuss the implications of these changes.
Article
The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons -- mathematical, sociological, and commercial -- for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.
Article
Growing usage and diversity of applications on the Internet makes Quality of Service (QoS) increasingly critical [15]. To date, the majority of research on QoS is systems oriented, lbcusing on traffic analysis, scheduling, and routing. Relatively minor attention has been paid to user-level QoS issues. It is not yet known how objective system quality relates to users' subjective perceptions of quality. This paper presents the results of quantitative experiments that establish a mapping between objective and perceived QoS in the context of Internet commerce. We also conducted focus groups to determine how contextual lactors influence users' perceptions of QoS. We show that, while users' perceptions of World Wide Web QoS are influenced by a number of contextual factors, it is possible to correlate objective measures of QoS with subjective judgements made by users, and therefore influence system design. We argue that only by integrating users' requirements for QoS into system design can the utility of the future Internet be maximized. The population of users of the World Wide Web is expected to grow from 100 million in 1998 to 320 million by 2002 [24]. While a vision of the future Internet offers the potential to break traditional barriers in communications and commerce, the current service quality to users is often unacceptable [7] [19]. With increasing usage of Internet services, the topic of providing adequate Quality of Service (QoS) for the Internet has become a focus of research. Traditional QoS metrics such as response time and delay no longer suffice to fully describe quality of service as perceived by users. The success of any scheme that Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed fbr prol]t or cornmercial advantage and that copies bear this notice and the lull citation on the first page. To copy otherwise, to republish, to post on se~wers or to redistribute to lists, requires prior specific permission and/or a fee.
Conference Paper
The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Conference Paper
The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons — mathematical, sociological, and commercial — for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.
Conference Paper
With over half a billion users, Online Social Networks (OSNs) are the major new applications on the Internet. Little information is available on the network impact of OSNs, although there is every expectation that the volume and diversity of trafc due to OSNs is set to explode. In this paper, we examine the specic role played by a key component of OSNs: the extremely popular and widespread set of third-party applications on some of the most popular OSNs. With over 81,000 third-party applications on Facebook alone, their impact is hard to predict and even harder to study. We have developed and launched a number of Facebook appli- cations, all of which are among the most popular applications on Facebook in active use by several million users monthly. Through our applications, we are able to gather, analyze, correlate, and re- port their workload characteristics and performance from the per- spective of the application servers. Coupled with PlanetLab exper- iments, where active probes are sent through Facebook to access a set of diverse applications, we are able to study how Facebook for- warding/processing of requests/responses impacts the overall de- lay performance perceived by end-users. These insights help pro- vide guidelines for OSNs and application developers. We have also made the data studied here publicly available to the research com- munity. This is the rst and only known study of popular third- party applications on OSNs at this depth.
Conference Paper
As Web sites move from relatively static displays of simple pages to rich media applications with heavy client-side interaction, the nature of the resulting Web traffic changes as well. Understanding this change is necessary in order to improve response time, evaluate caching effectiveness, and design intermediary systems, such as firewalls, security analyzers, and reporting/management systems. Unfortunately, we have little understanding of the underlying nature of today's Web traffic. In this paper, we analyze five years (2006-2010) of real Web traffic from a globally-distributed proxy system, which captures the browsing behavior of over 70,000 daily users from 187 countries. Using this data set, we examine major changes in Web traffic characteristics that occurred during this period. We also present a new Web page analysis algorithm that is better suited for modern Web page interactions by grouping requests into streams and exploiting the structure of the pages. Using this algorithm, we analyze various aspects of page-level changes, and characterize modern Web pages. Finally, we investigate the redundancy of this traffic, using both traditional object-level caching as well as content-based approaches.
Conference Paper
Growing usage and diversity of applications on the Internet makes Quality of Service (QoS) increasingly critical. To date, the majority of research on QoS is systems oriented, focusing on traffic analysis, scheduling, and routing. Relatively minor attention has been paid to user-level QoS issues. It is not yet known how objective system quality relates to users' subjective perceptions of quality. This paper presents the results of quantitative experiments that establish a mapping between objective and perceived QoS in the context of Internet commerce. We also conducted focus groups to determine how contextual factors influence users' perceptions of QoS. We show that, while users' perceptions of World Wide Web QoS are influenced by a number of contextual factors, it is possible to correlate objective measures of QoS with subjective judgements made by users, and therefore influence system design. We argue that only by integrating users' requirements for QoS into system design can the utility of the future Internet be maximized.
Conference Paper
Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide Web.
Conference Paper
The web browser is a CPU-intensive program. Especially on mobile devices, webpages load too slowly, expending sig- nicant time in processing a document's appearance. Due to power constraints, most hardware-driven speedups will come in the form of parallel architectures. This is also true of mobile devices such as phones and e-books. In this pa- per, we introduce new algorithms for CSS selector matching, layout solving, and font rendering, which represent key com- ponents for a fast layout engine. Evaluation on popular sites shows speedups as high as 80x. We also formulate the lay- out problem with attribute grammars, enabling us to not only parallelize our algorithm but prove that it computes in O(log) time and without reow.
Conference Paper
For the last few years we have studied the diusion of pri- vate information about users as they visit various Web sites triggering data gathering aggregation by third parties. This paper reports on our longitudinal study consisting of mul- tiple snapshots of our examination of such diusion over four years. We examine the various technical ways by which third-party aggregators acquire data and the depth of user- related information acquired. We study techniques for pro- tecting against this privacy diusion as well as limitations of such techniques. We introduce the concept of secondary privacy damage. Our results show increasing aggregation of user-related data by a steadily decreasing number of entities. A hand- ful of companies are able to track users' movement across almost all of the popular Web sites. Virtually all the pro- tection techniques have signican t limitations highlighting the seriousness of the problem and the need for alternate solutions.
Conference Paper
The systems and networking community treasures "sim- ple" system designs, but our evaluation of system sim- plicity often relies more on intuition and qualitative dis- cussion than rigorous quantitative metrics. In this paper, we develop a prototype metric that seeks to quantify the notion of algorithmic complexity in networked system design. We evaluate several networked system designs through the lens of our proposed complexity metric and demonstrate that our metric quantitatively assesses so- lutions in a manner compatible with informally artic- ulated design intuition and anecdotal evidence such as real-world adoption.
Conference Paper
Operator interviews and anecdotal evidence suggest that an operator's ability to manage a network decreases as the network becomes more complex. However, there is currently no way to systematically quantify how com- plex a network's design is nor how complexity may im- pact network management activities. In this paper, we develop a suite of complexity models that describe the routing design and configuration of a network in a suc- cinct fashion, abstracting away details of the underlying configuration languages. Our models, and the complex- ity metrics arising from them, capture the difficulty of configuring control and data plane behaviors on routers. They also measure the inherent complexity of the reach- ability constraints that a network implements via its rout- ing design. Our models simplify network design and management by facilitating comparison between alter- native designs for a network. We tested our models on seven networks, including four university networks and three enterprise networks. We validated the results through interviews with the operators of five of the net- works, and we show that the metrics are predictive of the issues operators face when reconfiguring their networks.
Article
The Ninja project seeks to enable the broad innovation of robust, scalable, distributed Internet services, and to permit the emerging class of extremely heterogeneous devices to seamlessly access these services. Our architecture consists of four basic elements: bases, which are powerful workstation cluster environments with a software platform that simplifies scalable service construction; units, which are the devices by which users access the services; active proxies, which are transformational elements that are used for unit- or service-specific adaptation; and paths, which are an abstraction through which units, services, and active proxies are composed.
Conference Paper
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them.One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.
Conference Paper
The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Article
Manageability directly influences a system's reliability, availability, security, and safety, thus being a key ingredient of system dependability. Alas, we do not have today a good way to measure manageability or reason quantitatively about it, and this is a major hindrance in improving systems' ease of management. In this paper, we propose a manageability metric that aims to be objective, intuitive, and broadly applicable. We hope such a metric will help software developers make the right design tradeoffs and will also foster fair competition on the basis of manageability. We also offer preliminary thoughts on incorporating this metric into a benchmark.
Conference Paper
We analyze the way in which Web browsers use TCP connections based on extensive traffic traces obtained from a busy Web server (the official Web server of the 1996 Atlanta Olympic games). At the time of operation, this Web server was one of the busiest on the Internet. We first describe the techniques used to gather these traces and reconstruct the behavior of the TCP on the server. We then present a detailed analysis of the TCP's loss recovery and congestion control behavior from the recorded transfers. Our two most important results are: (1) short Web transfers lead to poor loss recovery performance for TCPs, and (2) concurrent connections are overly aggressive users of the network. We then discuss techniques designed to solve these problems. To improve the data-driven loss recovery performance of short transfers, we present a new enhancement to the TCP's loss recovery. To improve the congestion control and loss recovery performance of parallel TCP connections, we present a new integrated approach to congestion control and loss recovery that works across the set of concurrent connections. Simulations and trace analysis show that our enhanced loss recovery scheme could have eliminated 25% of all timeout events, and that our integrated approach provides greater fairness and improved startup performance for concurrent connections
Article
Content distribution networks (CDNs) are a mechanism to deliver content to end users on behalf of origin Web sites. Content distribution offloads work from origin servers by serving some or all of the contents of Web pages. We found an order of magnitude increase in the number and percentage of popular origin sites using CDNs between November 1999 and December 2000. In this paper we discuss how CDNs are commonly used on the Web and define a methodology to study how well they perform. A performance study was conducted over a period of months on a set of CDN companies employing the techniques of DNS redirection and URL rewriting to balance load among their servers. Some CDNs generally provide better results than others when we examine results from a set of clients. The performance of one CDN company clearly improved between the two testing periods in our study due to a dramatic increase in the number of distinct servers employed in its network. More generally, the results indicate that use of a DNS lookup in the critical path of a resource retrieval does not generally result in better server choices being made relative to client response time in either average or worst case situations.
Article
This paper describes httperf, a tool for measuring web server performance. It provides a flexible facility for generating various HTTP workloads and for measuring server performance. The focus of httperf is not on implementing one particular benchmark but on providing a robust, high-performance tool that facilitates the construction of both micro- and macro-level benchmarks. The three distinguishing characteristics of httperf are its robustness, which includes the ability to generate and sustain server overload, support for the HTTP/1.1 protocol, and its extensibility to new workload generators and performance measurements. In addition to reporting on the design and implementation of httperf this paper also discusses some of the experiences and insights gained while realizing this tool. 1 Introduction A web system consists of a web server, a number of clients, and a network that connects the clients to the server. The protocol used to communicate between the client and server is HTTP...
Web Site Delays: How Tolerant are Users? Journal of the Association for Information Systems
  • D Galletta
  • R Henry
  • S Mccoy
  • P Polak
Web content cartography
  • B Ager
  • W Mãijhlbauer
  • G Smaragdakis
  • S Uhlig
B. Ager, W. MÃijhlbauer, G. Smaragdakis, and S. Uhlig. Web content cartography. In Proc. IMC, 2011.
Privacy diffusion on the web: A longitudinal perspective
  • B Krishnamurthy
  • C E Willis
B. Krishnamurthy and C. E. Willis. Privacy diffusion on the web: A longitudinal perspective. In Proc. WWW, 2009.
On the use and performance of content distribution networks
  • B Krishnamurthy
  • C E Willis
  • Y Zhang
B. Krishnamurthy, C. E. Willis, and Y. Zhang. On the use and performance of content distribution networks. In Proc. IMW, 2001.
Active measurement system for high-fidelity characterization of modern cloud applications
  • M Lee
  • R R Kompella
  • S Singh
M. Lee, R. R. Kompella, and S. Singh. Active measurement system for high-fidelity characterization of modern cloud applications. In Proc. USENIX Conference on Web Applications, 2010.