Conference Paper

Performance scalability of a multi-core web server

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Today's large multi-core Internet servers support thousands of concurrent connections or ows. The computation ability of future server platforms will depend on increasing numbers of cores. The key to ensure that performance scales with cores is to ensure that systems software and hardware are designed to fully exploit the parallelism that is inherent in independent network ows. This paper identifies the major bottlenecks to scalability for a reference server workload on a commercial server platform. However, performance scaling on commercial web servers has proven elusive. We determined that on web server running a modified SPEC-web2005 Support workload, throughput scales only 4.8 x on eight cores. Our results show that the operating system, TCP/IP stack, and application exploited ow-level parallelism well with few exceptions, and that load imbalance and shared cache affected performance little. Having eliminated these potential bottlenecks, we determined that performance scaling was limited by the capacity of the address bus, which became saturated on all eight cores. If this key obstacle is addressed, commercial web server and systems software are well-positioned to scale to a large number of cores.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Malgré ce parallélisme inhérent, des études précédentes montrent que des serveurs de données n'ont pas un passage à l'échelle idéal en multi-coeur [83]. Au vu de ce constat, nous souhaitons étudier plus avant les performances des serveurs de données dans un contexte multi-coeur, et analyser leur passage à l'échelle avec le nombre de coeurs. ...
... Veal et Foong [83] ont également étudié le passage à l'échelle du serveur Web Apache sur une machine 8 coeurs très similaire à la nôtre. En utilisant la charge dynamique SpecWeb2005, ils montrent qu'Apache ne passe pas à l'échelle idéalement avec le nombre de coeurs. ...
... En effet, le débit est triplé (x3.04) à 4 coeurs, et seulement multiplié par 5.26 à 8 coeurs. Nous retrouvons ici les performances observées par Veal et al. [83], qui font un test similaire avec le même serveur Web, la même charge et un processeur de même famille. La principale différence avec cette étude réside dans le matériel réseau utilisé. ...
Article
Full-text available
This thesis studies the performances of data servers on multicores. More precisely, we focus on the scalability with the number of cores. First, we study the internals of an event-driven multicore runtime. We demonstrate that false sharing and inter-core communications hurt performances badly, and prevent applications from scaling. We then propose several optimisations to fix these issues. In a second part, we compare the multicore performances of three Webservers, each reprensentative of a programming model. We observe that the differences between each server's performances vary as the number of cores increases. We are able to pinpoint the cause of the scalability limitation observed. We present one approach and some perspectives to overcome this limit.
... Many benchmarking studies suggest that each individual core performs differently across the cores of one multiprocessor [24,32,21]. Veal et al. [45] and Hashemian et al. [21] observe a CPU single core bottleneck and suggest methods to distribute the bottleneck to achieve better performance. However, most modelling work treats each core of a multicore processor equally by using M/M/k queues [7,5], where k represents the number of cores. ...
... Multicore & Scalability: To exploit the benefits of a multicore architecture, applications need to be parallelised [31,33,45]. Parallelism is mainly used by operating systems at the process level to provide seamless multitasking [14]. ...
... As a result, modern web servers can efficiently utilise multiple CPU cores. However, scalability of web servers is not in practice linear as other factors, such as sharing cache between cores, communication overhead, call-stack depth, synchronization between threads, or sequential work-flows [29,45,8,24] Step 1 ...
Conference Paper
Full-text available
As the computing industry enters the Cloud era, multicore architectures and virtualisation technologies are replacing traditional IT infrastructures. However, the complex relationship between applications and system resources in multicore virtualised environments is not well understood. Workloads such as web services and on-line financial applications have the requirement of high performance but benchmark analysis suggests that these applications do not optimally benefit from a higher number of cores. In this paper, we try to understand the scalability behaviour of network/CPU intensive applications running on multicore architectures. We begin by benchmarking the Petstore web application, noting the systematic imbalance that arises with respect to per-core workload. Having identified the reason for this phenomenon, we propose a queueing model which, when appropriately parametrised, reflects the trend in our benchmark results for up to 8 cores. Key to our approach is providing a fine-grained model which incorporates the idiosyncrasies of the operating system and the multiple CPU cores. Analysis of the model suggests a straightforward way to mitigate the observed bottleneck, which can be practically realised by the deployment of multiple virtual NICs within our VM. Next we make blind predictions to forecast performance with multiple virtual NICs. The validation results show that the model is able to predict the expected performance with relative errors ranging between 8 and 26 per cent.
... Veal et Foong ont mené en 2007 une campagne de tests visant à évaluer le passage à l'échelle du serveur Web Apache sur une architecture Intel à 8 coeurs [95]. La charge mise en oeuvre pour cette évaluation est la charge SPECWeb2005 Support que nous décrivons dans la partie 6.2. ...
... Toutefois, sur ces liens transitent, en plus des paquets réseau, des données applicatives (par exemple la communication entre un processus Apache et un processus PHP situés sur deux processeurs différents) ainsi que des messages dus au protocole de cohérence de cache. Notons que dans le cas de Veal et Foong [95], la saturation du bus d'adresses est la cause principale de la limitation du passage à l'échelle. Nous allons détailler dans cette section chacune des métriques que nous allons utiliser par la suite pour confirmer ou infirmer chacune de ces hypothèses, en expliquant la procédure employée pour les obtenir ainsi que leur intérêt. ...
... . Pariag et al. ont comparé les performances de différentes architectures de serveur Web sous une charge réelle sans toutefois s'intéresser au passage à l'échelle de l'ensemble. Veal et Foong[95] ont réalisé la seule étude complète du passage à l'échelle d'un serveur Web sur les nouvelles architectures multi-coeurs. Cette étude a été effectuée en utilisant la version 2005 de l'injecteur de charge SpecWeb[86]. ...
Article
Full-text available
This thesis focuses on data servers performance on multicore architectures. We study two different aspects of this problematic. First, we benchmark an event-driven multicore runtime. We especially show that the workstealing mechanism used for load balancing may sometimes degrade the performance of data servers. Consequently, we introduce a new runtime and new heuristics to improve the workstealing behavior. Second, we study the performance of the Apache Web server, which uses both threads and processes, on a NUMA architecture. We show that this Web server does not perfectly scale under a realistic workload. Thanks to a detailed analysis of costs using both hardware and software profiling, we determine the reasons of this lack of scalability and we present different propositions to improve the web server performance on NUMA architectures.
... In contrast, we consider a real platform with an in-memory data set and we take dynamic content generation into account. Our work has strong connections with the study by Veal and Foong [Veal 2007], which considers the scalability of the Linux-Apache-PHP stack on an 8-core Intel architecture with a centralized memory controller. They conclude that their address bus is the primary obstacle to performance scaling and masks software bottlenecks. ...
... This choice is in line with current trends in data center design [Ousterhout 2010]. This modification was also employed by Veal and Foong in their study [Veal 2007]. ...
... SPECweb2005 uses a pseudo backend server (BeSim) to simulate the database tier. Like [Veal 2007], we deploy one BeSim instance on each client machine. For each experiment, we systematically check that the BeSim tier is not a bottleneck. ...
Article
Full-text available
Multicore machines with Non-Uniform Memory Accesses (NUMA) are becoming commonplace. It is thus becoming crucial to understand how the resources they provide can be efficiently exploited. Most current research works are tack-ling the problem at the Operating System (OS) level. They focus on improving existing OS primitives, or on propos-ing novel OS designs with the aim of reducing OS bottle-necks and improving the scalability of applications running on such machines. In this paper, we adopt a complementary perspective: we examine how to optimize the scalability of a parallel ap-plication running on top of an unmodified, currently avail-able operating system. The chosen application is the popu-lar Apache-PHP stack. We highlight three performance is-sues at different levels of the system due to: (i) excessive remote memory accesses, (ii) inefficient load dispatching among cores, and (iii) contention on kernel data structures. We propose and implement solutions at the application-level for each issue. Our optimized Apache-PHP software stack achieves a 33% higher throughput than the base configura-tion on a 16-core setup. We conclude the paper with lessons learned on optimizing server applications for multicore com-puters.
... The study of how a multicore chip enhances the performance of a Web system is a main research topic currently. Veal and Foong studied the performance scalability of a multicore Web server on a physical computer [15]. The throughput of the system scaled up 4.8x from single core to eight cores. ...
... It indicates the system capacity gain is limited, even for #VC=12. The system S(1, r) may have the similar bus issues mentioned in [15]. The MaxTP is 378 sessions / minute with TG6 at #CS=400. ...
... PE < 1 for small or large #CS. The cause of the bottleneck mentioned in [15] may apply to this case. All PG and PE curves are bell-shaped with long small-valued asymptotic tails. ...
Article
Enhancing the performance of computing systems has been an important topic since the invention of computers. The leading-edge technologies of multicore and virtualization dramatically influence the development of current IT systems. We study performance attributes of response time (RT), throughput, efficiency, and scalability of a virtualized Web system running on a multicore server. We build virtual machines (VMs) for a Web application, and use distributed stress tests to measure RTs and throughputs under varied combinations of virtual cores (VCs) and VM instances. Their gains, efficiencies and scalabilities are also computed and compared. Our experimental and analytic results indicate: 1) A system can perform and scale much better by adopting multiple single-VC VMs than by single multiple-VC VM. 2) The system capacity gain is proportional to the number of VM instances run, but not proportional to the number of VCs allocated in a VM. 3) A system with more VMs or VCs has higher physical CPU utilization, but lower vCPU utilization. 4) The maximum throughput gain is less than VM or VC gain. 5) Per-core computing efficiency does not correlate to the quality of VCs or VMs employed. The outcomes can provide valuable guidelines for selecting instance types provided by public Cloud providers and load balancing planning for Web systems.
... The vast majority of this work is in the context of computation-centric workloads and benchmarks such as TPC-C. Closer to our interest in packet processing, are efforts similar to those of Veal et al. [24] that look for the bottlenecks in server-like workloads that involve a fair load of TCP termination. Their analysis reveals that such workloads are bottlenecked on the FSB address bus. ...
... A similar conclusion has been arrived at for several, more traditional, workloads. (We refer the reader to [24] for additional references to the literature on such evaluations.) As our results indicate, the bottleneck to packet processing lies elsewhere. ...
... Note that even 51Gbps is fairly low relative to the nominal rating of 100Gbps we used in estimating upper bounds. It turns out this limit is due to saturation of the address bus; recall that the address bus utilization is 74% for the stream test; prior work[24] and discussions with architects reveal that an address bus is regarded as saturated at approximately 75% utilization. This is in keeping with the general perception that, in a shared-bus architecture, the vast majority of applications are bottlenecked on the FSB. ...
Article
Full-text available
Compared to traditional high-end network equipment built on specialized hardware, software routers run-ning on commodity servers offer significant advan-tages: lower costs due to large-volume manufacturing, a widespread supply/support chain, and, most impor-tantly, programmability and extensibility. The challenge is scaling software-router performance to carrier-level speeds. As a first step, in this paper, we study the packet-processing capability of modern commodity servers; we identify the packet-processing bottlenecks, examine to what extent these can be alleviated through upcoming technology advances, and discuss what further changes are needed to take software routers beyond the small en-terprise.
... The drop in bandwidth implies a drop in N R , the number of requests. According to formula (5) and (6) in Section III, this in turn will reduce the difference in effectiveness between SAIs and irqbalance scheduling as shown in Figure 12. When the number of client nodes is greater than 32, 8 I/O nodes are not enough to serve the increased parallel I/O requests. ...
... The impact of the data movement incurred by parallelization strategies of packet processing on the general-purpose monolithic OS has been analyzed by Salehi et al. [18] and Willmann et al. [29]. As for multi-core systems, Foong et al. [1][2] [5] and Narayanaswamy et al. [8] [9] have shown the in-depth analysis of processor data locality problem, but there analysis has not been considered for parallel I/O situations. To enable users tuning applications performance to keep data locality and reduce data movement above multi-core systems, VTune [16] and autopin [34] have been developed by Intel and T. Klug et al. ...
Conference Paper
Recent technological advances are putting increased pressure on CPU scheduling. On one hand, processors have more cores. On the other hand, I/O systems have become more complex. Intensive research has been conducted on multi/many-core scheduling, however, most of the studies follow the conventional approach and focus on the utilization and load balance of the cores. In this study, we focus on increasing data locality by bringing source information from I/O into the core interrupt scheduling process. The premise is to group interrupts associated for the same I/O request together on the same core, and prove that data locality is more important than core utilization for many applications. Based on this idea, a source-aware affinity interrupt-scheduling scheme is introduced and a prototype system, SAIs, is implemented. Experiment results show that SAIs is feasible and promising, bandwidth shows a 23.57% improvement in a 3-Gigabit NIC environment and in the optimal case without the NIC bottleneck, the bandwidth improvement increases to 53.23%.
... To analyze the origin of the high memory latencies, we conducted profiling experiments similar to the ones found in Veal et al. [14]. We did not notice any irregularity in caches and TLB behaviours, thus we analyzed the front side bus usage, as shown in Figure 5. ...
... Veal et al. [14] have also studied the scalability of the Apache Web server on very similar 8-cores machines. Using the SpecWeb2005 workload, they are able to show that Apache does not scale with the number of cores. ...
Article
Full-text available
We study the impact of concurrent programming models on multicore performances of Web servers. More precisely, we consider three implementations of servers, each being representative of a particular model: Knot (thread-based), Userver (event-driven), Watpipe (stage-based). Our experiments show that memory access costs increase with the number of cores. We also show that at 8 cores we reach a point where the memory is fully saturated, leading to all Web server implementations having the same performance. Using fine-grain profiling, we are able to pinpoint the cause of this issue as a hardware bottleneck: the saturation of the address bus. Further memory benchmarking on a 24-cores machine show that a memory-related scalability issue is still present beyond this point.
... For this study, we narrowed our focus to problems arising from high TCP/IP connection arrival rates as observed in our motivating example, as well as contention for the memory hierarchy which previous studies have shown to be important ( [9], [10]). Accordingly, the probe sensor continuously alternates between two phases of processing. ...
... For example, Boyd-Wickhizer et al. [18] proposed modifications to the standard Linux kernel to improve the performance of Web servers running on multi-core servers. Veal and Foong [9] characterized the the performance of a SPECweb2005 workload on a centralized memory Intel Clovertown system. Using hardware monitoring, the authors establish that contention among cores for the system's address bus severely impacts how the Web server scales with increasing core count. ...
Article
Full-text available
Public and private cloud computing environments employ virtualization methods to consolidate application workloads onto shared servers. Modern servers typically have one or more sockets each with one or more computing cores, a multi-level caching hierarchy, a memory subsystem, and an interconnect to the memory of other sockets. While resource management methods may manage application performance by controlling the sharing of processing time and input-output rates, there is generally no management of contention for virtualization kernel resources or for the memory hierarchy and subsystems. Yet such contention can have a significant impact on application performance. Hardware platform specific counters have been proposed for detecting such contention. We show that such counters alone are not always sufficient for detecting contention. We propose a software probe based approach for detecting contention for shared platform resources and demonstrate its effectiveness. We show that the probe imposes low overhead and is remarkably effective at detecting performance degradations due to inter-VM interference over a wide variety of workload scenarios and on two different server architectures. The probe successfully detected virtualization-induced software bottleneck and memory contention on both server architectures. Our approach supports the management of workload placement on shared servers and pools of shared servers.
... More recently, [5] identified many bottlenecks in the Linux kernel when scaling to many cores. A similar scalability analysis is decribed for many important application domains by [7,21]. Finally, there has been recent work in reducing overheads due to the addition of virtualization layer for I/O intensive workloads. ...
Article
Full-text available
1. ABSTRACT In this work we examine the relative overhead of the op-erating system and using virtual machines on I/O intensive applications. We use real applications to calculate the cy-cles and energy per I/O using simple models, based on ac-tual measurements. Our results indicate that the OS can cost up to 60% in terms of energy spent per I/O operation. Further, a single VM instance costs 150% in terms of per-formance and 180% in terms of energy consumption per I/O operation. For some applications, server consolidation using two VM instances can reduce the cost compared to one VM instance by up to 25%. Finally, we note that current system stacks do not scale with the number of cores and on average, compared with one core, the system component of execution time increases by 90x on a 1000-core processor.
... Cui et al. use micro-benchmarks to stress separate parts of Linux kernel, and identify the scalability problems of memory-mapped file creation and deletion, file descriptor operation, system V semaphore operation [1] . Veal and Foong evaluate the performance of Apache on 8-core machine by a modified SPECweb2005 Support workload [8] . Their experimental results show the bottlenecks in scheduling and directory lookup of Linux 2.6.20.3. ...
Article
The trends of exponential growing core counts incur new requirements on operating systems. The contemporary monolithic OSs protect shared kernel data by locking in multicore environment. However, the lock contention of OS functions may lead to overall performance degradation. This paper adopts microkernel architecture for scalability concerns, since it has flexibilities for the management of computing resources and explicit data layout to avoid locking. We present a scalable Memory management service (MMS) based on microkernel OS. The physical memory is distributed into servers to remove the lock contention over page pools. Then we discuss the new problems, including load balance and "distributed memory fragmentation". MMS is divided into one master and multiple slaves. The master is a coordinator for adjusting loads and routing requests with a global memory view, while the slaves are responsible for the management of distributed page zones and virtual memory areas. The experimental results show that MMS achieves better scalability than Linux on a 32-core machine.
... TLA + formalia specifikavimo kalba sudarius "Redis Cluster" formalią specifikaciją ir atlikus modelio tikrinimą, buvo nustatyta, kad tam tikrose situacijose sistema gali veikti nekorektiškai -yra pažeidžiama sistemos maišos lizdų savybė (1). Pradinio modelio tikrinimo metu buvo identifikuotos sistemos klaidos, kurios buvo atkartotos realioje sistemoje naudojant tekstinę sąsają redis-cli. ...
Article
Full-text available
Šiame straipsnyje yra analizuojamas podėlio sistemos „Redis Cluster“ korektiškumas. Analizuojant sistemą buvo naudojami formalūs metodai – TLA+ specifikavimo kalba buvo sudaryta sistemos formali specifikacija. Specifikacijos modelio tikrinimo metu buvo vertinama, ar yra užtikrinama sistemos savybė, kad už vieną maišos lizdą yra atsakingas tik vienas pagrindinis mazgas ir jo pavaldūs mazgai. Atlikus modelio tikrinimą buvo surastos situacijos, kada ši sistemos savybė nėra užtikrinama. Surastos klaidos buvo atkartotos realioje sistemoje ir šioms klaidoms buvo pateikti galimi sprendimo būdai.
... This introduces important challenges for operating systems designed for these environments in terms of scalability when the number of cores increases [16,109,22,65,23]. Some studies reveal that poor scalability of some operating system services can dominate application performance [43,105]. An important source for poor scalability of such services is the use of concurrent kernel data structures, which are accessed by multiple cores at the same time. ...
Article
The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain.
... Single-installation server machines follow suit, with a typical x86-based server containing at least eight CPUs [1]. On the other hand, most server software, such as Apache, PostgreSQL, and sendmail, follow a thread-per-request architecture; the industry is continually trying to improve hardware to support these applications [20]. ...
Conference Paper
We present Quarantine, a system that enables data-driven selective isolation within concurrent server applications. Instead of constructing arbitrary isolation boundaries between components, Quarantine collects data to learn where such boundaries should be placed, and then instantiates said barriers to improve reliability. We present the case for data-driven selective isolation, and discuss the challenges in realizing such a system.
... In (1) Ethernet frames arrive at the physical layer in an order defined by the sending system and with a traffic distribution as defined by the routing policy of the network. The packets are then distributed by the NIC to internal queues (2) based on either rules, e.g. in the case of Intel FlowDirector, or based on a hash algorithm like the Toeplitz hash [12]. The challenging part is how to fine-tune the hash or how to set the rules such that packet distribution per VNF (3) is optimal with respect to the NIC queue affinity (4), the architecture (5), the core affinity (6) and the VNF affinity (7). ...
Conference Paper
Network Functions Virtualization (NFV) aims to move network functions away from expensive hardware appliances to off-the-shelf server hardware. NFV promises higher flexibility and cost reduction for the network operator. In order to achieve high throughput performance with this commodity hardware, fast packet processing frameworks like NetMap or the Data Plane Development Kit (DPDK) can be used. It is known that packet processing performance is very sensitive regarding copying of packets. In this paper we take steps towards quantifying the efficiency of NFV regarding packet copying overhead at hardware level. As modern servers are often built up of multiple CPUs with segregated memory, we evaluate the performance penalties resulting from this segregation in conjunction with DPDK. Additionally we evaluate the effects of cache misses on packet processing in detail. Subsequently a metric that quantifies the efficiency of a running VNF is introduced and an optimization scheme is outlined which describes the use of the metric. Our results show how both cache misses and memory segregation reduce the network efficiency.
... We used a modified version of the ab package to send sequences of requests to random tiles among the 262,144 that compose the highest image zoom level. Appropriate system settings, as prescribed in Veal and Foong (2007), were applied server-side and client-side to ensure that both ends would stand the highest possible concurrency levels with minimum latencies and maximum throughput. We conducted preliminary tests through Apache's httpd 11 , Lighty Labs' lighttpd 12 , a combination of Nginx 13 and Lighty Labs' spawn-fcgi and finally LiteSpeed Technologies' OpenLiteSpeed 14 . ...
Article
Full-text available
Visualizing and navigating through large astronomy images from a remote location with current astronomy display tools can be a frustrating experience in terms of speed and ergonomics, especially on mobile devices. In this paper, we present a high performance, versatile and robust client-server system for remote visualization and analysis of extremely large scientific images. Applications of this work include survey image quality control, interactive data query and exploration, citizen science, as well as public outreach. The proposed software is entirely open source and is designed to be generic and applicable to a variety of data sets. It provides access to full precision floating point data at terabyte scales, with the ability to precisely adjust image settings in real-time. The proposed clients are light-weight, platform-independent web applications built on standard HTML5 web technologies and compatible with both touch-based and mouse-based devices. We put the system to the test and assess the performance of the system and show that a single server can comfortably handle more than a hundred simultaneous users accessing full precision 32 bit astronomy data.
... It was determined that all applications except one trigger a scalability bottleneck in the Linux kernel, and several modifications to the kernel were introduced to reduce this bottleneck. In [2], the scalability of a multi-core web server was examined, and it was observed that the capacity of the address bus in the eight-core system was the limiting factor in performance scaling. The performance of a SIP server on multi-core systems was studied in [3]. ...
... Veal and Foong [54] also analysed the scalability of a web server on a multiprocessor. Similar to the experiments in this section, they found scalability problems as the number of cores increased. ...
... Current state of the art Web servers do not take the query into consideration while presenting the content to the user. Lot of work has been reported on improving the architecture of Web servers for various applications [3,11,20]. Many models are available to compare the architectures of the servers [10]. ...
Conference Paper
Retrieval and content management are assumed to be mutually exclusive. In this paper we suggest that they need not be so. In the usual information retrieval scenario, some information about queries leading to a website (due to ‘hits’ or ‘visits’) is available to the server administrator of the concerned website. This information can be used to better present the content on the website. Further, we suggest that some more information can be shared by the retrieval system with the content provider. This will enable the content provider (any website) to have a more dynamic presentation of the content that is in tune with the query trends, without violating the privacy of the querying user. The result will be a better synchronization between retrieval systems and content providers, with the purpose of improving the user’s web search experience. This will also give the content provider a say in this process, given that the content provider is the one who knows much more about the content than the retrieval system. It also means that the content presentation may change in response to a query. In the end, the user will be able to find the relevant content more easily and quickly. All this can be made subject to the condition that user’s consent is available.
... There are a lot of works on high-performance packet processing. Veal et al. [86] investigated the performance scalability of a multi-core Web server. They found that flow-level parallelism is well exploited; thus, the performance is scaled. ...
Article
Today's enterprise, data-center, and internet-service-provider networks deploy different types of network devices, including switches, routers, and middleboxes such as network address translation and firewalls. These devices are vertically integrated monolithic systems. Software-defined networking (SDN) and network function virtualization (NFV) are promising technologies for dis-aggregating vertically integrated systems into components by using “softwarization”. Software-defined networking separates the control plane from the data plane of switch and router, while NFV decouples high-layer service functions (SFs) or Network Functions (NFs) implemented in the data plane of a middlebox and enables the innovation of policy implementation by using SF chaining. Even though there have been several survey studies in this area, this area is continuing to grow rapidly. In this paper, we present a recent survey of this area. In particular, we survey research activities in the areas of re-architecting middleboxes, state management, high-performance platforms, service chaining, resource management, and trouble shooting. Efforts in these research areas will enable the development of future virtual-network-function platforms and innovation in service management while maintaining acceptable capital and operational expenditure.
... Hence, in order to eliminate an unfavorable default configuration of the network stack as a confounding variable, we modified the configuration on Linux, Rumprun and OSv. Since many best practices guides cover the subject of tuning network performance on Linux, we employed the recommendations from [30], resulting in the configuration denoted in Table 3. Based on this model, we modified the configuration parameters of both Rumprun and OSv to correspond to the Linux-based settings [28]. ...
Conference Paper
Full-text available
The increasing prevalence of the microservice paradigm creates a new demand for low-overhead virtualization techniques. Complementing containerization, unikernels are emerging as alternative approaches. With both techniques undergoing rapid improvements, the current landscape of lightweight virtualization approaches presents a confusing scenery, complicating the task of choosing a suited technology for an intended purpose. This work provides a comprehensive performance comparison covering containers, unikernels, whole-system virtualization, native hardware, and combinations thereof. Representing common workloads in microservice-based applications, we assess application performance using HTTP servers and a key-value store. With the microservice deployment paradigm in mind, we evaluate further characteristics such as startup time, image size, network latency, and memory footprint.
... As an example, TCP flows can stay in SYN received state between 1 and 5 minutes, depending on the configuration. The number of SYN state entities in regular TCP stacks at server or other endpoints range between 1,024 (the default backlog size in the linux kernel) and a few thousands for high-performance webservers [3]. In case such a server or endpoint is connected via at least 10Gbps Ethernet, an aggressive DoS attack can *Corresponding author. ...
Conference Paper
Full-text available
While the scale, frequency and impact of the recent cyber- and DoS-attacks have all increased, the traditional security management systems are still supervised by human operators in the decisional loop. To cope with the new breed of machine-driven attacks - particularly those designed to overload the humans in the loop - the next-generation anomaly detection and attack mitigation schema, i.e. the network security management, must improve greatly in speed and accuracy: become machine-driven, too. As infrastructure we propose an FPGA-accelerated Network Function Virtualization that potentially enhances the current multi-Tbps switching fabrics with SDN-based security capabilities of vastly higher performance and scalability. As key novelties, we contribute (i) sub-ms detection lag (ii) of the top 9 Akamai attacks with (iii) a real-time SDN feedback loop between a distributed programmable data plane and a centralized SDN controller, (iv) coupled via a global N:1 mirror. We validate the concept in an actual datacenter network with a new security application that can detect and mitigate real-world dDoS attacks, with lags from 430 us up to 3 ms - several orders of magnitude faster than before.
... Current state of the art Web servers do not take the query into consideration while presenting the content to the user. Lot of work has been reported on improving the architecture of Web servers for various applications [1,6,4]. Many models are available to compare the architectures of the servers [3]. ...
Article
Retrieval and content management are assumed to be mutually exclusive. In this paper we suggest that they need not be so. In the usual information retrieval scenario, some information about queries leading to a website (due to `hits' or `visits') is available to the server administrator of the concerned website. This information can used to better present the content on the website. Further, we suggest that some more information can be shared by the retrieval system with the content provider. This will enable the content provider (any website) to have a more dynamic presentation of the content that is in tune with the query trends, without violating the privacy of the querying user. The result will be a better synchronization between retrieval systems and content providers, with the purpose of improving the user's web search experience. This will also give the content provider a say in this process, given that the content provider is the one who knows much more about the content than the retrieval system. It also means that the content presentation may change in response to a query. In the end, the user will be able to find the relevant content more easily and quickly.
Article
Full-text available
The conference "Lithuanian MSc Research in Informatics and ICT" is a venue to present research of Lithuanian MSc theses in informatics and ICT. The aim of the event is to raise skills of MSc and other students, familiarize themselves with the research of other students, encourage their interest in scientific activities. Students from Kaunas University of Technology, Vilnius University, and Vytautas Magnus University will give their presentations at the conference.
Article
The MapReduce programming model, in which the data nodes perform both the data storing and the computation, was introduced for big-data processing. Thus, we need to understand the different resource requirements of data storing and computation tasks and schedule these efficiently over multi-core processors. In particular, the provision of high-performance data storing has become more critical because of the continuously increasing volume of data uploaded to distributed file systems and database servers. However, the analysis of the performance characteristics of the processes that store upstream data is very intricate, because both network and disk inputs/outputs (I/O) are heavily involved in their operations. In this paper, we analyze the impact of core affinity on both network and disk I/O performance and propose a novel approach for dynamic core affinity for high-throughput file upload. We consider the dynamic changes in the processor load and the intensiveness of the file upload at run-time, and accordingly decide the core affinity for service threads, with the objective of maximizing the parallelism, data locality, and resource efficiency. We apply the dynamic core affinity to Hadoop Distributed File System (HDFS). Measurement results show that our implementation can improve the file upload throughput of end applications by more than 30% as compared with the default HDFS, and provide better scalability.
Article
Multi-core processors can improve parallelism of application processes and thus can enhance the system throughput. Researchers have recently revealed that the processor affinity is an important factor to determine network I/O performance due to architectural characteristics of multi-core processors; thus, many researchers are trying to suggest a scheme to decide an optimal processor affinity. Existing schemes to dynamically decide the processor affinity are able to transparently adapt for system changes, such as modifications of application and upgrades of hardware, but these have limited access to characteristics of application behavior and run-time information that can be collected heuristically. Thus, these can provide only sub-optimal processor affinity. In this paper, we define meaningful system variables for determining optimal processor affinity and suggest a tool to gather such information. We show that the implemented tool can overcome limitations of existing schemes and can improve network bandwidth.
Conference Paper
There are numerous proprietary appliances in operators' networks. These appliances consume a lot of electricity and plenty of space to deploy, which lead to high operating expense (OPEX) for operators. Network Function Virtualisation (NFV) is introduced to solve this problem. NFV consolidates many network devices into network applications, which can be running on industry commodity servers. Those appliances are different from routers, because they have to handle protocol processing above network layer and provide socket APIs to various applications, which need full protocol stack support instead of packet forwarding only. Unfortunately, despite increasingly high speed bandwidth up to 10 Gbps or even 40 Gbps on commodity multi-core servers, network protocol processing bottlenecks are identified, such as throughput does not scale by the number of cores, or stack processing latency is too long for some applications, etc. In this paper, the reasons for poor stack performance (especially performance scalability and stack process latency) in software are systematically analyzed. And based on improving such analysis results, we propose Stack Pool, a novel high-performance scalable network architecture on multi-core servers. Stack Pool is constituted by multiple isolated virtual lanes. Each virtual lane contains an independent protocol stack instance, several pairs of hardware queues in NICs, as well as socket instances located in the stack instance. Each logical CPU core is responsible to process packets in a virtual lane. Flow director in NIC and lane selector in Stack Pool direct packets of different flows to several virtual lanes based on packet headers. We have implemented a Stack Pool prototype to show that the approach is promising. The Stack Pool outperforms standard Linux protocol stack with approximately 7 times throughput of UDP or 3 times that of TCP in a single virtual lane. Moreover, Stack Pool performance accrues linearly when scale to multiple cores, e.g.,- 10.7 and 17.2 times on 6 cores of UDP transmit and receive respectively, and 6.5 times of TCP throughput on 6 logical cores. At the same time, packet latency on Stack Pool is approximated only 1/4 than that on native Linux stack.
Conference Paper
We present MegaPipe, a new API for efficient, scalable network I/O for message-oriented workloads. The design of MegaPipe centers around the abstraction of a channel - a per-core, bidirectional pipe between the kernel and user space, used to exchange both I/O requests and event notifications. On top of the channel abstraction, we introduce three key concepts of MegaPipe: partitioning, lightweight socket (lwsocket), and batching. We implement MegaPipe in Linux and adapt memcached and nginx. Our results show that, by embracing a clean-slate design approach, MegaPipe is able to exploit new opportunities for improved performance and ease of programmability. In microbenchmarks on an 8-core server with 64 B messages, MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.
Conference Paper
Improving the performance and scalability of Web servers enhances user experiences and reduces the costs of providing Web-based services. The advent of Multi-core technology motivates new studies to understand how efficiently Web servers utilize such hardware. This paper presents a detailed performance study of a Web server application deployed on a modern 2 socket, 4-cores per socket server. Our study show that default, "out-of-the-box" Web server configurations can cause the system to scale poorly with increasing core counts. We study two different types of workloads, namely a workload that imposes intense TCP/IP related OS activity and the SPECweb2009 Support workload, which incurs more application-level processing. We observe that the scaling behaviour is markedly different for these two types of workloads, mainly due to the difference in the performance characteristics of static and dynamic requests. The results of our experiments reveal that with workload-specific Web server configuration strategies a modern Multi-core server can be utilized up to 80% while still serving requests without significant queuing delays; utilizations beyond 90% are also possible, while still serving requests with acceptable response times.
Conference Paper
The paper describes the implementation of the server that provides adaptive, scalable and efficient way for auto-configuration and control of consumer electronics devices. The implementation is based on the TR-069 communication protocol for remote control and monitoring devices. The solution enables the analysis and diagnosis of devices, or the provision of quality of service for broadcasters. Server provides various services for web-based applications or mobile applications that allow visualization of process executed on the server.
Conference Paper
The prevalence of multi-core processors has raised the question of whether applications can use the increasing number of cores efficiently in order to provide predictable quality of service (QoS). In this paper, we study the horizontal scalability of n-tier application performance within a multicore processor (MCP). Through extensive measurements of the RUBBoS benchmark, we found one major source of performance variations within MCP: the mapping of cores to virtual CPUs can significantly lower on-chip cache hit ratio, causing performance drops of up to 22% without obvious changes in resource utilization. After we eliminated these variations by fixing the MCP core mapping, we measured the impact of three mainstream hypervisors (the dominant Commercial Hypervisor, Xen, and KVM) on intra-MCP horizontal scalability. On a quad-core dual-processor (total 8 cores), we found some interesting similarities and dissimilarities among the hypervisors. An example of similarities is a non-monotonic scalability trend (throughput increasing up to 4 cores and then decreasing for more than 4 cores) when running a browse-only CPU-intensive workload. This problem can be traced to the management of last level cache of CPU packages. An example of dissimilarities among hypervisors is their handling of write operations in mixed read/write, I/O-intensive workloads. Specifically, the Commercial Hypervisor is able to provide more than twice the throughput compared to KVM. Our measurements show that both MCP cache architecture and the choice of hypervisors indeed have an impact on the efficiency and horizontal scalability achievable by applications. However, despite their differences, all three mainstream hypervisors have difficulties with the intra-MCP horizontal scalability beyond 4 cores for n-tier applications.
Conference Paper
The MapReduce programming model is introduced for big-data processing, where the data nodes perform both data storing and computation. Thus, we need to understand different resource requirements of data storing and computation tasks and schedule these efficiently over multi-core processors. The core affinity defines mapping between a set of cores and a given task. The core affinity can be decided based on resource requirements of a task because this largely affects the efficiency of computation, memory, and I/O resource utilization. In this paper, we analyze the impact of core affinity on the file upload performance of Hadoop Distributed File System (HDFS). Our study can provide the insight into the process scheduling issues on big-data processing systems. We also suggest a framework for dynamic core affinity based on our observations and show that a preliminary implementation can improve the throughput more than 40% compared with default Linux system.
Article
In this paper, we conduct research on the impact of multi-threading strategy on performance of end-to-end file transmission. As we all know, adopting multi-threading strategy in file transmission can greatly improve performance. However, the degree of such performance gains is not well documented. We first analyze the merits of file transmission utilizing multi-threaded transmission method. Then, to demonstrate the performance gains of multi-threaded transmission, an architecture which supports multi-threading transmission both in client side and server side is designed. We utilize the architecture in a real network and compare transmission performance by utilizing different parameters. Test results shows that multi-threading transmission achieves a better performance when compared to single-thread transmission. However, the performance gains are not proportional to the number of threads. Since file transmission is very normal in our daily life, it is expected the results in this paper will provide a guide for software vendor when multi-threading tools are developed.
Chapter
Mutual exclusion protects data structures in parallel environments in order to preserve data integrity. A lock being held effectively blocks the execution of all other threads wanting to access the same shared resource until the lock is released. This blocking behavior reduces the level of parallelism causing performance loss. Fine grained locking reduces the contention for the locks resulting in better throughput, however, the granularity, i.e. how many locks to use, is not straightforward. In large bucket hash tables, the best approach is to divide the table into blocks, each containing one or more buckets, and locking these blocks independently. The size of the block, for optimal performance, depends on the time spent within the critical sections, which depends on the table’s internal properties, and the arrival intensity of the queries. A queuing model is presented capturing this behavior, and an adaptive algorithm is presented fine-tuning the granularity of locking (the block size) to adapt to the execution environment.
Article
Asynchronous event-driven server architecture has been considered as a superior alternative to the thread-based counterpart due to reduced multithreading overhead. In this paper, we conduct empirical research on the efficiency of asynchronous Internet servers, showing that an asynchronous server may perform significantly worse than a thread-based one due to two design deficiencies. The first one is the widely adopted one-event-one-handler event processing model in current asynchronous Internet servers, which could generate frequent unnecessary context switches between event handlers, leading to significant CPU overhead of the server. The second one is a write-spin problem (i.e., repeatedly making unnecessary I/O system calls) in asynchronous servers due to some specific runtime workload and network conditions (e.g., large response size and non-trivial network latency). To address these two design deficiencies, we present a hybrid solution by exploiting the merits of different asynchronous architectures so that the server is able to adapt to dynamic runtime workload and network conditions in the cloud. Concretely, our hybrid solution applies a lightweight runtime request checking and seeks for the most efficient path to process each request from clients. Our results show that the hybrid solution can achieve from 10% to 90% higher throughput than all the other types of servers under the various realistic workload and network conditions in the cloud.
Thesis
Aktuelle Mehrkernprozessoren stellen parallele Systeme dar, die den darauf ausgeführten Programmen gemeinsamen Speicher zur Verfügung stellen. Sowohl die ansteigende Kernanzahlen in sogenannten Vielkernprozessoren (many-core processors) als auch die weiterhin steigende Leistungsfähigkeit der einzelnen Kerne erfordert hohe Bandbreiten, die das Speichersystem des Prozessors liefern muss. Hardware-basierte Cache-Kohärenz stößt in aktuellen Vielkernprozessoren an Grenzen des praktisch Machbaren. Dementsprechend müssen alternative Architekturen und entsprechend geeignete Programmiermodelle untersucht werden. In dieser Arbeit wird der Single-Chip Cloud Computer (SCC), ein nicht-cachekohärenter Vielkernprozessor betrachtet, der aus 48, über ein Gitternetzwerk verbundenen Kernen besteht. Obwohl der Prozessor für nachrichten-basierte Kommunikation entwickelt worden ist, zeigen die Ergebnisse dieser Arbeit, dass einseitige Kommunikation auf Basis gemeinsamen Speichers effizient auf diesem Architekturtyp realisiert werden kann. Einseitige Kommunikation ermöglicht Datenaustausch zwischen Prozessen, bei der der Empfänger keine Details über die stattfindende Kommunikation besitzen muss. Im Sinne des MPI-Standards ist so ein Zugriff auf Speicher entfernter Prozesse möglich. Zur Umsetzung dieses Konzepts auf nicht-kohärenten Architekturen werden in dieser Arbeit sowohl eine effiziente Prozesssynchronisation als auch ein Kommunikationsschema auf Basis von software-basierter Cache-Kohärenz erarbeitet und untersucht. Die Prozesssynchronisation setzt das Konzept der general active target synchronization aus dem MPI-Standard um. Ein existierendes Klassifikationsschema für dessen Implementierungen wird erweitert und zur Identifikation einer geeigneten Klasse für die nicht-kohärente Plattform des SCC verwendet. Auf Grundlage der Klassifikation werden existierende Implementierungen analysiert, daraus geeignete Konzepte extrahiert und ein leichtgewichtiges Synchronisationsprotokoll für den SCC entwickelt, das sowohl gemeinsamen Speicher als auch ungecachete Speicherzugriffe verwendet. Das vorgestellte Schema ist nicht anfällig für Verzögerungen zwischen Prozessen und erlaubt direkte Kommunikation sobald beide Kommunikationspartner dafür bereit sind. Die experimentellen Ergebnisse zeigen ein sehr gutes Skaliserungsverhalten und eine fünffach geringere Latenz für die Prozesssynchronisation im Vergleich zu einer auf Nachrichten basierenden MPI-Implementierung des SCC. Für die Kommunikation wird mit SCOSCo ein auf gemeinsamen Speicher und software-basierter Cache-Kohärenz basierenden Konzept vorgestellt. Entsprechende Anforderungen an die Kohärenz, die dem MPI-Standard entsprechen, werden aufgestellt und eine schlanke Implementierung auf Basis der Hard- und Software-Funktionalitäten des SCCs entwickelt. Trotz einer aufgedecktem Fehlfunktion im Speichersubsystem des SCC kann in den experimentellen Auswertungen von Mikrobenchmarks eine fünffach verbesserte Bandbreite und eine nahezu vierfach verringerte Latenz beobachtet werden. In Anwendungsexperimenten, wie einer dreidimensionalen schnellen Fourier-Transformation, kann der Anteil der Kommunikation an der Laufzeit um den Faktor fünf reduziert werden. In Ergänzung dazu werden in dieser Arbeit Konzepte aufgestellt, die in zukünftigen Architekturen, die Cache-Kohärenz nicht auf einer globalen Ebene des Prozessors liefern können, für die Umsetzung von Software-basierter Kohärenz für einseitige Kommunikation hilfreich sind.
Article
Moving the high performance computing (HPC) to Cloud not only reduces the costs but also gives users the ability to customize their system. Besides, compared with the traditional HPC computing environments, such as grid and cluster which run HPC applications on bare-metal, cloud equipped with virtualization not only improves resource utilization but also reduces maintenance cost. While for some reasons, current virtualization-based cloud has limited performance for HPC. Such performance overhead could be caused by Virtual Machine Monitor (VMM) interceptions, virtualized I/O devices or cross Virtual Machine (VM) interference, etc. In order to guarantee the performance of HPC applications on Cloud, the VMM should interfere guest VMs as less as possible and allocate dedicated resources such as CPU cores, DRAM and devices to guest VMs running HPC applications. In this paper, we propose a novel cloud infrastructure to serve the HPC applications and normal applications concurrently. This novel infrastructure is based on a lightweight high performance VMM named nOSV. For HPC applications, nOSV constructs a strong isolated high performance guest VM with dedicated resources. At runtime, the high performance VM manages all resources by itself and is not interfered by nOSV. By supporting nested virtualization, nOSV can run HPC with commodity application and keep the flexibility of traditional Cloud. nOSV runs other virtualization environments, like Xen and Docker, as high performance guest VMs. All commodity Cloud applications are hosted in these virtualization environments and share hardware resources with each other. The prototype of nOSV shows that it provides a bare-metal like performance for HPC applications and has about 23% improvement compared to HPC applications running on Xen.
Article
Multithreaded programming is becoming increasingly popular in software design to effectively use the computing resources of multicore systems. Tests on an Intel 40-core system show that current commodity operating systems are not well suited to manage such large scale hardware resources. In particular, the application performance is significantly affected when system services are frequently accessed. This paper describes a message-passing mechanism in a Linux kernel with optimization of the system calls that dynamically adjusts execution of the system calls to reduce resource contention overhead. Tests on hackbench and dbench show that this method given high performance and scalability for multithreaded system service intensive applications.
Article
We examine the implications of end-to-end web application development, in the social web era. The paper describes a distributed architecture, suitable for modern web application development, as well as the interactivity components associated with it. Furthermore, we conducted a series of stress tests, on popular server side technologies. The PHP/Apache stack was found inefficient to address the increasing demand in network traffic. Nginx was found more than 2.5 times faster in input/output (I/O) operations than Apache, whereas Node.js outperformed both. Node.js, although excellent in I/O operations and resource utilization, was found lacking in serving static files using its built in HTTP server, while Nginx performed great at this task. So, in order to address efficiency, an Nginx server could be placed in-front and proxy static file requests, allowing the Node.js processes to only handle dynamic content. Such a configuration can offer a better infrastructure in terms of efficiency and scalability, replacing the aged PHP/Apache stack. Furthermore we have found that building cross platform applications based on web technologies, is both feasible and highly productive, especially when addressing stationary and mobile devices, as well as the fragmentation among them. Our study concludes that Node.js offers client-server development integration, aiding code reusability in web applications, and is the perfect tool for developing fast, scalable network applications.
Article
Large number of cores and hardware resource sharing are two characteristics on multicore processors, which bring new challenges for the design of operating systems. How to locate and analyze the speedup restrictive factors in operating systems, how to simulate and avoid the phenomenon that speedup decreases with the number of cores because of lock contention (i.e., lock thrashing) and how to avoid the contention of shared resources such as the last level cache are key challenges for the operating system scalability research on multicore systems.
Conference Paper
Retrieval and content management are assumed to be mutually exclusive. In this paper we suggest that they need not be so. In the usual information retrieval scenario, some information about queries leading to a website (due to ‘hits’ or ‘visits’) is available to the server administrator of the concerned website. This information can be used to better present the content on the website. Further, we suggest that some more information can be shared by the retrieval system with the content provider. This will enable the content provider (any website) to have a more dynamic presentation of the content that is in tune with the query trends, without violating the privacy of the querying user. The result will be a better synchronization between retrieval systems and content providers, with the purpose of improving the user’s web search experience. This will also give the content provider a say in this process, given that the content provider is the one who knows much more about the content than the retrieval system. It also means that the content presentation may change in response to a query. In the end, the user will be able to find the relevant content more easily and quickly. All this can be made subject to the condition that user’s consent is available.
Article
This paper presents dynamic modeling of a webserver hosted on a private cloud using grey box identification technique. In contrast to the results in literature, we model the web-server as a linear parameter varying (LPV) state space system valid around several well defined operating regions. We programmatically generate synthetic HTTP load based on open source workload tool called httperf to obtain response time as performance metric. Finally, we validate the models on test data by conducting experiments in the actual Eucalyptus cloud environment.
Article
Multicore processor architectures have become ubiquitous in today's computing platforms, especially in parallel computing installations, with their power and cost advantages. While the technology trend continues towards having hundreds of cores on a chip in the foreseeable future, an urgent question posed to system designers as well as application users is whether applications can receive sufficient support on today's operating systems for them to scale to many cores. To this end, people need to understand the strengths and weaknesses on their support on scalability and to identify major bottlenecks limiting the scalability, if any. As open-source operating systems are of particular interests in the research and industry communities, in this paper we choose three operating systems (Linux, Solaris and FreeBSD) to systematically evaluate and compare their scalability by using a set of highly-focused microbenchmarks for broad and detailed understanding their scalability on an AMD 32-core system. We use system profiling tools and analyze kernel source codes to find out the root cause of each observed scalability bottleneck. Our results reveal that there is no single operating system among the three standing out on all system aspects, though some system(s) can prevail on some of the system aspects. For example, Linux outperforms Solaris and FreeBSD significantly for file-descriptor- and process-intensive operations. For applications with intensive sockets creation and deletion operations, Solaris leads FreeBSD, which scales better than Linux. With the help of performance tools and source code instrumentation and analysis, we find that synchronization primitives protecting shared data structures in the kernels are the major bottleneck limiting system scalability.
Article
This paper addresses the resource contention issue caused by the sharing of the last level caches by introducing a novel contention-aware scheduler. To accurately determine a task's resource requirements, i.e. the unique input of our scheduler, we develop a methodology to select the best heuristic metric from five candidates to represent a task's resource requirements. Based on the heuristic of each task acquired by exploiting the performance monitor unit, our scheduler co-schedules tasks with complementary resource requirements by combining scheduling order adjustments with task-to-core reassignments. The proposed scheduler has been implemented in the completely fair scheduler, rotating staircase deadline scheduler and O(1) schedulers. Using eight workloads constructed from nine NASA advanced supercomputing serial benchmarks on an Intel dual-core platform, the execution time of an individual task is reduced by up to 21%, system scalability and performance of a workload are improved by up to 13%, and the full potential of the contention-aware scheduling can be achieved if the time slice length and the period of executing each task once are short enough. In addition, our proposal exhibits benefits in the reduction of the execution time fluctuation of individual tasks due to an enforcement of reasonable usage of shared resources. Finally, we demonstrate an expected performance improvement on an Intel eight-core platform in order to suggest the broad applicability of our protocol.
Article
It is meaningful to use a little energy to obtain more performance improvement compared with the increased energy. It also makes sense to relax a small quantity of performance restriction to save an enormous amount of energy. Trading a small amount of energy for a considerable sum of performance or vice versa is possible if the relativities between performance and energy of parallel programs are exactly known. This work studies the relativities by recording the performance speedup and energy consumption of parallel programs when the number of cores on which programs run are changed. We demonstrate that the performance improvement and the increased energy consumption have a linear negative correlation.In addition, these relativities can guide us to do performance-energy adaptation under two assumptions. Our experiments show that the average correlation coefficients between performance and energy are higher than 97 %. Furthermore, it can be found that exchanging less than 6 % performance loss for more than 37 % energy consumption is feasible and vise versa.
Article
The advent of multi‒core technology motivates new studies to understand how efficiently Web servers utilize such hardware. This paper presents a detailed performance study of a Web server application deployed on a modern eight‒core server. Our study shows that default Web server configurations result in poor scalability with increasing core counts. We study two different types of workloads, namely, a workload with intense TCP/IP related OS activity and the SPECweb2009 Support workload with more application‒level processing. We observe that the scaling behaviour is markedly different for these workloads, mainly because of the difference in the performance of static and dynamic requests. While static requests perform poorly when moving from using one socket to both sockets in the system, the converse is true for dynamic requests. We show that, contrary to what was suggested by previous work, Web server scalability improvement policies need to be adapted based on the type of workload experienced by the server. The results of our experiments reveal that with workload‒specific Web server configuration strategies, a multi‒core server can be utilized up to 80% while still serving requests without significant queuing delays; utilizations beyond 90% are also possible, while still serving requests with ‘acceptable’ response times. Copyright © 2014 John Wiley & Sons, Ltd.
Conference Paper
Full-text available
As technology trends push future microprocessors to- ward chip multiprocessor designs, operating system net- work stacks must be parallelized in order to keep pace with improvements in network bandwidth. There are two competing strategies for stack parallelization. Message- parallel network stacks use concurrent threads to carry out network operations on independent messages (usu- ally packets), whereas connection-parallel stacks map operations to groups of connections and permit con- current processing on independent connection groups. Connection-parallel stacks can use either locks or threads to serialize access to connection groups. This paper eval- uates these parallel stack organizations using a modern operating system and chip multiprocessor hardware. Compared to uniprocessor kernels, all parallel stack organizations incur additional locking overhead, cache inefficiencies, and scheduling overhead. However, the organizations balance these limitations differently, lead- ing to variations in peak performance and connection scalability. Lock-serialized connection-parallel organi- zations reduce the locking overhead of message-parallel organizations by using many connection groups and eliminate the expensive thread handoff mechanism of thread-serialized connection-parallel organizations. The resultant organization outperforms the others, delivering 5.4 Gb/s of TCP throughput for most connection loads and providing a 126% throughput improvement versus a uniprocessor for the heaviest connection loads.
Conference Paper
Full-text available
Load balancing in packet-switched networks is a task of ever-growing importance. Network traffic properties, such as the Zipf-like flow length distribution and bursty transmission patterns, and requirements on packet ordering or stable flow mapping, make it a particularly difficult and complex task, needing adaptive heuristic solutions. In this paper, we present two main contributions: Firstly, we evaluate and compare two recently proposed algorithmic heuristics that attempt to adaptively balance load among the destination units. The evaluation on real life traces confirms the previously conjectured impact of the Zipf-like flow length distribution and traffic burstiness. Furthermore, we identify the distinction between the goals of preserving either the sequence order of packets, or the flow- to-destination mapping, showing different strengths of each algorithm. Secondly, we demonstrate a novel hybrid scheme that combines best of the flow-based and burst-based load balancing techniques and excels in both of the key metrics of flow remapping and packet reordering.
Conference Paper
Full-text available
Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of user-defined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before
Article
Full-text available
A novel scheme for processing packets in a router is presented that provides load sharing among multiple network processors distributed within the router. It is complemented by a feedback control mechanism designed to prevent processor overload. Incoming traffic is scheduled to multiple processors based on a deterministic mapping. The mapping formula is derived from the robust hash routing (also known as the highest random weight - HRW) scheme, introduced in K. W. Ross, IEEE Network, 11(6), 1997, and D. G. Thaler et al., IEEE Trans. Networking, 6(1), 1998. No state information on individual flow mapping has to be stored, but for each packet, a mapping function is computed over an identifier vector, a predefined set of fields in the packet. An adaptive extension to the HRW scheme is provided to cope with biased traffic patterns. We prove that our adaptation possesses the minimal disruption property with respect to the mapping and exploit that property to minimize the probability of flow reordering. Simulation results indicate that the scheme achieves significant improvements in processor utilization. A higher number of router interfaces can thus be supported with the same amount of processing power.
Article
Full-text available
Workload distribution is critical to the performance of network processor based parallel forwarding systems. Scheduling schemes that operate at the packet level, e.g., round-robin, cannot preserve packet-ordering within individual TCP connections. Moreover, these schemes create duplicate information in processor caches and therefore are inefficient in resource utilization. Hashing operates at the flow level and is naturally able to maintain per-connection packet ordering; besides, it does not pollute caches. A pure hash-based system, however, cannot balance processor load in the face of highly skewed flow-size distributions in the Internet; usually, adaptive methods are needed. In this paper, based on measurements of Internet traffic, we examine the sources of load imbalance in hash-based scheduling schemes. We prove that under certain Zipf-like flow-size distributions, hashing alone is not able to balance workload. We introduce a new metric to quantify the effects of adaptive load balancing on overall forwarding performance. To achieve both load balancing and efficient system resource utilization, we propose a scheduling scheme that classifies Internet flows into two categories: the aggressive and the normal, and applies different scheduling policies to the two classes of flows. Compared with most state-of-the-art parallel forwarding schemes, our work exploits flow-level Internet traffic characteristics.
Article
Full-text available
The delivered TCP performance on high-speed networks is often limited by the sending and receiving hosts, rather than by the network hardware or the TCP protocol implementation itself. In this case, systems can achieve higher bandwidth by reducing host overheads through a variety of optimizations above and below the TCP protocol stack, given support from the network interface. This article surveys the most important of these optimizations and illustrates their effects quantitatively with empirical results from an experimental network delivering up to 2 Gb/s of end-to-end TCP bandwidth
Article
In 2002, Ireland's National Research and Education Network, HEAnet, decided to overhaul its mirroring service -ftp.heanet.ie. The re-launched service, using Apache 2.0, quickly attracted attention as it became the first official Sourceforge mirror not run by the OSDN group. Due to a combination of Sourceforge being entirely HTTP-based, and ftp.heanet.ie being re-sponsible for approximately 80% of all Sourceforge downloads between April 2003 and April 2004, sustaining over 20,000 concurrent connections (current record is 27,317) from a single Apache instance has become a regular occurrence. This paper, and accompanying talk, share the lessons learned and the techniques used to achieve this scalability.
Article
Traditionally, small SMP nodes have been based on a shared bus interconnect. The frequency and performance of this shared bus depends heavily on the number of agents/devices on the bus. In order to scale the bus frequency proportionally to the increase in CPU frequency, the use of multiple independent busses (and therefore multiple smaller nodes) may be desirable. In multi-bus architectures, the propagation of snoops to maintain coherence can severely limit the application performance. In this paper, we study the benefits of snoop filters to reduce the amount of snoop propagations. We describe a detailed case study of the design trade-offs for snoop filters in a dual-bus web server. We show how the performance impact of a snoop filter depends on the following dimensions: (1) size and organization of the snoop filter, (2) inclusive vs. non-inclusive filters, (3) amount of state maintained in the snoop filter. We also propose optimizations to snoop filter replacement policies and show how various optimizations impact the snoop filter performance. Our case study focuses on a commercial dual-processor web server running a SPECweb99-like workload. Our evaluation is based on detailed trace-driven snoop filter simulations (using CASPER) and detailed hybrid simulation / analytical models for the workload and the platform. Overall, we show that the use of the snoop filter may be essential to improving the performance of multi-bus architectures.
Article
Techtdques for avoiding the high memory overheads found on many modern shared-memory mtdtiprocessors are of increasing importance in the development of high-performance mtdtiprocessor protocol implementations. One such technique is processor-cache aflirdty schedtding, which can significantly lower packet latency and substantially increase protocol processing throughput (30). In tlds paper, we evaluate severtd aspects of the effectiveness of affinity-based scheduling in mtdtiproeessor network protocol processing, under packet-level and connection- level parallelization approaches Speeifieally, we evaluate the performance of the scheduling technique 1) when a large number of streams are concurrently supported, 2) when processing in- cludes copying of uncached packet data, 3) as applied to send-side protocol processing, and 4) in the presence of stream burstiness and source locality, two well-known properties of network t-c. We find that ~ty-based schedtding performs well under these conditions, emphasizing its rubustneatsand general effectiveness in multiprocessor network processing. In addition, we explore a technique which improves the cachhsg behavior and available packet-level concurrency under connection-level parallelism, and find performance improves dramatically.
Article
As Internet usage continues to expand rapidly, careful attention needs to be paid to the design of Internet servers for achieving high performance and end-user satisfaction. Currently, the memory system continues to remain a significant performance bottleneck for Internet servers employing multi-GHz processors. In this paper, our aim is two-fold: (1) to characterize the cache/memory performance of web server workloads and (2) to propose and evaluate cache design alternatives for future web servers. We chose SPECweb99 as the representative web server workload and our entire characterization and evaluation methodology is based on our CASPER simulation framework. We begin by exploring the processor cache design space for single and dual-processor servers. Based on our observations, we then evaluate other cache hierarchy alternatives such as chipset caches, coherence filters and decompressed page stores. We show the sensitivity of these components to basic organization parameters such as cache size, line size and degree of associativity. We also present the performance implications of routing memory requests initiated by I/O devices through these caches. Based on detailed simulation data and its implications on system level performance, this paper shows that chipset caches have significant potential for improving future web server performance.
Article
This document describes a new networking subsystem architecture built around a packet classifier executing in the Network Interface Card (NIC). By classifying packets in the NIC, we believe that performance, scalability, and robustness can be significantly improved on shared-memory multiprocessor Internet servers. In order to demonstrate the feasibility and the benefits of the approach, we developed a software prototype (consisting in extensions to the Linux kernel and modifications to the Myrinet NIC firmware and driver) and ran a series of experiments. The obtained results, presented therein, show the relevance of the approach.
Conference Paper
Destination set prediction can improve the latency/bandwidth tradeoff in shared memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors), reducing latency for cache to cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache to cache misses. Recently proposed hybrid protocols tradeoff latency and bandwidth by directly sending requests to a predicted destination set. We explore the destination set predictor design space, focusing on a collection of important commercial workloads. First, we analyze the sharing behavior of these workloads. Second, we propose predictors that exploit the observed sharing behavior to target different points in the latency/bandwidth tradeoff. Third, we illustrate the effectiveness of destination set predictors in the context of a multicast snooping protocol. For example, one of our predictors obtains almost 90% of the performance of snooping while using only 15% more bandwidth than a directory protocol (and less than half the bandwidth of snooping).
Conference Paper
The authors report a preliminary analysis of the processing overhead of the transport protocol TCP (Transmission Control Protocol) in which they estimate the possible performance range of the protocol. The analysis was performed by compiling a version of TCP and counting the number of the instructions in the common path. The analysis suggests that fewer than 200 instructions are required to process a TCP packet in the normal case. This number is small enough to support very high-speed transmission if it were the major overhead. The authors offer some speculations about the actual source of processing overhead in network protocols
Article
The transport layer of the protocol suite, especially in connectionless protocols, has considerable functionality and is typically executed in software by the host processor at the end points of the network. It is thus considered a likely source of processing overhead. However, a preliminary examination has suggested to the authors that other aspects of networking may be a more serious source of overhead. To test this proposition, a detailed study was made of the Transmission Control Protocol (TCP), the transport protocol from the Internet protocol suite. In this set of protocols, the functions of detecting and recovering lost or corrupted packets, flow control, and multiplexing are performed at the transport level. The results of that study are presented. It is concluded that TCP is in fact not the source of the overhead often observed in packet processing, and that it could support very high speeds if properly implemented.
FireEngine—a new networking architecture for the Solaris operating system
  • S Tripathi
S. Tripathi. FireEngine—a new networking architecture for the Solaris operating system.
Workload Characterization: Application Cross-Section
  • Caida
CAIDA. Workload Characterization: Application Cross-Section, 2006. http://www.caida.org/analysis/workload/.
How the Linux TCP Output Engine Works
  • D Miller
D. Miller. How the Linux TCP Output Engine Works. http://vger.kernel.org/˜davem/tcp output.html.
Goals, Design and Implementation of the New Ultra-Scalable O(1) Scheduler. Linux Kernel
  • I Molnar
I. Molnar. Goals, Design and Implementation of the New Ultra-Scalable O(1) Scheduler. Linux Kernel, Apr. 2002. Documentation/sched-design.txt.
Intel front side bus architecture. Intel Software College course
  • V Viswanathan
V. Viswanathan. Intel front side bus architecture. Intel Software College course, 2006.
Receive Side Scaling on Intel Network Adapters
  • Intel Corporation
Intel Corporation. Receive Side Scaling on Intel Network Adapters. http://support.intel.com/support/ network/adapter/pro100/sb/CS-027574.htm.
Linux virtual server clusters
  • W Zhang
  • W Zhang
W. Zhang and W. Zhang. Linux virtual server clusters. Linux Magazine, 5(11), 2003.
Adaptive load sharing for network processors Packet classification in the NIC for improved SMP-based Internet servers
  • L Kencl
  • J.-Y L Boudec
L. Kencl and J.-Y. L. Boudec. Adaptive load sharing for network processors. In INFOCOM, volume 2, pages 545–554. IEEE, 2002. [10] ´ E. Lemoine, C. Pham, and L. Lefèvre. Packet classification in the NIC for improved SMP-based Internet servers. In ICN. IEEE, 2004.
Linux IP Sysctl Documentation. Documentation/networking/ip-sysctl
  • Linux Kernel
Linux Kernel. Linux IP Sysctl Documentation. Documentation/networking/ip-sysctl.txt.
FireEngine--a new networking architecture for the Solaris operating system. Whitepaper, Sun Microsystems
  • S Tripathi