Tomasz Bujlow’s research while affiliated with Polytechnic University of Catalonia and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


A Survey on Web Tracking: Mechanisms, Implications, and Defenses
  • Article

March 2017

·

534 Reads

·

119 Citations

Proceedings of the IEEE

Tomasz Bujlow

·

Valentin Carela-Espanol

·

·

Privacy seems to be the Achilles' heel of today's web. Most web services make continuous efforts to track their users and to obtain as much personal information as they can from the things they search, the sites they visit, the people they contact, and the products they buy. This information is mostly used for commercial purposes, which go far beyond targeted advertising. Although many users are already aware of the privacy risks involved in the use of internet services, the particular methods and technologies used for tracking them are much less known. In this survey, we review the existing literature on the methods used by web services to track the users online as well as their purposes, implications, and possible user's defenses. We present five main groups of methods used for user tracking, which are based on sessions, client storage, client cache, fingerprinting, and other approaches. A special focus is placed on mechanisms that use web caches, operational caches, and fingerprinting, as they are usually very rich in terms of using various creative methodologies. We also show how the users can be identified on the web and associated with their real names, e-mail addresses, phone numbers, or even street addresses. We show why tracking is being used and its possible implications for the users. For each of the tracking methods, we present possible defenses. Some of them are specific to a particular tracking approach, while others are more universal (block more than one threat). Finally, we present the future trends in user tracking and show that they can potentially pose significant threats to the users' privacy.


Web Tracking: Mechanisms, Implications, and Defenses
  • Article
  • Full-text available

July 2015

·

3,287 Reads

·

10 Citations

This articles surveys the existing literature on the methods currently used by web services to track the user online as well as their purposes, implications, and possible user's defenses. A significant majority of reviewed articles and web resources are from years 2012-2014. Privacy seems to be the Achilles' heel of today's web. Web services make continuous efforts to obtain as much information as they can about the things we search, the sites we visit, the people with who we contact, and the products we buy. Tracking is usually performed for commercial purposes. We present 5 main groups of methods used for user tracking, which are based on sessions, client storage, client cache, fingerprinting, or yet other approaches. A special focus is placed on mechanisms that use web caches, operational caches, and fingerprinting, as they are usually very rich in terms of using various creative methodologies. We also show how the users can be identified on the web and associated with their real names, e-mail addresses, phone numbers, or even street addresses. We show why tracking is being used and its possible implications for the users (price discrimination, assessing financial credibility, determining insurance coverage, government surveillance, and identity theft). For each of the tracking methods, we present possible defenses. Apart from describing the methods and tools used for keeping the personal data away from being tracked, we also present several tools that were used for research purposes - their main goal is to discover how and by which entity the users are being tracked on their desktop computers or smartphones, provide this information to the users, and visualize it in an accessible and easy to follow way. Finally, we present the currently proposed future approaches to track the user and show that they can potentially pose significant threats to the users' privacy.

Download

Independent comparison of popular DPI tools for traffic classification

January 2015

·

2,666 Reads

·

216 Citations

Computer Networks

Deep Packet Inspection (DPI) is the state-of-the-art technology for traffic classification. According to the conventional wisdom, DPI is the most accurate classification technique. Consequently, most popular products, either commercial or open-source, rely on some sort of DPI for traffic classification. However, the actual performance of DPI is still unclear to the research community, since the lack of public datasets prevent the comparison and reproducibility of their results. This paper presents a comprehensive comparison of 6 well-known DPI tools, which are commonly used in the traffic classification literature. Our study includes 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, nDPI, and Libprotoident). We studied their performance in various scenarios (including packet and flow truncation) and at different classification levels (application protocol, application and web service). We carefully built a labeled dataset with more than 750 K flows, which contains traffic from popular applications. We used the Volunteer-Based System (VBS), developed at Aalborg University, to guarantee the correct labeling of the dataset. We released this dataset, including full packet payloads, to the research community. We believe this dataset could become a common benchmark for the comparison and validation of network traffic classifiers. Our results present PACE, a commercial tool, as the most accurate solution. Surprisingly, we find that some open-source tools, such as nDPI and Libprotoident, also achieve very high accuracy.


NDPI: Open-source high-speed deep packet inspection

September 2014

·

3,894 Reads

·

159 Citations

Network traffic analysis was traditionally limited to packet header, because the transport protocol and application ports were usually sufficient to identify the application protocol. With the advent of port-independent, peer-To-peer, and encrypted protocols, the task of identifying application protocols became increasingly challenging, thus creating a motivation for creating tools and libraries for network protocol classification. This paper covers the design and implementation of nDPI, an open-source library for protocol classification using both packet header and payload. nDPI was extensively validated in various monitoring projects ranging from Linux kernel protocol classification, to analysis of 10 Gbit traffic, reporting both high protocol detection accuracy and efficiency.


Is Our Ground-Truth for Traffic Classification Reliable?

March 2014

·

80 Reads

·

56 Citations

Lecture Notes in Computer Science

The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-filter) are not reliable enough and should not be used for ground-truth generation in their current form.


Obtaining Internet Flow Statistics by Volunteer-Based System

January 2013

·

21 Reads

·

1 Citation

Advances in Intelligent Systems and Computing

In this paper we demonstrate how the Volunteer Based System for Research on the Internet, developed at Aalborg University, can be used for creating statistics of Internet usage. Since the data is collected on individual machines, the statistics can be made on the basis of both individual users and groups of users, and as such be useful also for segmentation of users intro groups. We present results with data collected from real users over several months; in particular we demonstrate how the system can be used for studying flow characteristics -the amount of TCP and UDP flows, average flow lengths, and average flow durations. The paper is concluded with a discussion on what further statistics can be made, and the further development of the system.


A method for evaluation of quality of service in computer networks

January 2013

·

48 Reads

·

4 Citations

Monitoring of Quality of Service (QoS) in high-speed Internet infrastructures is a challenging task. However, precise assessments must take into account the fact that the requirements for the given quality level are service-dependent. The backbone QoS monitoring and analysis requires processing of large amounts of data and knowledge of which kinds of applications the traffic is generated by. To overcome the drawbacks of existing methods for traffic classification, we proposed and evaluated a centralized solution based on the C5.0 Machine Learning Algorithm (MLA) and decision rules. The first task was to collect and to provide to C5.0 high-quality training data divided into groups, which correspond to different types of applications. It was found that the currently existing means of collecting data (classification by ports, Deep Packet Inspection, statistical classification, public data sources) are not sufficient and they do not comply with the required standards. We developed a new system to collect training data, in which the major role is performed by volunteers. Client applications installed on volunteers' computers collect the detailed data about each flow passing through the network interface, together with the application name taken from the description of system sockets. This paper proposes a new method for measuring the level of Quality of Service in broadband networks. It is based on our Volunteer-Based System to collect the training data, Machine Learning Algorithms to generate the classification rules and the application-specific rules for assessing the QoS level. We combine both passive and active monitoring technologies. The paper evaluates different possibilities of implementation, presents the current implementation of particular parts of the system, their initial runs and the obtained results, highlighting parts relevant from the QoS point of view.


Obtaining application-based and content-based internet traffic statistics

December 2012

·

23 Reads

Understanding Internet traffic is crucial in order to facilitate academic research and practical network engineering, e.g. when doing traffic classification, prioritization of traffic, creating realistic scenarios and models for Internet traffic development etc. In this paper we demonstrate how the Volunteer-Based System for Research on the Internet, developed at Aalborg University, is capable of providing detailed statistics of Internet usage. Since an increasing amount of HTTP traffic has been observed during the last few years, the system also supports creating statistics of different kinds of HTTP traffic, like audio, video, file transfers, etc. All statistics can be obtained for individual users of the system, for groups of users, or for all users altogether. This paper presents results with real data collected from a limited number of real users over six months. We demonstrate that the system can be useful for studying characteristics of computer network traffic in application-oriented or content-type- oriented way, and is now ready for a larger-scale implementation. The paper is concluded with a discussion about various applications of the system and possibilities of further enhancement.


Classification of HTTP traffic based on C5.0 Machine Learning Algorithm

July 2012

·

165 Reads

·

79 Citations

Proceedings - International Symposium on Computers and Communications

Our previous work demonstrated the possibility of distinguishing several kinds of applications with accuracy of over 99%. Today, most of the traffic is generated by web browsers, which provide different kinds of services based on the HTTP protocol: web browsing, file downloads, audio and voice streaming through third-party plugins, etc. This paper suggests and evaluates two approaches to distinguish various HTTP content: distributed among volunteers' machines and centralized running in the core of the network. We also assess accuracy of the global classifier for both HTTP and non-HTTP traffic. We achieved accuracy of 94%, which supposed to be even higher in real-life usage. Finally, we provided graphical characteristics of different kinds of HTTP traffic.


Volunteer-Based System for Research on the Internet Traffic

June 2012

·

64 Reads

·

5 Citations

Telfor Journal

T. Bujlow

·

K. Balachandran

·

S. Ligaard Nørgaard Hald

·

[...]

·

J. Myrup Pedersen

To overcome the drawbacks of existing methods for traffic classification (by ports, Deep Packet Inspection, statistical classification) a new system has been developed, in which data are collected and classified directly by clients installed on machines belonging to volunteers. Our approach combines the information obtained from the system sockets, the HTTP content types, and the data transmitted through network interfaces. It allows grouping packets into flows and associating them with particular applications or types of service. This paper presents the design of our system, implementation, the testing phase and the obtained results. The performed threat assessment highlights potential security issues and proposes solutions in order to mitigate the risks. Furthermore, it proves that the system is feasible in terms of uptime and resource usage, assesses its performance and proposes future enhancements. We released the system under The GNU General Public License v3.0 and published as a SourceForge project called Volunteer-Based System for Research on the Internet.


Citations (11)


... While Web tracking is generally perceived as a benign technology that provides user-tailored information, it also has the potential to be exploited, raising significant privacy concerns, as evidenced by previous studies [3][4][5]. Users may voluntarily provide personal information on the web, such as through Web forms, or this information may be indirectly collected without their knowledge through methods such as IP header analysis, HTTP requests, search engine query analysis, and JavaScript [6]. ...

Reference:

Combating Web Tracking: Analyzing Web Tracking Technologies for User Privacy
A Survey on Web Tracking: Mechanisms, Implications, and Defenses
  • Citing Article
  • March 2017

Proceedings of the IEEE

... The general idea was first described in [5] and a preliminary limited prototype was implemented in [6]. The current system design was announced in [7], while more technical details on later refinements can be found in [8]. Other papers ( [9], [10] and [11]) demonstrate various applications of our system. ...

Volunteer-Based System for Research on the Internet Traffic
  • Citing Article
  • June 2012

Telfor Journal

... Research [3] shows that DPI (Deep Packet Inspection) technology is used for effective classification of encrypted traffic, which, unlike classic firewalls, analyzes not only packet headers, but also the payload starting from the channel level of the Open System Interconnection (OSI) network model. Deri et al. [3] compared different traffic recognition algorithms and found that the nDPI library recognizes different types of traffic with greater accuracy than the Protocol and Application Classification Engine (PACE), Libprotoident and the Universitat Politecnica de Catalunya Machine Learning Algorithm (UPC MLA). ...

NDPI: Open-source high-speed deep packet inspection
  • Citing Article
  • September 2014

... If the login device is an old device for the user, the user passes this authentication, and if it is a new device for the user, then the user's login is forbidden. Device fingerprint initially constructs a unique identity for each device by using Cookies, hardware and software configuration information of devices [3]. However, this device fingerprint is extremely unstable and easily tampered with by criminals. ...

Web Tracking: Mechanisms, Implications, and Defenses

... Even though we had the full packets of our experiments, we worked only on packet headers because, in the online case, we would not have access to the full packets, as it is known that deep packet inspection that includes packet payloads is an issue for many researchers [25]. One cannot access this due to encryption or privacy, and it is even illegal in some cases. ...

Is Our Ground-Truth for Traffic Classification Reliable?
  • Citing Conference Paper
  • March 2014

Lecture Notes in Computer Science

... Currently, common network traffic classification methods can be divided into the following four categories: port-based traffic classification [10,11], payload-based traffic classification [12], host-behavior-based traffic classification [13], and machine-learning-based traffic classification. ...

Independent comparison of popular DPI tools for traffic classification
  • Citing Article
  • January 2015

Computer Networks

... [14] With the development of machine learning technology, researchers began to build labeled traffic data sets, and applied supervised learning paradigm to train machine learning models to achieve automatic traffic classification. T. Bujlow et al. applied C5.0 algorithm to traffic classification, this method could generate and combine several decision trees for a better prediction [15].In 2009, K Nearest Neighbor(kNN) algorithm was utilized to divide internet traffic with statistical features by S. Huang et al [16]. This method measured Euclidean distance between features vectors, so training phased was not required. ...

Classification of HTTP traffic based on C5.0 Machine Learning Algorithm
  • Citing Conference Paper
  • July 2012

Proceedings - International Symposium on Computers and Communications

... Although the context-awareness concept [4] exists since decades, very few works [5] has been spotted to describe the user's digital context over Internet, while most of context models [6,7] describe users' physique activities and environments. Regarding the digital context classification, machine learning based approaches [8,9] has surpassed the traditional ground-truth based methods [10,11], as ground truth-based methods can fail when facing data encryption or mechanisms such as dynamic port or MAC spoofing [12], on the other hand machine learning based methods are able to discover implicit relations among features from big volume of data and prove to be robust. More specifically, since Internet traffic consists of sequences of packets, recurrent models of deep learning [13,14] such as RNN, GRU and LSTM are frequently used to consider Internet traffic as time series data for classification. ...

Volunteer-based system for classification of traffic in computer networks
  • Citing Conference Paper
  • November 2011

... Algorithm C5.0 is one of the data mining methods for decision tree-based classification techniques, a refinement of the ID3 and C4.5 algorithms. This algorithm has also been widely applied, such as in the study of Bujlow & Pedersen (2012) to distinguish various types of traffic in computer networks with an average accuracy of 99.3-99.9% [7]. ...

A method for classification of network traffic based on C5.0 Machine Learning Algorithm
  • Citing Article
  • January 2012