Article

Linking user online behavior across domains with internet traffic

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We are facing an era of Online With Offline (OWO) in the smart city-almost everyone is using various online services to connect friends, watch videos, listen to the music, download resources, and so on. Our online behaviors are separated by different domains, which may cause serious problem in the area of cross-domain recommendation, advertising, and criminal tracking in online and offline world, since it is a very challenging task to link user online behaviors belonging to the same natural person. Existing methods usually tackle user online behavior linkage problem by estimating the profile content similarity between two different online services. However, the profile contents in heterogeneous online services are unreliable or misaligned, and the proposed methods are always limited to several services in a specific domain. In order to link individual’s online behavior across domains, in this paper, we propose user Online Behavior Linkage across Domains (OBLD), a novel hybrid model, to link user online behavior across domains with Internet traffic. It derives several significant attributes from users’ online behaviors, such as user digital identity, various fingerprints of terminals and browsers, spatio-temporal behavior of users, and leverages a supervised classification method to discover the relationship between users’ online behaviors. Also, the proposed model has unsupervised setting for dataset with non or few label data if a certain percentage of user digital identities can be extracted from original dataset. By using real-world network traffic collected from two large provinces in China, we evaluate the OBLD model and the linkage precision achieves 89% and 97.9% for two datasets respectively. Especially, the inputs of OBLD, i.e., network traffic flows, cover all online behavior of users who connect with Internet through monitored networks, which makes it possible to link online behaviors of users in whole online world.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is feasible to analyze and study students' online behavior by using their online data [5]. The log records the user's operation of using the network, such as login Future Internet 2021, 13,199 2 of 14 time, logout time, usage time, and usage flow. Although this kind of data is easy to obtain, the data structure is complex, the value density is low, and the increment is rapid, which is often ignored by people, and there is little research on this kind of data [6]. ...
... Fan [12] used the user behavior log as the basic data set to study the user's personal preferences and analyze the potential purchase demand, so as to realize the digital research on user demand, and to realize the digital research on the needs of users. Qiao [13] proposed a new hybrid model called OBLD (User Online Behavior Linkage over Domains), which links the online behavior of cross-domain users with network traffic. This model derives several important attributes from the user's online behavior, such as the user's digital identity, and their various fingerprints on the terminal and browser. ...
... A cross-modal learning idea was proposed, and a user profile model based on multimodal fusion was designed [17]. The stacking integration method was used to integrate multiple multimodal Future Internet 2021, 13,199 3 of 14 learning joint representation networks to learn the corresponding model combination; the attention mechanism, introduced to enable the model to learn the contribution of different modal representations to the prediction results, was different. ...
Article
Full-text available
Network behavior analysis is an effective method to outline user requirements, and can extract user characteristics by constructing machine learning models. To protect the privacy of data, the shared information in the model is limited to non-directional network behavior information, such as online duration, traffic, etc., which also hides users’ unconscious needs and habits. However, the value density of this type of information is low, and it is still unclear how much student performance is affected by online behavior; in addition there is a lack of methods for analyzing the correlation between non-directed online behavior and academic performance. In this article, we propose a model for analyzing the correlation between non-directed surfing behavior and academic performance based on user portraits. Different from the existing research, we mainly focus on the public student behavior information in the campus network system and conduct in-depth research on it. The experimental results show that online time and online traffic are negatively correlated with academic performance, respectively, and student’s academic performance can be predicted through the study of non-directional online behavior.
... Collected web logs usually record hundreds of millions of flow records by five triples without a user ID. To distinguish the flow records generated by the same real person, in our previous work [50], [51], we found several features that are widely available in Internet traffic, including the IP address, the "online fingerprint," and the spatiotemporal behavior of the user. The above features are highly discriminative between different users because they usually do not change within a specific time period. ...
Article
Linking online identities of users among countless heterogeneous network services on the Internet can provide an explicit digital representation of users, which can benefit both research and industry. In recent years, user identity linkage (UIL) through the Internet has become an emerging task with great potential and many challenges. Existing works mainly focus on online social networks that consider inconsistent profiles, content, and networks as features or use sparse location-based data sets to link the online behaviors of a real person. To extend the UIL problem to a general scenario, we try to link the web-browsing behaviors of users, which can help to distinguish specific users from others, such as children or malicious users. More specifically, we propose a Siamese neural network (NN) architecture-based UIL (SAUIL) model that learns and compares the highest-level feature representation of input web-browsing behaviors with deep NNs. Although the number of matching and nonmatching pairs for the UIL problem is highly imbalanced, previous studies have not considered imbalanced UIL data sets. Therefore, we further address the imbalanced learning issue by proposing cost-sensitive SAUIL (C-SAUIL) model, which assumes higher costs for misclassifying the minority class. In the experiments, the proposed model is robust and exhibits a good performance on very large, real-world data sets collected from different regions with distinct characteristics.
Article
Abstract This paper reviews the crime linkage literature to identify how data were pre-processed for analysis, methods used to predict linkage status/series membership, and methods used to assess the accuracy of linkage predictions. Thirteen databases were searched, with 77 papers meeting the inclusion/exclusion criteria. Methods used to pre-process data were human judgement, similarity metrics (including machine learning approaches), spatial and temporal measures, and Mokken Scaling. Jaccard's coefficient and other measures of similarity (e.g., temporal proximity, inter-crime distance, similarity vectors) are the most common ways of pre-processing data. Methods for predicting linkage status were varied and included human (expert) judgement, logistic regression, multi-dimensional scaling, discriminant function analysis, principal component analysis and multiple correspondence analysis, Bayesian methods, fuzzy logic, and iterative classification trees. A common method used to assess linkage-prediction accuracy was to calculate the hit rate, although position on a ranked list was also used, and receiver operating characteristic (ROC) analysis has emerged as a popular method of assessing accuracy. The article has been published open access and is free to download from https://www.sciencedirect.com/science/article/pii/S1359178924001046
Article
The nature of people's web navigation has significantly changed in recent years. The advent of smartphones and other handheld devices has given rise to web users consulting websites with more than one device, or using a shared device. As a result, large volumes of seemingly disjoint data are available, which when analysed together can support decision-making. The task of identifying web sessions by linking such data back to a specific person, however, is hard. The idea of session stitching aims to overcome this by using machine learning inference to identify similar or identical users. Many such efforts use various demographic data or device-based features to train matching algorithms. However, often these variables are not available for every dataset or are recorded differently, making a streamlined setup difficult. Besides, the often result in vast feature spaces which are hard to use for actionable interpretation. In this paper, we present an alternative approach based on the fingerprinting of web pages visited by users in a single session. By learning behavioural patterns from these sequences of page visits, we obtain features that can be used for matching without requiring sensitive user-agent data such as IP, geo location, or device details as is common with other approaches. Using these sequential fingerprints does not rely on pre-defined features, but only requires the recording of web page visits, making our approach actionable. The approach is empirically tested on real-life web logs and compared with matching using regular user-agent features and state-of-the-art embedding techniques. Results in an ecommerce context show sequential features can still obtain strong performance with fewer features, facilitating decision-making on session stitching and inform subsequent related activities such as marketing or customer analysis.
Article
Tensor factorization has been applied in recommender systems to discover latent factors between multidimensional data such as time, place, and social context. However, tensor-based recommender systems still encounter with several problems such as sparsity, cold-start, and so on. In this paper, we introduce the new model social tensor to propose a tensor-based recommendation with a social relationship to deal with the existing problems. In addition, an adaptive method is presented to adjust the range of the social network for an active user. To evaluate our method, we conducted several experiments in the movie domain. The results indicate the ability of our method to improve the recommendation performance, even in the case of a new user. Particularly, the proposed method conducts the regeneration and factorization of the tensor in real time. Furthermore, our approach recommends not only a single item, but also the multi-factors for the item such as social, temporal, and spatial contexts.