ArticlePDF Available

Log filtering method based on user behaviors

Authors:

Abstract and Figures

In the big data environment, various websites on the Internet have generated more and more user behaviors. Designing a universal log filtering method based on user behaviors is the current research trend. However, the current log filtering technology has disadvantages such as low filtering accuracy and low efficiency. In this paper, we propose a log filtering method based on user behaviors. First, divide user behaviors into multiple sub-behaviors and assign corresponding weights. Obtain and store log information of user behaviors through distributed log collection tools, and the log information of corresponding sub-behaviors below the weight threshold is filtered. Then, the log information of the retained sub-behaviors is processed in parallel through the utility function. The utility function establishes the mapping relationship between user interest degree and sub-behavior indicators. The corresponding log information of the sub-behaviors below the user interest degree threshold is deleted, and the log information of the user’s preferred sub-behaviors is retained, forming an optimized data source for recommendation results, and stored in the data cluster. This method can perform secondary filtering of the massive log information, respond to users’ current requirements and interesting information promptly, improving processing efficiency.
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Log filtering method based on user behaviors
To cite this article: Nan Wu et al 2022 J. Phys.: Conf. Ser. 2253 012014
View the article online for updates and enhancements.
You may also like
Research on Web User Behavior
Compliance Detection Method Based on
Clustering Data Analysis Technology
Cheng Qin
-
Design of Library Mobile User Behavior
Analysis model for Personalized
Information Service
Miaoji Tang
-
Computational challenges and
opportunities for a bi-directional artificial
retina
Nishal P Shah and E. J. Chichilnisky
-
This content was downloaded from IP address 205.237.94.154 on 11/12/2022 at 18:15
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
EECT-2022
Journal of Physics: Conference Series 2253 (2022) 012014
IOP Publishing
doi:10.1088/1742-6596/2253/1/012014
1
Log filtering method based on user behaviors
Nan Wu, Xueming Tang and Ying Pan*
School of computer and information engineering, Nanning Normal University,
Nanning, Guangxi, People’s Republic of China
* Corresponding author’s e-mail: panying@nnnu.edu.cn
Abstract. In the big data environment, various websites on the Internet have generated more and
more user behaviors. Designing a universal log filtering method based on user behaviors is the
current research trend. However, the current log filtering technology has disadvantages such as
low filtering accuracy and low efficiency. In this paper, we propose a log filtering method based
on user behaviors. First, divide user behaviors into multiple sub-behaviors and assign
corresponding weights. Obtain and store log information of user behaviors through distributed
log collection tools, and the log information of corresponding sub-behaviors below the weight
threshold is filtered. Then, the log information of the retained sub-behaviors is processed in
parallel through the utility function. The utility function establishes the mapping relationship
between user interest degree and sub-behavior indicators. The corresponding log information of
the sub-behaviors below the user interest degree threshold is deleted, and the log information of
the user's preferred sub-behaviors is retained, forming an optimized data source for
recommendation results, and stored in the data cluster. This method can perform secondary
filtering of the massive log information, respond to users' current requirements and interesting
information promptly, improving processing efficiency.
Keywords. log filtering; user behaviors; information query; information recommendation;
distributed environment
1. Introduction
With the rapid development of the Internet, users generate massive amounts of log information while
using the Internet. When faced with massive amounts of Internet information, it is difficult for users to
obtain the information they are interested in, resulting in information overload problems [1,2]. Therefore,
various recommendation methods have become research hotspots, enabling user groups to obtain real-
time and effective information that they are interested in (such as microblog recommendation, product
recommendation, movie recommendation, etc.).
Log filtering is an essential part of the recommendation system. In the big data environment, various
websites on the Internet produce more and more types of user behaviors. Among them, user behaviors
refer to a user's browsing behaviors while using the network, and log information is the information
recorded in the browsing behaviors. Designing a universal log filtering method based on user behaviors
is the current research trend [3,4].
Al-Duwairi et al. [5] proposed a LogDos method, which can filter GET-based message log records
and remove data packets from malicious hosts. Wei et al. [6] used unsupervised multi-autoencoders to
analyze system log files, filter abnormal data in log records, detect threatened data. Vidgof et al. [7]
developed and evaluated an interactive log-delta analysis technology in which analysts can interactively
define the filtering range for log filtering. This method can explore logs and manually separate typical
behaviors from atypical behaviors. Generally, the load data set is unstable, and its reliability evaluation
is very necessary for the preprocessing of the data filtering method. Therefore, Cao et al. [8] proposed
EECT-2022
Journal of Physics: Conference Series 2253 (2022) 012014
IOP Publishing
doi:10.1088/1742-6596/2253/1/012014
2
a novel statistical data filtering method that considers the reliability of the data set by analyzing a wide
range of predefined confidence levels.
However, the current log filtering technology has many shortcomings, such as missing data
(incomplete data, lack of ID, time, product ID, etc.). In addition, only the data containing noise and
missing values are filtered, and different recommendation systems use different filtering methods, which
cannot achieve universality [9,10].
To solve the above problems, this paper provides a log filtering method based on user behaviors,
which can perform secondary filtering of the massive log information, respond to users' current
requirements and interesting information promptly, improving processing efficiency. In addition, this
method is easy to expand and has some fault tolerance.
2. Implementation of log filtering based on user behaviors
2.1. Basic ideas
The basic idea of the filtering method in the paper is shown in figure 1.
User
Behaviors
Sub-behavior 1
Sub-behavior 2
Sub-behavior 3
Sub-behavior 4
Log Information 1
Log Information 2
Log Information 3
Log Information 1
Log Information 2
Log Information 3
Log Information 1
Log Information 2
Log Information 3
Behavior
Function
Utility
Function 1
Utility
Function 2
Utility
Function 3
The weight of this log information is below the
threshold, and this log information is filtered out.
The user interest degree of this log information is below
the threshold, and this log information is filtered out.
The user interest degree of this log information is below
the threshold, and this log information is filtered out.
The user interest degree of this log information is below
the threshold, and this log information is filtered out.
The user interest degree of this log information is below
the threshold, and this log information is filtered out.
Figure 1. The basic idea of log filtering based on user behaviors.
Users generate a huge amount of user behaviors in each business system (e.g., client applications or
pages for online shopping, microblog browsing, news recommendations, etc.), and page developers pre-
divide user behaviors into multiple sub-behaviors and assign corresponding weights in the back-end for
different business systems. For example, in the business system of online shopping, user behaviors are
divided into various sub-behaviors such as browsing behaviors, clicking behaviors, purchasing
behaviors, and so on, and in the business system of microblog browsing, user behaviors are divided into
various sub-behaviors such as browsing behaviors, clicking behaviors, searching behaviors, etc. An
example of a business system of online shopping is shown below as a reference. When a user is shopping
online, the page developer will first enumerate a wide range of sub-behaviors in advance for most
consumers' shopping habits, and assign weights to multiple sub-behaviors based on the user's purchase
probability. Then the system accesses the log tables of the database through an existing distributed log
collection tool, parses the log tasks, and extracts the log information of the user. After acquiring and
storing the log information of user behaviors, the system saves it to the data cluster, which can carry a
huge amount of log information of user behaviors and provides reliable information transmission for the
EECT-2022
Journal of Physics: Conference Series 2253 (2022) 012014
IOP Publishing
doi:10.1088/1742-6596/2253/1/012014
3
subsequent log filtering stage. Finally, the log information of the corresponding sub-behaviors below
the weight threshold is filtered out, i.e., the log information of some sub-behaviors with relatively no
reference value is removed. In this way, the first filtering of the behavior log is achieved.
The log information of the reserved sub-behaviors is processed separately in parallel by the utility
function, i.e., each sub-behavior is processed separately. The utility function with targeting is established,
and the part of log information of each sub-behavior that does not have reference value is filtered out
again. Where, the sub-behaviors include attribute information and indicators, the indicators include
multiple sub-indicators with parameters, and the numerical magnitude of the sub-indicators is of
comparative significance. The utility function establishes a mapping relationship between the user
interest degree and indicators of at least one sub-behavior, calculates the user interest degree separately
for different types of utility functions, and presets the interest degree threshold value separately. The
corresponding part of the log information of the sub-behaviors below the interest threshold is filtered
out, and the remaining log information that is not below the interest degree threshold is the user preferred
sub-behaviors. Finally, log information of user preferred sub-behaviors is retained to form an optimized
data source for recommendation results, which is stored in the data cluster as a data source with wide
applicability for each recommendation end. In this way, the second filtering of the behavior log is
implemented.
2.2. The creation of behavior functions
In the first filtering, create the behavior function of user behaviors, define multiple sub-behaviors, such
as browsing behaviors (a single click on the viewed page will record multiple browsing data, such as
user information, time, address, product ID, current mouse dwell time, current page scroll count, etc.),
clicking behaviors (the sub-behaviors of clicking behaviors are clicking on search products or
recommending products in the list, recording user information, time, address, clicking product ID, etc.),
buying behaviors(the sub-behaviors of buying behaviors are adding products to the cart for payment or
not, recording user information, product ID, payment time, order time, address, etc.), comparing
behaviors(adding multiple products to the comparison column to compare each parameter), etc. In
particular, there is a certain overlap of each behavior, such as the browsing process will have clicking
behavior, all will be extracted and considered separately for the two sub-behaviors, and record user
information, product ID, comparison time, address, etc. The weights of the various sub-behaviors are
adjusted to the needs of the user. The behavior function is
Ϝ󰇛α󰇜 =fx1 . x2……xm=wi
m
ixi (1)
Where wi is the corresponding weight of each sub-behavior of user α, 0< wi <1, x1. x2……xm are the
corresponding m kinds of sub-behaviors of user α. In the business system of online shopping, the weight
of browsing behaviors, clicking behaviors, and purchasing behaviors are higher than the threshold, and
the weight of comparison behaviors is lower than the threshold, so all the log information of comparison
behaviors is filtered out, and the log information of browsing behaviors, clicking behaviors and
purchasing behaviors are retained.
2.3. Calculation of utility functions for different sub-behaviors
Sub-behaviors include user information (user ID, account registration time), user's current page access
time, current page address, and sub-behavior indicators. Sub-behavior indicators for different sub-
behaviors do not include exactly the same items.
2.3.1. Indicators of sub-behaviors with multiple independent parameters. When the indicators of sub-
behaviors are multiple independent parameters, the multiple independent parameters have no relative or
complementary relationships with each other, and all of them have consideration value. For example,
the indicators of browsing sub-behaviors are mouse dwell time, current page scroll count, etc., where
browsing time and current page scroll count are independent parameters. The utility function is
G()=fy1 . y2……yn=wi
n
iyi (2)
EECT-2022
Journal of Physics: Conference Series 2253 (2022) 012014
IOP Publishing
doi:10.1088/1742-6596/2253/1/012014
4
The weights of each parameter of the sub-behavior are adjusted and assigned according to the user's
needs, and the user interest degree of the current page of the sub-behavior is obtained by calculation.
Where wi is the corresponding weight of each parameter of sub-behavior β, 0< wi <1, wmouse dwell time is
preset to 0.8 and wcurrent page scroll count is preset to 0.2, i.e., the operation of mouse dwell time is regarded
as the more interesting behavior of users. y1. y2……yn are the corresponding n parameters for sub-
behavior β, for a given page, wmouse dwell time is 5 seconds and wcurrent page scroll count is 1 time. The G(β)
calculated is 4.2, and the page developer uses 4.2 as the interest threshold, i.e., when G(β)≥4.2, the
corresponding log information of the page is retained, and the log information that does not satisfy the
function condition is filtered out.
2.3.2. Indicators of sub-behaviors with two options of executed and unexecuted. When the
indicators of a sub-behavior are two options of executed and unexecuted, the two options are either
relative or complementary. For example, when the indicators of the sub-behavior of purchase behavior
are the two options of purchase and unpurchased, the two options are relative. Then the utility function
is
G󰇛β󰇜=󰇥0 exectuted
1 unexectuted (3)
Retain the log information corresponding to the sub-behavior corresponding to the option (i.e., user
interest degree is 1 and interest degree threshold is 1) that takes the value 1, i.e., retain the log
information of the sub-behavior that generates order information, or retain the log information of the
sub-behavior of the product for which the user clicked on the search.
2.3.3. Indicators of the searching sub-behaviors. When the sub-behavior is searching behavior, the sub-
behavior of searching is to input keywords to query and record user information, product ID, retrieved
keywords, address, etc. Reading the keywords searched by users, for example, the user enters the search
box with the keyword "movie ticket", and uses the semantic model to get the associated words of the
keywords. The semantic model is an existing technology and contains the semantic extension query
interface, the semantic support system, the inference system, and the ontology system. The semantic
extension query interface is used to analyze user requests, determine the semantics of users and bind to
relevant concepts. The semantic support system provides support for semantic analysis. The inference
system serves for semantic analysis and knowledge processing, and the ontology system is used for
knowledge representation and knowledge processing. The associated words are inferred from the
keywords input by the user through the semantic model to obtain the information of the associated
objects. For example, if a user's historical orders include "movie ticket" and "diaper", the associated
words can be "movie channel", "baby diapers", etc. The indicators of the sub-behaviors are the similarity
between keywords and associated words. When these associated words appear in the same historical
order, the user interest degree of the associated word is defined as 1. When the associated word does not
appear in the historical order, the user interest degree can be calculated by the similarity method. Then
the utility function is
G󰇛β󰇜= 0 unrelevant
x associated
1 identical
(4)
3. Conclusions
The log filtering method in this paper adopts distributed mode to collect log information from various
business systems in the network to obtain log information of user behaviors. The optimization result is
obtained through secondary filtering of self-defined functions. This method can quickly and efficiently
process small batches of data, ensuring the efficiency and practicability of log filtering. At the same
time, the method is easy to expand, and the fault-tolerant recovery mechanism is easy to implement.
EECT-2022
Journal of Physics: Conference Series 2253 (2022) 012014
IOP Publishing
doi:10.1088/1742-6596/2253/1/012014
5
Acknowledgments
This research was supported by the National Natural Science Foundation of China under Grant No.
61862010, Guangxi Collaborative Innovation Center of Multi-source Information Integration and
Intelligent Processing, and Innovation Project of Guangxi Graduate Education No. YCSW2021283.
References
[1] Tang Y, Spektor A, Khatchadourian R and Bagherzadeh M 2021 A Tool for Rejuvenating Feature
Logging Levels via Git Histories and Degree of Interest Preprint arXiv/2112.02758
[2] Wu W, Zhang R and Liu L 2019 A personalized network-based recommendation approach via
distinguishing user's preference International Journal of Modern Physics B 33 1950029
[3] Wan H and Ismail N F 2021 Recommender System for Multiple Databases Based on Web Log
Mining Annals of Emerging Technologies in Computing 5 187-93
[4] Tanaka T, Niibori H, Shiyingxue L I, Nomura S and Tsuda K 2020 Bot Detection Model using
User Agent and User Behavior for Web Log Analysis Procedia Computer Science 176 1621-
25
[5] Al-Duwairi B, Oozkasap O, Uysal A, Kocaogullar C and Yildirim K 2020 LogDos: A Novel
Logging-based DDoS Prevention Mechanism in Path Identifier-Based Information Centric
Networks Computers & Security 99 102071
[6] Y Wei, Chow K P and Yiu S M 2020 Insider Threat Detection Using Multi-autoencoder Filtering
and Unsupervised Learning IFIP Advances in Information and Communication Technology
vol 589 ed Peterson G and Shenoi S (Springer, Cham: New Delhi, India) pp 273-90
[7] Vidgof M, D Djurica, Bala S and Mendling J 2021 Interactive log-delta analysis using multi-
range filtering Software and Systems Modeling 4 1-22
[8] Cao M T, Pham T T, Kuo T C, Bui D M and Nguyen T H 2020 Short-Term Load Forecasting
Enhanced With Statistical Data-Filtering Method 2020. IEEE. Int. Conf. on Power Electronics,
Smart Grid and Renewable Energy (PESGRE2020) (IEEE: Cochin, India) pp 1-8
[9] Jiang M, Zhang Z, J Jiang, Wang Q and Pei Z 2019 A collaborative filtering recommendation
algorithm based on information theory and bi-clustering Neural Computing and Applications
31 827987
[10] Feng L, Cai Y, Wei E and Li J 2022 Graph Neural Networks with Global Noise Filtering for
Session-based Recommendation Neurocomputing 472 113-23
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Process mining is a family of analytical techniques that extract insights from an event log and present them to an analyst. A key analysis task is to understand the distinctive features of different variants of the process and their impact on process performance. Techniques for log-delta analysis (or variant analysis) put a strong emphasis on automatically extracting explanations for differences between variants. A weakness of them is, however, their limited support for interactively exploring the dividing line between typical and atypical behavior. In this paper, we address this research gap by developing and evaluating an interactive technique for log-delta analysis, which we call InterLog . This technique is developed based on the idea that the analyst can interactively define filter ranges and that these filters are used to partition the log L into sub-logs L1L_1 L 1 for the selected cases and L2L_2 L 2 for the deselected cases. In this way, the analyst can step-by-step explore the log and manually separate the typical behavior from the atypical. We prototypically implement InterLog and demonstrate its application for a real-world event log. Furthermore, we evaluate it in a preliminary design study with process mining experts for usefulness and ease of use.
Article
Full-text available
Finding information from a large collection of resources is a tedious and time-consuming process. Due to information overload, searchers often need help and assistance to search and find the information. Recommender system is one of the innovative solutions to the problem related to information searching and retrieval. It helps and assist searchers by recommending the possible solution based on the previous search activities. These activities can be obtained from the web log, which requires a web log mining approach to extract all the keywords. In this study, keywords obtained from the library web log were analysed and the search keyword patterns were obtained. These keyword patterns were from several databases or resources that were subscribed by the library. The finding revealed some of the popular keywords and the most searchable databases among the searchers. This information was used to design and develop the recommender system that can be used to assist other searchers. The usability test of the recommender system showed that it is beneficial and useful to the searchers. These findings will also benefit the management in planning and managing the subscription of online databases at the university’s library.
Article
Full-text available
In recent years, it has become a common function to automatically distribute content suitable for each user by letting AI learn the user’s behavior pattern from the user’s web access log. On the other hand, browsing information by a bot is included in the web access log. There are malicious bots for the purpose of DDos attacks and illegal mass extraction of content. Furthermore, it is not uncommon for bots to disguise themselves as if they were showing their attributes to the user. In this study, we propose a method to discriminate between the user and the bot’s web access log in order to exclude the bot’s web access log from the analysis target.
Article
Full-text available
With the rapid growth of commerce and development of Internet technology, a large number of user consumption preferences become available for online market intelligence analysis. A critical demand is to reduce the impact of information overload by using recommendation algorithms. In physical dynamics, network-based recommendation algorithms based on mass-diffusion have been popular for its simplicity and efficiency. In this paper, to solve the problem that most network-based recommendation algorithms cannot distinguish how much the user likes collected items and make resource configuration more reasonable, we propose a novel method called biased network-based inference (BNBI). The proposed method treats rating systems and nonrating systems differently and measures user’s preference for items by means of item similarity. The proposed method is evaluated in real datasets (MovieLens and Last.FM) and compared with some existing classic recommendation algorithms. Experimental results show that the proposed method is more effective and it can reduce the impact of item diversity and discover the real interest of users.
Article
Full-text available
Collaborative filtering is the most popular and efficient recommendation algorithm to character the potential preference of the new users, by exploring the patterns of historical consuming records/ratings of the investigated users. There are two types of primary collaborative filtering algorithms: the user-based recommendation system, which recommends items to new users by ranking the similarity of the shared items between the history users and the new users, and the item-based collaborative filtering recommend items to new users by considering the rank of the similarity among all the history items of the training data. Although the collaborative filtering has been successfully applied to many commercial fields, several original drawbacks of collaborative filtering, especially the sparsity of the rating data raises a serious challenge to the accuracy and the universality of those algorithms. In particular, the most rating terms for each specific user are missing in many applications, and the performance of collaborative filtering will be degraded along with the increment of the number of items in training dataset. In this paper, we proposed a novel collaborative filtering method (CBE-CF) to extract the local dense rating modules to cope with the data sparsity and the computational efficiency of the traditional recommendation algorithms, by introducing the information entropy and bi-clustering into collaborative filtering. Here, both the rows and columns of the user-item-rating matrix are clustered together to identify the dense rating modules of the historical records (training) data, and then an information entropy metric is used to quantify the similarity between the new user and each dense modules, and the final prediction is optimized by the aggregative recommendations of the global generalization of item-based methods and the local similarity of the nearest modules. Experimental analysis presents the characters of the proposed CBE-CF, and the precision and the computational cost, etc., are better than state of the art on the benchmark dataset.
Article
Session-based recommendation leverages anonymous sessions to predict which item a user is most likely to click on next. While previous approaches capture items-transition patterns within current session and neighbor sessions, they do not accurately filter out noise within session or widen the range of feasible data in a more reasonable way. In a current session, the user may accidentally click on an unrelated item, resulting in the fact that, the users’ primary intents from neighbor sessions, may mismatch the current session. Thereby, we propose a new framework, dubbed Graph Neural Networks with Global Noise Filtering for Session-based Recommendation (GNN-GNF), aiming to filter noisy data and exploit items-transition patterns in a more comprehensive and reasonable manner. In simple terms, GNN-GNF contains two parts: data preprocessing and model learning. In data preprocesing, an item-level filter module is used to obtain the main intent of user and a session-level filter module is designed to filter the sessions unrelated to the target session intent by means of edge matching. In model learning, we consider both local-level interest obtained by an aggregation of the items representing the main intent of user within a session, and global-level interest deduced from a global graph. We take two kinds of neighbor aggregations, summation and interactive aggregation, respectively, to iteratively derive the representation of the central node in the global graph. Finally, GNN-GNF concatenates the local and global preference to characterize the current session, towards better recommendation prediction. Experiments on two datasets demonstrate that GNN-GNF can achieve competitive results. The source code is available at: https://github.com/Fenglixia/GNF.
Article
Information Centric Networks (ICNs) have emerged in recent years as a new networking paradigm for the next-generation Internet. The primary goal of these networks is to provide effective mechanisms for content distribution and retrieval based on in-network content caching. Several network architectures were proposed in recent years to realize this communication model. This include Named Data Networks (NDN) and Path-Identifier (PID) based ICN. This paper proposes LogDoS as a novel mechanism to address the problem of data flooding attacks in PID-based ICNs. The proposed LogDoS mechanism is a unique hybrid approach that combines the best of NDN networks and PID-based ICNs, and it is the first to employ Bloom-filter based logging approach in a novel way to filter attack traffic efficiently. In this context, we develop and model three versions of LogDoS with varying levels of storage overhead at LogDoS-enabled routers. Extensive simulation experiments show that LogDoS is very effective against DDoS attacks as it can filter more than 99.98% of attack traffic in different attack scenarios while incurring acceptable storage overhead.
Chapter
Insider threat detection and investigation are major challenges in digital forensics. Unlike external attackers, insiders have privileges to access resources in their organizations and violations of normal behavior are difficult to detect. This chapter describes an unsupervised deep learning framework for detecting insider threats by analyzing system log files. A typical deep neural network can capture normal behavior patterns, but not insider threat behavior patterns because of the presence of small, if any, amounts of insider threat data. For example, the autoencoder unsupervised deep learning model, which is widely used for anomaly detection, requires a dataset containing labeled normal data for training purposes and does not work well when the training dataset contains anomalies. In contrast, the framework proposed in this chapter leverages unsupervised multi-autoencoder filtering to remove anomalies from a training dataset and uses the resulting trained Gaussian mixture model to estimate the distributions of encoded and recognized normal data; data with lower probabilities is identified as insider threat data by the trained model. Experiments demonstrate that the multi-autoencoder-filtered unsupervised learning framework has superior detection performance compared with state-of-the-art baseline models.