Figure 4 - uploaded by Mitali Srivastava
Content may be subject to copyright.
Sample of Extended common log format (Collected from web server of BHU website)

Sample of Extended common log format (Collected from web server of BHU website)

Source publication
Article
Full-text available
Due to huge, unstructured and scattered amount of data available on web, it is very tough for users to get relevant information in less time. To achieve this, improvement in design of web site, personalization of contents, prefetching and caching activities are done according to user’s behavior analysis. User’s activities can be captured into a spe...

Context in source publication

Context 1
... the other hand extended log format is a customizable log file format which can add some additional attributes like referrer_url, http_user_agent and cookies [3,8]. Table-1 provides the description of attributes of extended common log format and Figure 4 shows the snapshot of extended common log format [12,28].this snapshot shows the single entry of the log file. ...

Citations

... In 2014, a study [11] presented techniques applied in preprocessing step of Web usage mining with their advantage and disadvantage. In 2017, 2 studies [12] [1] provided overviews of Web Usage Mining (WUM) and explained the process involved in WUM, its applications and tools. ...
Conference Paper
Full-text available
Browsing on Internet is part of the world population’s daily routine. The number of web pages is increasing and so is the amount of published content (news, tutorials, images, videos) provided by them. Search engines use web robots to index web contents and to offer better results to their users. However, web robots have also been used for exploiting vulnerabilities in web pages. Thus, monitoring and detecting web robots’ accesses is important in order to keep the web server as safe as possible. Data Mining methods have been applied to web server logs (used as data source) in order to detect web robots. Then, the main objective of this work was to observe evidences of definition or use of web robots detection by analyzing web server-side logs using Data Mining methods. Thus, we conducted a systematic Literature mapping, analyzing papers published between 2013 and 2020. In the systematic mapping, we analyzed 34 studies and they allowed us to better understand the area of web robots detection, mapping what is being done, the data used to perform web robots detection, the tools, and algorithms used in the Literature. From those studies, we extracted 33 machine learning algorithms, 64 features, and 13 tools. This study is helpful for researchers to find machine learning algorithms, features, and tools to detect web robots by analyzing web server logs.
... Neeraj kandpal et al gave a detailed study of different methods used in web usage mining [19]. Mitali Srivastava et al presented preprocessing techniques for web usage mining [20]. ...
Article
Preprocessing is the most important and time consuming step in web log mining. Results of web log mining mostly depend on extensive and correct preprocessing. Due to the drastic change in data consumption by users, there is a need of time to improve classical preprocessing algorithms. Although many algorithms exist for preprocessing, there is a need for an algorithm which can dynamically change the results of preprocessing according to need. Also there is a drastic need to add a few more parameters in earlier preprocessing algorithms to reduce the error rate. In this paper, we have introduced preprocessing algorithms for web usage data by inculcating some new parameters. These parameters make results more accurate for effective generation of web usage patterns.
... Mitali Srivastava [5] improved later phases of web usage mining like pattern discovery and pattern analysis several data preprocessing techniques such as Data Cleaning, User Identification, Session Identification, Path Completion etc. have been used. In this paper all these techniques are discussed in detail. ...
... The most used analysis techniques are statistics; association and sequential rules; classification and clustering; and dependency modeling [3], [8], [9]. These techniques are used separately or combined to discover useful knowledge about users as well as the system. ...
... Within the identified reviewing and contribution papers [1]- [3], [6], [7], [9], [16]- [30], the terms user, session and transaction identification, in addition to reconstruction and implementation methods are used interchangeably in the announced targeted contribution within the titles. However, the provided contributions are limited to one of them. ...
... The most used threshold in this regard is a 10 minutes time gap between two successive navigated webpages within a single clickstream. The objective function below depict the reconstruction process of the timeoriented session [6], [9], [16], [16], [33]- [35]. ...
Conference Paper
Full-text available
This paper addresses the issue of Weblog Data structuring within the scope of Web Usage Mining. Web Usage Mining is interested in end-user behavior. Thus, Weblog data need to be cleaned of agent requests and structured per single users and sessions for transaction mining. Since that Webservers record requests interlaced in a sequential order regardless of their source or type, Weblog Data structuring is not trivial when end-users are reticent towards proactive structuration, i.e., authentication or cookies. However, reactive structuration alternatives suffer from quality concerns in terms of session reconstruction. They are agent-centric heuristics that group requests per single pairs of user-agent and IP addresses. This contribution introduces a stream-centric heuristic structuration that drives the session construction to provide relevant single streams before mapping them to single user-agents and IP addresses pairs. The experimental results of the proposed method demonstrate its efficiency in terms of the session construction quality.
... On the basis of web usage mining reviewing papers from WEBKDD'99 up to 2017 [2], [4], [5], [9], [13], [14], [18]- [23], the cleaning task can be mapped into three layers. The first is meant to identify clicks from hits. ...
... In (k) space of points, the MP is that (m) among the existing points with the closest ( ) distance (d) from the center of all the points (m). (5) ...
Conference Paper
Full-text available
This paper addresses the issue of Weblog Data cleaning within the scope of Web Usage Mining. Weblog data are information on end-user clicks and underlying user-agent hits recorded by webservers. Since Web Usage Mining is interested in end-user clicks, user-agent hits are referred to as noise to be cleaned before mining. The most referenced and implemented cleaning methods are the conventional and advanced cleaning. They are content-centric filtering heuristics based on the requested resource attribute of weblog databases. These cleaning methods are limited in terms of relevancy, workability and cost constraints, within the context of dynamic and responsive web. In order to deal with these constraints, this contribution introduces a clustering-based cleaning method focused on the genetic features of the logging structure. The introduced cleaning method mines clicks from hits on the basis of their underlying genetics features and statistical properties. The genetics clustering-based cleaning experimentation demonstrates significant advantages compared to the content-centric methods.
... Many researchers have contributed in the area of filtering systems, and their work can be divided into five categories; 4. filtering through internet service providers (ISPs), 5. Search engine or web server-based filtering Content filtering has been used to block the access to unwanted information and is implemented as an additional layer to mitigate these issues in each of the categories. Despite advances in filtering techniques such as white list/blacklists [2], local databases [3] and session-based heuristics [4], there are still some deficiencies in these approaches due to their technical limitations such as evolving and dynamic content. ...
Article
Full-text available
Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.
... By imposing barriers to free-flow of data streaming, a compact representation of raw data is sufficient to produce brief but reliable decision making process (Eggers and Khuon, 1990). In addition, an efficiency and scalability of an object refinement; a subsequent step applied to processed data, can be improved through a proper pre-processing (Mitali et al., 2003). ...
Article
Full-text available
Tram/train derailment subject to human mistakes makes investments in an advanced control room as well as information gathering system exaggerated. A disaster in Croydon in year 2016 is recent evidence of limitation of the acquired systems to mitigate human shortcoming in disrupted circumstances. One intriguing way of resolution could be is to fuse continuous online textual data obtained from tram travelers and apply the information for early cautioning of risk discovery. This resolution conveys our consideration regarding a resource of data fusion. The focal subject of this paper is to discuss about role of pre-processing ventures in a low-level data fusion that have been distinguished as a pass to avoid time and exertion squandering amid information retrieval. Inclines in online text data pre-processing is reviewed which comes about an outline suggestion that concede traveler's responses through social media channels. The research outcome shows by a case of data fusion could go about as an impetus to railway industry to effectively partake in data exploration and information investigation.
... Most people find online information by using general-purpose search engines rather than accredited health websites or portals (Bernstam et al., 2008). Therefore, we used three most used search engines Google, Yahoo and Bing (Silberg et al., 1997;Srivastava, Garg, & Mishra, 2014;Whitten, Nazione, & Lauckner, 2013) to identify web pages that users are likely to encounter when searching for online health information. A list of search terms commonly used by patients suffering from diabetes was obtained from published literature, and finally we selected the keyword ''diabetes mellitus'' as it was the most used (Kim & Ladenson, 2002). ...
Article
Full-text available
Objective: The patients involvement in disease management can decrease economic burden on diabetic patients and society. Quality health information may help patients to involve in their health management. Thus, individuals need to find the additional information from other information resources such as health websites. Nevertheless, health websites vary in quality and reliability. Therefore, it is of great importance to identify trustable health websites on diabetes. Thus, the aim of the present study was to evaluate the reliability of health websites concerning diabetes. Materials and methods: The keyword ‘’diabetes mellitus ‘’ was entered as a search term into the three most used search engines Google, Yahoo and Bing. The results for first three pages reported by each search engine were selected. After excluding 19 websites, 71 unique websites were eligible for examination. The reliability of websites was evaluated manually using the HONcode of conduct tool by both researchers. Furthermore, HONcode toolbar function was used to recognize officially verified websites. Results: Only 19 out of 71 websites were officially verified by HONcode foundation. None of the other retrieved websites achieved all 8 principles. Most of the retrieved websites were commercial (67.6%) and the minimum number of the them belongs to university websites (1.4%). The highest and lowest compliance with the HON principles belonged to justifiability (99%), and attribution (51%). Conclusion: Diabetic patients need high quality information from trustworthy websites to decide better about their health. Thus, physicians should have knowledge about the variable quality of health websites and guide their patients to reliable online resources.
... Several studies have been carried out in past for session generation (Srivastava et al., 2014). These techniques are broadly classified into three categories: time-oriented, navigation-oriented and integer programming. ...
... Web Usage Mining is the application of data mining techniques to usage data, termed weblog data, for different purposes, e.g., system enhancement and adaptation, personalization, recommendation and advertisement [1], [2]. Depending on the analysis purpose and context, the weblog data are completed with other data [3] and are mined using different mining techniques, e.g., statistics, classification, clustering, association/sequential rules and dependency modelling. ...
Chapter
Full-text available
This paper addresses the issue of Weblog Data cleaning within the scope of Web Usage Mining. Weblog data are information on end-user clicks and un-derlying user-agent hits recorded by webservers. Since Web Usage Mining is interested in end-user behavior, user-agent hits are referred to as noise to be cleaned before mining. The most referenced and implemented cleaning methods are the conventional and advanced cleaning. They are content-centric filtering heuristics, based on the requested resource attribute of the weblog database. These cleaning methods are limited in terms of relevancy, workability and cost constraints, within the context of dynamic and respon-sive web. In order to deal with dynamic and responsive web constraints, this contribution introduces a rule-based cleaning method focused on the logging structure rules. The rule-based cleaning method experimentation demon-strates significant advantages compared to the content-centric methods.