Figure 4 - uploaded by Vijini Mallawaarachchi
Content may be subject to copyright.
Learning Curve of machine learning model with the size of dataset used for testing and training.

Learning Curve of machine learning model with the size of dataset used for testing and training.

Source publication
Conference Paper
Full-text available
The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an “ideal scheduler” to crawl each web page immediately afte...

Citations

... Since this process is expensive in terms of time and traffic, there is a vast body of research on efficient crawling policies, starting with seminal work by Cho and Garcia-Molina [13,14]. For more recent reviews we refer the reader to [5,29,45,6,34,42,19]. ...
... The content of a page, when used in addition to the change history, is shown to improve prediction [8,37,44,39,34]. While assuming a fixed change rate for each page, static page features are used as predictors for the first time in [8], alongside historical data for the page change. ...
... The page change frequency estimator in [34] uses as dynamic predictive features the change value of a page (computed at eight rates, from every 4 h to every 72 h). This change value includes page layout (or attribute changes), and three types of content changes for page elements: element additions, deletions, and modifications. ...
Preprint
Full-text available
Discovering new hyperlinks enables Web crawlers to find new pages that have not yet been indexed. This is especially important for focused crawlers because they strive to provide a comprehensive analysis of specific parts of the Web, thus prioritizing discovery of new pages over discovery of changes in content. In the literature, changes in hyperlinks and content have been usually considered simultaneously. However, there is also evidence suggesting that these two types of changes are not necessarily related. Moreover, many studies about predicting changes assume that long history of a page is available, which is unattainable in practice. The aim of this work is to provide a methodology for detecting new links effectively using a short history. To this end, we use a dataset of ten crawls at intervals of one week. Our study consists of three parts. First, we obtain insight in the data by analyzing empirical properties of the number of new outlinks. We observe that these properties are, on average, stable over time, but there is a large difference between emergence of hyperlinks towards pages within and outside the domain of a target page (internal and external outlinks, respectively). Next, we provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. These models include the features used earlier in the literature, as well as new features introduced in this work. We analyze correlation between the features, and investigate their informativeness. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page. Finally, we propose ranking methods as guidelines for focused crawlers to efficiently discover new pages, which achieve excellent performance with respect to the corresponding targets.
... Whenever a fresh webpage is added, that webpage will be crawled to detect a suitable change frequency, and it will be recorded in the system. Then these values are sent to a machine learning model [72] that will predict the time interval between two crawls for a particular webpage. The change values together with change frequencies for a webpage is sent to the machine learning model, it would output a time interval called a loop time corresponding to that particular webpage. ...
Article
Full-text available
The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.
... Whenever a fresh webpage is added, that webpage will be crawled to detect a suitable change frequency and it will be recorded in the system. Then these values are sent to a machine learning model [Meegahapola et al. 2018] which will predict the time interval between two crawls for a particular webpage. The machine learning model used to predict a time interval between two crawls for a particular webpage was implemented based on a modified random forest classifier model in H2O.ai machine learning API [Cook 2016] [Distributed Random Forest (DRF) -H2O 3.12.0.1 documentation 2017]. ...
Preprint
Full text can be found here: https://arxiv.org/abs/1901.02660 -- The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process, and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.
Article
Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new ‘look back, look around’ (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modeling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.