Conference Paper

Focussed Crawling of Environmental Web Resources: A Pilot Study on the Combination of Multimedia Evidence

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Webcrawling has extensively been applied in data collection for different applications such as collecting air quality measurements and forecast from open web pages [14], identifying modern "traditional" medicine from user generated blogs [15], extracting information from social web forums [16], among others (see [17]). To our knowledge, it is the first time that webcrawling is used or reported in the literature as a method to collect spatial data for the development of high resolution emission inventories. ...
... This can explain the low share of wood-based technologies in this area. Previous studies establish that wood consumption is lower when other heating technologies such as district heating are also available, e.g., [14]. Our results contrast with methods used to spatially distribute emissions from residential heating based on proxy data such as population or dwelling density [4], [26], and it supports the concern that such methods may over-allocate emissions in highly populated areas [5]. ...
Article
Full-text available
In this study we apply two methods for data collection that are relatively new in the field of atmospheric science. The two developed methods are designed to collect essential geo-localized information to be used as input data for a high resolution emission inventory for residential wood combustion (RWC). The first method is a webcrawler that extracts openly online available real estate data in a systematic way, and thereafter structures them for analysis. The webcrawler reads online Norwegian real estate advertisements and it collects the geo-position of the dwellings. Dwellings are classified according to the type (e.g., apartment, detached house) they belong to and the heating systems they are equipped with. The second method is a model trained for image recognition and classification based on machine learning techniques. The images from the real estate advertisements are collected and processed to identify wood burning installations, which are automatically classified according to the three classes used in official statistics, i.e., open fireplaces, stoves produced before 1998 and stoves produced after 1998. The model recognizes and classifies the wood appliances with a precision of 81%, 85% and 91% for open fireplaces, old stoves and new stoves, respectively. Emission factors are heavily dependent on technology and this information is therefore essential for determining accurate emissions. The collected data are compared with existing information from the statistical register at county and national level in Norway. The comparison shows good agreement for the proportion of residential heating systems between the webcrawled data and the official statistics. The high resolution and level of detail of the extracted data show the value of open data to improve emission inventories. With the increased amount and availability of data, the techniques presented here add significant value to emission accuracy and potential applications should also be considered across all emission sectors.
... Early techniques of the first category (e.g., [19]) have relied only on textual evidence for the classification of the retrieval results, while more recent approaches [18] have used visual evidence for performing this post-retrieval filtering; however, the combination of multimedia evidence has not been considered. On the other hand, recent focussed crawling approaches [25] have taken into account both textual and visual evidence for selecting the links to follow during their traversal of the Web graph. ...
... The keyword spices listed in Table 2 are generated using an annotated set of 664 air quality Web resources (284 positive, 380 negative). These were obtained by performing focussed crawling while starting from a set of seed pages that provide air quality measurements and forecasts [25], and manually annotating the crawled results using the relevance scale presented in the next section. This annotated dataset is split in half for training and validation. ...
Conference Paper
This work proposes a framework for the discovery of environmental Web resources providing air quality measurements and forecasts. Motivated by the frequent occurrence of heatmaps in such Web resources, it exploits multimedia evidence at different stages of the discovery process. Domain-specific queries generated using empirical information and machine learning driven query expansion are submitted both to the Web and Image search services of a general-purpose search engine. Post-retrieval filtering is performed by combining textual and visual (heatmap-related) evidence in a supervised machine learning framework. Our experimental results indicate improvements in the effectiveness when performing heatmap recognition based on SURF and SIFT descriptors using VLAD encoding and when combining multimedia evidence in the discovery process.
... The studies discussed above cover either of the steps in hidden web crawling. Web crawling is one of the prominent method being applied in data collection for applications such as crawling user-generated blogs for recognition of modern traditional medicine [32], information extraction from social web forums [33], industrial digital ecosystem [34] and for carbon emission [35]. Crawling for all the domains is difficult, so a crawler is required to be focused on certain domains as well as intelligent rules are required to stop unproductive crawling and spider traps. ...
Article
Full-text available
Due to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.
... As already discussed, the hyperlink selection policy relies on three different classifiers (i.e., link-based, parent Web page and destination Web page classifier), combined according to conditions related to the destination network type and/or the strength of local evidence around a hyperlink on a parent page. Based on recent research [14] and the empirical study conducted in the context of HOMER which indicated that the anchor text and also the URLs of hyperlinks leading to HME information often contain HME-related terms, e.g., the name of the HME, the link-based classifier represents the local context of each hyperlink using: (i) its anchor text, (ii) a text window of x characters (x = 50) surrounding the anchor text that does not overlap with the anchor text of adjacent hyperlinks, and (iii) the terms extracted from the URL. Given that Tor and Freenet URLs inherently contain network-specific terms within the URL domain name which do not convey any meaningful information (i.e., onion URLs contain automatically generated 16character alpha-semi-numeric hashes, whereas Freenet URLs contain a localhost IP address), the local context representation in the case of Tor or Freenet URLs includes the URL terms extracted only from the URL path or parameters. ...
Article
Full-text available
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates 11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed focused crawler both for the Surface and the Dark Web.
... Our methodology is motivated by the results of an empirical study performed with the support of HME experts in the context of the HOMER project which indicated that the anchor text of hyperlinks leading to HME information often contains HME-related terms (e.g. the name of the HME), and also that the URL could be informative to some extent, since it may contain relevant information (e.g. the name of the HME). As a result‚ the focussed crawler follows recent research ( Tsikrika et al. 2016) and represents the local context of each hyperlink using: (i) its anchor text, (ii) a text window of x characters (e.g. x = 50) surrounding the anchor text that does not overlap with the anchor text of adjacent links, and (iii) the terms extracted from the URL. ...
Chapter
The Dark Web, a part of the Deep Web that consists of several darknets (e.g. Tor, I2P, and Freenet), provides users with the opportunity of hiding their identity when surfing or publishing information. This anonymity facilitates the communication of sensitive data for legitimate purposes, but also provides the ideal environment for transferring information, goods, and services with potentially illegal intentions. Therefore, Law Enforcement Agencies (LEAs) are very much interested in gathering OSINT on the Dark Web that would allow them to successfully prosecute individuals involved in criminal and terrorist activities. To this end, LEAs need appropriate technologies that would allow them to discover darknet sites that facilitate such activities and identify the users involved. This chapter presents current efforts in this direction by first providing an overview of the most prevalent darknets, their underlying technologies, their size, and the type of information they contain. This is followed by a discussion of the LEAs’ perspective on OSINT on the Dark Web and the challenges they face towards discovering and de-anonymizing such information and by a review of the currently available techniques to this end. Finally, a case study on discovering terrorist-related information, such as home made explosive recipes, on the Dark Web is presented.
... To this end, it estimates the relevance of a hyperlink to an unvisited resource based on its local context. Motivated by the results of the aforementioned empirical study that indicated that the anchor text of hyperlinks leading to HME information often contains HME-related terms (e.g. the name of the HME), and also that the URL could also be informative to some extent, since it may contain relevant information (e.g. the name of the HME and HME-related keywords), we follow recent research [13] and represent the local context of each hyperlink using: (i) its anchor text, (ii) a text window of x characters (e.g. x = 50) surrounding the anchor text 13 that 11 http://code.google.com/p/boilerpipe/ 12 http://rp-www.cs.usyd.edu.au/~scilect/sherlock/ 13 This limit of x characters is automatically extended so as to guarantee that it will not split any of the words lying at the edges of the window. ...
Conference Paper
This work proposes a novel framework that integrates diverse state-of-the-art technologies for the discovery, analysis, retrieval, and recommendation of heterogeneous Web resources containing multimedia information about homemade explosives (HMEs), with particular focus on HME recipe information. The framework corresponds to a knowledge management platform that enables the interaction with HME information, and consists of three major components: (i) a discovery component that allows for the identification of HME resources on the Web, (ii) a content-based multimedia analysis component that detects HME-related concepts in multimedia content, and (iii) an indexing, retrieval, and recommendation component that processes the available HME information to enable its (semantic) search and provision of similar information. The proposed framework is being developed in a user-driven manner, based on the requirements of law enforcement and security agencies personnel, as well as HME domain experts. In addition, its development is guided by the characteristics of HME Web resources, as these have been observed in an empirical study conducted by HME domain experts. Overall, this framework is envisaged to increase the operational effectiveness and efficiency of law enforcement and security agencies in their quest to keep the citizen safe.
Article
At present, focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks. For most current focused crawling technologies, there are some difficulties in obtaining high-quality crawling results. The main difficulties are the establishment of topic benchmark models, the assessment of topic relevance of hyperlinks, and the design of crawling strategies. In this paper, we use domain ontology to build a topic benchmark model for a specific topic, and propose a novel multiple-filtering strategy based on local ontology and global ontology (MFSLG). A comprehensive priority evaluation method (CPEM) based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks, and a simulated annealing (SA) method is used to avoid the focused crawler falling into local optima of the search. By incorporating SA into the focused crawler with MFSLG and CPEM for the first time, two novel focused crawler strategies based on ontology and SA (FCOSA), including FCOSA with only global ontology (FCOSA_G) and FCOSA with both local ontology and global ontology (FCOSA_LG), are proposed to obtain topic-relevant webpages about rainstorm disasters from the network. Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices.
Article
Full-text available
The focused crawler downloads web pages related to the given topic from the Internet. In many research studies, most of focused crawler predict the priority values of unvisited hyperlinks by integrating the topic similarities based on the text similarity model and equivalent weighted factors based on the manual method. However, in these focused crawlers, there are flaws in the text similarity models, and weighted factors are arbitrarily determined for calculating priorities of unvisited URLs. To solve these problems, this paper proposes a semantic and intelligent focused crawler based on the Semantic Vector Space Model (SVSM) and the Membrane Computing Optimization Algorithm (MCOA). Firstly, the SVSM method is used to calculate topic similarities between texts and the given topic. Secondly, the MCOA method is used to optimize four weighted factors based on the evolution rules and the communication rule. Finally, this proposed focused crawler predicts the priority of each unvisited hyperlink by integrating the topic similarities of four texts and the optimal four weighted factors. The experiment results indicate that the proposed SVSM-MCOA Crawler improve the evaluation indicators compared with the other four focused crawlers. In conclusion, the proposed SVSM and MCOA method promotes the focused crawler to have semantic understanding and intelligent learning ability.
Conference Paper
This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.
Conference Paper
This work investigates the effectiveness of a novel interactive search engine in the context of discovering and retrieving Web resources containing recipes for synthesizing Home Made Explosives (HMEs). The discovery of HME Web resources both on Surface and Dark Web is addressed as a domain-specific search problem; the architecture of the search engine is based on a hybrid infrastructure that combines two different approaches: (i) a Web crawler focused on the HME domain; (ii) the submission of HME domain-specific queries to general-purpose search engines. Both approaches are accompanied by a user-initiated post-processing classification for reducing the potential noise in the discovery results. The design of the application is built based on the distinctive nature of law enforcement agency user requirements, which dictate the interactive discovery and the accurate filtering of Web resources containing HME recipes. The experiments evaluating the effectiveness of our application demonstrate its satisfactory performance, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Article
Full-text available
There is a large amount of meteorological and air quality data available online. Often, different sources provide deviating and even contradicting data for the same geographical area and time. This implies that users need to evaluate the relative reliability of the information and then trust one of the sources. We present a novel data fusion method that merges the data from different sources for a given area and time, ensuring the best data quality. The method is a unique combination of land-use regression techniques, statistical air quality modelling and a well-known data fusion algorithm. We show experiments where a fused temperature forecast outperforms individual temperature forecasts from several providers. Also, we demonstrate that the local hourly NO2 concentration can be estimated accurately with our fusion method while a more conventional extrapolation method falls short. The method forms part of the prototype web-based service PESCaDO, designed to cater personalized environmental information to users.
Conference Paper
Full-text available
It is common practice to disseminate Chemical Weather (air quality and meteorology) forecasts to the general public, via the internet, in the form of pre-processed images which differ in format, quality and presentation, without other forms of access to the original data. As the number of on-line available Chemical Weather (CW) forecasts is increasing, there are many geographical areas that are covered by different models, and their data could not be combined, compared, or used in any synergetic way by the end user, due to the aforementioned heterogeneity. This paper describes a series of methods for extracting and reconstructing data from heterogeneous air quality forecast images coming from different data providers, to allow for their unified harvesting, processing, transformation, storage and presentation in the Chemical Weather portal.
Article
Full-text available
The EU legislative framework related to air quality, together with national legislation and relevant Declarations of the United Nations (UN) requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The WWW is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Conference Paper
Full-text available
Shot boundary detection The shot boundary detection system in 2007 is basically the same as that of last year. We make three major modifications in the system of this year. First, CUT detector and GT detector use block based RGB color histogram with the different parameters instead of the same ones. Secondly, we add a motion detection module to the GT detector to remove the false alarms caused by camera motion or large object movements. Finally, we add a post-processing module based on SIFT feature after both CUT and GT detector. The evaluation results show that all these modifications bring performance improvements to the system. The brief introduction to each run is shown in the following table: Run_id Description Thu01 Baseline system: RGB4_48 for CUT and GT detector, no motion detector, no sift post-processing, only using development set of 2005 as training set Thu02 Same algorithm as thu01, but with RGB16_48 for CUT detector, RGB4_48 for GT detector Thu03 Same algorithm as thu02, but with SIFT post-processing for CUT Thu04 Same algorithm as thu03, but with Motion detector for GT Thu05 Same algorithm as thu04, but with SIFT post-processing for GT Thu06 Same algorithm as thu05, but no SIFT processing for CUT Thu09 Same algorithm as thu05, but with different parameters thu11 Same algorithm as thu05, but with different parameters Thu13 Same algorithm as thu05, but with different parameters Thu14 Same algorithm and parameters as thu05, but trained with all the development data from 2003-2006 High-level feature extraction We try a novel approach, Multi-Label Multi-Feature learning (MLMF learning) to learn a joint-concept distribution on the regional level as an intermediate representation. Besides, we improve our Video diver indexing system by designing new features, comparing learning algorithms and exploring novel fusion algorithms. Based on these efforts in improving feature, learning and fusion algorithms, we achieve top results in HFE this year.
Article
Full-text available
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Conference Paper
Full-text available
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete cov- erage of Web resources. On the other hand, health infor- mation obtained through whole-of-Web search may not be scientifically based and can be potentially harmful. To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant in- formation. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the qual- ity cannot. We therefore estimated quality of the entire link- ing page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality. We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores ob- tained using an evidence-based rating scale. Both a rele- vance focused crawler and the quality focused crawler re- trieved twice as many relevant pages as a breadth-first con- trol. The quality focused crawler was quite eective in re- ducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler. Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network trac.
Article
Full-text available
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing , differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
Conference Paper
Full-text available
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable World-Wide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
Article
Full-text available
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a support vector machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.
Article
Full-text available
Domain-specific Web search engines are effective tools for reducing the difficulty experienced when acquiring information from the Web. Existing methods for building domain-specific Web search engines require human expertise or specific facilities. However, we can build a domain-specific search engine simply by adding domain-specific keywords, called "keyword spices," to the user's input query and forwarding it to a general-purpose Web search engine. Keyword spices can be effectively discovered from Web documents using machine learning technologies. The paper describes domain-specific Web search engines that use keyword spices for locating recipes, restaurants, and used cars.
Article
Full-text available
The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed decisions about how to prioritize the links to be visited. Here we focus on the balance between a crawler's need to exploit this information to focus on the most promising links, and the need to explore links that appear suboptimal but might lead to more relevant pages. We investigate the issue for two di#erent tasks: (i) seeking new relevant pages starting from a known relevant subset, and (ii) seeking relevant pages starting a few links away from the relevant subset. Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in spite of a penalty in the early stage of the crawl.
Article
Full-text available
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through welldefined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of di#erent nature and di#culty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and e#ciency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing di#erent relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling e#ectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
Environmental data are considered of utmost importance for human life, since weather conditions, air quality and pollen are strongly related to health issues and affect everyday activities. This paper addresses the problem of discovery of air quality and pollen forecast Web resources, which are usually presented in the form of heatmaps (i.e. graphical representation of matrix data with colors). Towards the solution of this problem, we propose a discovery methodology, which builds upon a general purpose search engine and a novel post processing heatmap recognition layer. The first step involves generation of domain-specific queries, which are submitted to the search engine, while the second involves an image classification step based on visual low level features to identify Web sites including heatmaps. Experimental results comparing various visual features combinations show that relevant environmental sites can be efficiently recognized and retrieved.
The European Union (EU) legislative framework related to air quality, together with national legislation and relevant declarations of the United Nations (UN), requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The World Wide Web (WWW) is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Article
Progress in computer capabilities has substantially influenced research in air quality modelling, a very complex and multidisciplinary area. It covers remote sensing, land use impacts, initial and boundary conditions, data assimilation techniques, chemical schemes, comparison between measured and modelled data, computer efficiency, parallel computing, coupling with meteorology, long-range transport impact on local air pollution, new satellite data assimilation techniques, real-time and forecasting and sensitivity analysis. This contribution focuses on providing a general overview of the state of the art in air quality modelling from the point of view of the “user community,” which includes policy makers, urban planners and environmental managers. It also tries to bring to the discussion key questions, such as where are the greatest uncertainties in emission inventories and meteorological fields, how well do air quality models simulate urban aerosols, and what are the next generation developments in models to answer new scientific and management questions.
Conference Paper
Analysis and processing of environmental information is considered of utmost importance for humanity. This article addresses the problem of discovery of web resources that provide environmental measurements. Towards the solution of this domain-specific search problem, we combine state-of-the-art search techniques together with advanced textual processing and supervised machine learning. Specifically, we generate domain-specific queries using empirical information and machine learning driven query expansion in order to enhance the initial queries with domain-specific terms. Multiple variations of these queries are submitted to a general-purpose web search engine in order to achieve a high recall performance and we employ a post processing module based on supervised machine learning to improve the precision of the final results. In this work, we focus on the discovery of weather forecast websites and we evaluate our technique by discovering weather nodes for south Finland.
Article
Platt’s probabilistic outputs for Support Vector Machines (Platt, J. in Smola, A., et al. (eds.) Advances in large margin classifiers. Cambridge, 2000) has been popular for applications that require posterior class probabilities. In this note, we propose an improved algorithm that theoretically converges and avoids numerical difficulties. A simple and ready-to-use pseudo code is included.
Conference Paper
Raster map images (e.g., USGS) provide much information in digital form; however, the color assignments and pixel labels leave many serious ambiguities. A color histogram classification scheme is described, followed by the application of a tensor voting method to classify linear features in the map as well as intersections in linear feature networks. The major result is an excellent segmentation of roads, and road intersections are detected with about 93% recall and 66 % precision.
Article
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Article
This paper proposes a method for binary image retrieval, where the black-and-white image is represented by a novel feature named the adaptive hierarchical density histogram, which exploits the distribution of the image points on a two-dimensional area. This adaptive hierarchical decomposition technique employs the estimation of point density histograms of image regions, which are determined by a pyramidal grid that is recursively updated through the calculation of image geometric centroids. The extracted descriptor combines global and local properties and can be used in variant types of binary image databases. The validity of the introduced method, which demonstrates high accuracy, low computational cost and scalability, is both theoretically and experimentally shown, while comparison with several other prevailing approaches demonstrates its performance.
Conference Paper
The separation of overlapping text and graphics is a challenging problem in document image analysis. This paper proposes a specific method of detecting and extracting characters that are touching graphics. It is based on the observation that the constituent strokes of characters are usually short segments in comparison with those of graphics. It combines line continuation with the feature line width to decompose and reconstruct segments underlying the region of intersection. Experimental results showed that the proposed method improved the percentage of correctly detected text as well as the accuracy of character recognition significantly.
Conference Paper
Previous work on domain specific search services in the area of depressive illness has documented the significant human cost required to setup and maintain closed-crawl parameters. It also showed that domain coverage is much less than that of whole-of-web search engines. Here we report on the feasibility of techniques for achieving greater coverage at lower cost. We found that acceptably effective crawl parameters could be automatically derived from a DMOZ depression category list, with dramatic saving in effort. We also found evidence that focused crawling could be effective in this domain: relevant documents from diverse sources are extensively interlinked; many outgoing links from a constrained crawl based on DMOZ lead to additional relevant content; and we were able to achieve reasonable precision (88%) and recall (68%) using a J48-derived predictive classifier operating only on URL words, anchor text and text content adjacent to referring links. Future directions include implementing and evaluating a focused crawler. Furthermore, the quality of information in returned pages (measured in accordance with the evidence based medicine) is vital when searchers are consumers. Accordingly, automatic estimation of web site quality and its possible incorporation in a focused crawler is the subject of a separate concurrent study.
Article
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
Article
Finding specific information in the World-Wide Web (WWW, or Web for short) is becoming increasingly difficult, because of the rapid growth of the Web and because of the diversity of the information offered through the Web. Hypertext in general is ill-suited for information retrieval as it is designed for stepwise exploration. To help readers find specific information quickly, specific overview documents are often included into the hypertext.Hypertext systems often provide simple searching tools such as full text search or title search, that mostly ignore the “hyper-structure” formed by the links.In the WWW, finding information is further complicated by its distributed nature. Navigation, often via overview documents, still is the predominant method of finding one's way around the Web.Several searching tools have been developed, basically in two types: •A gateway, offering (limited) search operations on small or large parts of the WWW, using a pre-compiled database. The database is often built by an automated Web scanner (a “robot”).•A client-based search tool that does automated navigation, thereby working more or less like a browsing user, but much faster and following an optimized strategy.This paper highlights the properties and implementation of a client-based search tool called the “fish-search” algorithm, and compares it to other approaches. The fish-search, implemented on top of Mosaic for X, offers an open-ended selection of search criteria.Client-based searching has some definite drawbacks: slow speed and high network resource consumption. The paper shows how combining the fish search with a cache greatly reduces these problems. The “Lagoon” cache program is presented. Caches can call each other, currently only to further reduce network traffic. By moving the algorithm into the cache program, the calculation of the answer to a search request can be distributed among the caching servers.
Article
MPEG-7, formally known as the Multimedia Content Description Interface, includes standardized tools (descriptors, description schemes, and language) enabling structural, detailed descriptions of audio-visual information at different granularity levels (region, image, video segment, collection) and in different areas (content description, management, organization, navigation, and user interaction). It aims to support and facilitate a wide range of applications, such as media portals, content broadcasting, and ubiquitous multimedia. We present a high-level overview of the MPEG-7 standard. We first discuss the scope, basic terminology, and potential applications. Next, we discuss the constituent components. Then, we compare the relationship with other standards to highlight its capabilities
Article
In this paper we study in what ortW acr wler should visit the URLs it has seen, inorNH to obtainmor "imp orpW t" pages fires Obtaining imporWk t pagesrgesWk can be ver useful when acr wler cannot visit the entir Web in arCH?5W)NV amount of time. We define severW imporjjKH metrKHK orrKHK schemes, and perK5VW)NC evaluation measurn for thisprsWN5? We also experWjK tally evaluate theorWVK?H schemes on theStanfor UnivervW y Web.Our rurWN show that acr wler with a goodor5KHKW scheme can obtain imporWN t pages significantly faster than one without. 1
Text/graphics separation in maps Graphics Recognition: Algorithms and Applications
  • R Cao
Discovery, analysis and retrieval of multimodal environmental information
  • A Moumtzidou
  • S Vrochidis
  • I Kompatsiaris