Conference PaperPDF Available

Configuration Model for Focused Crawlers in Technology Intelligence

Authors:

Abstract and Figures

Due to a steady increase of competitive constraints caused by ongoing globalization and dynamically growing markets, technology intelligence has become an important element of strategic business intelligence. The objective of technology intelligence is to focus on the systematic identification of future chances but also threats to companies caused by new technologies and further technology developments. To operate technology intelligence efficiently, access to up‐to‐date, relevant, and sufficiently complete information is essential. Indeed, availability of information is higher than ever by reason of digitalization. However, it also causes the problem of information overload. The available mass of data has to be searched, assorted and assessed to identify the actual needed information. In addition, the entire information processing has to be continued permanently or to be repeated for each new object of investigation, otherwise the validity of the results is not given any more. Accordingly, it appears reasonable to automate this process by widely using smart software solutions. One of the promising approaches is " focused crawling " which not just runs through given data sources in the web, but also rates each data record to make an autonomous decision, which information is relevant for the further process, and which data records should reasonably be analyzed next. To implement such crawlers, different approaches exist in the field of information retrieval: For example, different rating and discovery algorithms. This paper presents the status quo of ongoing research to develop a configuration model for focused crawlers to fulfill the varying requirements of technology intelligence tasks. At first, the assessment criteria for information in a technology intelligence process and the configuration possibilities of focused crawlers are described. As a result, a first approach of a matching between the requirements of technology intelligence tasks and the consequences of different focused crawler configurations is presented. Closing, the paper explains how this approach will be improved and validated in case studies prospectively.
Content may be subject to copyright.
A preview of the PDF is not available
... Regarding the previously presented possibilities of Artificial Intelligence, the steps of information retrieval and information evaluation are particularly suitable for automation. After the information needs are defined and at least the initial information sources are chosen, a Focused Crawler can parse the available sources and thereby do classification and evaluation [86]. Based on this evaluation, the crawler decides on how to proceed with the parsing process. ...
... It is based on four processing layers: network, parsing and extraction, representation, and intelligence. [51,86,87]. ...
Article
Full-text available
Technology Management is an important part of a company’s business strategy. Procuring, evaluating, and processing information is crucial for this process' success. This paper describes the way of dealing with today’s challenge of information overload. It introduces a concept of technology databases based on opportunity evaluation and the systematical development process of technical systems. The idea is to capture two perspectives – generic technology information outside the company and internal technology knowledge – in one comprehensive database to be used as a basis for decision-making in technology management. Complementing this, the paper presents concepts of Artificial Intelligence-based information retrieval and processing that are suitable to efficiently support and semi-automate the filling and updating of such a database. Furthermore, existing software solutions are considered as exemplary. The finding of this paper is a combined approach of such a technology database with artificial intelligence methods for information retrieval that can support the process of technology management more comprehensively than is currently possible.
... In conjunction with the CoE, the crawler is intended to take each sub-project's material as training data to monitor selected research databases for new publications fitting the sub-project's interests. First thoughts on a configuration model to prepare such crawlers are presented in Schuh et al. (2015). ...
Chapter
Scientific Cooperation Engineering researches, fosters and supports scientific cooperation on all hierarchical levels and beyond scientific disciplines as a key resource for innovation in the Cluster of Excellence. State-of-the-art research methods—such as structural equation models, success models, or studies on success factors—that are frequently used in IS research are applied to create profound knowledge and insights in the contribution and optimal realization of scientific inter and trans-disciplinary communication and cooperation. A continuous formative evaluation is used to derive and explore insights into interdisciplinary collaboration and innovation processes from a management perspective. In addition, actor-based empirical studies are carried out to explore critical factors for interdisciplinary cooperation and intercultural diversity management. Based on these results, workflows, physical networking events and tailor-made training programs are created and iteratively optimized towards the cluster’s needs. As Scientific Cooperation Engineering aims to gain empirical and data-driven knowledge, a Scientific Cooperation Portal and a prototypic flowchart application are under development to support workflows and project management. Furthermore, data science methods are currently implemented to recognize synergetic patterns based on bibliometric information and topical proximity, which is analyzed via project terminologies.
Article
Full-text available
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
Conference Paper
Full-text available
The large amount of available information on the Web makes it hard for users to locate resources about particular topics of interest. Traditional search tools, e.g., search engines, do not always successfully cope with this problem, that is, helping users to seek the right information. In the personalized search domain, focused crawlers are receiving increasing attention, as a well-founded alternative to search the Web. Unlike a standard crawler, which traverses the Web downloading all the documents it comes across, a focused crawler is developed to retrieve documents related to a given topic of interest, reducing the network and computational resources. This chapter presents an overview of the focused crawling domain and, in particular, of the approaches that include a sort of adaptivity. That feature makes it possible to change the system behavior according to the particular environment and its relationships with the given input parameters during the search.
Article
Full-text available
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Article
Full-text available
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.
Conference Paper
One of the most crucial influencing factors on company's innovation capability and therefore sustainable success are technologies. In innovation management the use of search fields as orientation guide for the identification of new products is widely common. Yet, the transfer to technology management is not done fully. The identification of appropriate technological developments, which lead to future technological chances and risks, is getting more and more a challenge. In the light of accelerated technical progress the potential fields of observation for the technology intelligence is exponentially growing. The paper presents a model of monitoring radars using the search field idea for the orientation of the technology intelligence. It is shown that the definition of the relevant information demand by search fields and corresponding strategies under consideration of the technological basis and company characteristics leads to a higher transparency and efficiency of the technology intelligence. Keywords: technology management; search field; search field strategy; search field hierarchy; technology intelligence; monitoring radar.
Conference Paper
The fast and high availability of knowledge is at first seen as a benefit for knowledge workers in the information age. On closer examination the outcome of this is a big challenge: The amount of data that is available these days has to be reasonably structured and conditioned. Only the US Library of Congress collected 235 terabyte of data on its own by April 2011. Technology intelligence as a fundamental component of technology management is expected to monitor these data, so technology managers are able to respond to new developments and trends just in time. Possible tools to meet this challenge in an efficient way are the focused crawlers. These are programs, which explore data collections independently to identify material related to the current working context. To implement such a tool, there exist a multitude of different approaches within the field of information retrieval, but they have to be used and combined on an individual basis to fit the requirements of a particular task. Hence, before a focused crawler can make the processes of technology intelligence more efficient, the dedicated requirements have to be identified. In this paper we develop a requirements model to close this gap. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6921172
Conference Paper
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.
Article
The effectiveness of technology management is fundamentally influenced by the quality of a firm's technology intelligence process, i.e. the acquisition and assessment of information on technological trends. Although there is a vast literature on different technology intelligence methods, there is a lack of research on the factors influencing the choice of appropriate technology intelligence methods in a specific situation. This paper presents the results of an exploratory case study research in 25 leading European and North American companies in the pharmaceutical, telecommunications equipment and automobile/machinery industries. Major contingency factors of the selection of technology intelligence methods in multinationals are identified and integrated into a contingency-based framework for the use of technology intelligence methods.