Article

Trend detection through temporal link analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Although time has been recognized as an important dimension in the co-citation literature, to date it has not been incorporated into the analogous process of link analysis on the Web. In this paper, we discuss several aspects and uses of the time dimension in the context of Web information retrieval. We describe the ideal case— where search engines track and store temporal data for each of the pages in their repository, assigning timestamps to the hyperlinks embedded within the pages. We introduce several applications which benefit from the availability of such timestamps. To demonstrate our claims, we use a somewhat simplistic approach, which dates links by approximating the age of the page's content. We show that by using this crude measure alone it is possible to detect and expose significant events and trends. We predict that by using more robust methods for tracking modifications in the content of pages, search engines will be able to provide results that are more timely and better reflect current real-life trends than those they provide today.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Also, several works have contributed to the temporal characterization of information on the web [6] [3] [7]. More recently, attentions have turned to tasks related to result ranking [8], web mining [9], summarization [10] or question answering [11]. Overall, this is an area still in its infancy. ...
... The presented version of BuzzRank is very expensive both in terms of computing power and storage requirements because it needs to store the entire graph and perform PageRank computations for each time interval. Amitay et al. [9] observe HTTP headers, particularly the Last-Modified field, and are able to reveal significant events and trends. The Last-Modified field is used to approximate the age of the page's content. ...
... This erratic behavior is generally attributed to incorrectly configured web servers. Several independent studies have estimated that 35% to 80% of web documents have valid Last-Updated values [6] [9] [4]. Despite this problem, it is possible to explore this information as shown by Amitay et al. [9] ...
Article
Web Information Retrieval (WebIR) is the application of Information Retrieval concepts to the World Wide Web. The most successful approaches in this field have modeled the web's structure as a directed graph and explored this concept using different approaches. Within this line of research, HITS and PageRank are two of the most well known paradigms for evaluating the importance of web documents. Most of this research has origins in the area of citation analysis, but although time is an important dimension in the citation analysis literature, it hasn't been explored in depth within WebIR. Recent studies show that the web is a highly dynamic environment, with significant changes occurring weekly. The Blogospace is a good example of this very active behavior. In this work, temporal web evidence is identified and categorized according to two classes, one based on features extracted form individual documents and the other based on features extracted from the whole web. Also, a broad survey of previous work exploring temporal evidence is presented. Finally, ideas for exploring temporal web evidence in typical web tasks are briefly discussed. The lack of suitable corpora containing temporal evidence has been a deterrent to research on this field. The recent availability of public datasets containing temporal information has raised public awareness of this topic.
... As stated in [11], local trust scores are preferred in social networks with many controversial users. The blogosphere is a network that gathers people with different and often contradictory beliefs, and thus, local and decentralized rating schemes for content and users are better than a universal rating mechanism [8], [13], [9], The models that capture the freshness of references between web pages [1], [2], or scientific papers [3], [15], are based on the fact that ranking algorithms favor old pages and balance this bias using a link (or citation) weighting scheme, which is based on the age of the web page (or paper). In a post-processing step the authority of a pointed node decays, based on the node's age and the incoming links age. ...
... This factor is introduced on the basis that the BS j blogroll provides a list of friendly blog sites frequently accessed/read by the authors of BS j . It has been assumed that BR BSj tp=k (BS i ) lies within the [0] [1] range, where a value close to 1 indicates that the target BS i is a friendly blog site to the evaluator BS j . In the context of this study, BR BSj tp=k (BS i ) is modeled as a variable assuming values that lie within the [0] [1] range, taking the value of 0 in case the BS i does not belong to the blogroll of BS j at time period k. ...
... It has been assumed that BR BSj tp=k (BS i ) lies within the [0] [1] range, where a value close to 1 indicates that the target BS i is a friendly blog site to the evaluator BS j . In the context of this study, BR BSj tp=k (BS i ) is modeled as a variable assuming values that lie within the [0] [1] range, taking the value of 0 in case the BS i does not belong to the blogroll of BS j at time period k. In essence, BS j provides a rating of the friendly blog sites in the blogroll, which could be exploited in order to differentiate BR BSj tp=k (BS i ) factor for the friendly blog sites comprised in the BS j blogroll. ...
Conference Paper
Full-text available
The blogosphere is a part of the Web, enhanced with several characteristics that differentiate blogs from traditional websites. The number of different authors, the multitude of user-provided tags, the inherent connectivity between blogs and bloggers, the high update rate, and the time information attached to each post are some of the features that can be exploited in various information retrieval tasks. Traditional search engines perform poorly on blogs since they do not cover these aspects. In an attempt to exploit these features, this paper proposes a personalized recommendation model, which capitalizes on a collaborative rating mechanism that exploits the hyperlinks between blogs. The model assumes that the intention of a blog owner who creates a link to another blog is to provide a recommendation to the blog readers and quantifies this intention in a local score for the blog being pointed. A set of implicit and explicit links between any two blogs affect the exchanged score. The process is iterative and takes into account the opinion of a set of affiliated blogs and the freshness of links.
... Detection of global trends in the community is an additional valuable source of information for constructing recommendations . In [3], the evolution of the relationship graphs over time is analyzed. The application of the proposed method lies in the improved detection of current reallife trends in search engines. ...
... The application of the proposed method lies in the improved detection of current reallife trends in search engines. In the future, new search methods for folksonomies should support adaptation of [3] to the Web 2.0 scenario. Kleinberg [19] summarizes several different approaches to analyze online information streams over time. ...
Conference Paper
Full-text available
Web 2.0 applications like Flickr, YouTube, or Del.icio.us are increasingly popular online com- munities for creating, editing and sharing con- tent. However, the rapid increase in size of on- line communities and the availability of large amounts of shared data make discovering rele- vant content and finding related users a difficult task. Web 2.0 applications provide a rich set of structures and annotations that can be mined for a variety of purposes. In this paper we propose a formal model to characterize users, items, and annotations in Web 2.0 environments. Based on this model we propose recommendation mecha- nisms using methods from social network analy- sis, collaborative filtering, and machine learning. Our objective is to construct collaborative recom- mender systems that predict the utility of items, users or groups based on the multi-dimensional social environment of a given user.
... Together, these two characteristics imply that the Web has become an exceptionally potent repository of programmatically accessible data. Some of the most provocative recent Web applications are those that gather and process large-scale Web data, such as virtual tourism [33], knowledge extraction [15], Web site trust assessment [24], and emerging trend detection [6]. New Web services that want to take advantage of Web-scale data face a high barrier to entry. ...
Conference Paper
Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large-scale, timely crawling is complex, operationally intensive, and expensive. In this paper, we introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves. This paper describes the architecture, implementation, and evaluation of our prototype extensible crawler, and also relates early experience from several crawler applications we have built. We focus on the challenges and trade-offs in the system, such as the design of a filter language that is simultaneously expressive and efficient to execute, the use of filter indexing to cheaply match a page against millions of filters, and the use of document and filter partitioning to scale our prototype implementation to high document throughput and large numbers of filters. We argue that the low-latency, high selectivity, and scalable nature of our system makes it a promising platform for taking advantage of emerging real-time streams of data, such as Facebook or Twitter feeds.
... Amitay et al. [1] proposed a method to detect trends using time-stamped links. The authors assumed that links were time-stamped with the Last-Modified time of their source pages. ...
... Kleinberg's burst detection algorithm (Kleinberg, 2003) is most widely used to identify emerging topics in science (Amitay et al., 2004; Naaman et al., 2011; Small et al., 2014). ...
Article
This paper aims to synthetically investigate invention profiles and uneven growth of technological knowledge in the emerging nano-energy field, based on patents data extracted from the Derwent Innovation Index (DII) database during the time period 1991–2012. The trend analysis shows that invention in this field has experienced enormous growth and also diversification over the past 22 years. The co-occurrence network of burst technology domains reveals that technology domains constantly burst, and innovative progress in nanotechnology has tremendously contributed to energy production, storage, conversion and harvesting and so on. Nano-energy patented inventions mainly come from a combinatorial process with a very limited role of developing brand-new technological capabilities. Reusing existing technological capabilities including recombination reuse, recombination creation and single reuse is the primary source of inventions. For the impacts of technology networks' embeddedness, we find that network tie strength suppresses the growth of technological knowledge domains, and network status and convergence both facilitate the growth of technological knowledge domains. We expect that this study will provide some enlightenment for inventing or creating new knowledge in emerging fields in complex technological environment.
... VCPs take advantage of the supervised classification framework in (Lichtenwalter et al. 2010), which involves undersampling, bootstrap aggregation, and random forest or random subspace classification algorithms by substituting the simple feature vector derived from topological analysis with the VCP. There are several other supervised classification frameworks (Al Hasan et al. 2006;Wang et al. 2007) for link prediction that use basic topological characteristics, unsupervised link predictors, node attributes, and other information to construct their feature vectors. ...
Article
Full-text available
We describe the vertex collocation profile (VCP) concept. VCPs provide rich information about the surrounding local structure of embedded vertex pairs. VCP analysis offers a new tool for researchers and domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other context. The same resolution that gives the VCP method its analytical power also enables it to perform well when used to accomplish link prediction. We first develop the theory, mathematics, and algorithms underlying VCPs. We provide timing results to demonstrate that the algorithms scale well even for large networks. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across different network families. Unlike many analytical tools, VCPs inherently generalize to multirelational data, which provides them with unique power in complex modeling tasks. To demonstrate this, we apply the VCP method to longitudinal networks by encoding temporally resolved information into different relations. In this way, the transitions between VCP elements represent temporal evolutionary patterns in the longitudinal network data. Results show that VCPs can use this additional data, typically challenging to employ, to improve predictive model accuracies. We conclude with our perspectives on the VCP method and its future in network science, particularly link prediction.
... Amitay et al. [1] have also identified aspects of temporal locality in Web domains, showing that incorporating hyperlink timestamps into link-based page-ranking algorithms can improve retrieval accuracy. ...
Conference Paper
Many relational domains contain temporal information and dynamics that are important to model (e.g., social net- works, protein networks). However, past work in relational learning has focused primarily on modeling static "snap- shots" of the data and has largely ignored the temporal di- mension of these data. In this work, we extend relational techniques to temporally-evolving domains and outline a representational framework that is capable of modeling both temporal and relational dependencies in the data. We develop efficient learning and inference techniques within the framework by considering a restricted set of temporal- relational dependencies and using parameter-tying meth- ods to generalize across relationships and entities. More specifically, we model dynamic relational data with a two- phase process, first summarizing the temporal-relational in- formation with kernel smoothing, and then moderating at- tribute dependencies with the summarized relational infor- mation. We develop a number of novel temporal-relational models using the framework and then show that the current approaches to modeling static relational data are special cases within the framework. We compare the new models to the competing static relational methods on three real-world datasets and show that the temporal-relational models con- sistently outperform the relational models that ignore tem- poral information—achieving significant reductions in er- ror ranging from 15% to 70%.
... Research on developing innovative trend detection techniques to be applied on various online components has been performed for a long time. Online components include web blogs [13][3], news media [25], social networks [17] and Wikipedia [7]. However, we focus on presenting the state of the art on trend detection in social networks. ...
Conference Paper
Full-text available
Extracting and representing user interests on the Social Web is becoming an essential part of the Web for personalisation and recommendations. Such personalisation is required in order to provide an adaptive Web to users, where content fits their preferences, background and current interests, making the Web more social and relevant. Current techniques analyse user activities on social media systems and collect structured or unstructured sets of entities representing users' interests. These sets of entities, or user profiles of interest, are often missing the semantics of the entities in terms of: (i) popularity and temporal dynamics of the interests on the Social Web and (ii) abstractness of the entities in the real world. State of the art techniques to compute these values are using specific knowledge bases or taxonomies and need to analyse the dynamics of the entities over a period of time. Hence, we propose a real-time, computationally inexpensive, domain independent model for concepts of interest composed of: popularity, temporal dynamics and specificity. We describe and evaluate a novel algorithm for computing specificity leveraging the semantics of Linked Data and evaluate the impact of our model on user profiles of interests.
... In the literature, different proposals analyze the Web to get insights of what happens in the society: some focus on the blogosphere, other on specific platforms like Twitter and Facebook, and the applications vary from investigating general people' concerns/opinions to measure Hollywood stars' notoriety, from understanding politicians' popularity to identify consumers' opinions, (e.g., [2] [3] [5] [7] [8] [11] [13] [20] [23] [26] [35] [37]). The main limitations to most of these approaches are: i) they focus on definite scenarios like marketing and therefore an effective general-purpose approach is missing, ii) they analyze specific platforms like the Blogosphere, Twitter or Facebook, and therefore results represent only a portion of the society, iii) their analysis is mainly based on the sole magnitude of data and therefore it is easy to maliciously affect the input data to produce biased results , iv) they give the same importance to all the Web resources and therefore very old unrelated Web resources are considered similar to very recent and correlated Web resources. ...
Article
Full-text available
The high availability of user-generated contents in the Web scenario represents a tremendous asset for understanding various social phenomena. Methods and commercial products that exploit the widespread use of the Web as a way of conveying personal opinions have been proposed, but a critical thinking is that these approaches may produce a partial, or distorted, understanding of the society, because most of them focus on definite scenarios, use specific platforms, base their analysis on the sole magnitude of data, or treat the different Web resources with the same importance. In this paper, we present SIWeb (Social Interests through Web Analysis), a novel mechanism designed to measure the interest the society has on a topic (e.g., a real world phenomenon, an event, a person, a thing). SIWeb is general purpose (it can be applied to any decision making process), cross platforms (it uses the entire Webspace, from social media to websites, from tags to reviews), and time effective (it measures the time correlation between the Web resources). It uses fractal analysis to detect the temporal relations behind all the Web resources (e.g., Web pages, RSS, newsgroups, etc.) that talk about a topic and combines this number with the temporal relations to give an insight of the the interest the society has about a topic. The evaluation of the proposal shows that SIWeb might be helpful in decision making processes as it reflects the interests the society has on a specific topic.
Article
Full-text available
On the heterogeneous web information spaces, users have been suffering from efficiently searching for relevant information. This paper proposes a mediator agent system to estimate the semantics of unknown web spaces by learning the fragments gathered during the users' focused crawling. This process is organized as the following three tasks; (i) gathering semantic information about web spaces from personal agents while focused crawling in unknown spaces, (ii) reorganizing the information by using ontology alignment algorithm, and (iii) providing relevant semantic information to personal agents right before focused crawling. It makes the personal agent possible to recognize the corresponding user's behaviors in semantically heterogeneous spaces and predict his searching contexts. For the experiments, we implemented comparison-shopping system with heterogeneous web spaces. As a result, our proposed method efficiently supported the users, and then, network traffic was also reduced.
Conference Paper
Full-text available
The estimated number of static web pages in Oct 2005 was over 20.3 billion, which was determined by multiplying the average number of pages per web server based on the results of three previous studies, 200 pages, by the estimated number of web servers on the Internet, 101.4 million. However, based on the analysis of 8.5 billion web pages that we crawled by Oct. 2005, we estimate the total number of web pages to be 53.7 billion. This is because the number of dynamic web pages has increased rapidly in recent years. We also analyzed the web structure using 3 billion of the 8.5 billion web pages that we have crawled. Our results indicate that the size of the ”CORE,” the central component of the bow tie structure, has increased in recent years, especially in the Chinese and Japanese web.
Conference Paper
Full-text available
Search engines affect page popularity by making it difficult for currently unpopular pages to reach the top ranks in the search results. This is because people tend to visit and create links to the top-ranked pages. We have addressed this problem by analyzing the previous content of web pages. Our approach is based on the observation that the quality of this content greatly affects link accumulation and hence the final rank of the page. We propose detecting the content that has the greatest impact on the link accumulation process of top-ranked pages and using it for detecting high quality but unpopular web pages. Such pages would have higher ranks assigned.
Conference Paper
Full-text available
Existing search engines contain the picture of the Web from the past and their ranking algorithms are based on data crawled some time ago. However, a user requires not only relevant but also fresh information. We have developed a method for adjusting the ranking of search engine results from the point of view of page freshness and relevance. It uses an algorithm that post-processes search engine results based on the changed contents of the pages. By analyzing archived versions of web pages we estimate temporal qualities of pages, that is, general freshness and relevance of the page to the query topic over certain time frames. For the top quality web pages, their content differences between past snapshots of the pages indexed by a search engine and their present versions are analyzed. Basing on these differences the algorithm assigns new ranks to the web pages without the need to maintain a constantly updated index of web documents.
Thesis
Full-text available
Este estudio ha tenido como objetivo principal, determinar si los procesos y expresiones de búsqueda de información usados por los usuarios en motores de búsqueda, pueden considerarse como indicadores válidos para el análisis y estudio de los hábitos de lectura y posible interés en otros contenidos ofrecidos por las bibliotecas en España (como videojuegos o películas). Para ello se propone un modelo de análisis con el que caracterizar el lenguaje de búsqueda de información de los usuarios de internet que utilizan Google desde España como motor de búsqueda, durante el período 2004 - 2016, al recuperar información sobre la temática de el libro, la lectura y las bibliotecas, desde una perspectiva histórica. De esta forma, se pretende aportar otra dimensión de análisis a los estudios que hay sobre los hábitos lectores en general, y en España en particular. La investigación tiene distintas áreas de aplicación del análisis del lector online, como son el apoyo a la indización y la clasificación bibliotecaria, la evaluación de colecciones y evaluación de la biblioteca, los estudios de necesidades de usuarios, la evaluación de OPACs, la analítica digital de sedes web bibliotecarias o de entidades de la industria del libro como editoriales, librerías online, metabuscadores o páginas web de autores y aficionados a la literatura en general, márketing bibliotecario y promoción de la lectura, márketing editorial, altmetría y cibermetría, SEO o ASEO (posicionamiento en buscadores académicos) El análisis de los hábitos lectores tiene una larga tradición en el mundo offline, especialmente en España, donde el estudio de hábitos lectores es parte importante de la investigación estratégica en la industria del libro. Se han observado distintas metodologías, desde las encuestas y entrevistas a lectores y no lectores, el análisis de las ventas de los libros y la prensa, a los análisis de logs de préstamos en las bibliotecas. Al entrar la lectura en e-book, y en plena era de internet, la lectura en papel ha sufrido una transformación, donde los usuarios leen por internet, y buscan su lectura (ya sea online, en ebook y/o en papel) a través de internet, especialmente utilizando motores de búsqueda, de los que en España el más utilizado desde principios de siglo hasta al menos su segunda década, es el buscador Google. Es este cambio en las formas de localizar la lectura la que impulsa a investigar cómo se busca información sobre lectura en un buscador. Anteriormente se han investigado distintos aspectos de esas conductas con distintas técnicas, dentro del paradigma cognitivo, y especialmente dentro de la disciplina de Information Seeking, de difícil traducción al castellano. Tras consignar modelos de búsqueda por parte de los usuarios, como el modelo Berrypicking de Marcia Bates, el modelo de Ellis, el modelo de Marchionini, o el modelo de Information Search Process de Kulthau, entre otros, se han estudiado otros modificadores de las conductas de búsqueda, llegando a los estudios sobre User Search Behaviour (conductas de búsqueda de los usuarios en motores de búsqueda) especialmente en lo concerniente a desambiguación y expansión de búsquedas, análisis longitudinal de la búsqueda y de Query Intent, el Análisis de la Intención de Búsqueda. Es precísamente en la combinación de las últimas subdisciplinas hacia donde se ha orientado este estudio. Para la investigación, en 2010 se obtuvieron de Google Keywords Planner, el log de búsquedas del motor de búsqueda, más de 30.000 expresiones de búsqueda (denominadas también como frases de búsqueda, queries, keywords o palabras clave), relacionadas con el libro, la lectura y las bibliotecas, segmentando la búsqueda de palabras clave en lenguaje español y de búsquedas realizadas desde España. Posteriormente se extrajo de Google Trends la serie de datos histórica de 2004 a 2016, para conformar un dataset con el que realizar un análisis longitudinal. Las palabras clave fueron clasificadas en 27 facetas distintas de intención de búsqueda, contando también con aspectos modificadores y aspectos lingüísticos. Por tanto, no se clasificó en categorías mutuamente excluyentes, sino de forma que una expresión de búsqueda pudiera pertenecer a varias clases simultáneamente, por lo que se realizó un estudio del grado de coocurrencia entre las distintas facetas y los aspectos identificados. Posteriormente se dividió las palabras clave, previamente clasificadas, en una nueva dimensión de análisis, según si era atemporales (tenían una larga vida en la serie histórica) o temporales, aquellas que nacían en algún momento de la serie, y tenían una vida más o menos corta. Como resultado del análisis, se han estudiado las posibilidades de la facetación como mejora o complemento de otras técnicas de análisis de las intenciones de búsqueda (query intent analysis); se ha validado el modelo de estudio, de forma que sirva como corpus inicial de futuros análisis de los hábitos de lectura en España, a través del estudio de la demanda de información en motores de búsqueda; se han descubierto subtipos de intenciones de búsqueda propias del sector de la lectura, dentro de las clasificaciones clásicas de intención de búsqueda (navegacional, informacional, transaccional); se han identificado facetas adicionales, distintas a las meramente temáticas, como modificadores y características del lenguaje, que sirvan para completar las facetas halladas desde una dimensión de análisis complementaria; se ha descubierto distintos patrones de uso, nuevas abreviaturas y formas de expresión de las necesidades de busqueda de los usuarios mediante lenguaje natural, se han relacionado distintos media y/o formatos, así como, tras una selección mediante una muestra intencionada, de distintos ejemplos paradigmáticos de estas tendencias de búsqueda y sus posibles relaciones causales, observando los efectos producidos en la evolución de la demanda de información en torno a la lectura a través de la búsqueda de la misma en Google en España, durante el período 2004-2016. Finalmente, y además de constatar su utilidad para completar otras técnicas de análisis de los hábitos lectores mediante una técnica inédita hasta la fecha en el sector del libro y bibliotecas, se ha observado cómo la demanda de información sobre lectura en España ha decaído de forma paulatina en la segunda década del siglo XXI, lo que coincide con otras investigaciones y datos de estudios de hábitos lectores, esta vez desde la perspectiva de la demanda online o a través de Internet.
Conference Paper
Full-text available
Social bookmarking services have become recently popular in the Web. Along with the rapid increase in the amount of social bookmarks, future applications could leverage this data for enhancing search in the Web. This paper investigates the possibility and potential benefits of a hybrid page ranking approach that would combine the ranking criteria of PageRank with the one based on social bookmarks in order to improve the search in the Web. We demonstrate and discuss the results of analytical study made in order to compare both popularity estimates. In addition, we propose a simple hybrid search method that combines both ranking metrics and we show some preliminary experiments using this approach. We hope that this study will shed new light on the character of data in social bookmarking systems and foster development of new, effective search applications for the Web.
Article
In recent years, the Web has become a popular medium for disseminating information, news, ideas, and opinions of the modern society. Due to this phenomenon, the Web information is reflecting current events and trends that are happening in the real world which, in turn, has attracted a lot of interest in using the Web as a sociological research tool for detecting the emerging topics, and social trends. To facilitate such kind of sociological research, in this paper, we study the characteristics of socio-topical web keywords sampled from a series of Thai web snapshots. The socio-topical web keyword, extracted from the content of some web pages, is a keyword relating to some topics of interest in a real-world society. The study was conducted as follows. First, the socio-topical keywords were sampled from the inverted index of each Thai web snapshot. Then, for each sampled keyword, we observe the pattern of changes of the number of documents containing the keyword, and the inverse document frequency (IDF) scores. Finally, we try to find the relationships between the observed patterns of changes and their corresponding real-world events in the Thai society.
Article
Full-text available
Web tracking sites or Web bugs are potential but serious threats to users’ privacy during Web browsing. Web sites and their associated advertising sites surreptitiously gather the profiles of visitors and possibly abuse or improperly expose them, even if visitors are unaware their profiles are being utilized. In order to prevent such sites in a corporate network, most companies employ filters that rely on blacklists, but these lists are insufficient. In this paper, we propose Web tracking sites detection and blacklist generation based on temporal link analysis. Our proposal analyzes traffic at the network gateway so that it can monitor all tracking sites in the administrative network. The proposed algorithm constructs a graph between sites and their visited time in order to characterize each site. Then, the system classifies suspicious sites using machine learning. We confirm that public black lists contain at most 22-70% of the known tracking sites respectively. The machine learning can identify the blacklisted sites with true positive rate, 62-73%, which is more accurate than any single blacklist. Although the learning algorithm falsely identified 15% of unlisted sites, 96% of these are verified to be unknown tracking sites by means of a manual labeling. These unknown tracking sites can serve as good candidates for an entry of a new backlist.
Article
Full-text available
In recent years, blogging has become an exploding passion among Internet communities. By combining the grassroots blogging with the richness of expression available in video, videoblogs (vlogs for short) will be a powerful new media adjunct to our existing televised news sources. Vlogs have gained much attention worldwide, especially with Google's acquisition of YouTube. This article presents a comprehensive survey of videoblogging (vlogging for short) as a new technological trend. We first summarize the technological challenges for vlogging as four key issues that need to be answered. Along with their respective possibilities, we give a review of the currently available techniques and tools supporting vlogging, and envision emerging technological directions for future vlogging. Several multimedia technologies are introduced to empower vlogging technology with better scalability, interactivity, searchability, and accessability, and to potentially reduce the legal, economic, and moral risks of vlogging applications. We also make an in-depth investigation of various vlog mining topics from a research perspective and present several incentive applications such as user-targeted video advertising and collective intelligence gaming. We believe that vlogging and its applications will bring new opportunities and drives to the research in related fields.
Conference Paper
Full-text available
One of the grand research and industrial challenges in recent years is efficient web search, inherently involving the issue of page ranking. In this paper we address the issue of representing and quantifying web ranking trends as a measure of web pages. We study the rank position of a web page among different snapshots of the web graph and propose normalized measures of ranking trends that are comparable among web graph snapshots of different sizes. We define the ra nk c hang e r ate (racer ) as a measure quantifying the web graph evolution. Thereafter, we examine different ways to aggregate the rank change rates and quantify the trends over a group of web pages. We outline the problem of identifying highly dynamic web pages and discuss possible future work. In our experimental evaluation we study the dynamics of web pages, especially those highly ranked.
Conference Paper
Web archives like the {I}nternet {A}rchive preserve the evolutionary history of large portions of the {W}eb. Access to them, however, is still via rather limited interfaces – a search functionality is often missing or ignores the time axis. Time-travel search alleviates this shortcoming by enriching keyword queries with a time-context of interest. In order to be effective, time-travel queries require historical {P}age{R}ank scores. In this paper, we address this requirement and propose rank synopses as a novel structure to compactly represent and reconstruct historical {P}age{R}ank scores. Rank synopses can reconstruct the {P}age{R}ank score of a web page as of any point during its lifetime, even in the absence of a snapshot of the {W}eb as of that time. We further devise a normalization scheme for {P}age{R}ank scores to make them comparable across different graphs. Through a comprehensive evaluation over different datasets, we demonstrate the accuracy and space-economy of the proposed methods.
Article
Full-text available
The link structure of the web is analyzed to measure the authority of pages, which can be taken into account for ranking query results. Due to the enormous dynamics of the web, with millions of pages created, updated, deleted, and linked to every day, temporal aspects of web pages and links are crucial factors for their evaluation. Users are interested in important pages (i.e., pages with high authority score) but are equally interested in the recency of information. Time—and thus the freshness of web content and link structure—emanates as a factor that should be taken into account in link analysis when computing the importance of a page. So far only minor effort has been spent on the integration of temporal aspects into link-analysis techniques. In this paper we introduce T-Rank Light and T-Rank, two link-analysis approaches that take into account the temporal aspects freshness (i.e., timestamps of most recent updates) and activity (i.e., update rates) of pages and links. Experimental results show that T-Rank Light and T-Rank can produce better rankings of web pages.
Article
Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 85\% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets.
Article
Full-text available
Wikipedia is an online encyclopedia (www.wikipedia.org), available in more than 100 languages. If we consider each article as a node and each hyperlink between articles as a link, we have a wikigraph, the link structure of wikipedia. We can extract one wikigraph for each available language, with size ranging from less then 1000 nodes to more than 500 thousand nodes and more then 5 million links. Associated with each node there are timestamps, indicating the creation and upadate dates of each page, that allows to study how the graph properties evolve over time. In a first part of our study we observe that wikigraphs maintain the main characteritics of webgraphs, for which temporal information is usually not available. We then study the temporal evolution of several topological properties of wikigraphs and relate this measures to the number of updates of the documents.
Article
Understanding the evolution of research topics is crucial to detect emerging trends in science. This paper proposes a new approach and a framework to discover the evolution of topics based on dynamic co-word networks and communities within them. The NEViewer software was developed according to this approach and framework, as compared to the existing studies and science mapping software tools, our work is innovative in three aspects: (a) the design of a longitudinal framework based on the dynamics of co-word communities; (b) it proposes a community labelling algorithm and community evolution verification algorithms; (c) and visualizes the evolution of topics at the macro and micro level respectively using alluvial diagrams and coloring networks. A case study in computer science and a careful assessment was implemented and demonstrating that the new method and the software NEViewer is feasible and effective.
Conference Paper
For the temporal analysis of news articles or the extraction of temporal expressions from such documents, accurate document creation times are indispensable. While document creation times are available as time stamps or HTML metadata in many cases, depending on the document collection in question, this data can be inaccurate or incomplete in others. Especially in digitally published online news articles, publication times are often missing from the article or inaccurate due to (partial) updates of the content at a later time. In this paper, we investigate the prediction of document creation times for articles in citation networks of digitally published news articles, which provide a network structure of knowledge flows between individual articles in addition to the contained temporal expressions. We explore the evolution of such networks to motivate the extraction of suitable features, which we utilize in a subsequent prediction of document creation times, framed as a regression task. Based on our evaluation of several established machine learning regressors on a large network of English news articles, we show that the combination of temporal and local structural features allows for the estimation of document creation times from the network.
Article
Full-text available
Introduction. The strong dynamic nature of the Web is a well-known reality. Nonetheless, research on Web dynamics is still a minor part of mainstream Web research. This is largely the case in Web link analysis. In this paper we investigate and measure the impact of time in link-based ranking algorithms on a particular subset of the Web, specifically blogs. Method. Using a large collection of blog posts that span more than three years, we compare a traditional link-based ranking algorithm with a time-biased alternative, providing some insights into the evolution of link data over time. We designed two experiments to evaluate the use of temporal features in authority estimation algorithms. In the first experiment we compare time-independent and time-sensitive ranking algorithms with a reference rank based on the total number of visits to each blog. In the second, we use feedback from communication media domain experts to contrast different rankings of Portuguese news Websites. Results. The distribution of citations to a Web document over time contains valuable information. Based on several examples we show that time-independent algorithms are unable to capture the correct popularity of sites with high citation activity. Using a reference rank based on the number of visits to a site, we show that a time-biased approach has a better performance. Conclusions. Although both time-independent and time-aware approaches are based on the same raw data, the experiments indicate that they can be treated as complementary signals for relevance assessment by information retrieval systems. We show that temporal information present in blogs can be used to derive stable time-dependent features, which can be successfully used in the context of Web document ranking.
Article
The Web is a useful data source for knowledge extraction, as it provides diverse content virtually on any possible topic. Hence, a lot of research has been recently done for improving mining in the Web. However, relatively little research has been done taking directly into account the temporal aspects of the Web. In this chapter, we analyze data stored in Web archives, which preserve content of the Web, and investigate the methodology required for successful knowledge discovery from this data. We call the collection of such Web archives past Web; a temporal structure composed of the past copies of Web pages. First, we discuss the character of the data and explain some concepts related to utilizing the past Web, such as data collection, analysis and processing. Next, we introduce examples of two applications, temporal summarization and a browser for the past Web.
Article
In recent years, blogging has become an exploding passion among Internet communities. By combining the grassroots blogging with the richness of expression available in video, videoblogs (vlogs for short) will be a powerful new media adjunct to our existing televised news sources. Vlogs have gained much attention worldwide, especially with Google's acquisition of YouTube. This article presents a comprehensive survey of videoblogging (vlogging for short) as a new technological trend. We first summarize the technological challenges for vlogging as four key issues that need to be answered. Along with their respective possibilities, we give a review of the currently available techniques and tools supporting vlogging, and envision emerging technological directions for future vlogging. Several multimedia technologies are introduced to empower vlogging technology with better scalability, interactivity, searchability, and accessability, and to potentially reduce the legal, economic, and moral risks of vlogging applications. We also make an in-depth investigation of various vlog mining topics from a research perspective and present several incentive applications such as user-targeted video advertising and collective intelligence gaming. We believe that vlogging and its applications will bring new opportunities and drives to the research in related fields.
Article
Topic time reflects the temporal feature of topics in Web news pages, which can be used to establish and analyze topic models for many time-sensitive text mining tasks. However, there are two critical challenges in discovering topic time from Web news pages. The first issue is how to normalize different kinds of temporal expressions within a Web news page, e.g., explicit and implicit temporal expressions, into a unified representation framework. The second issue is how to determine the right topic time for topics in Web news. Aiming at solving these two problems, we propose a systematic framework for discovering topic time from Web news. In particular, for the first issue, we propose a new approach that can effectively determine the appropriate referential time for implicit temporal expressions and further present an effective defuzzification algorithm to find the right explanation for a fuzzy temporal expression. For the second issue, we propose a relation model to describe the relationship between news topics and topic time. Based on this model, we design a new algorithm to extract topic time from Web news. We build a prototype system called Topic Time Parser (TTP) and conduct extensive experiments to measure the effectiveness of our proposal. The results suggest that our proposal is effective in both temporal expression normalization and topic time extraction.
Article
Traditional link-based web ranking algorithms run on a single web snapshot without concern of the dynamics of web pages and links. In particular, the correlation of web pages freshness and their classic PageRank is negative (see [11]). For this reason, in recent years a number of authors introduce some algorithms of PageRank actualization. We introduce our new algorithm called Actual PageRank, which generalizes some previous approaches and therefore provides better capability for capturing the dynamics of the Web. To the best of our knowledge we are the first to conduct ranking evaluations of a fresh-aware variation of PageRank on a large data set. The results demonstrate that our method achieves more relevant and fresh results than both classic PageRank and its "fresh" modifications.
Article
This study proposes a temporal analysis method to utilize heterogeneous resources such as papers, patents, and web news articles in an integrated manner. We analyzed the time gap phenomena between three resources and two academic areas by conducting text mining-based content analysis. To this end, a topic modeling technique, Latent Dirichlet Allocation (LDA) was used to estimate the optimal time gaps among three resources (papers, patents, and web news articles) in two research domains. The contributions of this study are summarized as follows: firstly, we propose a new temporal analysis method to understand the content characteristics and trends of heterogeneous multiple resources in an integrated manner. We applied it to measure the exact time intervals between academic areas by understanding the time gap phenomena. The results of temporal analysis showed that the resources of the medical field had more up-to-date property than those of the computer field, and thus prompter disclosure to the public. Secondly, we adopted a power-law exponent measurement and content analysis to evaluate the proposed method. With the proposed method, we demonstrate how to analyze heterogeneous resources more precisely and comprehensively.
Article
Full-text available
The blogosphere is a part of the World Wide Web, enhanced with several characteristics that differentiate blogs from traditional websites. The number of different authors, the multitude of user-provided tags, the inherent connectivity between blogs and bloggers, the high update rate, and the time information attached to each post are some of the features that can be exploited in various information retrieval tasks in the blogosphere. Traditional search engines perform poorly on blogs since they do not cover these aspects. In an attempt to exploit these features and assist any specialized blog search engine to provide a better ranking of blogs, we propose a rating mechanism, which capitalizes on the hyperlinks between blogs. The model assumes that the intention of a blog owner who creates a link to another blog is to provide a recommendation to the blog readers, and quantifies this intention in a score transferred to the blog being pointed. A set of implicit and explicit links between any two blogs, along with the links’ type and freshness, affect the exchanged score. The process is iterative and the overall ranking score for a blog is subject to its previous score and the weighted aggregation of all scores assigned by all other blogs.
Conference Paper
In the last years, a lot of attention was attracted by the problem of page authority computation based on user browsing behavior. However, the proposed methods have a number of limitations. In particular, they run on a single snapshot of a user browsing graph ignoring substantially dynamic nature of user browsing activity, which makes such methods recency unaware. This paper proposes a new method for computing page importance, referred to as Fresh BrowseRank. The score of a page by our algorithm equals to the weight in a stationary distribution of a flexible random walk, which is controlled by recency-sensitive weights of vertices and edges. Our method generalizes some previous approaches, provides better capability for capturing the dynamics of the Web and users behavior, and overcomes essential limitations of BrowseRank. The experimental results demonstrate that our method enables to achieve more relevant and fresh ranking results than the classic BrowseRank.
Conference Paper
This proposal is concerned with the addition of a time stamp (a date) to the triples normally used in the representation of folksonomies. We motived our approach by helping the detection of trends in social networks.
Conference Paper
The web was invented to quickly exchange data between scientists, but it became a crucial communication tool to connect the world. However, the web is extremely ephemeral. Most of the information published online becomes quickly unavailable and is lost forever. There are several initiatives worldwide that struggle to archive information from the web before it vanishes. However, search mechanisms to access this information are still limited and do not satisfy their users who demand performance similar to live-web search engines. This demo presents the Portuguese Web Archive, which enables search over 1.2 billion files archived from 1996 to 2012. It is the largest full-text searchable web archive publicly available [17]. The software developed to support this service is also publicly available as a free open source project at Google Code, so that it can be reused and enhanced by other web archivists. A short video about the Portuguese Web Archive is available at vimeo.com/59507267. The service can be tried live at archive.pt.
Article
Introducing and studying two types of time series, referred to as R1 and R2, we try to enrich the set of time series available for time dependent informetric studies. In a first part we focus on mathematical properties, while in a second part we check if these properties are visible in real data. This practical application uses data in the social sciences related to top Chinese universities. R1 sequences always increase over time, tending relatively fast to one, while R2 sequences have a decreasing tendency tending to zero in practical cases. They can best be used over relatively short periods of time. R1 sequences can be used to detect the rate with which cumulative data increase, while R2 sequences detect the relative rate of development.The article ends by pointing out that these time series can be used to compare innovative activities in firms. Clearly, this investigation is just a first attempt. More studies are needed, including comparisons with other related sequences.
Article
Full-text available
This paper discusses how mass digitisation of music has led to an emerging discipline of Music Information Retrieval (MIR), which has focussed more on systems than on users, and identifies the area of information need for work purposes as a focus for planned research. A literature review provides an overview of developments in MIR, pointing out its multidisciplinary nature, which causes problems in evaluation and retrieval. Two types of systems, content-based and context-based are discussed, and it is suggested that each type meets differing user needs depending on the level of specialist or interest of the user and that information behaviour and need differs according to the type of user. Evaluation is discussed, suggesting there are historical links with text retrieval while proposing music retrieval has sufficient additional complexities to justify its own discipline. A discussion of user research suggests that both content and context should be considered, and that different users respond in different ways to music, leading to the requirement for systems which reflect a variety of approaches and interpretations, needs and uses. It is proposed that a range of music industry professionals are interviewed using semi-structured interviews, and observation in order to investigate their information needs and behaviour, and that the systems they use are evaluated by existing techniques of precision and recall as well as from interview and observation data. Interview questions will be based on a semiotic music analysis framework. Analysis and discussion of the data will be by reference to existing information need models and a reflexive communication model while a cognitive information seeking and retrieval model will ground the research in current thinking. It is planned that the analysis will allow the researcher to determine whether an ideal MIR system can serve the needs of the music industry professional. Finally discussion issues are raised which highlight the holistic focus and interdisciplinary approach of the project.
Article
As the same information appears on many Web pages, we often want to know which page is the first one that discussed it, or how the information has spread on the Web as time passes. In this paper, we develop two methods: a method of detecting the first page that discussed the given information, and a method of generating a graph showing how the number of pages discussing it has changed along the timeline. To extract such information, we need to determine which pages discuss the given topic, and also need to determine when these pages were created. For the former step, we design a metric for estimating inclusion degree between information and a page. For the latter step, we develop a technique of extracting creation timestamps on web pages. Although timestamp extraction is a crucial component in temporal Web analysis, no research has shown how to do it in detail. Both steps are, however, still error-prone. In order to improve noise elimination, we examine not only the properties of each page, but also temporal relationship between pages. If temporal relationship between some candidate page and other pages are unlikely in typical patterns of information spread on the Web, we eliminate the candidate page as a noise. Results of our experiments show that our methods achieve high precision and can be used for practical use.
Article
It is hard to predict what the major challenge in search will be 100 years from now. The challenge may not even be related to information retrieval itself but could be the result of shortages of electricity, network disruptions due to insurgencies, information manipulation or access denial by an uncontrolled computer-based artificial intelligence (as imagined in the science fiction movie The Terminator). Of course, we could simply extrapolate the current ongoing trends, which we know do have an effect on information storage and retrieval, and might hope this gives an indication of some of the challenges that may affect finding and understanding information in a long-term perspective. In this article our focus will be on challenges that can be traced back to Time.
Conference Paper
As the number of resources on the web exceeds by far the number of documents one can track, it becomes increasingly difficult to remain up to date on ones own areas of interest. The problem becomes more severe with the increasing fraction of multimedia data, from which it is difficult to extract some conceptual description of their contents. One way to overcome this problem are social bookmark tools, which are rapidly emerging on the web. In such systems, users are setting up lightweight conceptual structures called folksonomies, and overcome thus the knowledge acquisition bottleneck. As more and more people participate in the effort, the use of a common vocabulary becomes more and more stable. We present an approach for discovering topic-specific trends within folksonomies. It is based on a differential adaptation of the PageRank algorithm to the triadic hypergraph structure of a folksonomy. The approach allows for any kind of data, as it does not rely on the internal structure of the documents. In particular, this allows to consider different data types in the same analysis step. We run experiments on a large-scale real-world snapshot of a social bookmarking system.
Conference Paper
Full-text available
Several visualizations have emerged which attempt to visualize all or part of the World Wide Web. Those visualizations, however, fail to present the dynamically changing ecology of users and documents on the Web. We present new techniques for Web Ecology and Evolution Visualization (WEEV). Disk Trees represent a discrete time slice of the Web ecology. A collection of Disk Trees forms a Time Tube, representing the evolution of the Web over longer periods of time. These visualizations are intended to aid authors and webmasters with the production and organization of content, assist Web surfers making sense of information, and help researchers understand the Web.
Article
Full-text available
The past decade has witnessed the birth and explosive growth of the World Wide Web, both in terms of content and user population. Figure 1 shows the exponetial growth in the number of Web servers. The number of users online has been growing exponentially as well. Whereas in 1996 there were 61 million users, at the close of 1998 over 147 million people had internet access worldwide. In the year 2000, the number of internet users more than doubled again to 400 million[1]. With its re-markable growth, the Web has popularized electronic commerce, and as a result an increasing segment of the world's population conducts commercial transactions online. From its very onset, the Web has demonstrated a tre-mendous variety in the size of its features. Surprisingly, we found out that there is order to the apparent arbitrari-ness of its growth. One observed pattern is that there are many small elements contained within the web, but few large ones. A few sites consist of millions of pages, but millions of sites only contain a handful of pages. Few sites contain millions of links, but many sites have one or two. Millions of users flock to a few select sites, giving little attention to millions of others. This diversity can expressed in mathematical fashion as a distribution of a particular form, called a power law, meaning that the probability of attaining a certain size x is proportional to 1/x to a power t, where t is greater than or equal to1.When a distribution of some property has a power law form, the system looks the same at all length scales. What this means is that if one were to look at the distribution of site sizes for one arbitrary range, say just sites which have between 10,000 and 20,000 pages, it would look the same as for a different range, say 10 to 100 pages. In other words, zooming in or out in the distribution, one keeps obtaining the same result. It also means that if one can determine the distribution of pages per site for a range of
Article
Full-text available
The Web is growing and evolving at a rapid pace.The data of theI nternet Archive (http://www.archive.org) represents a unique opportunity to explore and investigate how the Web evolves over time. However, it is di�cult to conveniently extract the link structure of the Web at a certain point in time. We introduce the notion of TimeLinks and motivate to use these as a convenient infrastructure for experimenting with the evolving link structure of the Webgraph. TimeLinks are directed edges that incorporate a first and last crawler access timestamp. TimeLink collections can then be used to generate the link structure of the Web graph at any given moment in time. In particular, we are interested in how these changes of hyperlinks can be used for various ranking algorithms in the context of Web information retrieval (e.g.,indegree and host-indegree of popular web sites over time). Since the Internet Archive contains currently over 100 Terabytes of compressed data we were able in our experiments to examine a large collection of billions of hyperlinks.
Conference Paper
Full-text available
We introduce a simple and efficient method for clustering and identifying temporal trends in hyper-linked document databases. Our method can scale to large datasets because it exploits the underlying regularity often found in hyper-linked document databases. Because of this scalability, we can use our method to study the temporal trends of individual clusters in a statistically meaningful manner. As an example of our approach, we give a summary of the temporal trends found in a scientific literature database with thousands of documents
Article
Full-text available
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated
Conference Paper
Full-text available
We describe a set of tools that detect when WorldWide -Web pages have been modified and present the modifications visually to the user through markedup HTML. The tools consist of three components: w3newer, which detects changes to pages; snapshot, which permits a user to store a copy of an arbitrary Web page and to compare any subsequent version of a page with the saved version; and HtmlDiff, which marks up HTML text to indicate how it has changed from a previous version. We refer to the tools collectively as the AT&T Internet Difference Engine (AIDE). This paper discusses several aspects of AIDE, with an emphasis on systems issues such as scalability, security, and error conditions. 1 Introduction Use of the World-Wide-Web (W 3 ) has increased dramatically over the past couple of years, both in the volume of traffic and the variety of users and content providers. The W 3 has become an information distribution medium for academic environments (its original motivation), commercial...
Article
Full-text available
The World Wide Web contains an enormous amount of information, but it can be exceedingly difficult for users to locate resources that are both high in quality and relevant to their information needs. We develop algorithms that exploit the hyperlink structure of the WWW for information discovery and categorization, the construction of high-quality resource lists, and the analysis of on-line hyperlinked communities.
Article
Full-text available
We introduce a simple and efficient method for clustering and identifying temporal trends in hyper-linked document databases. Our method can scale to large datasets because it exploits the underlying regularity often found in hyper-linked document databases. Because of this scalability, we can use our method to study the temporal trends of individual clusters in a statistically meaningful manner. As an example of our approach, we give a summary of the temporal trends found in a scientific literature database with thousands of documents. 1 Introduction Over the past decade, the World Wide Web has become an increasingly popular medium for publishing scientific literature. Since many researchers release preprints on the web, the scientific literature on the web is often far more timely than a similar snapshot of paper journals and proceedings, especially when one considers review and publication delays. As such, the scientific literature on the web may represent one of the more up-to-da...
Article
Full-text available
The World Wide Web contains an enormous amount of information, but it can be exceedingly difficult for users to locate resources that are both high in quality and relevant to their information needs. We develop algorithms that exploit the hyperlink structure of the WWW for information discovery and categorization, the construction of high-quality resource lists, and the analysis of on-line hyperlinked communities. 1 Introduction The World Wide Web contains an enormous amount of information, but it can be exceedingly difficult for users to locate resources that are both high in quality and relevant to their information needs. There are a number of fundamental reasons for this. The Web is a hypertext corpus of enormous size --- approximately three hundred million Web pages as of this writing --- and it continues to grow at a phenomenal rate. But the variation in pages is even worse than the raw scale of the data: the set of Web pages taken as a whole has almost no unifying structure, wi...
Article
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them.One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.
Article
We propose two new tools to address the evolution of hyperlinked corpora. First, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. Second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. We develop these tools in the context of Blogspace, the space of weblogs (or blogs). Our study involves approximately 750 K links among 25 K blogs. We create a time graph on these blogs by an automatic analysis of their internal time stamps. We then study the evolution of connected component structure and microscopic community structure in this time graph. We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding, not just in metrics of scale but also in metrics of community structure and connectedness. By randomizing link destinations in Blogspace, but retaining sources and timestamps, we introduce a concept of randomized Blogspace. Herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. Having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. We extend recent work of Kleinberg (2002) to discover dense periods of “bursty” intra-community link creation. Furthermore, we find that the blogs that give rise to these communities are significantly more enduring than an average blog.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Conference Paper
This paper introduces and evaluates a new paradigm, called Knowledge Agents, that incorporates agent technology into the process of domainspecific Web search. An agent is situated between the user and a search engine. It specializes in a specific domain by extracting characteristic information from search results. Domains are thus user-defined and can be of any granularity and specialty. This information is saved in a knowledge base and used in future searches. Queries are refined by the agent based on its domain-specific knowledge and the refined queries are sent to general purpose search engines. The search results are ranked based on the agent’s domain specific knowledge, thus filtering out pages which match the query but are irrelevant to the domain. A topological search of the Web for additional relevant sites is conducted from a domain-specific perspective. The combination of a broad search of the entire Web with domain-specific textual and topological scoring of results, enables the knowledge agent to find the most relevant documents for a given query within a domain of interest. The knowledge acquired by the agent is continuously updated and persistently stored thus users can benefit from search results of others in common domains.
Article
Today, when searching for information on the World Wide Web, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web sites whose contents match the query. For broad topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the World Wide Web. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing relationship: a good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he devised an algorithm aimed at finding authoritative sites. We present SALSA, a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. We show that both SALSA and Kleinberg's mutual reinforcement approach employ the same meta-algorithm. We then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally more efficient than the mutual reinforcement approach. We compare the results of applying SALSA to the results derived through Kleinberg's approach. These comparisons reveal a topological phenomenon called the TKC effect (Tightly Knit Community) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities.
Article
The Web harbors a large number of communities — groups of content-creators sharing a common interest — each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities — those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment.
Article
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Article
A problem, raised by Wallace (JASIS, 37, 136-145, [1986]), on the relation between the journal's median citation age and its number of articles is studied. Leaving open the problem as such, we give a statistical explanation of this relationship, when replacing median by mean in Wallace's problem. The cloud of points, found by Wallace, is explained in this sense that the points are scattered over the area in first quadrant, limited by a curve of the form [formule] where E is a constant. This curve is obtained by using the Central Limit Theorem in statistics and, hence, has no intrinsic informetric foundation. The article closes with some reflections on explanations of regularities in informetrics, based on statistical, probabilistic or informetric results, or on a combination thereof.
Conference Paper
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them.One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.
Conference Paper
We propose two new tools to address the evolution of hyperlinked corpora. First, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. Second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. We develop these tools in the context of Blogspace , the space of weblogs (or blogs). Our study involves approximately 750K links among 25K blogs. We create a time graph on these blogs by an automatic analysis of their internal time stamps. We then study the evolution of connected component structure and microscopic community structure in this time graph. We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. This expansion shows no sign of abating, although measures of connectedness must plateau within two years. By randomizing link destinations in Blogspace, but retaining sources and timestamps, we introduce a concept of randomized Blogspace . Herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. Having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. We extend recent work of Kleinberg [11] to discover dense periods of "bursty" intra-community link creation.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
store a series of large Web archives. It is now an exciting challenge for us to observe evolution of the Web. In this paper, we propose a method for observing evolution of web communities. A web community is a set of web pages created by individuals or associations with a common interest on a topic. So far, various link analysis techniques have been developed to extract web communities. We analyze evolution of web communities by comparing four Japanese web archives crawled from 1999 to 2002. Statistics of these archives and community evolution are examined, and the global behavior of evolution is described. Several metrics are introduced to measure the degree of web community evolution, such as growth rate, novelty, and stability. We developed a system for extracting detailed evolution of communities using these metrics. It allows us to understand when and how communities emerged and evolved. Some evolution examples are shown using our system.
Article
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
Article
We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine. Keywords: Search, taxonomies, link analysis, anchor text, information retrieval. 1. Overview The subject of this paper is the design and evaluation of an automatic resource compiler. An autom...
Article
In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more di#cult to maintain the copy "fresh," making it crucial to synchronize the copy e#ectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how e#ective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the "freshness" very significantly compared to current policies in use. 1 Introduction Local copies of remote data sources are frequently made to improve performance or availability. For instance, a data warehouse may copy remote sales and customer tables for local analysis. Similarly, a web search engine copies portions of the web, and then indexes them to help users navigate the web. In many cases, the remote source is updated ...
Article
. The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. Categories and S...
Article
This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic. Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. The essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. We identify three problems with the existing approach and devise algorithms to tackle them. The results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis.
Knowledge agents on the Web Improved algorithms for topic distilla-tion in a hyperlinked environment
  • Y Aridor
  • D Carmel
  • R Lempel
  • A Soffer
  • Springer
  • K Bharat
  • M Henzinger
Aridor, Y., Carmel, D., Lempel, R., Soffer, A., & Maarek, Y.S. (2000, July 7–9). Knowledge agents on the Web. Proceedings of the Fourth Interna-tional Workshop on Cooperative Information Agents CIA 2000, Boston, MA. Also in M. Klush & L. Kerschberg (Eds.), Lecture notes in artificial intelligence 1860 (pp. 15–26). Springer. Bharat, K., & Henzinger, M. (1998). Improved algorithms for topic distilla-tion in a hyperlinked environment. In W. Bruce Croft, A. Moffat, C.J. van Rijsbergen, R. Wilkinson, & J. Zobel (Ed.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval (SIGIR '98) (pp. 104–111). Melbourne, Australia: ACM.
Extracting evolu-tion of Web communities from a series of Web archives HTTP/1.1, Section 14: Header Field Definitions
  • M Toyoda
  • M Kitsuregawa
Toyoda, M., & Kitsuregawa, M. (2003, August 26–30). Extracting evolu-tion of Web communities from a series of Web archives. Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2003) (pp. 28–37), Nottingham, UK, ACM. W3C, Hypertext Transfer Protocol. (2003). HTTP/1.1, Section 14: Header Field Definitions. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Visualizing the evolution of Web ecologies. Pro-ceedings of ACM Conference on Human Factors and Computing Sys-tems
  • E H Chi
  • J Pitkow
  • J Mackinlay
  • P Pirolli
  • R Gossweiler
  • S K Card
Chi, E.H., Pitkow, J., Mackinlay, J., Pirolli, P., Gossweiler, R., & Card, S.K. (1998, April 18–23). Visualizing the evolution of Web ecologies. Pro-ceedings of ACM Conference on Human Factors and Computing Sys-tems, Los Angeles, CA. (CHI '98) (pp. 400–407).
A signal processing approach to generating natural language reports from time series. Unpublished doctoral dissertation The anatomy of a large-scale hypertextual {Web} search engine
  • S Boyd
  • Nsw
  • Australia
  • S Brin
  • L Page
Boyd, S. (1999). A signal processing approach to generating natural language reports from time series. Unpublished doctoral dissertation, Macquarie University, NSW, Australia. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual {Web} search engine. WWW7/Computer Networks & ISDN, 30(1–7), 107–117.
HTTP/1.1, Section 14: Header Field Definitions
W3C, Hypertext Transfer Protocol. (2003). HTTP/1.1, Section 14: Header Field Definitions. http://www.w3.org/Protocols/rfc2616/rfc2616- sec14.html
A signal processing approach to generating natural language reports from time series
  • S Boyd
Boyd, S. (1999). A signal processing approach to generating natural language reports from time series. Unpublished doctoral dissertation, Macquarie University, NSW, Australia.
Unpublished presentation at Cell Separation, Hematology and Journal Citation Analysis Mini Symposium in tribute to Arne Bøyum
  • Garfield
Garfield, E. (1998). The use of journal impact factors and citation analysis for evaluation of science. Unpublished presentation at Cell Separation, Hematology and Journal Citation Analysis Mini Symposium in tribute to Arne Bøyum, Rikshospitalet, Oslo, April 17, 1998. Retrieved July 25, 2004, from http://www.garfield.library.upenn.edu/papers/eval_of_ science_oslo.html
Bursty and hierarchical structure in streams
  • J M Kleinberg
Kleinberg, J.M. (2002, July 23-26). Bursty and hierarchical structure in streams. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002) (pp. 91-101), Edmonton, Alberta, Canada, ACM.
TimeLinks: Exploring the evolving link structure of the Web
  • R Kraft
  • E Hastor
  • R Stata
Kraft, R., Hastor, E., & Stata, R. (2003, June). TimeLinks: Exploring the evolving link structure of the Web. Proceedings of the Second Workshop on Algorithms and Models for the Web-Graph, Budapest, Hungary.
  • Egghe