Article

Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Points to the way in which computer scientists and librarians working with the World Wide Web are turning to traditional library and information science techniques, such as cataloguing and classification, to bring order to the chaos of the Web. Explores cataloguing opportunities offered by the ephemeral nature of materials on the Web and examines several of the latter's unique characteristics. Suggests the coupling of automated filtering and measuring to the Web record cataloguing process, with particular reference to the ephemeral nature of Web documents and the ability to measure Uniform Resource Locator (URL) and Web document characteristics and migrate them to catalogue records using automated procedures. Reports results of an ongoing longitudinal study of 361 randomly selected Web pages and their Web sites, the data being collected weekly using the Flashsite 1.01 software package. Four basic approaches to ordering information on the Web were studied: postcoordinate keyword and full-text indexes; application of both precoordinate and postcoordinate filters or identifiers to the native document by either authors or indexers; use of thesauri and other classification schemes; and bibliometric techniques employing mapping of hypertext links and other citation systems. Concludes that off-the-shelf technology exists that allows the monitoring of Web sites and Web pages to 'measure' Web page and Web site characteristics, to process quantified changes, and to write those changes to bibliographic records. Capturing semantic or meaningful change is more complex, but these can be approximated using existing software.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... While this project has been collecting university link data since 2000, in the context of web analysis this is a long-term perspective and has already been used to provide significant insight into the patterns and relationships inherent in academic hyperlinks [2] [3] [4]. While much research has been carried out on academic web links, and longitudinal studies have been undertaken on internet web sites and domains [5] [6] [7], the research questions in this paper have been chosen in an attempt to fill a critical gap in current webometrics research. By undertaking a longitudinal study of academic web spaces, it is hoped that patterns and trends in inlinks and outlinks over time, particularly with regard to academic research, can be identified and explained. ...
... While this project has been collecting university link data since 2000, in the context of web analysis this is a long-term perspective and has already been used to provide significant insight into the patterns and relationships inherent in academic hyperlinks234. While much research has been carried out on academic web links, and longitudinal studies have been undertaken on internet web sites and domains567, the research questions in this paper have been chosen in an attempt to fill a critical gap in current webometrics research. By undertaking a longitudinal study of academic web spaces, it is hoped that patterns and trends in inlinks and outlinks over time, particularly with regard to academic research, can be identified and explained. ...
... Web link studies carried out from a purely longitudinal perspective are few and far between. A series of papers, believed to be the longest continuous study of a single set of URLs, has carried out analyses on a random selection of 361 URLs since December 1996 [5,161718. Significant findings are that the half-life of a web page is approximately two years and that web page content appears to have stabilized over time. ...
Article
Full-text available
Longitudinal studies of web change are needed to assess the stability of webometric statistics and this paper forms part of an on-going longitudinal study of three national academic web spaces. It examines the relationship between university inlinks and research productivity over time and identifies reasons for individual universities experiencing significant increases and decreases in inlinks over the last six years. The findings also indicate that between 66 and 70% of outlinks remain the same year on year for all three academic web spaces, although this stability conceals large individual differences. Moreover, there is evidence of a level of stability over time for university site inlinks when measured against research productivity. Surprisingly, however, inlink counts can vary significantly from year to year for individual universities, for reasons unrelated to research which undermines their use in webometrics studies.
... The current importance of the web has spawned many attempts to analyse sets of web pages in order to derive general conclusions about its properties and patterns of use (Crowston and Williams, 1997; Hooi-Im et al, 1998; Ingwerson, 1998; Lawrence and Giles, 1998a; Miller and Mather, 1998; Henzinger et al., 1999; Koehler, 1999; Lawrence and Giles, 1999a; Smith, 1999; Snyder and Rosenbaum, 1999; Henzinger et al., 2000; Thelwall, 2000a; Thelwall, 2000b; Thelwall, 2000d). The enormous size of the web means that any study can only use a fraction of it. ...
... @BULLET Using a copy of the whole database of a large search engine (Broder et al. 2000). @BULLET Using a systematic approach to select links from the directory structure of a search engine (Hooi-Im et al., 1998; Callaghan and Pie, 1998, Cockburn and Wilson, 1996; Koehler, 1999; Crowston and Williams, 2000). @BULLET Selecting pages from a random walk seeded with pages from a predefined large initial set (Henzinger et al., 1999; Bar-Yossef et al., 2000; Rusmevichientong et al., 2001). ...
Article
Full-text available
There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
... Note that ideally variant C breaks down the inclusion hierarchy of A in a way that none of its web pages is anymore polymorphic. 3[6] reports on an analysis in which the maximum website level is 179. ...
... This paper is based upon an ongoing longitudinal study of Web page life cycle behavior initiated in December 1996 (Koehler 1999a, Koehler 1999b. Previously published findings suggest that an aging collection of Web pages and sites may very well manifest patterned and predictable behavior. ...
... Web genres and their defining conventions are "still in the process of becoming" [1]. In the present study, the term genre was used in a broad sense for describing types of functions and purposes of web sites and web pages (e.g., [11], [14], [20]). ...
Chapter
Full-text available
The chapter outlines an exploratory empirical investigation of genre connectivity in an academic web space, i.e., how web page genres areconnected by links. The data set contained source and target pages on shortest link paths between different topical domains at UK universities. The pages were categorized into 9 institutional and 8 personal meta genres (bundled genre categories). Most frequent genre pairs were institutional link lists linking to institutional homepages and personal link lists linking to personal publications. Some genres function as ‘hook’ genres being outlink-prone (e.g. link lists) and some as inlink-prone ‘lug’ genres (e.g. institutional homepages). A genre network graph is used to discuss web spaces as webs of genres with genre drift and topic drift, i.e., changes in page genres and page topics along link paths. Complementarities of genre drift and topic drift may affect small-world properties in the shape of short link distances between different topical clusters in academic web spaces.
... Page genres were classified by the author for all visited 530 source and target pages in the 10 path nets. The term genre is here used in a broad sense in accordance with contemporary Web terminology for describing types of functions and purposes of Web sites and Web pages (e.g., KOEHLER, 1999). The visited Web pages represented a rich diversity of genres reflecting a multitude of creators, purposes and potential audiences. ...
Article
Full-text available
Summary Combining webometric and social network analytic approaches, this study developed a methodology to sample and identify Web links, pages, and sites that function as small-world connectors affecting short link distances along link paths between different topical domains in an academic Web space. The data set comprised 7669 subsites harvested from 109 UK universities. A novel corona-shaped Web graph model revealed reachability structures among the investigated subsites. Shortest link path netsfunctioned as investigable small-world link structures - 'mini small worlds' - generated by deliberate juxtaposition of topically dissimilar subsites. Indicative findings suggest that personal Web page authors and computer science subsites may be important small-world connectors across sites and topics in an academic Web space. Such connectors may counteract balkanization of the Web into insularities of disconnected and unreachable subpopulations.
Article
This webometric study identifies web links, pages, and sites that function as small-world connectors affecting short link distances across topics in an academic web space. A five-step methodology is developed to sample and identify small-world properties by zooming stepwise into more and more fine-grained web node levels and link structures among 7669 subsites harvested from 109 UK universities. The methodology includes shortest path nets functioning as investigable small-world link structures, 'mini small worlds', generated by deliberate juxtaposition of topically dissimilar subsites. The network analysis tool Pajek identified all shortest link paths within the data set between 10 pairs of subsites. The study includes a novel corona-shaped model of reachability structures in a web subgraph. Indicative findings suggest that personal link creators and computer science subsites may be important small-world connectors across sites and topics in an academic web space. Such small-world connectors are important as they counteract balkanization of the Web into insularities of disconnected and unreachable subpopulations. The study also suggests how the Web is a web of genres with richly diversified genre connectivity and with genre drift, i.e. changes in page genres along link paths that may affect small-world properties.
Article
Purpose This paper seeks to address the question: do university web sites publish the same kind of information and use the same kind of hyperlinks year on year or do these change over time? Design/methodology/approach A link classification exercise is used to identify temporal changes in the distribution of different types of academic web links changes, using the academic web spaces of the UK, Australia and New Zealand in the years 2000 and 2006. Findings Significant increases in “research oriented”, “social/leisure” and “superficial” links were identified as well as notable decreases in “technical” and “personal” links. Some of these changes identified may be explained by general changes in the management of university web sites and some by more wide‐spread internet trends, e.g. dynamic pages, blogs and social networking. The increase in the proportion of research‐oriented links is particularly hopeful for future link analysis research. Originality/value This is an important issue from the perspective of interpreting the results of academic web analyses, because changes in link types over time would also force interpretations of link analyses to change over time.
Conference Paper
Web documents present digital and hybrid librarians with a set of bibliographic management issues heretofore of no or minor significance for materials in print. These include frequent content change as well as the rate at which Web documents are removed or moved by their authors or creators. There have been a number of author-side and cataloger-side initiatives to assist in the management of Web documents, but these do not adequately address change. This paper explores some of those options. It addresses the impact of document change and demise on digital and hybrid collections. It offers suggestions on the management of the change and demise phenomena.
Article
Changes in the topography of the Web can be expressed in at least four ways: (1) more sites on more servers in more places, (2) more pages and objects added to existing sites and pages, (3) changes in traffic, and (4) modifications to existing text, graphic, and other Web objects. This article does not address the first three factors (more sites, more pages, more traffic) in the growth of the Web. It focuses instead on changes to an existing set of Web documents. The article documents changes to an aging set of Web pages, first identified and collected in December 1996 and followed weekly thereafter. Results are reported through February 2001. The article addresses two related phenomena: (1) the life cycle of Web objects, and (2) changes to Web objects. These data reaffirm that the half-life of a Web page is approximately 2years. There is variation among Web pages by top-level domain and by page type (navigation, content). Web page content appears to stabilize over time; aging pages change less often than once they did.
Article
As the web is continuously changing, perhaps growing exponentially since its inception, a major potential problem for webometrics is that web statistics may be obsolete by the time they are published in the academic literature. It is important therefore to know as much as possible about how the web is changing over time. This paper studies the UK, Australian and New Zealand academic webs from 2000 to 2005, finding that the number of static pages and links in each of the three academic webs appears to have stabilised as far back as 2001. This stabilisation may be partly due to increases in dynamic pages which are normally excluded from webometric analyses. Nevertheless, the results are encouraging evidence that webometrics for academic spaces may have a longer-term validity than would have been previously assumed.
Article
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy A problem common to all current web link analyses is that, as the web is continuously evolving, any web-based study may be out of date by the time it is published in academic literature. It is therefore important to know how web link analyses results vary over time, with a low rate of variation lengthening the amount of time corresponding to a tolerable loss in quality. Moreover, given the lack of research on how academic web spaces change over time, from an information science perspective it would interesting to see what patterns and trends could be identified by longitudinal research and the study of university web links seems to provide a convenient means by which to do so. The aim of this research is to identify and track changes in three academic webs (UK, Australia and New Zealand) over time, tracking various aspects of academic webs including site size and overall linking characteristics, and to provide theoretical explanations of the changes found. This should therefore provide some insight into the stability of previous and future webometric analyses. Alternative Document Models (ADMs), created with the purpose of reducing the extent to which anomalies occur in counts of web links at the page level, have been used extensively within webometrics as an alternative to using the web page as the basic unit of analysis. This research carries out a longitudinal study of ADMs in an attempt to ascertain which model gives the most consistent results when applied to the UK, Australia and New Zealand academic web spaces over the last six years. The results show that the domain ADM gives the most consistent results with the directory ADM also giving more reliable results than are evident when using the standard page model. Aggregating at the site (or university) level appears to provide less consistent results than using the page as the standard unit of measure, and this finding holds true over all three academic webs and for each time period examined over the last six years. The question of whether university web sites publish the same kind of information and use the same kind of hyperlinks year on year is important from the perspective of interpreting the results of academic link analyses, because changes in link types over time would also force interpretations of link analyses to change over time. This research uses a link classification exercise to identify temporal changes in the distribution of different types of academic web links, using three academic web spaces in the years 2000 and 2006. Significant increases in ‘research oriented’, ‘social/leisure’ and ‘superficial’ links were identified as well as notable decreases in the ‘technical’ and ‘personal’ links. Some of these changes identified may be explained by general changes in the management of university web sites and some by more wide-spread Internet trends, e.g., dynamic pages, blogs and social networking. The increase in the proportion of research-oriented links is particularly hopeful for future link analysis research. Identifying quantitative trends in the UK, Australian and New Zealand academic webs from 2000 to 2005 revealed that the number of static pages and links in each of the three academic webs appears to have stabilised as far back as 2001. This stabilisation may be partly due to an increase in dynamic pages which are normally excluded from webometric analyses. In response to the problem for webometricians due to the constantly changing nature of the Internet, the results presented here are encouraging evidence that webometrics for academic spaces may have a longer-term validity than would have been previously assumed. The relationship between university inlinks and research activity indicators over time was examined, as well as the reasons for individual universities experiencing significant increases and decreases in inlinks over the last six years. The findings indicate that between 66% and 70% of outlinks remain the same year on year for all three academic web spaces, although this stability conceals large individual differences. Moreover, there is evidence of a level of stability over time for university site inlinks when measured against research. Surprisingly however, inlink counts can vary significantly from year to year for individual universities, for reasons unrelated to research, underlining that webometric results should be interpreted cautiously at the level of individual universities. Therefore, on average since 2001 the university web sites of the UK, Australia and New Zealand have been relatively stable in terms of size and linking patterns, although this hides a constant renewing of old pages and areas of the sites. In addition, the proportion of research-related links seems to be slightly increasing. Whilst the former suggests that webometric results are likely to have a surprisingly long shelf-life, perhaps closer to five years than one year, the latter suggests that webometrics is going to be increasingly useful as a tool to track research online. While there have already been many studies involving academic webs spaces, and much work has been carried out on the web from a longitudinal perspective, this thesis concentrates on filling a critical gap in current webometric research by combining the two and undertaking a longitudinal study of academic webs. In comparison with previous web-related longitudinal studies this thesis makes a number of novel contributions. Some of these stem from extending established webometric results, either by introducing a longitudinal aspect (looking at how various academic web metrics such as research activity indicators, site size or inlinks change over time) or by their application to other countries. Other contributions are made by combining traditional webometric methods (e.g. combining topical link classification exercises with longitudinal study) or by identifying and examining new areas for research (for example, dynamic pages and non-HTML documents). No previous web-based longitudinal studies have focused on academic links and so the main findings that (for UK, Australian and New Zealand academic webs between 2000 and 2006) certain academic link types exhibit changing patterns over time, approximately two-thirds of outlinks remain the same year on year and the number of static pages and links appears to have stabilised are both significant and novel.
Article
Web pages and Web sites, some argue, can either be collected as elements of digital or hybrid libraries, or, as others would have it, the WWW is itself a library. We begin with the assumption that Web pages and Web sites can be collected and categorized. The paper explores the proposition that the WWW constitutes a library. We conclude that the Web is not a digital library. However, its component parts can be aggregated and included as parts of digital library collections. These, in turn, can be incorporated into "hybrid libraries." These are libraries with both traditional and digital collections. Material on the Web can be organized and managed. Native documents can be collected in situ, disseminated, distributed, catalogueed, indexed, controlled, in traditional library fashion. The Web therefore is not a library, but material for library collections is selected from the Web. That said, the Web and its component parts are dynamic. Web documents undergo two kinds of change. The first type, the type addressed in this paper, is "persistence" or the existence or disappearance of Web pages and sites, or in a word the lifecycle of Web documents. "Intermittence" is a variant of persistence, and is defined as the disappearance but reappearance of Web documents. At any given time, about five percent of Web pages are intermittent, which is to say they are gone but will return. Over time a Web collection erodes. Based on a 120-week longitudinal study of a sample of Web documents, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. That is to say, an unweeded Web document collection created two years ago would contain the same number of URLs, but only half of those URLs point to content. The second type of change Web documents experience is change in Web page or Web site content. Again based on the Web document samples, very nearly all Web pages and sites undergo some form of content within the period of a year. Some change content very rapidly while others do so infrequently (Koehler, 1999a). This paper examines how Web documents can be efficiently and effectively incorporated into library collections. This paper focuses on Web document lifecycles: persistence, attrition, and intermittence. While the frequency of content change has been reported (Koehler, 1999a), the degree to which those changes effect meaning and therefore the integrity of bibliographic representation is yet not fully understood. The dynamics of change sets Web libraries apart from the traditional library as well as many digital libraries. This paper seeks then to further our understanding of the Web page and Web site lifecycle. These patterns challenge the integrity and the usefulness of libraries with Web content. However, if these dynamics are understood, they can be controlled for or managed.
Article
This seventeen-month IIa Internet Study addressed the problem of access and retrieval of relevant information for the end user in the evolving digital environment. The effort evaluated World Wide Web search engines and commercial Internet resource search and maintenance tools for efficacy in creating an Area Studies Digital Library (ASDL) collection. The researchers identified document types unique to the Internet and applied bibliographic and data transfer standards. These standards were modified to respond to, and take advantage of, the unique features of the Internet environment. Commercial software, limited programming, and a custom database application were integrated to create a dynamically maintainable catalog of Internet and World Wide Web resources in scope for the collection. In the course of the effort, the researchers developed search strategies which contributed to comprehensive coverage and minimized the manual effort required to produce a useable collection. The two key products of the project were an evaluated database of Internet resources and a semiautomated methodology for development and maintenance of a specialized Internet collection.
Article
Critically assesses the value of the Internet to the library and information science profession by examining its role in three areas: the impact on academic and scholarly periodical publishing and the dissemination of information; the contribution to reference work; and the impact on recreational reading. Concludes that the Internet will be most useful in the publication and distribution of scholarly electronic journals and reviews recent trends in Internet published electronic periodicals and the problems requiring solution, particularly those relating to bibliographic description. Commercial publishers are taking a keen interest in these developments but it remains to be seen how the academic community will react, given their past criticisms of periodical publishers. The design of electronic periodicals will probably be brought more into line with their printed counterparts. Screen layouts and display features will need to be improved to facilitate skimming and browsing and, if the hypertext linking facilities are retained, then a major breakthrough in readability will be achieved. Nevertheless the issues of authority, accuracy and currency of information remain to be solved.
Article
Links in hypertext and hypermedia are guides for users to browse, navigate and locate digital information. They function, in many ways, as index terms for collections of digital objects. The quality of such links usually determines how well hypertext and hypermedia information can be represented intellectually. Since users of the World Wide Web, an excellent tool for organizing and presenting hypertext and hypermedia information, have realized that it is hard to locate precise information from the Internet even with the help of various search engines, it would be interesting to look at the origin of the problem. That is, how much information on the Web contains quality links within the hyper-networked structure? The present author explored this research question by surveying documents selected from the Web sites of the Alexandria Project, CNET, the Library of Congress and the National Information Standards Organization. The quality of hyperlinks were measured by two widely used indexing parameters: exhaustivity and specificity. The study also offers advice and suggestions for people to create quality hyperlinks for their digital collections.
Article
Describes a study of the retrieval results of World Wide Web search engines. Research quantified accurate matches versus matches of arguable quality for 200 subjects relevant to undergraduate curricula. Both "evaluative" engines (Magellan, Point Communications) and "nonevaluative" engines (Lycos, InfoSeek, AltaVista) were examined. Taking into account indexing depth and searching vagaries, AltaVista and InfoSeek performed the best. (BEW)
Article
It is well known that selectivity leaves a lot to be desired in searching for information resources on the Internet with existing search systems (Desai, 1995c). This has prompted a number of researchers to turn their attention to the development and implementation of models for indexing and searching information resources on the Internet. In this article, we examine briefly the results of a simple query on a number of existing search systems and then discuss two proposed index metadata structures for indexing and supporting search and discovery: The Dublin Core Elements List and the Semantic Header. We also present an indexing and discovery system based on the Semantic Header. © 1997 John Wiley & Sons, Inc.