ArticlePDF Available

Invisible Institutional Repositories: Addressing the Low Indexing Ratios of IRs in Google Scholar.


Abstract and Figures

Purpose Google Scholar has difficulty indexing the contents of institutional repositories, and the authors hypothesize the reason is that most repositories use Dublin Core, which cannot express bibliographic citation information adequately for academic papers. Google Scholar makes specific recommendations for repositories, including the use of publishing industry metadata schemas over Dublin Core. This paper aims to test a theory that transforming metadata schemas in institutional repositories will lead to increased indexing by Google Scholar. Design/methodology/approach The authors conducted two surveys of institutional and disciplinary repositories across the USA, using different methodologies. They also conducted three pilot projects that transformed the metadata of a subset of papers from USpace, the University of Utah's institutional repository, and examined the results of Google Scholar's explicit harvests. Findings Repositories that use GS recommended metadata schemas and express them in HTML meta tags experienced significantly higher indexing ratios. The ease with which search engine crawlers can navigate a repository also seems to affect indexing ratio. The second and third metadata transformation pilot projects at Utah were successful, ultimately achieving an indexing ratio of greater than 90 percent. Research limitations/implications The second survey is limited to 40 titles from each of seven repositories, for a total of 280 titles. A larger survey that covers more repositories may be useful. Practical implications Institutional repositories are achieving significant mass, and the rate of author citations from those repositories may affect university rankings. Lack of visibility in Google Scholar, however, will limit the ability of IRs to play a more significant role in those citation rates. Social implications Transforming metadata can be a difficult and tedious process. The Institute of Museum and Library Services has recently awarded a National Leadership Grant to the University of Utah to continue SEO research with its partner, OCLC Inc., and to develop a toolkit that will include automated transformation mechanisms. Originality/value Little or no research has been published about improving the indexing ratio of institutional repositories in Google Scholar. The authors believe that they are the first to address the possibility of transforming IR metadata to improve indexing ratios in Google Scholar.
Content may be subject to copyright.
Invisible institutional repositories
Addressing the low indexing ratios of
IRs in Google Scholar
Kenning Arlitsch and Patrick S. O’Brien
J. Willard Marriott Library, University of Utah, Salt Lake City,
Utah, USA
Purpose – Google Scholar has difficulty indexing the contents of institutional repositories, and the
authors hypothesize the reason is that most repositories use Dublin Core, which cannot express
bibliographic citation information adequately for academic papers. Google Scholar makes specific
recommendations for repositories, including the use of publishing industry metadata schemas over
Dublin Core. This paper aims to test a theory that transforming metadata schemas in institutional
repositories will lead to increased indexing by Google Scholar.
Design/methodology/approach – The authors conducted two surveys of institutional and
disciplinary repositories across the USA, using different methodologies. They also conducted three
pilot projects that transformed the metadata of a subset of papers from USpace, the University of
Utah’s institutional repository, and examined the results of Google Scholar’s explicit harvests.
Findings – Repositories that use GS recommended metadata schemas and express them in HTML
meta tags experienced significantly higher indexing ratios. The ease with which search engine
crawlers can navigate a repository also seems to affect indexing ratio. The second and third metadata
transformation pilot projects at Utah were successful, ultimately achieving an indexing ratio of greater
than 90 percent.
Research limitations/implications The second survey is limited to 40 titles from each of seven
repositories, for a total of 280 titles. A larger survey that covers more repositories may be useful.
Practical implications – Institutional repositories are achieving significant mass, and the rate of
author citations from those repositories may affect university rankings. Lack of visibility in Google
Scholar, however, will limit the ability of IRs to play a more significant role in those citation rates.
Social implications Transforming metadata can be a difficult and tedious process. The Institute
of Museum and Library Services has recently awarded a National Leadership Grant to the University
of Utah to continue SEO research with its partner, OCLC Inc., and to develop a toolkit that will include
automated transformation mechanisms.
Originality/value – Little or no research has been published about improving the indexing ratio of
institutional repositories in Google Scholar. The authors believe that they are the first to address the
possibility of transforming IR metadata to improve indexing ratios in Google Scholar.
Keywords Search engines, Digital libraries, Google Scholar, Institutional repositories,
Search engine optimization, Metadata
Paper type Research paper
Search engine optimization (SEO) research conducted at the University of Utah has
revealed that many institutional repositories (IRs) have a low indexing ratio[1] in Google
Scholar (GS). IRs were developed to manage and ensure long-term access to academic
The current issue and full text archive of this journal is available at
The authors would like to thank Dr Awesome for her expertise, edits, and unflagging support.
Received October 2011
Revised November 2011
Accepted November 2011
Library Hi Tech
Vol. 30 No. 1, 2012
pp. 60-81
qEmerald Group Publishing Limited
DOI 10.1108/07378831211213210
publications, and GS was created as a search engine for those publications, whether they
reside in IRs, at publisher repositories, or other research-oriented sites. This paper
addresses the reasons for the low indexing ratio of many IRs, which the authors believe
stem mainly from the metadata requirements of GS and which stand in contrast from the
practices of many IRs. The authors conducted two surveys of IRs across the country and
implemented three pilot projects designed to increase the indexing ratio of the IR at Utah.
Transforming the metadata schema in those pilot projects led to a significant improvement
in GS indexing ratio of the sample set. Additional reasons for the low indexing ratio of IRs
can be tied to the ease with which GS’s crawlers can navigate a given repository.
While much has been written about search engine optimization for general websites,
very little has been published about SEO specifically for digital repositories and even
less for institutional or disciplinary repositories. The subject of this paper developed
from more general digital repository SEO research the authors have conducted at the
University of Utah’s J. Willard Marriott Library for the past 18 months, and whose
continued research has recently been funded by a National Leadership Grant from the
Institute of Museum and Library Services (IMLS). OCLC is a formal partner on this
grant. The authors have dramatically improved the indexing ratio of Utah’s digital
library (including IR) in Google’s main index, and some of that work will be discussed
as background.
Digital repositories relevant to this article are defined as databases that store
digitized or born-digital objects, making them freely accessible to the public. These
repositories typically run on web server technologies, use descriptive metadata, and
because their missions are generally directed at open access they all benefit from
successful harvesting and indexing by internet search engines. IRs are that subset of
digital repositories that capture and manage the intellectual output of academic
institutions or disciplines.
Research question
The USpace IR at the University of Utah currently experiences an indexing ratio of less
than 0.1 percent in GS, even though the indexing ratio for the same repository
improved from approximately 18 to 98 percent in Google’s main index after numerous
SEO problems were addressed. The authors hypothesized that the average indexing
ratio in GS of IRs across the USA is low, and that altering repository metadata to
follow one of the publishing industry schemas recommended by Google Scholar would
lead to a substantial improvement in the indexing ratio of USpace content.
Research method
The authors conducted two surveys to identify the indexing ratio of other IRs, and the
methodologies used for both are explained more fully in the survey methodologies
section. In brief, the authors selected repositories from the OpenDOAR directory
(University of Nottingham, 2011), gathered total item numbers from each repository,
and then systematically sampled GS for evidence that those items had been included in
its index. Two surveys (survey 1 and survey 2) were conducted, each using a
substantially different searching methodology.
To test improvements to the USpace IR the authors gathered indexing ratio
statistics and created a feedback loop to measure results of changes they made to
repository items, demonstrated in three pilot studies. They submitted sitemaps
containing URLs for all items in USpace to GS, and then used Google Webmaster Tools
(Google, 2011) to observe the activity of crawlers and the resulting indexing ratios. The
authors adjusted the metadata for a subset of repository articles, and following a
re-harvest they gathered additional statistics (pilot 1). After this approach failed, the
authors conducted discussions with OCLC and GS to confirm a new approach, which
led to the development of metadata templates for various academic paper types whose
effectiveness was tested during explicit harvests by GS (pilot 2 and 3).
Google and Google Scholar
Google and Google Scholar are separate indexes, and GS has a different focus from its
much larger parent. Dr Anurag Acharya, GS’s founding engineer, has stated that the
goal is to offer the “most comprehensive list of research papers available on the Web,”
and that GS limits its results to “peer reviewed papers, theses, books, abstracts, and
technical reports” (Assisi, 2005). More recently, GS has added patents and legal cases
to the items it indexes.
GS has its own crawlers (also known as spiders or robots) that visit repositories and
publisher sites, among others, to harvest content appropriate for its index. A
peculiarity of GS’s presentation of academic papers is that it generally provides a link
directly to the PDF document. This is expedient for users as it gets them directly to the
content, but it also strips any context that may have been provided by the repository’s
HTML display. In other words, metadata, institutional logos, and other information
normally displayed to users are lost unless they are inserted into the PDF itself. The
practice can also affect the reporting of visitation statistics through website analytics
software that utilize page tagging. For instance, Google Analytics requires a tracking
code inserted in the HTML of each page of a given website to gather statistics. Each
time that page is displayed in a web browser it is counted as a visit by Google
Analytics, but separating the PDF file from the HTML display means that the visit will
not be counted because the required tracking code is not executed when the PDF is
called directly. This problem can be overcome by having the webserver execute a PHP
script containing the tracking code before serving the requested PDF, but it is unlikely
that many repository managers are doing this, or are even aware that their visitation
and download statistics may be underreported as a result of GS’s item display practice.
Literature review
Internet search engines dominate general information-seeking behavior of users, and
Google is by far the most popular search engine, consistently grabbing 65 percent share
of the “explicit core search market” (Comscore, 2011). Bing powers Microsoft and Yahoo!
search sites, capturing another 30 percent of market share. The dominance of search
engines is also apparent in the academic sector. A 2005 survey by OCLC demonstrated
that 89 percent of college students began their research with internet search engines, and
that only 2 percent began at library websites (DeRosa and OCLC, 2005). A repeat of that
survey five years later demonstrated that the situation for libraries had only worsened,
as 0 percent of respondents reported visiting library websites at the outset of their
research (DeRosa et al., 2010). That same report saw a slight drop in traditional search
engine use, but also noted for the first time the use of social media search engines for
initial research. Another 2005 survey in the UK found that “students prefer to locate
information or resources via a search engine above all options, and Google is the search
engine of choice” (Griffiths and Brophy, 2005). The information-seeking behaviors of
young academic researchers in Sweden displayed an “almost complete dominance of
Google as a starting point for searching scientific information” (Haglund and Olsson,
Faculty search behavior is similar. A study of active faculty researchers at four
major universities reported that “researchers find Google and Google Scholar to be
amazingly effective” for their information retrieval needs and accept the results as
“good enough in many cases” (Kroll and Forsman, 2010). Rieger reports a high degree
of use and satisfaction with internet search engines. She notes that “both faculty and
students prefer search engines over other resources to support their academic work”
and that “there is a broader awareness of specialized Google tools such as Google
Scholar and Google Book among faculty members and graduate students” (Rieger,
2009). In a comparison of GS to Web of Science, Mikki states that “the amount of
qualified scholarly content has increased considerably in Google Scholar since it was
launched in 2004,” and that it has developed into a serious research and citation study
tool that should be included in information literacy programs (Mikki, 2009).
A review of the literature pertaining to SEO in libraries reveals that much of the
published research deals with general websites (e.g. Cahill and Chalut, 2009; Rushton
et al., 2008). The minimal research dealing with digital repositories sometimes concludes
by suggesting that content be replicated outside the database in a static format in order
to make it friendlier to search engines, a method that seems arcane and burdensome, but
may have been the best option at the time. “Unless links are located on a static web page,
crawlers won’t find them, and many such links are not followed” (DeRidder, 2008). Page
rank in search engines is another factor that plays into repository visibility. Malaga has
shown that 62 percent of users click only on results that appear in the first search engine
results page (Malaga, 2008). The high use of internet search engines as primary search
mechanisms suggests that digital repositories created by libraries are likely to be nearly
invisible to users if their contents are not indexed in these search engines.
Search engine and metadata optimization for institutional repositories are also
addressed only minimally in the published literature, and the value and use of GS is
sometimes questioned. McKay offers that “authors are quite right in perceiving [IRs] as
‘islands of information,’ [...] a condition that can be addressed by search-engine
harvesting [...]” She goes on to say “Google Scholar is not usually the first information
source” consulted by academics, though that may have been truer when the article was
published in 2007, when GS was relatively new and contained much less content
(McKay, 2007). Increased use of GS is demonstrated in a more recent University of
Mississippi study in which use rose from 4 to 27 percent of major library databases
over a four-year period (Herrera, 2010). A 2006 article on optimizing metadata for
search engines acknowledges “the problem may not lie with the search engines but
with the data providers,” and introduces the concept of “data shoogling” to offer more
Google-friendly metadata in digital collections (Dawson and Hamilton, 2006). It does
not, however, specifically address institutional repositories or GS. A survey of 540
librarians at 108 ARL libraries notes complaints of “inadequate use of metadata by
Google Scholar” (Drewry, 2007), which may support this paper’s hypothesis that GS
does not find metadata supplied by libraries to be appropriately structured or unique.
Beel, Gipp, and Wilde offer related and significant strategies for optimizing
academic papers themselves for better inclusion in search engines, and in GS in
particular. Their advice includes optimizing graphics for indexing purposes, writing
relevant document titles, and selecting appropriate keywords (Beel et al., 2010). While
optimization of the academic papers themselves merits continued exploration and
testing, it is beyond the scope of this article.
Background on SEO research for digital repositories at the University of
Digital repositories of every type face a common challenge: having their content found
by interested users in a crowded sea of information on the internet. Getting found means
the repository items must be included in the indexes of major search engines, because
that is where the vast majority of users start looking. Unfortunately, many digital
repositories show poorly in the results from major search engines. In 2010 the authors
conducted a survey of 650 known objects across the thirteen repositories of the Mountain
West Digital Library (MWDL), and revealed a disturbing pattern: only 38 percent of
digital objects searched by title were found in Google’s index. Worse, this Google search
engine results page (SERP) consisted mostly of links back to a search results screen in
the local repository, rather than linking directly to the objects. Only 15 percent of the hits
on the SERP provided users with direct links to the objects. The known-item title
searching method employed by the survey probably produced the best results possible at
the time; searching by keyword or subject term would likely have presented even fewer
items from the repositories of the libraries and archives in the MWDL.
Search engines can be thought of as “users with substantial constraints: they can’t
read text in images, can’t interpret JavaScript or applets, and can’t ‘view’ many other
kinds of multimedia content” (Hagans, 2005). In order to be indexed by internet search
engines, repository databases and the servers on which they reside must be receptive
to crawlers sent out by the search engines. The crawlers follow links to each digital
object in the repository; a process greatly facilitated by the submission of sitemaps that
function as formal invitations and guides, revealing the preferred URL for each of the
repository’s objects. The crawlers “harvest” metadata and other information about the
objects, sending that information back to the search engine where it is analyzed by
algorithms that take many factors into account in deciding whether to add the
metadata to the search engine’s index. Crawlers that encounter difficulties in the
harvesting phase will throw off errors that can be analyzed and addressed using free
Webmaster Tools services offered by both Google and Bing. Errors that are not
addressed in a timely manner may discourage crawlers from returning, leading to
continuing low search engine indexing ratios, or worse, being dropped from the index
altogether. Technical problems encountered by crawlers may include, but are not
limited to the following:
.conflicts between sitemaps and robots.txt files;
.slow server response time;
.dead links or failure to provide appropriate redirects;
.labyrinths created by repository software, including poorly implemented
framesets and JavaScripts, as well as multiple URLs for the same object;
.poor application of metadata, including re-use of the same metadata terms for
multiple objects; and
.metadata schemas deemed unacceptable by the specific search engine.
Additional challenges with search engine optimization for digital repositories may be
framed as administrative. These include:
.aligning the goals of the digital library with institutional goals;
.informing, training, motivating, and coordinating staff from various
.establishing an environment of continuous monitoring and addressing crawler
errors as they arise; and
.institutionalizing tools to analyze metrics, and using them to inform and
convince stakeholders of the impact of the digital library.
Results at Utah
Analyzing the SEO problems and applying a variety of solutions has resulted in
dramatic improvements to the indexing ratio of Utah’s digital repositories in Google.
The digital repositories (including the IR) managed by the J. Willard Marriott Library
at the University of Utah currently run on CONTENTdm v5.4 (OCLC, 2011). While
version 5.4 includes some features that are considered unfriendly to search engines,
such as JavaScripts, framesets (for compound objects), and multiple URLs for digital
objects, those ultimately proved not to be barriers to successful SEO (see Figure 1).
Version 6, which was released in late 2010, eliminates those barriers altogether, and
Utah is planning a migration.
Figure 1 shows increases in average indexing ratios for all digital collections over a
period of 15 months. It also shows improvements in the highest indexing ratio achieved
for collections with more than 500 URLs.
Increased indexing ratios have thus far led to a 200 percent increase in referrals
from Google, and an 80 percent increase in visits to all digital collections. Indexing
ratios of USpace, the University of Utah’s IR, have also increased from approximately
18 percent to 98 percent, but only in Google, not in Google Scholar (see Figure 2).
Figure 1.
Google index ratio
improvement for general
digital collections at Utah
Open access and institutional repositories
The open access movement was launched to improve access to publicly funded
research, and to help libraries deal with rampant inflation in journal subscription
prices. According to Peter Suber the open access movement is dependent on internet
technologies and the consent of the author or copyright holder (Suber, 2004).
Institutional repositories were one product of this movement; they capture the
intellectual output of the faculty, staff, and students of universities or academic
disciplines, and assure perpetual and free access to that output (barring embargo
periods or other publisher restrictions). IRs often include electronic theses and
dissertations (ETD), and most are managed by academic libraries and some by
scholarly societies. Over the past decade IRs have variously enjoyed advances and
suffered setbacks, but through the consistent work of many individuals at numerous
institutions they are achieving enough mass to become viable sources of research
publications. They also hold the promise of contributing significantly to author citation
rates. Recent research in the UK suggests that institutional repositories may play a
crucial role in measuring research output, and in turn may affect university rankings
(Key Perspectives and Brown, 2009). The Times Higher Education publishes an annual
ranking of the top world universities, and research citations contribute 32.5 percent
toward each university’s score (The Times Higher Education, 2010).
Libraries have not developed a mechanism to aggregate and search IRs, and thus
GS has become the best de facto search engine available for IR content. But just as
institutional repositories are gaining enough mass to make them useful and credible
sources of research output, the difficulties associated with SEO threaten to undermine
their potential. Faculty and other authors who contribute publications to IRs may lose
interest if their publications can’t be located (and cited) in academically-oriented search
engines like GS.
Figure 2.
Increase in USpace
indexing ratios in Google
Surveys of IR indexing ratios in Google Scholar
In October and December 2011 the authors conducted two surveys of institutional and
disciplinary repositories to arrive at a preliminary determination of how well GS was
indexing them. The IRs were identified through the Directory of Open Access
Repositories, also known as OpenDOAR (University of Nottingham, 2011).
Only institutional or disciplinary repositories housed in the USA were selected for
these surveys. They were chosen for their academic content, and to represent an
approximate real-world distribution of several repository software types: DSpace, Digital
Commons, EPrints, IR þ, CONTENTdm, DigiTool, and arXiv (see Table I). While there
are a number of other software types in use, many of them are not found in the U.S. Some
repositories found in OpenDOAR were ruled out because it was immediately obvious
that they included other types of non-IR digital collections, such as photographs.
According to OpenDOAR the arXiv repository software is used only by arXiv, but it was
included in Survey 1 because of its size and importance to the scientific community (see
Table I for a complete listing of the repositories selected for survey 1).
Survey methodologies
Search engine indexing is a dynamic environment. Crawlers return to repositories
periodically to pick up new additions, sometimes discarding items if they run into
errors, and the repositories themselves are (hopefully) continually growing. Therefore
these surveys should be understood to be a snapshot from a specific moment in time.
OpenDOAR records list the number of items in most repositories, but those figures are
usually outdated. The authors determined the current number of repository items from
figures available on the sites themselves, and in one case by contacting the repository
manager. DSpace repositories make it is easy to browse by title to reveal all the items in
the repository. In the case of digital commons, a dynamic script posts the current total
items in the repository. EPrints repositories had a page that listed the number of items by
type, the sum of which represented all the items in the repository. Other sites offered
similar methods of determining the total number of items contained in the repository.
Survey 1 Methodology
In the first survey searches were conducted to determine the number of items indexed
by GS from a given repository by using the “site” operator, i.e. search queries in GS
were structured in the following manner: “site:repositoryURL.” This operator must be
used with caution, because in GS it only searches the primary versions of academic
papers. In other words, a paper that has been formally published in a journal will be
considered the primary version. Additional versions of that paper, including those that
appear in IRs may be indexed by GS, but are considered other versions and will only be
revealed by clicking the “versions” link (see Figure 3). Because the other versions do
not appear on the initial search results page, it is incorrect to assume that the number
of results of a search using the site operator shows all the items that GS has indexed
from that repository.
The data from this survey confirm a low average primary publication indexing ratio
of only 30 percent (see Table I and Figure 4). Being mindful of casting aspersions, the
authors are fully aware that their own IR (USpace) currently shows a near zero percent
indexing ratio in GS.
Repository name Repository software Repository URL
Items in
ratio (%)
Boston College-eScholarship@BC DigiTool 1,635 1 0
UW – ResearchWorks Archive DSpace 11,285 304 3
Univ of Rochester Research IR þ 16,184 983 6
CaltechAuthors Eprints 22,000 2,290 10
D-Scholarship@Pitt Eprints 5,888 686 12
Columbia Univ-Academic Commons Fedora/Backlight 4,631 586 13
IU Scholarworks DSpace 7,782 1,030 13
Texas A&M Repository DSpace 46,324 7,250 16
UW Madison-Minds@UW DSpace 15,078 2,520 17
eCommons@Cornell DSpace 18,544 3,410 18
Harvard Univ DASH DSpace 6,193 1,710 28
Univ of Oregon-Scholars Bank DSpace 9,740 2,840 29
Michigan Deep Blue DSpace 66,038 22,200 34
BYU Scholars Archive CONTENTdm 7,421 2,520 34
IUPUI Scholar DSpace 2,109 800 38
Cornell-Digital Commons@ILR Digital Commons 14,669 5,880 40
Cornell-arXiv arXiv 706,906 330,000 47
Aquatic Commons Eprints 5,722 3,230 56
Virginia Tech CS Tech Reports Eprints 983 586 60
Digital Commons@UNLincoln Digital Commons 50,657 30,200 60
Baylor U BearDocs DSpace 928 829 89
Table I.
Survey 1 of IRs showing
primary publication
version indexing ratios
Survey 2 Methodology
In the second survey the authors used a similar approach to the one they had employed
in 2010 to survey the repositories of the Mountain West Digital Library, i.e. they
searched in GS for known repository items by their titles. This method is, of course,
slower and more laborious, but it is also more accurate, allowing articles to be counted
whether they appear as the primary link in the initial list of results or are hidden
behind the “versions” link.
Figure 3.
Google Scholar search
result showing link to
other versions of the paper
Figure 4.
Survey 1 results showing
indexing ratios of
repository primary
The authors created a data set for seven repositories from survey 1 by using crawler
software to harvest titles from each repository. This method mimicked the process
used by internet search engine crawlers, and collected 500 to 1,400 article titles from
each repository and saved them into Excel spreadsheets. In some cases, scholarly
papers in the IRs were easy to identify and entire collections could be crawled. In other
cases, it was difficult to isolate the publicly available scholarly papers for an
automated crawler because the repositories do not follow the GS recommendations.
These difficulties in crawling the IR resulted in less than optimal sampling of titles
within the IR collection; in fact the sample may have been biased to favor a higher
indexing ratio because the authors made efforts to harvest only academic papers.
Using a sampling methodology developed for verifying database backups (LaRock,
2010) those titles were then randomized, and forty titles from each set were searched by
copying article titles from the spreadsheets and pasting them into the GS search box.
The authors used Zotero to create metadata records and snapshots for each search
result, whether the article was found or not. “Versions” links were followed whenever
found and the resulting screen was also captured as a snapshot attached to the same
metadata record in Zotero.
Of the seven repositories that were sampled, three showed very high indexing ratios
(88-98 percent), while the other four showed ratios below 50 percent (see Table II). A
discussion about the likely reasons for these differences follows in the section titled
“Survey Conclusions.”
Survey conclusions
The first survey had limitations in terms of calculating a complete index ratio for each
IR. However, since use of the site operator in GS reveals only the primary versions of
the articles, the average indexing ratio of 30 percent indicates that most IRs do not
contain very many primary articles. This raises some interesting questions about the
purpose of IRs. Specifically, how much value is really derived from having pre-prints in
the IR, given the amount of labor required to put them there, particularly if the primary
publisher is open access as well? On the other hand, IRs that largely contain grey
literature that is not published elsewhere will likely see a higher indexing ratio with GS
precisely because those are the primary articles.
Data from the second survey are much more interesting. Because the authors used
crawler software to harvest article titles, they encountered many of the same problems
that Internet search engine crawlers face when trying to harvest institutional
repositories. The crawling and indexing guidelines shown in Table II were drawn from
stated requirements and recommendations from GS’s Webmaster Inclusion Guidelines
website (Google Scholar, 2010). In general, IRs that followed these guidelines had a
much higher indexing ratio (88-98 percent) than sites that did not (38-48 percent). For
the purposes of this paper, the most validating differences were found in the expression
of publisher metadata schemas (Bepress, Highwire Press, PRISM, or Eprints) in the
meta tags within the header tags of the HTML display pages (see Figure 5). Those
repositories that did not make their metadata available in one of the recommended
publisher schemas within the HTML meta tags generally fared much more poorly than
those that did. Further, the repositories that offered absolute URLs to the PDF files for
their documents also had far higher indexing ratios than those that did not. Finally,
improving crawler efficiency by providing chronological listings of papers, recently
Cornell Oregon Cal Tech Texas A&M Faculty
UW Aquatic Tech
Reports Columbia Rochester
Indexing ratio (%) 98 88 88 48 46 45 38
Software Digital
DSpace ePrints DSpace DSpace Fedora/
IR þ
Titles available/captured Unknown/
1,421 4,067/1,463
1,306 763/757 563/539 3,819/1,432 1,562/926
Crawling guidelines
Browse by date No Yes Yes Yes Yes No No
Recently added No No Yes No No No No
10 clicks from home page Yes Yes Yes No, only first 200 No only first 200 Yes No
Robots.txt Yes Yes, not in
Yes Yes, disallows browse by
Yes, disallows browse
by date
Yes, not
Sitemap index Yes No No Yes, not compliant with
No No No
Indexing guidelines
Meta Tag Schema in HTML
BePress DC ePrints and
None DC and DCTERMS None None
Title Yes Yes Yes No Yes No No
Author Yes Yes Yes No Yes No No
Pub Date Yes Yes Yes No DCTERMS No No
Publisher Yes Yes Yes No No No No
Journal No No Yes No No No No
Volume No No Yes No No No No
Issue No No Yes No No No No
First page No No Yes No No No No
Last page No No Yes No No No No
Absolute URL to PDF Yes Yes Yes No No No No
Institution n/a n/a n/a na n/a No n/a
Dissertation name n/a n/a n/a na n/a No n/a
Table II.
Survey 2 indexing ratios
for seven institutional
added papers, and a limited number of clicks to publicly available scholarly papers
also seemed to positively affect indexing ratio.
GS makes specific recommendations for IR software on its Inclusion Guidelines for
Webmasters site (see reference below), but the surveys in this paper demonstrate that
software makes little or no difference; the problem cuts across institutions, repository
focus, and repository software. Instead, indexing ratio success has much more to do
with how carefully a repository follows the guidelines described, above:
If you’re a university repository, we recommend that you use the latest version of Eprints
(, Digital Commons (, or DSpace (
software to host your papers. If you use a less common hosting product or service, or an
older version of these, please read the rest of this document and make sure that your website
meets our technical guidelines (Google Scholar, 2010).
Why Google Scholar has difficulty with institutional repositories
Librarians are great believers in standards, and while building digital repositories they
have dutifully followed them for scanning, metadata creation, harvesting, and web
services, among others. Search engines, however, are not required to honor standards.
For example, in August 2008 Google announced that it was “Retiring support for
OAI-PMH[2] in Sitemaps” (Mueller, 2008), causing consternation across the library
community. Two years later, GS made the following announcement on its Webmaster
inclusion guidelines site: “Use Dublin Core tags (e.g. DC.title) as a last resort they
work poorly for journal papers [...]” (Google Scholar, 2010).
Although Dublin Core is recognized to be a standard of the lowest common
denominator, libraries have used it widely for most digital repositories, including IRs.
The Dublin Core schema works “poorly for journal papers” because it does not include
adequate fields for citation data and because it is interpreted inconsistently. Citation
information such as journal name, volume and issue number, and page numbers span
of the article is usually entered into a single field, such as DC.Relation or DC.Source in
simple Dublin Core, and there is no specified format or consistency. This makes it
difficult for a search engine like GS to accurately parse and index the data into their
individual bibliographic components. The Dublin Core Metadata Initiative website
(DCMI, 2005) does include guidelines for encoding bibliographic citation information
using a qualification of the DC.Identifier field (called “bibliographicCitation”) but this is
still only a single field. It is also unlikely that many repositories have updated to reflect
the relatively recent development of DC Qualifiers. Dublin Core also does not facilitate
various academic paper types: there is no specific field to distinguish a pre-print from a
journal article, a book chapter from a book, a working paper from a conference
proceeding, or a dissertation. In short, libraries are not focusing enough on making
metadata machine-readable.
Instead of Dublin Core GS recommends using one of the following schemas:
Highwire Press, Eprints, Bepress, and PRISM. These schemas are more adept at
Figure 5.
Example of HTML meta
tags using of Bepress
structuring citation data appropriately. Highwire Press, a division of Stanford
University, developed its schema for journal articles and GS extended the tags to
cover additional academic paper types, such as working papers, dissertations,
manuscripts, conference papers, books and book chapters. The authors used the
extended Highwire Press tags in their pilot projects to test the hypothesis that
transforming metadata would lead to an increase in indexing ratio in GS for an IR.
Pilot 1
Due to the USpace’s non-existent showing in GS, the authors began to strategize
methods to modify USpace metadata to fit the recommendations. GS explains how
Highwire Press tags could map to Dublin Core fields (Google Scholar, 2010). Thus the
first step was to begin aligning existing Dublin Core fields with those mappings (see
Table III).
The indexing ratio for USpace at the University of Utah prior to the pilot (July 5,
2010) was poor, at best, and can be summarized as follows:
(1) Index ratio for the three primary USpace IR collections containing 6,482 papers:
.ranged between 4 percent and 23 percent within Google;
.average overall Google Index Ratio was 18.33 percent (1,188/6,482); and
.index ratio within GS was less than 0.1 percent.
The following steps were taken to address the poor indexing ratio:
(1) Sitemaps representing three IR collections were submitted through Google
Webmaster Tools:
.A total of 6,482 URLs were submitted:
Each collection contained between 500 and 4,200 academic papers.
Highwire press tags Dublin Core tags
citation_author DC.creator
citation_date DC.issued
citation_title DC.title
citation_publisher DC.publisher
citation_journal_title DC.relation.ispartof
citation_volume DC.citation.volume
citation_issue DC.citation.issue
citation_firstpage DC.citation.spage
citation_lastpage DC.citation.epage
citation_issn n/a
citation_isbn n/a
citation_keywords DC.subject
citation_dissertation_institution DC.publisher
citation_technical_report_institution DC.publisher
citation_technical_report_number n/a
citation_language DC.language
citation_conference_title DC.publisher
citation_pdf_url DC.identifier
Table III.
Map used in first GS pilot
(2) Errors generated during Google crawls were analyzed using Webmaster Tools
and improvements were made:
.improved server performance;
.implemented unique title and description tags containing the paper’s name
and abstract, respectively; and
.implemented “rel ¼canonical” tags, indicating the preferred URL of each
digital object (there were often multiple URLs pointing to each paper).
To address the metadata requirements per the Google Scholar inclusion guidelines the
authors did the following:
(1) Mapped Dublin Core to Google-supported Highwire Press tags:
.Extended Dublin Core fields according to GS recommendations:
journal volume (DC.volume);
journal issue (DC.issue);
starting page number (DC.citation.spage); and
ending page number (DC.citation.epage).
A total of 20 papers were selected for a pilot:
.Verified metadata was accurate and mapped correctly to the HTML “meta
name ¼” fields on display templates as understood from GS inclusion guidelines
(see Table III and Figure 6).
.Ensured each of the 20 papers had a full-text PDF that met GS inclusion
guideline requirements.
.Embedded the metadata schema directly into five of the PDF files of the papers.
.Provided a “landing page” per GS inclusion guidelines, containing links to the 20
IR pilot papers that was within a few clicks of the home page. This landing page
contained links to both a paper’s HTML page and its full-text PDF.
Figure 6.
Converting bibliographic
The experiment delivered a significant increase in the Google index ratio for the IR
collections (see Figure 2), and as of October 16, 2011 the Google Index ratio for the IR
collections was 97.82 percent (10,306/10,536). However there was no effect on the IR’s
GS index ratio. In fact, not one of twenty USpace papers that had been isolated and
optimized was included in the GS index[3].
Pilot 2
During the summer of 2011 the authors consulted with OCLC and Google Scholar with
the aim of developing and testing a second pilot project. Nineteen papers from USpace
were selected for the second pilot:
(1) Six of seven GS paper types were represented and the full text PDF document
was included for each paper. The book paper type was out of scope for this pilot
(see Appendix for examples of each paper type):
.dissertation and thesis;
.conference article;
.working paper;
.manuscript and pre-print;
.journal article; and
.book chapter.
(2) CONTENTdm v6.0 display templates were augmented:
.embedded Highwire Press meta tags in the HTML page header of display
templates using an automated script (see Figure 7);
.created a browse by year page that provided links to papers in chronological
order of publishing date; and
.created a recently added page that listed papers added to the IR within the
last 30 days.
The second pilot was a moderate success, with 62 percent of papers indexed on the first
harvest. However, due to unexpected campus network and power outages that took
down the test server for an extended period, the pilot was cut short and the results were
dropped from GS’s index.
Pilot 3
For the third and final pilot project, the authors uploaded 56 papers with full-text PDF
files, and transformed the Dublin Core metadata to Highwire Press tags as described
Figure 7.
Highwire Press tags
embedded in HTML
earlier. The same six paper types were represented as before. This time more than 90
percent appeared in the GS index after four weeks. Continuing conversations with GS
and OCLC will help address lingering issues, but the authors consider this success to
be a significant breakthrough.
Transforming metadata
The thought of manually transforming metadata for an IR might induce nausea in
repository managers. Fortunately, the IMLS NLG grant recently awarded to the
University of Utah intends, as one of its deliverables, to help address this problem.
OCLC is a partner in the grant and will develop formal crosswalks between Dublin
Core and one or more of the publishing industry schemas recommended by GS.
Automated transformation and linked data mechanisms will also be developed to
minimize the work required to express citation data more effectively for indexing. The
products of that grant will be published in a toolkit by 2014 or sooner.
Transforming metadata to GS-preferred metadata schemas is very likely to raise
indexing ratio of IRs. The second and third pilot projects described in this paper were
successful, demonstrating that transforming from Dublin Core metadata tags to more
precise bibliographic Highwire Press tags increased the sample data set GS indexing
ratio from 0 percent to 62 percent in the second pilot, and then to more than 90 percent
in the third. The authors are cautiously optimistic that continuing discussions with GS
and OCLC will eliminate most remaining indexing problems. Transforming metadata
to EPrints, PRISM, and Bepress schemas is also likely to have a positive effect, though
this assertion will require additional testing.
The low indexing ratio of IRs in GS cuts across institutions and repository software.
Despite GS’s endorsement of three software packages, the surveys conducted for this
paper demonstrates that software is not a deciding factor for indexing ratio in GS. Each
of the three recommended software packages showed good indexing ratios for some
repositories and poor ratios for others. Rather, the major deciding factors seem to lie in:
.whether the IR has provided crawlers an efficient method to access its scholarly
papers; and
.whether acceptable metadata schemas are provided that offer precise
bibliographic information within the HTML page header tags.
While transforming metadata seems to be an effective route to getting indexed,
individual IRs may have additional SEO-related problems that must be addressed as
well. Slow or misconfigured servers, failure to submit viable sitemaps, crawler errors
that remain unresolved, failure to provide appropriate server response codes, lack of
communication across the organization, and a host of other potential problems must be
considered for effective SEO that will raise repositories’ visibility in all search engine
indexes. Advanced methods for optimizing PDF files may also help to assure inclusion
in the GS index. More research and testing is needed, but it is fair to say that a
crawler-friendly repository will fare much better in GS than one that poses difficulties
to crawlers. Upgrading to current repository software packages may help in this
endeavor as product development teams become aware of and address SEO issues.
The growing use of GS by researchers underscores the need to address the problem
of low IR indexing ratio. As the economic recession has tightened university budgets,
more emphasis is being placed on assessment and measurement of outputs. IRs have
the potential to raise author citation rates, and in turn to affect university rankings, but
this potential may be seriously hampered if IR content is redundant or invisible to
researchers who use GS.
1. Indexing ratio is defined here as the number of unique URLs from a given repository found
in a search engine’s index divided by the total number of URLs in the repository.
2. (Open Archives Initiative Protocol for Metadata Harvesting, a common standard for sharing
metadata in the library community).
3. USpace added a second theses and dissertations collection after the first GS pilot was started
in July, 2010.
Assisi, F.C. (2005), “Anurag Acharya helped Google’s scholarly leap”, INDOlink – Science
& Technology, available at: (accessed
13 October 2011).
Beel, J., Gipp, B. and Wilde, E. (2010), “Academic search engine optimization”, Journal of
Scholarly Publishing, Vol. 41 No. 2, pp. 176-90.
Cahill, K. and Chalut, R. (2009), “Optimal results: what libraries need to know about Google and
search engine optimization”, The Reference Librarian, Vol. 50 No. 3, pp. 234-47.
Comscore (2011), “comScore releases September 2011 US search engine rankings”, available at:
September_2011_U.S._Search_Engine_Rankings (accessed 22 October 2011).
Dawson, A. and Hamilton, V. (2006), “Optimising metadata to make high-value content more
accessible to Google users”, Journal of Documentation, Vol. 62, pp. 307-27.
DCMI (2005), “Guidelines for encoding bibliographic citation information in Dublin Core
metadata”, Dublin Core Metadata Initiative, available at:
dc-citation-guidelines/ (accessed 26 October 2011).
DeRidder, J.L. (2008), “Googlizing a digital library”, The Code4Lib Journal, No. 2, available at: (accessed 5 October 2011).
DeRosa, C. and OCLC (2005), Perceptions of Libraries and Information Resources: A Report to the
OCLC Membership, OCLC Online Computer Library Center, Dublin, OH.
DeRosa, C. et al. (2010), Perceptions of Libraries, 2010: Context and Community, OCLC Online
Computer Library Center, Inc., Dublin, OH, available at:
2010perceptions.htm (accessed 4 October 2011).
Drewry, J.M. (2007), Google Scholar, Windows Live Academic Search and Beyond: A study of new
tools and changing habits in ARL libraries, University of North Carolina at Chapel Hill,
Chapel Hill, NC, available at: (accessed
21 October 2011).
Google (2011), “Google Webmaster Central”, available at:
(accessed 29 October 2011).
Google Scholar (2010), “Inclusion Guidelines for Webmasters”, available at:
com/intl/en/scholar/inclusion.html (accessed 4 October 2011).
Griffiths, J.R. and Brophy, P. (2005), “Student searching behavior and the web: use of academic
resources and Google”, Library Trends, Spring, pp. 539-54.
Hagans, A. (2005), “High accessibility is effective search engine optimization”, A List Apart,
available at: (accessed 4 October 2011).
Haglund, L. and Olsson, P. (2008), “The impact on university libraries of changes in information
behavior among academic researchers: a multiple case study”, The Journal of Academic
Librarianship, Vol. 34 No. 1, pp. 52-9.
Herrera, G. (2010), “Google Scholar users and user behaviors: an exploratory study”, College and
Research Libraries, available at:
abstract (accessed 4 October 2011).
Key Perspectives and Brown, S. (2009), “A comparative review of research assessment regimes in
five countries and the role of libraries in the research assessment process: a pilot study
commissioned by OCLC Research”, OCLC Research, Dublin, OH.
Kroll, S. and Forsman, R. (2010), A Slice of Research Life Information Support for Research in the
United States, OCLC Research, Dublin, OH.
LaRock, T. (2010), “Statistical sampling for verifying database backups”, simple-talk, available
database-backups/ (accessed 10 December 2011).
McKay, D. (2007), “Institutional repositories and their ‘other’ users: usability beyond authors”,
Ariadne, No. 52, available at: (accessed 15 October 2011).
Malaga, R.A. (2008), “Worst practices in search engine optimization”, Communications of the
ACM, Vol. 51 No. 12, p. 147.
Mikki, S. (2009), “Google Scholar compared to Web of Science: a literature review”, Nordic
Journal of Information Literacy in Higher Education, Vol. 1 No. 1, pp. 41-51.
Mueller, J. (2008), “Retiring support for OAI-PMH in Sitemaps”, Official Google Webmaster
Central Blog, available at:
support-for-oai-pmh-in.html (accessed 19 October 2011).
OCLC (2011), “CONTENTdm Digital Collection Management Software”, CONTENTdm (OCLC –
Digital Collection Services), available at: (accessed
27 October 2011).
Rieger, O.Y. (2009), “Search engine use behavior of students and faculty: user perceptions and
implications for future research”, First Monday, Vol. 14 No. 12, available at: http:// (accessed
21 October 2011).
Rushton, E.E., Kelehan, M.D. and Strong, M.A. (2008), “Searching for a new way to reach patrons:
a search engine optimization pilot project at Binghamton University Libraries”, Journal of
Web Librarianship, Vol. 2 No. 4, pp. 525-47.
Suber, P. (2004), “Very brief introduction to open access”, available at:
,peters/fos/brief.htm (accessed 15 October 2011).
(The) Times Higher Education (2010), “The Times Higher Education World University Rankings
2010-2011”, available at:
(accessed 4 October 2011).
University of Nottingham (2011), “OpenDOAR – Home Page – Directory of Open Access
Repositories”, available at: (accessed 12 October 2011).
Meta tag Pre-print Journal article
1. citation_author Maloney, Krisellen;
Antelman, Kristin; Arlitsch,
Kenning; Butler, John
Maloney, Krisellen;
Antelman, Kristin; Arlitsch,
Kenning; Butler, John
2. citation_date 2009 2010
3. citation_title Future leaders’ views on
organizational culture
Future leaders’ views on
organizational culture
4. citation_publisher N/A Association of College and
Research Libraries
5. citation_journal_title N/A College and research
6. citation_volume 71
7. citation_issue 4
8. citation_firstpage 1 322
9. citation_lastpage 56 347
10. citation_doi
11. citation_issn
12. citation_isbn
13. citation_keywords Organizational culture Organizational culture
16. citation_technical_report_institution Uspace Institutional
Repository, University of
17. citation_technical_report_number N/A
18. citation_language en en
21. citation_pdf_url
22. citation_abstract_html_url
Table AI.
Highwire Press metadata
mappings for seven paper
Meta tag Book chapter Book
1. citation_author Riloff, Ellen M. Ram, Ashwin
2. citation_date 1999 1999
3. citation_title Information extraction as a
stepping stone toward story
Understanding language:
understanding computational
models of reading
4. citation_publisher MIT Press MIT Press
8. citation_firstpage 435 1
9. citation_lastpage 460 519
12. citation_isbn 0-262-18192-4 0-262-18192-4
13. citation_keywords Information extraction; Story
Information extraction; Story
18. citation_language en en
20. citation_inbook_title Understanding language:
understanding computational
models of reading
21. citation_pdf_url
22. citation_abstract_html_url
Notes: Not relevant: 5 citation_journal_tile; 6 citation_volume; 7 citation_issue; 10. citation_doi;
11 – citation_issn; 14 – citation_dissertation_institution; 15 – citation_dissertation_name; 16 – citation_
technical_report_insitution; 17 citation_technical_report_number; 19 – citaiton_conference_title
Table AIII.
Highwire Press metadata
mappings for seven paper
Meta tag PhD Masters
1. citation_author Rague, Brian William Wu, Shangduan
2. citation_date 2010/08 2010/07
3. citation_title A CS1 pedagogical approach
to parallel thinking
Electronic structure and
transport property of
disordered graphene
8. citation_firstpage 1 1
9. citation_lastpage 234 84
13. citation_keywords
Computer; CS1; Education;
Parallel; Programming
Disorder; Electronic
structure; Graphene;
Transport property;
Electronic structure
14. citation_dissertation_institution
University of Utah, College of
University of Utah, College of
15. citation_dissertation_name PhD MS
18. citation_language en en
21. citation_pdf_url
22. citation_abstract_html_url
Notes: Not relevant: 4 – citation_publisher; 5 citation_journal_tile; 6 – citation_volume; 7 – citation_
issue; 10. citation_doi; 11 – citation_issn; 12 – citation_isbn; 16 – citation_technical_report_insitution;
17 – citation_technical_report_number; 19 – citaiton_conference_title; 20 – citation_inbook_title
Table AII.
Highwire Press metadata
mappings for seven paper
About the authors
Kenning Arlitsch is Associate Dean for IT Services at the J. Willard Marriott Library, University of
Utah. He recently completed a 12-month sabbatical, during which he conducted research with
OCLC on search engine optimization and network level library technologies. Mr Arlitsch began
building the Marriott’s digital library program in 1999, and founded the multi-state Mountain West
Digital Library, the Utah Digital Newspapers program, and co-founded the Western Soundscape
Archive. His department is responsible for digitization, interface design and development, ILS,
repository management, and server infrastructure for the library and its extended digital
programs. He holds a BA in English from Alfred University in New York, and a Master’s degree in
Library and Information Science from the University of Wisconsin-Milwaukee. He is also a
graduate of the Frye Leadership Institute (2005) and of the Research Libraries Leadership Fellows
program (2009), sponsored by the Association of Research Libraries. Kenning Arlitsch is the
corresponding author and can be contacted at:
Patrick S. O’Brien is an expert in customer focused, data driven sales and marketing
operations. He specializes in the use of new media channels and internet marketing to increase
product visibility, acquire new customers, and improve customer satisfaction. He first began
incorporating search engine optimization (SEO) into demand generation marketing programs in
1997. He is a former Accenture Strategy Consultant with over 15 years’ experience working with
business executives on converting marketing strategy into actionable results within the
pharmaceutical, biotechnology, healthcare, financial services and telecommunications industries.
Mr O’Brien holds a BA in Economics from UCLA and an MBA in Marketing and Finance from
The University of Chicago, Booth School of Business.
Meta tag Working paper
1. citation_author Wolfinger, Nicholas H.; McKeever, Matthew
2. citation_date 2006-07-26
3. citation_title Thanks for nothing: changes in income and labor force
participation for never-married mothers since 1982
6. citation_volume
7. citation_issue
8. citation_firstpage 1
9. citation_lastpage 43
10. citation_doi
13. citation_keywords Motherhood; Single Mothers; Income; Population surveys
16. citation_technical_report_institution Institute of Public and International Affairs (IPIA),
University of Utah
17. citation_technical_report_number 2006-07-04
18. citation_language en
19. citation_conference_title 101st American Sociological Association (ASA) Annual
meeting; 2006 Aug 11-14; Montreal, Canada
21. citation_pdf_url
22. citation_abstract_html_url
Notes: Not relevant: 4 – citation_publisher; 5 – citation_journal_tile; 11 – citation_issn; 12 – citation_
isbn; 14 – citation_dissertation_institution; 15 – citation_dissertation_name; 20 – citation_inbook_title
Table AIV.
Highwire Press metadata
mappings for seven paper
To purchase reprints of this article please e-mail:
Or visit our web site for further details:
... According to Lynch, a repository is a set of services offered by universities to lecturers / students for the management and dissemination of digital material created by the institution [2]. This is essentially the organization's commitment to the stewardship of digital materials, including long-term preservation and proper organization and access or distribution. ...
... Procces entri repository(2). ...
... They characterized the plight of repositories very simply: "[d]igital repositories of every type face a common challenge: having their content found by interested users in a crowded sea of information on the Internet" (Arlitsch & O'Brien, 2012, p. 64). Arlitsch and O'Brien's (2012) study used a content analysis method to determine how well repositories were indexed by Google Scholar. They found that all repositories, irrespective of system platform (ContentDM, Digital Commons, DSpace, EPrints, or Fedora), had low indexing ratios, with most performing under 60%, thus making them essentially invisible to Google Scholar. ...
... They found that all repositories, irrespective of system platform (ContentDM, Digital Commons, DSpace, EPrints, or Fedora), had low indexing ratios, with most performing under 60%, thus making them essentially invisible to Google Scholar. Institutional repositories are powerful tools with "the potential to raise author citation rates, and in turn to affect university rankings, but this potential may be hampered if IR content is redundant or invisible to researchers who use GS [Google Scholar]" (Arlitsch & O'Brien 2012). Their 2013 book goes on to outline the ways they increased their own repository's visibility in both Google and Google Scholar. ...
... [ [8][9][10][11] -Google Google Scholar . SEO (search engine optimization). ...
... SEO (search engine optimization). [6][7][8][9][10][11], , [1][2][3][4][5] B e i j i n g Z h e j i a n g G u a n g d o n g J i a n g s u S h a n g h a i H e n a n H e b e i F u j i a n S i c h u a n S h a n d o n g J i a n g x i H u n a n T i a n j i n H u b e i A n h u i S h a n x i L i a o n i n g H a i n a n C h o n g q i n g J i l i n Y u n n a n S h a a n x i H e i l o n g j i a n g G u a n g x i X i n j i a n g A n h u i B e i j i n g F u j i a n G a n s u G u a n g d o n g G u a n g x i G u i z h o u H a i n a n H e b e i H e n a n H e i l o n g j i a n g H u b e i H u n a n J i l i n J i a n g s u J i a n g x i L i a o n i n g I n n e r M o n g o l i a N i n g x i a Q i n g h a i S h a n d o n g S h a n x i S h a a n x i S h a n g h a i S i c h u a n T i a n j i n T i b e t X i n j i a n g Y u n n a n Z h e j i a n g C h o n g q i n g A n h u i B e i j i n g F u j i a n G a n s u G u a n g d o n g G u a n g x i G u i z h o u H a i n a n H e b e i H e n a n H e i l o n g j i a n g H u b e i H u n a n J i l i n J i a n g s u J i a n g x i L i a o n i n g I n n e r M o n g o l i a N i n g x i a Q i n g h a i S h a n d o n g S h a n x i S h a a n x i S h a n g h a i S i c h u a n T i a n j i n T i b e t X i n j i a n g Y u n n a n Z h e j i a n g C h o n g q i n g , . . , . . ...
... The development of the metadata is vital to improve discoverability and usage 9 . The libraries followed interoperability, harvesting, and standards for the information retrieval function, and the Dublin Core is the most popular, well-recognised, and widely used standard 10 . However, various challenges are found in the metadata dimension, such as insufficient resources to create metadata, the quality, interoperability among schemes, the lack of controlled vocabulary, etc. ...
Full-text available
Institutional Repositories (IRs) are effective systems for managing and disseminating institutions’ in scholarly communication. More specifically, an IR enhances the visibility and discoverability of the content and validates the repository’s importance. Knowledge Organisation System (KOS) strengthens the digital content organisation, connects users with collections, and improves information retrieval functionalities. This paper investigates the present status of user interface features and incorporates KOS in the institutional repository of technical institutions, restricted to Centrally Funded Technical Institutions in India. A group of twenty-four web-accessible institutional repositories was identified, and their KOS and user interface features were evaluated. It was found that user interfaces of all IRs under study comply with essential search and navigation functionalities, such as simple and advanced search, browsing, faceted or filtering approaches, and integration with multiple KOS. Only a few of them include complex KOS, such as control vocabulary. All repositories show their search results in both normal-text and metadata views. Some have specific display features, such as highlighting the query or displaying a thumbnail. Google is one of the most popular search engines that indexes IR content for visibility and discoverability, and approximately 90 per cent of repositories are linked with NDLI. Global visibility and impact participation are moderate, and they require attention.
... Google Scholar crawls the web and indexes any document with an academic-looking structure (Martín-Martín et al., 2018). The crawler then follows links to metadata which is then evaluated by the Google Scholar algorithm to determine whether or not the information is added to the Google Scholar index (Kenning & Patrick, 2012;Mahelingga, 2021;Williamson & Mirza, 2015). Google Scholar itself has its visual arrangement limitations in categorizing document structures that seem academics through a convention (Google, 2020). ...
Full-text available
A reference manager such as Mendeley Desktop is needed to make it easier for scientific paper writers to automatically generate bibliography. However, metadata of several refereed journal articles cannot be extracted properly by Mendeley. Errors in generating a journal’s metadata will certainly disadvantage the authors of the cited article and journal. Mistakes in citation can also threaten the citer with plagiarism. This study seeks to find an in-house style that can be accurately detected by Mendeley Desktop as a reference for journal managers to make their journals’ metadata can be properly extracted. The study investigates samples of 27 articles from 27 different journals published by LIPI. The accuracy of metadata extraction of the sample articles is examined on Mendeley Desktop version 1.19.4 and refers to Mendeley’s metadata extraction pipeline with eight assessment variables. The accuracy percentage from Mendeley Desktop's metadata extraction is analyzed using a descriptive approach. It was found that variable of journal titles in all journals can not be extracted which gives the maximum accuracy percentage of 87.5%, with only four journals that get this percentage. There are nine indicators that can be followed to achieve this percentage. For journal managers, setting the layout of journal articles based on Mendeley Desktop’s or Google Scholar’s convention in extracting metadata needs to be considered in the digital era. However, advances in information technology have an influence on the visual layout and the in-house style of scientific papers.
... An investigation into appropriate metadata formats, such as MARC21, EAD and Dublin Core with RDF, shows how particular map data can be stored (Beamer, 2009). Transforming metadata schemas in institutional repositories will lead to increased indexing by Google Scholar (Arlitsch & O'Brien, 2012). Identification of criteria for the evaluation and integration of visual search interfaces, proposing guidelines and recommendations to improve information retrieval tasks with emphasis on the education-al context ...
Generally, thesaurus construction and visual vocabulary are generated with the help of TemaTres. It shows the relation of different terms in view of a specific facet and sub facet. But this paper has explored extra facilities for users. This facility is a part of a cloud computing system. Now, this research paper has integrated the external repositories and software interface in both offline and online environments. How to integrate these external repositories with TemaTres? What are the metadata sets available for data interoperability and crosswalk? How to access these external repositories from the TemaTres metadata interface? However, this paper has been selected by the popular thesaurus construction and visual vocabulary software TemeTres for easy integration of external repositories regarding these specific questions. The whole process is developed and designed on the basis of configuration of files in TemaTres such as config.tematres.php and image icons. This integrated framework is very helpful to the users and librarians for easy access of thesaurus and visual vocabulary from different external repositories. Finally, this has created the common access interface of metadata for the users.
... A 2017 study of natural resources and environmental scientists found that "while institutional repositories were commended by interviewees for providing permanent archiving and long-term preservation, for supporting storage and download, and for ensuring accessibility and credibility… [they were] not particularly valued for searchability and discoverability" (Shen 2017, 120). While efforts have been made to improve discovery for institutional repositories (Arlitsch and O'Brien 2012), Mannheimer, Sterman, and Borda (2016) find that research data are discovered and reused most often if they are: (1) archived in disciplinary research data repositories; and (2) indexed in multiple online locations. ...
Objective: Promoting discovery of research data helps archived data realize its potential to advance knowledge. Montana State University (MSU) Dataset Search aims to support discovery and reporting for research datasets created by researchers at institutions. Methods and Results: The Dataset Search application consists of five core features: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest, and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by MSU researchers. Conclusion: The Dataset Search application is designed to be easily customized and implemented by other institutions. Indexes like Dataset Search can improve search and discovery for content archived in data repositories, therefore amplifying the impact and benefits of archived data.
... In these cases, the work is novel to the AFM event and will not be well represented elsewhere and therefore is not indexed by sites such as Google scholar. To improve the visibility of these works 16 and all of our publications more generally, we plan to use appropriate meta-data schema such as Highwire Press, ePrints or other tag systems. The information required to populate these tags is already available within our CE model described earlier, so making these available as meta-data tags within the generated pages is a simple but valuable extension. ...
Full-text available
Scientific publications from a group or consortium often form a coherent larger body of work with underlying threads and relationships. Rich social, structural, and topical networks between authors and organizations can be identified, and to convey these we have created the publicly available “Science Library” as a user‐centric, interactive portal. A key consideration in this endeavor is rapid and efficient curation of the corpus of publications, both in terms of assuring quality, as well minimizing the effort required. For this to be sustainable it must offer substantial benefits to the community and avoid excessive operational cost through cumbersome or complex processes. We describe the agility of the Science Library implementation as a controlled natural language (CNL) semantic knowledge graph and describe the different roles within the community to ensure efficient curation, validation, and provenance of the content. By describing the process of curation and validation, alongside the CNL‐based definition of the model we show how relatively non‐technical users are able to interact with, and contribute to the Science Library. This provides an extensible approach, initially based around digital library and virtual community capabilities, that can be applied more broadly to support other desired capabilities of Science Gateways.
Se presenta la implementación de en el repositorio RODERIC de la Universitat de València. Para el análisis del impacto de la implementación se han definido ocho indicadores que se han analizado en Google Search y Google Scholar según el caso: visitas, visitas a registros bibliográficos, documentos descargados, impresiones, clics, CTR, posición media en la SERP y posición en la SERP que fueron analizados durante dos períodos consecutivos de un año, antes y después de la implementación. Los resultados obtenidos muestran resultados desiguales para ambos buscadores. En el caso de Google Search, a pesar de conseguirse un incremento considerable en el número de impresiones (21,05%), tanto los clics (10,38%), como el número de sesiones (15,03%) descienden. En el caso de Google Scholar, las sesiones se incrementan ligeramente (6,25%). El número de registros visualizados y de descargas de documentos del repositorio mejora en un 16,21% y 12,18%, respectivamente.
Full-text available
Purpose: The purpose of this research is to design and develop an integrated open portal for searching and retrieving information of institutional repositories of universities, and higher education and research institutes in Iran. This paper discusses the collection and aggregative infrastructures, conceptual model, and architectural design and components that a practical institutional repository search portal should have. Iranian Institutional Repositories Integrated Search (IRIS) Portal architecture and services are designed as the first such example in the country. A functional development of this portal will be provided as a local system after implementation. Methodology: In the process of designing and developing the IRIS based on some background knowledge expanded after reviewing and collecting information of scientific texts and their methodologies, needs and feasibility assessment of existing infrastructure at the national level, analyzing similar international systems; the initial product implementation plan called “Design and Development of Iranian Institutional Repositories Integrated Search (IRIS) Portal” were targeted along with all the details of the subsystems and related workflows. In the system development process of this research, in five stages, Scrum agile framework was used as an iterative and incremental methodology to control IRIS software project management. Findings and Conclusion: For designing and developing of this system, we reviewed similar models used in other countries. The general architecture of this system is based on the data provider and service provider model using the OAI standard. In the architecture of this system, three major parts of data mining system, data processing management system and entities linking, and finally the provision system were considered. In the design of IRIS system, features such as ease of use (user friendly), customizable and extensible web user interface (especially with common languages), high visibility by search engine and indexing tools, permanent access to resources, support for search logic and a unique identifier for each document were considered. What is very important in the development of this system is the need to publish data freely and under open and permanent source licenses. Otherwise, one cannot expect the development of a suitable system that can mine, collect and aggregate data in an integrated manner. After implementing the system, it was found that the data entered in the data providers in this model, mainly had structural problems, and eventually the IRIS service may encounter some defects. It seems that by preparing policies by the Ministry of Science, Research and Technology (MSRT), it is possible to ensure the regular updating and loading of data in organizational repositories and finally the IRIS system.
Full-text available
This article introduces and discusses the concept of academic search engine optimization (ASEO). Based on three recently conducted studies, guidelines are provided on how to optimize scholarly literature for academic search engines in general and for Google Scholar in particular. In addition, we briefly discuss the risk of researchers' illegitimately 'over-optimizing' their articles.
Full-text available
The scope of the article is to give a literature review over comparison of the two services. To obtain insight into Google Scholar, it is tested against Web of Science (WoS), the most recognized proprietary database for peer reviewed journal content. Both databases are multidisciplinary, provide links to library holdings and offer opportunities for export of references. In addition they have the powerful feature of tracking citing items. Comparisons are based on database content, recall and research impact measures. The article touches library teaching issues at higher education institutions, and argues for which reasons Google Scholar along with WoS is worthwhile to be included in the library programs for information literacy teaching. Google Scholar is popular among faculty staff and students, but has been met with scepticism by library professionals and therefore not yet established as subject for teaching.
The University of Mississippi Library created a profile to provide linking from Google Scholar (GS) to library resources in 2005. Although Google Scholar does not provide usage statistics for institutions, use of Google Scholar is clearly evident in looking at library link resolver logs. The purpose of this project is to examine users of Google Scholar with existing data from interlibrary loan transactions and library Web site click-through logs and analytics. Questions about user status and discipline, as well as behaviors related to use of other library resources, are explored.
Search engine use is one of the most popular online activities. According to a recent OCLC report, nearly all students start their electronic research using a search engine instead of the library Web site. Instead of viewing search engines as competition, however, librarians at Binghamton University Libraries decided to employ search engine optimization strategies to make their Web site more visible on the search engine result pages. Although search engine optimization is used frequently by commercial Web sites, few libraries have attempted to optimize their own sites. This article describes Binghamton University's experiences in developing and implementing an optimization pilot project. The research presented in this article has importance for libraries who may be considering an optimization project for their own sites. (Contains 9 tables and 20 notes.)
Search engine optimization, or the practice of designing a web site so that it rises to the top of the results page when users search for particular keywords or phrases, has become so prevalent on the modern web that it has a significant influence on Google search results. This article examines the techniques used by search engine optimization practitioners, the difference between “white hat” and “black hat” optimization tactics, and why it is important for library staff to understand these techniques and their impact on search engine results pages. It also looks at ways that library staff can help their users develop awareness of the factors that influence search results and how to better assess the quality and relevance of results listings.
To better understand the information needs of young university researchers, an observational study was performed at three universities in Stockholm, Sweden. The observations revealed that most of the researchers used Google for everything, that they were confident that they could manage on their own, and that they relied heavily on immediate access to electronic information. They had very little contact with the library, and little knowledge about the value librarian competence could add. One important conclusion of the project is that librarians have to leave the library building and start working in the research environment, as well as putting some thought into the fact that library use is considered complicated, but Google (etc.) is easy. The findings of this project will influence changes in library services in both near and in a more distant future.
This paper examines the use of Web search engines by faculty and students to support learning, teaching, and research. We explore the academic tasks supported by search engine use to investigate if and how students and scholars vary in their use patterns. We also investigate the satisfaction levels with search outcomes and trust in search engines in supporting specific tasks. This study is based on triangulating three data-gathering methods, including a Web-based survey, interviews, and search log reviews. One of the goals of the study is to demonstrate how each methodology exhibits a unique strength in collecting information about different dimensions of search behavior and perceptions. We conclude that, although there are variations in search engine use among the faculty, graduate and undergraduate students surveyed, there is convergence in means of overall satisfaction with the outcomes of their searches and trust in search engines in supporting their studies and research. The paper concludes with a discussion of the implications of the findings for future search engine research and information practitioners.