ChapterPDF Available

Beyond Search: An Overview of Emerging Applications for Scholarly Digital Libraries in the Era of Big Data

Authors:

Abstract and Figures

Nowadays, scholarly digital libraries are common. On the one hand, there are domain specific libraries like IEEE Xplore, ACM Digital Library, on the other hand there are multidisciplinary digital libraries like the newly built National Digital Library of India that cater to a more general audience but index significant volume of scholarly works. These libraries provide access to millions of documents but do not always provide discovery mechanisms beyond browse and keyword-based search. Innovative retrieval services and analytic tools are needed to help users explore these rich repositories and keep abreast of the latest developments in the academic landscape. In this chapter, we review the emerging applications that are of critical importance to the usability of these libraries. In particular, we consider information visualization, impact measurement and recommendation services as the key emerging applications. We also briefly discuss how they can be introduced in the National Digital Library of India.
Content may be subject to copyright.
Beyond Search: An Overview of Emerging Applications
for Scholarly Digital Libraries in the Era of Big Data
Debarshi Kumar Sanyal
Indian Institute of Technology Kharagpur, Kharagpur-721302, West Bengal, INDIA
Email: debarshisanyal@gmail.com
ABSTRACT
Nowadays, scholarly digital libraries are common. On the one hand, there are
domain specific libraries like IEEE Xplore, ACM Digital Library, on the other
hand there are multidisciplinary digital libraries like the newly built National
Digital Library of India that cater to a more general audience but index
significant volume of scholarly works. These libraries provide access to
millions of documents but do not always provide discovery mechanisms beyond
browse and keyword-based search. Innovative retrieval services and analytic
tools are needed to help users explore these rich repositories and keep abreast of
the latest developments in the academic landscape. In this chapter, we review
the emerging applications that are of critical importance to the usability of these
libraries. In particular, we consider information visualization, impact
measurement and recommendation services as the key emerging applications.
We also briefly discuss how they can be introduced in the National Digital
Library of India.
Keywords: Digital library, Big scholarly data, Information visualization,
Recommendation Systems, Usability
1 INTRODUCTION
Scholarly data normally refer to works reporting research results, are written by
subject matter experts and published in peer reviewed venues like edited books,
conference proceedings and journals (Cornell University, 2018). They present new
knowledge or novel perspectives on existing knowledge. Scholarly data are being
produced at stupendous rates. It has been reported that global scientific output
doubles every nine years (Noorden, 2014a). No doubt, data scientists are talking
about big scholarly data (Xia et al., 2017). This definition is generally widened to
include other academic resources like presentation slides, audio and video lectures,
project descriptions, patents, and datasets. A scholarly digital library (SDL) stores or
indexes scholarly works. An example is the e-print service arXiv
(https://arxiv.org/) that contains north of 1.3 million e-prints in Physics,
Mathematics, Computer Science, Quantitative Biology, Quantitative Finance,
Statistics, Electrical Engineering and Systems Science, and Economics. In this
chapter, we focus on large SDLs which index upwards of several thousand scholarly
works. The digital libraries of many technical publishers like IEEE
(https://ieeexplore.ieee.org/Xplore/home.jsp)
and Springer
(https://link.springer.com/) fall comfortably in this category. Another
example is the National Digital Library of India (NDLI)
(https://ndl.iitkgp.ac.in/) designed by the Indian Institute of
Technology Kharagpur to provide a single search window to a wide variety of digital
resources present in different knowledge repositories in India and abroad. It indexes
more than 17 million resources of more than 60 types including articles, scholarly as
well as popular publications, theses, question papers, presentations, datasets, audio
and video lectures, albums, historical records, law judgments and patents. Unlike
many national libraries, NDLI indexes a multitude of scholarly works like research
papers; so we include it under SDL.
As more and more digital contents get added to SDLs, it becomes difficult to
explore their contents effectively. Most of them offer simple search, browse and in
some cases, recommendation services. Although keyword and faceted search are
common, semantic search (e.g., using machine learning (Xiong et al., 2017)) is
gradually being introduced in digital libraries. Multimodal search, though rare today,
is also likely to become prevalent in near future.
Given the issues of big data, it is important to have sophisticated tools beyond
search to explore the repositories and discover knowledge from them. We identify
the following three categories of emerging applications that we believe can enhance
the usability of SDLs: information visualization, impact measurement, and
recommendation services.
Recommendation services are present in some digital libraries like IEEE Xplore
and Scopus; they are absent in arXiv and many other libraries. They offer a
personalized discovery mechanism for users guiding them in the wilderness of big
data. Hence, they are very useful and are likely to be more popular in future.
Measuring the impact of different venues and using them to rank the search results is
also important as the number of venues is growing very fast. It helps users identify
the results published in the top-tier venues and hence, presumably, the most
significant ones. Although many commercial search engines like Google Scholar
display citation data, they are hardly found in SDLs like NDLI. For large digital
libraries, it is beneficial if users can visually examine the contents at a certain level
of abstraction. This helps them understand whether the source will satisfy her
information needs. Again, most SDLs like IEEE Xplore, arXiv and NDLI do not
support it. Thus, these applications are not widely implemented despite their
potential benefit. There is also a lack of consensus on the best approach to each of
them. As big scholarly data make their presence more prominent, they will become
ever more important. These three categories capture a large gamut of potential tools
for SDLs. We survey these applications and highlight the existing issues in them.
Towards the end of the chapter, we briefly describe the challenges in incorporating
these services in NDLI.
2 LARGE SCHOLARLY DIGITAL LIBRARIES
Nowadays, SDLs are quite common and range from focused, domain-specific ones
to world-scale multi-disciplinary repositories. Well known SDLs in Computer
Science are the ACM Digital Library (https://dl.acm.org/) and CiteSeerX
(http://citeseerx.ist.psu.edu/index). IEEE Xplore covers Computer
Science, Electrical and Electronics Engineering while PubMed Central archives
biomedical and life science journal literature. ScienceDirect
(https://www.sciencedirect.com/) and arXiv are multi-disciplinary
databases (spanning technology, science and social science). Most of them allow free
access to metadata like title, authors, and abstract while restricting full text access to
subscribers only. In fact, many of these libraries like ACM Digital Library index a
large number of resources but host only a subset of them; the remaining resources
come from related digital repositories.
Scholarly search: A digital library is typically equipped with a search engine so
that users can explore its contents. In addition, there are web-scale academic search
engines like Google Scholar (https://scholar.google.co.in/), Microsoft
Academic (https://academic.microsoft.com/)
and Semantic Scholar
(https://www.semanticscholar.org/)
that index scholarly works. Anurag
Acharya, the key inventor of Google Scholar, defined scholarly works as,
“‘Scholarly’ is what everybody else in the scholarly field considers scholarly. It
sounds like a recursive definition but it does settle down.” (Noorden, 2014c). There
is no silver bullet to hit it but one can start with a paper known for its scholarly
content (as judged by experts), collect the papers it cites and repeat the process.
These engines provide powerful means to discover scholarly content from anywhere
on the Web. Hence, they often serve as the gateway to more specific SDLs.
However, the metadata and content furnished by search engines are usually more
noisy than those available from specific publishers or libraries. Academic search is
usually classified as either navigational (where the user is looking for a specific
scholarly document; she usually uses precise metadata like digital identifiers such as
DOI or ISBN, or full title or author name with title or author name with venue/year,
etc.) or informational (where the user is seeking some information in the library)
(Khabsa et al., 2016). A third kind - transactional query - is also possible in an SDL,
though it is less commonplace. This term is used in taxonomy of web search to
indicate queries in which the user’s intent is to perform some web-mediated activity
like search for best sites to download music or buy gadgets (Broder, 2002). If an
SDL is linked to multiple book sellers or to physical libraries that lend books, users
might issue transactional queries, too. The SDL should then display the various sites
from where the user can borrow or buy the desired book. However, at the time, this
is not so common. To satisfy users’ needs, a scholarly search engine uses lexical
search (where keywords taken as free text or as values of specific metadata fields are
matched against the document corpora) or semantic search (where the search engine
tries to infer the intent of the query and retrieve results accordingly). Search engines
might construct elaborate linked data graph and knowledge graph offline to make
semantic search possible. Although many national libraries (like Library of Congress
USA, British Library, National Library of France and Europeana) do maintain linked
data (Hallo et al., 2016)
and some of them support SPARQL queries, semantic search
is not frequently seen in SDLs (e.g., in most of the technology SDLs mentioned
earlier).
Academic social networks: Closely related to SDLs are academic social networks
like ResearchGate (https://researchgate.net), Academia.edu
(https://www.academia.edu/)
and Mendeley
(https://www.mendeley.com/). They allow researchers to create personal
profiles, upload papers and related artifacts (like datasets, presentations, and code),
interact with other researchers, and track the influence (e.g., citations, downloads,
views, etc.) of their papers. These initiatives encourage unrestricted deliberations
among scientists (which might create an alternative to peer review) and sharing
papers openly (which helps many researchers cross over the paywalls in
subscription-based SDLs) (Noorden, 2014b). SDLs might find it profitable to
integrate with them or at least borrow some of their features on collaborative
interaction.
Curating content: While SDLs and academic social networks normally require
maintainers to manually upload papers and other artifacts, some SDLs like
CiteSeerX and Web-scale academic search engines typically deploy web crawlers to
harvest scholarly resources from the Web. Most articles are in Portable Document
Format (PDF). Usually SDLs, search engines and social networks extract metadata
and citations from the articles so that users can easily access them without viewing
the entire paper (Wu et al., 2015). Some of them also parse the articles for more
specific components like mathematical equations, chemical formulae, figures, tables,
algorithms, etc. using rule-based or machine learning techniques. They form the
basis for higher level representations and analytics on these works.
3 EMERGING APPLICATIONS FOR LARGE SDLS
Several applications and analytic tools can be built on top of large SDLs. They help
to understand the patterns hidden in the voluminous scholarly data stored in the
repositories. We look at some of the popular and emerging applications below that
we believe are critical to the usability of a large SDL.
3.1 Information Visualization
Search and retrieval form the most important functionality offered by a digital library
for without a proper discovery service, the scholarly works will remain inaccessible
to the users. Keyword-based search is most common although most SDLs allow
faceted search that applies specific filters on the metadata. Clustering and
visualization of search results can help users navigate through them easily. When the
user’s information needs are less well-defined, she would prefer to browse the
library. Again, visualizations can aid this process enormously (Greene et al., 2000).
Visualization tools must consider what data source to use, what analysis (e.g.,
discourse analysis, text summarization, etc.), what kind of visualization task to
execute (refers to lower level tasks like categorization, overview, etc. on the analysis
result), what kind of graphics to use (e.g., scatter plot, maps, etc.) and how many
dimensions to use (2D or 3D) (Kucher and Kerren, 2015).
In an SDL, visualization can be used at different levels, namely, collection level,
document level and intra-document level (Herrmannova and Knoth, 2012).
Attributes of a collection can be visualized using tag clouds (Hassan-Montero and
Herrero-Solana, 2006), scatter plots (Marks et al., 2005), and line graphs (Michel et
al., 2011). It can provide overviews of the topics or, simply, the top words in the
collection, or the temporal variation in topics/words in the collection. It can also
show thematic partitions of the collection with each partition containing documents
of a specific theme.
Fig. 1. The above map developed by Infobaleen (wikimap.infobaleen.com) shows
Wikipedia topics; each box is a cluster of related Wikipedia pages (Retrieved on December 31,
2017).
Figure 1 shows an information map of Wikipedia; each box is a cluster of related
Wikipedia pages. Similar visualizations could be designed for collections in an SDL.
Collection level views are very useful in digital humanities, for example, to track the
variation in the frequency of a word’s use (Michel et al., 2010)
or trace the temporal
change of a word’s linguistic sense. At the second level, one is more interested to see
how the documents connect with each other, e.g., through correlation of topics, co-
citation or co-authorship. The third level focuses on visualizing the contents of a
single document at some level of abstraction. For example, it may show the topics in
it or build a semantic model (like a knowledge graph) of the document (Ronzano and
Saggion, 2015).
We feel visualizations should support several zoom levels and furnish precise
numerical statistics whenever requested by the user. Numerous techniques available
for text mining, generation of ontology and linked data, and information
visualization can be used to design graphical representations of data within an SDL
(Liu et al., 2014). However, visualization methods have been traditionally tested on
small data sources and may face scalability issues with big data, especially if
computation needs to be done in real-time. They are hardly found in real-world
SDLs. Efforts should also be made to visualize other resources like datasets,
question banks, lectures, etc. available in SDLs (like NDLI) and present them in an
integrated canvas with text documents.
3.2 Impact Measurement
Researchers have spent huge efforts to measuring the impact of scholarly
publications (Mingers and Leydesdorff, 2015). It is an important activity as it helps
to identify significant works and devote even further research on their proposed
ideas. It also helps to identify the influential leaders in different branches of
knowledge. This information can be used by different stakeholders like funding
agencies for dispensation of research grants, academic and research institutes in
deciding promotions for their staff, authors in identifying the most influential
publication venues, researchers in understanding the state-of-the-art, accreditation
bodies in determining relative positions of research institutes, and academic search
engines in ranking retrieval results. Impact measurement has been typically at the
following levels.
Fig. 2. Google Scholar shows citation count for each article; it is also considered when ranking
retrieved results. (Retrieved December 31, 2017).
Articles: The most common metric in this area is citation count. It is widely
accepted as an indicator of the influence of a scholarly article. For example, as
shown in Figure 2, Google Scholar shows citation counts for every scholarly article.
It is also shown in CiteSeerX, IEEE Xplore and ACM Digital Library. However, it
has many limitations including inflation by self-citations, giving equal importance to
all citations, allowing obvious benefits to old papers and inability to account for
variations in citation patterns across disciplines. So, several alternatives have been
suggested that consider the nature (for example, appreciation, criticism, etc.) of the
citations, consider the age of the paper, and adjust for discipline-specific citation
patterns. Researchers have also tried to predict the citation counts of new
publications (Yan et al., 2011).
The increasing presence of scholarly works on social networks and other online
platforms has motivated the design of altmetrics to measure research impact (Priem,
2010). They are alternatives to the traditional citation-based metrics discussed above.
They can be roughly divided into (1) usage (i.e., views or downloads of the paper),
(2) captures (e.g., bookmarks), (3) mentions (e.g., comments, blog posts, Wikipedia
articles, reviews), (4) social media presence (e.g., tweets, likes, shares), and (5)
citations (e.g., in Web of Science) (Melero, 2015). Their proponents argue that they
can be accumulated fast (unlike the huge gestation time for citations), capture
influence of a paper from a rich diversity of sources, apply to many forms of
publishing like papers, datasets, code, experimental designs, etc., are amenable to
algorithmic processing and are more robust to fraudulent manipulation. They are
already visible in BioMed Central and Scopus. See Figure 3.
Fig. 3. Altmetrics for an article in Scopus (Retrieved December 31, 2017).
Authors: The most common metric here is the h-index. An author’s h-index is h if
she has h publications that have been cited at least h times each. Thus, it overlooks
the citation counts of her very highly cited articles. The g-index improvises on it.
Given a set of articles of an author arranged in decreasing order of their citation
counts, the g-index is the (unique) largest number such that the top g articles
received (together) at least g
2
citations.
Venues: Journals have long been the established as the authoritative venues to
archive scholarly work. Several metrics to measure their impact have been proposed.
Examples include Journal Impact Factor (defined for journals listed in Journal
Citation Reports (subset of Web of Science); JIF of a journal for a given year is the
number of citations, received in that year, of articles published in that journal during
the two preceding years, divided by the total number of articles published in that
journal during the two preceding years), Scimago Journal Rank (SJR) (based on
Scopus database, SJR of a journal is the ratio of the number of weighted citations
received in a year to the total number of publications in the last 3 years; the weight is
related to the importance of the source from where the citation comes), Source-
Normalized Impact per Paper (SNIP) (based on Scopus, SNIP of a journal for a year
divides the journal’s citation count per paper by the citation potential in its subject
area; it compensates for the skew in citation counts across disciplines), and the
recently proposed CiteScore (based on Scopus database, CiteScore of a journal for a
year is the ratio of the number of citations received in the year to the number of all
documents published in the journal in the preceding three years; here, computation is
done over all documents indexed by Scopus) (Elsevier, n.d.). Several other metrics
are proposed in the literature (Bradshaw and Brook, 2016). There are hardly any
popular ranking metrics for a conference series although variants of h-index are
sometimes used (modeling the conference as an author) (Google Scholar, n.d.).
Since the above metrics are not free of criticism and depend on the volume of the
indexed literature, SDLs should ideally display multiple metrics. But that is
practically difficult due to the huge data and computation needed.
3.3 Recommendation Services
The extraordinary rate at which scholarly literature is being produced baffles even
the most avid of readers. It is almost impossible to keep aware of all publications
even in a focused sub-field. Hence, digital libraries are increasingly providing
recommendation services that suggest scholarly resources to the users based on their
personal profiles. Figure 4 shows how Elsevier shows recommendations when a user
looks at a paper. Additionally, recommendations on potential author collaborations,
reviewers, publication venues, queries, courses, and tags for articles are useful in
digital libraries that index research works and are frequently used by active
researchers. Recommendations can be personalized for individual users or groups of
users (e.g., researchers in a specific domain).
Primarily, three techniques have been used to produce research paper
recommendations: content-based filtering (CBF) (a user is recommended an item
similar to the one she liked earlier), collaborative filtering (CF) (based on the notion
that people who agreed in their evaluation of certain items in the past will probably
agree again in the future) and hybrid (uses features of CBF and CF). CBF methods
typically use words or n-grams from article title, keywords, abstract and full text and
compare them across articles using techniques like vector space models (commonly
using term frequency-inverse document frequency as a weighting scheme and cosine
similarity to infer relatedness), probabilistic topic models (like Latent Dirichlet
Allocation), etc. Content analysis and comparison can be improved with external
knowledge bases. More generally, they may also use author names and co-citation
statistics. CBF is the most common method, possibly, due to lack of sufficient
transaction logs (Beel et al., 2016). In this comprehensive review (Beel et al., 2016),
Beel et al. point to the lack of systematic comparison of different research paper
recommender system approaches proposed in the literature. This is due to diversity
in evaluation methods (metrics; strategy, i.e., online or offline), low number of
participants in user studies, lack of information on runtimes and scalability,
heterogeneity in chosen baselines, diversity and unavailability of datasets and last,
but not the least, sparse information on the algorithms used. Hence, it is unclear if
CBF is better than CF or the other way round and which papers have advanced the
best designs. The recent paper (Beel and Dinesh, 2017) elaborates more on the
dismal scenario and encourages researchers to implement and evaluate recommender
systems in the real world and open source their projects. A promising step in this
direction is the Babel project (http://babel.eigenfactor.org/). It is,
though, clear that a scholarly recommender system should deliver provide relevant
recommendations in a timely manner; the recommendations displayed should be
small in number (say, at most five, to avoid information overload) and possess
diversity (e.g., showing related videos and datasets with articles). Recommendation
systems should also succinctly display the rationale behind their recommendations
and allow users to explicitly provide feedback for improvement. It may be noted here
that great care is needed in design of recommendation services; otherwise there is a
risk of trapping the user in a biased zone created by algorithms and artificial
intelligence. This will stifle innovation and discovery instead of encouraging it.
More research is also needed in other types of academic recommenders that we have
briefly mentioned at the beginning of this subsection.
Fig. 4. Recommendations (on right pane) in Elsevier ScienceDirect digital library (shown when
a user selects an article) (Retrieved December 31, 2017).
4 NDLI AND THE SCOPE OF SCHOLARLY APPLICATIONS
NDLI is a metadata hub that connects to different content holders like institutional
repositories and publisher sites. It caters to academic and non-academic readers, a
wide academic age group from kindergarten to research scholars and a host of
different languages. It indexes millions of resources of various types including
scholarly (like research papers, academic books, audio-visual lectures, etc.) and
popular publications (like children’s story books). However, in this section, we
restrict our analysis to scholarly resources.
Fig. 5. Search in NDLI; various filters are shown on the left and right panes (Retrieved
December 31, 2017).
Currently NDLI provides a simple browse and search interface with a number of
filters on metadata attributes like author name, language, learning resource type, etc.
See Figure 5. However, it does not contain the advanced services discussed above.
Although it may not be justifiable to compare an indigenous initiative like NDLI
with other advanced international scholarly platforms, it is worthwhile to explore the
scope of including these services in NDLI in future.
NDLI catalogues millions of resources from various sources like institutional
digital repositories from all over India and digital libraries maintained by
international publishers. Given this size and diversity, it would be useful to provide
visualizations of the various sources so that users can get a graphical overview of
their contents.
NDLI does not show any impact figures against publications, authors or venues.
This is indeed difficult because of copyright restrictions; in many cases, NDLI is not
allowed to parse full text or extract references from them. One alternative is to
collaborate with external indexing services like the Web of Science or open citation
databases (https://i4oc.org/).
In regard to recommendation, the overwhelming quantity and diversity of
resources in NDLI hold the promise of rich recommendations. Recommendations
using CBF method can be easily provided by analyzing bibliographic metadata.
However, in many cases metadata are sparse and noisy. Moreover, there are multiple
domains which should be used in clustering the resources before recommendations
are produced (West et al., 2016). Click-logs should also be analyzed to discover
library search behavior of users so that effective CF-based recommendations can be
added.
We emphasize that evaluating a digital library is a very complex task and requires
measuring several attributes (Sandusky, 2002; Fuhr et al., 2007). The above tools
alone do not produce a perfect digital library. But if it already has a rich content base
(as is true for NDLI), these tools can make it considerably more usable for
researchers.
5 CONCLUSION
We presented a brief review of tools found in an SDL. The exponential growth rate
of scholarly works makes them absolutely essential for a library user, whether she is
a casual reader or a dedicated researcher. Although a plethora of applications are
possible, we tried to restrict to three primary buckets, namely, information
visualization, impact measurement and recommendation services that, we believe,
are the most important. We have seen there are many approaches to implement each
application, each having its own pros and cons. But for each of them, the huge data
size results in considerable computational load. Hence, it is important to decide
whether to implement them as an offline or an online process (wherever there is a
choice). We also briefly discussed the scope of including them in NDLI which is
envisioned to revolutionize learning across all sections of Indian society.
ACKNOWLEDGEMENTS
This work is supported by Development of National Digital Library of India as a
National Knowledge Asset of the Nation sponsored by Ministry of Human Resource
Development, Government of India.
REFERENCES
Beel, J. and Dinesh, S. (2017). Real-world recommender systems for academia: The pain and gain in
building, operating, and researching them. In Proceedings of BIR@ECIR, pages 6–17.
Beel, J., Gipp, B., Langer, S., and Breitinger, C. (2016). Paper recommender systems: a literature
survey. International Journal on Digital Libraries, 17(4):305–338.
Bradshaw, C. J. and Brook, B. W. (2016). How to rank journals. PloS ONE, 11(3):e0149852.
Broder, A. (2002). A taxonomy of web search. In ACM SIGIR Forum, volume 36, pages 3–10.
ACM.
Cornell University. (Apr 16, 2018). Distinguishing scholarly from non-scholarly periodicals: A
checklist of criteria. Retrieved July 11, 2018, from
http://guides.library.cornell.edu/c.php?g=31867&p=201758.
Elsevier. (n.d.). Measuring a journal’s impact. Retrieved Jul 11, 2018, from
https://www.elsevier.com/authors/journal-authors/measuring-a-journals-impact.
Fuhr, N., Tsakonas, G., Aalberg, T., Agosti, M., Hansen, P., Kapidakis, S., Klas, C.-P., Kovács, L.,
Landoni, M., Micsik, A., et al. (2007). Evaluation of digital libraries. International Journal on
Digital Libraries, 8(1):21–38.
Google Scholar. (n.d.). Retrieved July 11, 2018, from
https://scholar.google.co.in/citations?view_op=top_venues&hl=en&vq=eng_computatio
nallinguistics.
Greene, S., Marchionini, G., Plaisant, C., and Shneiderman, B. (2000). Previews and overviews in
digital libraries: Designing surrogates to support visual information seeking. Journal of the
American Society for Information Science, 51(4):380–393.
Hallo, M., Luján-Mora, S., Maté, A., and Trujillo, J. (2016). Current state of linked data in digital
libraries. Journal of Information Science, 42(2):117–127.
Hassan-Montero, Y. and Herrero-Solana, V. (2006). Improving tag-clouds as visual information
retrieval interfaces. In Proceedings of the International Conference on Multidisciplinary
Information Sciences and Technologies, pages 25–28.
Herrmannova, D. and Knoth, P. (2012). Visual search for supporting content exploration in large
document collections. D-Lib Magazine, 18(7):8.
Khabsa, M., Wu, Z., and Giles, C. L. (2016). Towards better understanding of academic search. In
Proceedings of the IEEE/ACM Joint Conference on Digital Libraries (JCDL), pages 111–114.
IEEE.
Kucher, K. and Kerren, A. (2015). Text visualization techniques: Taxonomy, visual survey, and
community insights. In Proceedings of IEEE Pacific Visualization Symposium (PacificVis),
pages 117–121. IEEE.
Liu, S., Cui, W., Wu, Y., and Liu, M. (2014). A survey on information visualization: recent advances
and challenges. The Visual Computer, 30(12):1373–1393.
Marks, L., Hussell, J. A., McMahon, T. M., and Luce, R. E. (2005). ActiveGraph: A digital library
visualization tool. International Journal on Digital Libraries, 5(1):57–69.
Melero, R. (2015). Altmetrics–a complement to conventional metrics. Biochemia Medica,
25(2):152–160.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., Clancy,
D., Norvig, P., Orwant, J., et al. (2011). Quantitative analysis of culture using millions of
digitized books. Science, 331:176–182.
Mingers, J. and Leydesdorff, L. (2015). A review of theory and practice in scientometrics. European
Journal of Operational Research, 246(1):1–19.
Noorden, R, V. (2014a). Global scientific output doubles every nine years. Retrieved from
http://blogs.nature.com/news/2014/05/global-scientific-output-doubles-every-nine-years.html.
Noorden, R, V. (2014b). Online collaboration: Scientists and the social network. Nature news,
512:126-129. Retrieved from https://www.nature.com/news/online-collaboration-scientists-and-
the-social-network-1.15711.
Noorden, R, V. (2014c). Google Scholar pioneer on search engine’s future. Retrieved from
https://www.nature.com/news/google-scholar-pioneer-on-search-engine-s-future-1.16269.
Priem, J., Taraborelli, D., Groth, P., Neylon, C. (2010), Altmetrics: A manifesto, 26 October
2010. Retrieved from http://altmetrics.org/manifesto.
Ronzano, F. and Saggion, H. (2015). Dr. Inventor framework: Extracting structured information
from scientific publications. In Proceedings of the International Conference on Discovery
Science, pages 209–220. Springer.
Sandusky, R. J. (2002). Digital library attributes: Framing usability research. In Proceedings of
Workshop on Usability of Digital Libraries at JCDL, volume 2, pages 35–38.
West, J. D., Wesley-Smith, I., and Bergstrom, C. T. (2016). A recommendation system based on
hierarchical clustering of an article-level citation network. IEEE Transactions on Big Data,
2(2):113–123.
Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A. G., Jordan,
D., Mitra, P., and Giles, C. L. (2015). CiteSeerX: AI in a digital library search engine. AI
Magazine, 36(3):35–48.
Xia, F., Wang, W., Bekele, T. M., and Liu, H. (2017). Big scholarly data: A survey. IEEE
Transactions on Big Data, 3(1):18–35.
Xiong, C., Power, R., and Callan, J. (2017). Explicit semantic ranking for academic search via
knowledge graph embedding. In Proceedings of the 26th International Conference on World
Wide Web, pages 1271–1279.
Yan, R., Tang, J., Liu, X., Shan, D., and Li, X. (2011). Citation count prediction: learning to estimate
future citations for literature. In Proceedings of the 20th ACM International Conference on
Information and Knowledge Management, pages 1247–1252. ACM.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new angle. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Secondly, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a general overview and big picture to scholars interested in this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.
Article
Full-text available
There are now many methods available to assess the relative citation performance of peer-reviewed journals. Regardless of their individual faults and advantages, citation-based metrics are used by researchers to maximize the citation potential of their articles, and by employers to rank academic track records. The absolute value of any particular index is arguably meaningless unless compared to other journals, and different metrics result in divergent rankings. To provide a simple yet more objective way to rank journals within and among disciplines, we developed a κ-resampled composite journal rank incorporating five popular citation indices: Impact Factor, Immediacy Index, Source-Normalized Impact Per Paper, SCImago Journal Rank and Google 5-year h-index; this approach provides an index of relative rank uncertainty. We applied the approach to six sample sets of scientific journals from Ecology (n = 100 journals), Medicine (n = 100), Multidisciplinary (n = 50); Ecology + Multidisciplinary (n = 25), Obstetrics & Gynaecology (n = 25) and Marine Biology & Fisheries (n = 25). We then cross-compared the κ-resampled ranking for the Ecology + Multidisciplinary journal set to the results of a survey of 188 publishing ecologists who were asked to rank the same journals, and found a 0.68-0.84 Spearman’s ρ correlation between the two rankings datasets. Our composite index approach therefore approximates relative journal reputation, at least for that discipline. Agglomerative and divisive clustering and multi-dimensional scaling techniques applied to the Ecology + Multidisciplinary journal set identified specific clusters of similarly ranked journals, with only Nature & Science separating out from the others. When comparing a selection of journals within or among disciplines, we recommend collecting multiple citation-based metrics for a sample of relevant and realistic journals to calculate the composite rankings and their relative uncertainty windows.
Article
Full-text available
The Semantic Web encourages institutions, including libraries, to collect, link and share their data across the Web in order to ease its processing by machines to get better queries and results. Linked Data technologies enable us to connect related data on the Web using the principles outlined by Tim Berners-Lee in 2006. Digital libraries have great potential to exchange and disseminate data linked to external resources using Linked Data. In this paper, a study about the current uses of Linked Data in digital libraries, including the most important implementations around the world, is presented. The study focuses on selected vocabularies and ontologies, benefits and problems encountered in implementing Linked Data in digital libraries. In addition, it also identifies and discusses specific challenges that digital libraries face, offering suggestions for ways in which libraries can contribute to the Semantic Web. The study uses an adapted methodology for literature review, to find data available to answer research questions. It is based on the information found in the library websites recommended by W3C Library Linked Data Incubator Group in 2011, and scientific publications from Google Scholar, Scopus, ACM and Springer from the last 5 years. The selected libraries for the study are the National Library of France, the Europeana Library, the Library of Congress of the USA, the British Library and the National Library of Spain. In this paper, we outline the best practices found in each experience and identify gaps and future trends.
Conference Paper
This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR's ability in improving Semantic Scholar's online production system, especially on hard queries where word-based ranking fails.
Article
CiteSeerX is a digital library search engine that provides access to more than 5 million scholarly documents with nearly a million users and millions of hits per day. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. We also present AI technologies, implemented in table and algorithm search, that are special search modes in CiteSeerX. While it is challenging to rebuild a system like Cite-SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines. Copyright © 2015, Association for the Advancement of Artificial Intelligence. All rights reserved.
Conference Paper
Academics have relied heavily on search engines to identify and locate research manuscripts that are related to their research areas. Many of the early information retrieval sys- tems and technologies were developed while catering for li- brarians to help them sift through books and proceedings, followed by recent online academic search engines such as Google Scholar and Microsoft Academic Search. In spite of their popularity among academics and importance to academia, the usage, query behaviors, and retrieval models for aca- demic search engines have not been well studied. To this end, we study the distribution of queries that are received by an academic search engine. Furthermore, we delve deeper into academic search queries and classify them into navigational and informational queries. This work in- troduces a definition for navigational queries in academic search engines under which a query is considered naviga- tional if the user is searching for a specific paper or docu- ment. We describe multiple facets of navigational academic queries, and introduce a machine learning approach with a set of features to identify such queries.
Conference Paper
Even if research communities and publishing houses are putting increasing efforts in delivering scientific articles as structured texts, nowadays a considerable part of on-line scientific literature is still available in layout-oriented data formats, like PDF, lacking any explicit structural or semantic information. As a consequence the bootstrap of textual analysis of scientific papers is often a time-consuming activity. We present the first version of the Dr. Inventor Framework, a publicly available collection of scientific text mining components useful to prevent or at least mitigate this problem. Thanks to the integration and the customization of several text mining tools and on-line services, the Dr. Inventor Framework is able to analyze scientific publications both in plain text and PDF format, making explicit and easily accessible core aspects of their structure and semantics. The facilities implemented by the Framework include the extraction of structured textual contents, the discursive characterization of sentences, the identifications of the structural elements of both papers header and bibliographic entries and the generation of graph based representations of text excerpts. The framework is distributed as a Java library. We describe in detail the scientific mining facilities included in the Framework and present two use cases where the Framework is respectively exploited to boost scientific creativity and to generate RDF graphs from scientific publications.
Article
The scholarly literature is expanding at a rate that necessitates intelligent algorithms for search and navigation.For the most part, the problem of delivering scholarly articles has been solved. If one knows the title of an article, locating it requires little effort and, paywalls permitting, acquiring a digital copy has become trivial. However, the navigational aspect of scientific search - finding relevant, influential articles that one does not know exist - is in its early development. In this paper, we introduce EigenfactorRecommends - a citation-based method for improving scholarly navigation. The algorithm uses the hierarchical structure of scientific knowledge, making possible multiple scales of relevance for different users. We implement the method and generate more than 300 million recommendations from more than 35 million articles from various bibliographic databases including the AMiner dataset. We find little overlap with co-citation, another well-known citation recommender, which indicates potential complementarity. In an online A-B comparison using SSRN, we find that our approach performs as well as co-citation, but this new approach offers much larger recommendation coverage. We make the code and recommendations freely available at babel.eigenfactor.organd provide an API for others to use for implementing and comparing the recommendations on their own platforms.
Conference Paper
Text visualization has become a growing and increasingly important subfield of information visualization. Thus, it is getting harder for researchers to look for related work with specific tasks or visual metaphors in mind. In this paper, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends. We describe the taxonomy used for categorization of text visualization techniques and compare it to approaches employed in several other surveys. Finally, we present results of analyses performed on the entries data.