ArticlePDF Available

Searching the Enterprise

Authors:

Abstract and Figures

Search has become ubiquitous but that does not mean that search has been solved. Enterprise search, which is broadly speaking the use of information retrieval technology to find information within organisations, is a good example to illustrate this. It is an area that is of huge importance for businesses, yet has attracted relatively little academic interest. This monograph will explore the main issues involved in enterprise search both from a research as well as a practical point of view. We will first plot the landscape of enterprise search and its links to related areas. This will allow us to identify key features before we survey the field in more detail. Throughout the monograph we will discuss the topic as part of the wider information retrieval research field, and we use Web search as a common reference point as this is likely the search application area that the average reader is most familiar with.
Content may be subject to copyright.
A preview of the PDF is not available
... Some related work in other use-cases already showed promising results in this direction [17] [18]. This might open up new possibilities in using these models in enterprise search settings where confidential data must remain on-premise [19]. ...
Preprint
Full-text available
We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their performance was decent, though not on par with the best systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was able to compete with GPT-4 in the grounded Q&A setting on factoid and list answers. In Task 11b Phase A, focusing on retrieval, query expansion through zero-shot learning improved performance, but the models fell short compared to other systems. The code needed to rerun these experiments is available through GitHub.
... Unlike search for leisure or personal interest there is a vast area of search contexts which are found in a work environment. Professional search falls into that scope, i.e. search over domain-specific document collections and often with search tasks that are recalloriented rather than precision-focused (Kruschwitz and Hull, 2017;Verberne et al., 2019). Beyond applications where such search effort can directly be measured in financial terms (e.g. in patent search, e-discovery or the compilation of systematic reviews) there are many other fields where these costs are more implicit, e.g. in the area of genocide studies that rely on the analysis of vast quantities of different resources (Bachman, 2020;Hinton, 2012). ...
Article
Full-text available
Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one's research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year's hot topic on Language Technology for All.
... Talent search is an example of professional search [26] and shares many similarities with other forms of professional search-patent search, systematic reviews, enterprise search, and e-discovery-that are very different from Web search or search by job seekers, such as increased session duration, the need for contextual background knowledge, and the monetary value of the search [10,12]. ...
... B. um die Suche in kleineren Sammlungen mit wenig Redundanz, die vielleicht nur ein einziges übereinstimmendes Dokument für eine Anfrage haben. Ein Paradebeispiel für eine solche Anwendung ist die unternehmensweite Suche (enterprise search) (Kruschwitz & Hull 2017). Hier bietet sich das IIR als natürliche Lösung für ein praktisches Problem an. ...
... Systematically captured and stored workflow data can support mechanisms to ensure the quality and validity of data that can be examined or reviewed in the context of its source and transformations over time (Khan et al. 2016;Kruschwitz and Hull 2017). Workflow data allows users to track provenance information such as evolution. ...
Article
Full-text available
Open-source intelligence is a rapidly expanding area of the security and intelligence industry, involving the collection of internet located open data from various sources, turning that data into actionable intelligence, which is reused where possible and relevant. While creating or processing the raw input data capturing and managing the corresponding provenance information, e.g., workflow, state, raw evidence, reports, and summaries, that simplifies its retrieval and reuse is essential. In comparison, scientific workflows and tools that support them are routinely used in the majority of academic research disciplines, managing diverse sets of data resources and their provenance. Based on the techniques established within the academic community, we have developed a system for managing this open-source intelligence data and associated provenance information. This will enhance the efficiency of retrieving stored data products and reusing them to support intelligence-led security decision-making. The open-source intelligence company partnered within this project has an operational envelope that includes collecting and analyzing personal subject information. Therefore, they must understand the scope of their data holdings appropriately, especially in light of obligations under the General Data Protection Regulation. The system developed allows for tracking requests for intelligence products, ownership of the collection, analysis and generation of intelligence briefs, and tracking the delivery of those final products to the customer for future billing. This adds further layers of efficiency to operations and hence reduces the costs of producing intelligence products.
... Generic search tools, such as desktop search, database search, and web search, can help engineers use specific keywords to find experts and relevant documents, such as material information, technical reports, and patents. Enterprise Search supports the search of documents within the organization that contains relevant information and people with the right expertise [30]. PLM/PDM systems, e.g., Teamcenter, enable engineers to find different kinds of product information, for instance, CAD drawings, BOM data, parts information, and manufacturing instructions, across more domains and departments. ...
Article
Full-text available
Product design is crucial for product success. Many approaches can improve product design quality, such as concurrent engineering and design for X. This study focuses on applying product usage information (PUI) during product development. As emerging technologies become widespread, an enormous amount of product-related information is available in the middle of a product’s life, such as customer reviews, condition monitoring, and maintenance data. In recent years, the literature describes the application of data analytics technologies such as machine learning to promote the integration of PUI during product development. However, as of today, PUI is not efficiently exploited in product development. One of the critical issues to achieve this is identifying and integrating task-relevant PUI fit for purposes of different product development tasks. Nevertheless, preparing task-relevant PUI that fits different product development tasks is often ignored. This study addresses this research gap in preparing task-relevant PUI and rectifies the related shortcomings and challenges. By considering the context in which PUI is utilized, this paper presents a systematic procedure to help identify and specify developers’ information needs and propose relevant PUI fitting the actual information needs of their current product development task. We capitalize on an application scenario to demonstrate the applicability of the proposed approach.
... Unlike search for leisure or personal interest there is a vast area of search contexts which are found in a work environment. Professional search falls into that scope, i.e. search over domain-specific document collections and often with search tasks that are recalloriented rather than precision-focused (Kruschwitz and Hull, 2017;Verberne et al., 2019). Beyond applications where such search effort can directly be measured in financial terms (e.g. in patent search, e-discovery or the compilation of systematic reviews) there are many other fields where these costs are more implicit, e.g. in the area of genocide studies that rely on the analysis of vast quantities of different resources (Bachman, 2020;Hinton, 2012). ...
Preprint
Full-text available
Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one's research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year's hot topic on Language Technology for All.
Book
Information seeking is a fundamental human activity. In the modern world, it is frequently conducted through interactions with search systems. The retrieval and comprehension of information returned by these systems is a key part of decision making and action in a broad range of settings. Advances in data availability coupled with new interaction paradigms, and mobile and cloud computing capabilities, have created a broad range of new opportunities for information access and use. In this comprehensive book for professionals, researchers, and students involved in search system design and evaluation, search expert Ryen White discusses how search systems can capitalize on new capabilities and how next-generation systems must support higher order search activities such as task completion, learning, and decision making. He outlines the implications of these changes for the evolution of search evaluation, as well as challenges that extend beyond search systems in areas such as privacy and societal benefit. Discusses many new technologies and their role in the search process. Covers important issues involving data availability and privacy. Talks about these issues in depth, educating searchers in the benefits and potential costs involved in using big (and small) data. Combines research from information retrieval, information science, and human-computer interaction.
Conference Paper
Email is still among the most popular online activities. People spend a significant amount of time sending, reading and responding to email in order to communicate with others, manage tasks and archive personal information. Most previous research on email is based on either relatively small data samples from user surveys and interviews, or on consumer email accounts such as those from Yahoo! Mail or Gmail. Much less has been published on how people interact with enterprise email even though it contains less automatically generated commercial email and involves more organizational behavior than is evident in personal accounts. In this paper, we extend previous work on predicting email reply behavior by looking at enterprise settings and considering more than dyadic communications. We characterize the influence of various factors such as email content and metadata, historical interaction features and temporal features on email reply behavior. We also develop models to predict whether a recipient will reply to an email and how long it will take to do so. Experiments with the publicly-available Avocado email collection show that our methods outperform all baselines with large gains. We also analyze the importance of different features on reply behavior predictions. Our findings provide new insights about how people interact with enterprise email and have implications for the design of the next generation of email clients.
Article
Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media (such as blog articles, forum posts, product reviews, and tweets). This has led to an increasing demand for powerful software tools to help people manage and analyze vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and capture semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic. This book provides a systematic introduction to many of these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for humans. Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific application problem, thus effecively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful actionable knowledge directly useful for task completion or decision making, thus providing more direct task support for users. This book covers the major concepts, techniques, and ideas in information retrieval and text data mining from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of information retrieval and text mining to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. This book can be used as a textbook for computer science undergraduates and graduates, library and information scientists, or as a reference book for practitioners working on relevant problems in managing and analyzing text data.
Conference Paper
Online visitors often do not find the content they were expecting on specific pages of a large enterprise website, and subsequently search for it in site’s search box. In this paper, we propose methods to leverage website search logs to identify missing or expected content on webpages on the enterprise website, while showing how several scenarios make this a non-trivial problem. We further discuss how our methods can be easily extended to address concerns arising from the identified missing content.
Chapter
This chapter introduces patent search in a way that should be accessible and useful to both researchers in information retrieval and other areas of computer science and professionals seeking to broaden their knowledge of patent search. It gives an overview of the process of patent search, including the different forms of patent search. It goes on to describe the differences among different domains of patent search (engineering, chemicals, gene sequences and so on) and the tools currently used by searchers in each domain. It concludes with an overview of open issues.
Article
multiple heterogeneous search services in a unified interface-A single query box and a common presentation of results. In the web search domain, aggregated search systems are responsible for integrating results from specialized search services, or verticals, alongside the core web results. For example, search portals such as Google, Bing, and Yahoo! provide access to vertical search engines that focus on different types of media (images and video), different types of search tasks (search for local businesses and online products), and even applications that can help users complete certain tasks (language translation and math calculations). Aggregated search systems perform two mains tasks. The first task (vertical selection) is to predict which verticals (if any) to present in response to a user's query. The second task (vertical presentation) is to predict where and how to present each selected vertical alongside the core web results. The goal of this work is to provide a comprehensive summary of previous research in aggregated search. We first describe why aggregated search requires unique solutions. Then, we discuss different sources of evidence that are likely to be available to an aggregated search system, as well as different techniques for integrating evidence in order to make vertical selection and presentation decisions. Next, we survey different evaluation methodologies for aggregated search and discuss prior user studies that have aimed to better understand how users behave with aggregated search interfaces. Finally, we review different advanced topics in aggregated search.