
Shuo ZhangBloomberg · AI
Shuo Zhang
Phd
Research Scientist at Bloomberg
About
63
Publications
6,253
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
801
Citations
Citations since 2017
Introduction
Research interests: Information Retrieval, Table Retrieval, Knowlege Graph, Conversational Recommender System
Publications
Publications (63)
Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset InfoSyncC and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxe...
Explanations in conventional recommender systems have demonstrated benefits in helping the user understand the rationality of the recommendations and improving the system's efficiency, transparency, and trustworthiness. In the conversational environment, multiple contextualized explanations need to be generated, which poses further challenges for e...
Task-oriented dialogue systems (TDSs) are assessed mainly in an offline setting or through human evaluation. The evaluation is often limited to single-turn or is very time-intensive. As an alternative, user simulators that mimic user behavior allow us to consider a broad set of user goals to generate human-like conversations for simulated evaluatio...
We present an extensible user simulation toolkit to facilitate automatic evaluation of conversational recommender systems. It builds on an established agenda-based approach and extends it with several novel elements, including user satisfaction prediction, persona and context modeling, and conditional natural language generation. We showcase the to...
Existing approaches to constructing training data for Natural Language Inference (NLI) tasks, such as for semi-structured table reasoning , are either via crowdsourcing or fully automatic methods. However, the former is expensive and time-consuming and thus limits scale, and the latter often produces naive examples that may lack complex reasoning....
Existing approaches to constructing training data for Natural Language Inference (NLI) tasks, such as for semi-structured table reasoning, are either via crowdsourcing or fully automatic methods. However, the former is expensive and time-consuming and thus limits scale, and the latter often produces naive examples that may lack complex reasoning. T...
Recent methods based on pre-trained language models have exhibited superior performance over tabular tasks (e.g., tabular NLI), despite showing inherent problems such as not using the right evidence and inconsistent predictions across inputs while reasoning over the tabu-lar data (Gupta et al., 2021). In this work, we utilize Pattern-Exploiting Tra...
Recent methods based on pre-trained language models have exhibited superior performance over tabular tasks (e.g., tabular NLI), despite showing inherent problems such as not using the right evidence and inconsistent predictions across inputs while reasoning over the tabular data. In this work, we utilize Pattern-Exploiting Training (PET) (i.e., str...
User simulation has been a cost-effective technique for evaluating conversational recommender systems. However, building a human-like simulator is still an open challenge. In this work, we focus on how users reformulate their utterances when a conversational agent fails to understand them. First, we perform a user study, involving five conversation...
When pre-trained contextualized embedding-based models developed for unstructured data are adapted for structured tabular data, they perform admirably. However, recent probing studies show that these models use spurious correlations, and often predict inference labels by focusing on false evidence or ignoring it altogether. To study this issue, we...
The ability to reason about tabular or semi-structured knowledge is a fundamental problem for today's Natural Language Processing (NLP) systems. While significant progress has been achieved in the direction of tabular reasoning, these advances are limited to English due to the absence of multilingual benchmark datasets for semi-structured data. In...
User simulation has been a cost-effective technique for evaluating conversational recommender systems. However, building a human-like simulator is still an open challenge. In this work, we focus on how users reformulate their utterances when a conversational agent fails to understand them. First, we perform a user study, involving five conversation...
Task-oriented dialogue systems (TDSs) are assessed mainly in an offline setting or through human evaluation. The evaluation is often limited to single-turn or very time-intensive. As an alternative, user simulators that mimic user behavior allow us to consider a broad set of user goals to generate human-like conversations for simulated evaluation....
A large amount of information is stored in data tables. Users can search for data tables using a keyword-based query. A table is composed primarily of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form...
Simulation is used as a low-cost and repeatable means of experimentation. As Information Retrieval (IR) researchers, we are no strangers to the idea of using simulation within our own field---such as the traditional means of IR system evaluation as manifested through the Cranfield paradigm. While simulation has been used in other areas of IR resear...
Automatic detection of click-bait and incongruent news headlines is crucial to maintaining the reliability of the Web and has raised much research attention. However, most existing methods perform poorly when news headlines contain contextually important cardinal values, such as a quantity or an amount. In this work, we focus on this particular cas...
Hybrid data combining both tabular and tex-tual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Text...
Hybrid data combining both tabular and textual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Textu...
Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we prop...
Tables on the Web contain a vast amount of knowledge in a structured form. To tap into this valuable resource, we address the problem of table retrieval: answering an information need with a ranked list of tables. We investigate this problem in two different variants, based on how the information need is expressed: as a keyword query or as an exist...
Tables on the Web contain a vast amount of knowledge in a structured form. To tap into this valuable resource, we address the problem of table retrieval: answering an information need with a ranked list of tables. We investigate this problem in two different variants, based on how the information need is expressed: as a keyword query or as an exist...
We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query...
Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we prop...
We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query...
Conversational recommender systems support users in accomplishing recommendation-related goals via multi-turn conversations. To better model dynamically changing user preferences and provide the community with a reusable development framework, we introduce IAI MovieBot, a conversational recommender system for movies. It features a task-specific dia...
Category systems are central components of knowledge bases, as they provide a hierarchical grouping of semantically related concepts and entities. They are a unique and valuable resource that is utilized in a broad range of information access tasks. To aid knowledge editors in the manual process of expanding a category system, this paper presents a...
Conversational information access is an emerging research area. Currently, human evaluation is used for end-to-end system evaluation, which is both very time and resource intensive at scale, and thus becomes a bottleneck of progress. As an alternative, we propose automated evaluation by means of simulating users. Our user simulator aims to generate...
Tabular data provide answers to a significant portion of search queries. However, reciting an entire result table is impractical in conversational search systems. We propose to generate natural language summaries as answers to describe the complex information contained in a table. Through crowdsourcing experiments, we build a new conversation-orien...
Tables are powerful and popular tools for organizing and manipulating data. A vast number of tables can be found on the Web, which represent a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research on web tables. In particular, we organize existing literature into six main categories of inform...
Tables are a powerful and popular tool for organizing and manipulating data. A vast number of tables can be found on the Web, which represents a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research on web tables. In particular, we organize existing literature into six main categories of info...
When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web, on the other han...
We address the task of auto-completing data cells in relational tables. Such tables describe entities (in rows) with their attributes (in columns). We present the CellAutoComplete framework to tackle several novel aspects of this problem, including: (i) enabling a cell to have multiple, possibly conflicting values, (ii) supplementing the predicted...
We address the task of auto-completing data cells in relational tables. Such tables describe entities (in rows) with their attributes (in columns). We present the CellAutoComplete framework to tackle several novel aspects of this problem, including: (i) enabling a cell to have multiple, possibly conflicting values, (ii) supplementing the predicted...
This tutorial synthesizes and presents research on web tables over the past two decades. We group the tasks into six main categories of information access tasks: (i) table extraction, (ii) table interpretation, (iii) table search, (iv) question answering on tables, (v) knowledge base augmentation, and (vi) table completion. For each category, we id...
Tables are an extremely powerful visual and interactive tool for structuring and manipulating data, making spreadsheet programs one of the most popular computer applications. In this paper we introduce and address the task of recommending related tables: given an input table, identifying and returning a ranked list of relevant tables. One of the ma...
Tables contain valuable knowledge in a structured form. We employ neural language modeling approaches to embed tabular data into vector spaces. Specifically, we consider different table elements, such caption, column headings, and cells, for training word and entity embeddings. These embeddings are then utilized in three particular table-related ta...
Tables contain valuable knowledge in a structured form. We employ neural language modeling approaches to embed tabular data into vector spaces. Specifically, we consider different table elements, such caption, column headings, and cells, for training word and entity embeddings. These embeddings are then utilized in three particular table-related ta...
Many information needs revolve around entities, which would be better answered by summarizing results in a tabular format, rather than presenting them as a ranked list. Unlike previous work, which is limited to retrieving existing tables, we aim to answer queries by automatically compiling a table in response to a query. We introduce and address th...
We introduce SmartTable, an online spreadsheet application that is equipped with intelligent assistance capabilities. With a focus on relational tables, describing entities along with their attributes, we offer assistance in two flavors: (i) for populating the table with additional entities (rows) and (ii) for extending it with additional entity at...
Tables are one of those "universal tools'' that are practical and useful in many application scenarios. Tables can be used to collect and organize information from multiple sources and then turn that information into knowledge (and ultimately to support decision-making) by performing various operations, like sorting, filtering, and joins. Because o...
We introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based information access scenarios, such as table completion or table mining. The main novel contribution of this...
We address the task of generating query suggestions for task-based search. The current state of the art relies heavily on suggestions provided by a major search engine. In this paper, we solve the task without reliance on search engines. Specifically, we focus on the first step of a two-stage pipeline approach, which is dedicated to the generation...
This paper describes our participation in the Triple Scoring task of WSDM Cup 2017, which aims at ranking triples from a knowledge base for two type-like relations: profession and nationality. We introduce a supervised ranking method along with the features we designed for this task. Our system has been top ranked with respect to average score diff...
We introduce Nordlys, a toolkit for entity-oriented and semantic search. It provides functionality for entity cataloging, entity retrieval, entity linking, and target type identification. Nordlys may be used as a Python library or as a RESTful API, and also comes with a web-based user interface. The toolkit is open source and is available at http:/...
Tables are among the most powerful and practical tools for organizing and working with data. Our motivation is to equip spreadsheet programs with smart assistance capabilities. We concentrate on one particular family of tables, namely, tables with an entity focus. We introduce and focus on two specifc tasks: populating rows with additional instance...
We address the task of ranking objects (such as people, blogs, or verticals) that, unlike documents, do not have direct term-based representations. To be able to match them against keyword queries, evidence needs to be amassed from documents that are associated with the given object. We present two design patterns, i.e., general reusable retrieval...
This paper describes our participation in the Triple Scoring task ofWSDM Cup 2017, which aims at ranking triples from a knowledge base for two type-like relations: profession and nationality. We introduce a supervised ranking method along with the features we designed for this task. Our system has been top ranked with respect to average score diffe...
To further enhance the system performance of the heterogeneous network selection algorithm, a Markov process based utility optimization method for multi-access selection in heterogeneous wireless networks is presented in this paper. The method adopts the birth and death process to analyze the system performance of heterogeneous networks, fully cons...
In the environment of heterogeneous wireless networks, it is vital to select a currently optimal network for applications and subscribers. The use of multiple attribute decision making (MADM) for heterogeneous network selection can provide subscribers with satisfactory service quality. Converting heterogeneous network selection into a MADM problem,...