Michael Collins’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (140)


Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
  • Preprint

May 2024

·

5 Reads

Bernd Bohnet

·

Kevin Swersky

·

Rosanne Liu

·

[...]

·

Noah Fiedel

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.



Examples of utterances in context, and their explicatures.
An excerpt from the annotation guidelines, showing the two steps in the task. Step A corresponds to ascertaining the explicature of the system response; Step B corresponds to the “according to” test.
Some examples of system responses shown to annotators, paired with paraphrases of “information provided by the system response”. This corresponds directly to the explicature of the system response.
Instructions given to the raters for Step B, corresponding to the “according to” test.
Examples from the table-to-text annotations.

+9

Measuring Attribution in Natural Language Generation Models
  • Article
  • Full-text available

December 2023

·

43 Reads

·

49 Citations

Computational Linguistics

Large neural models have brought a new challenge to natural language generation (NLG): It has become imperative to ensure the safety and reliability of the output of models that generate freely. To this end, we present an evaluation framework, Attributable to Identified Sources (AIS), stipulating that NLG output pertaining to the external world is to be verified against an independent, provided source. We define AIS and a two-stage annotation pipeline for allowing annotators to evaluate model output according to annotation guidelines. We successfully validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset). We provide full annotation guidelines in the appendices and publicly release the annotated data at https://github.com/google-research-datasets/AIS.

Download

Figure A.5 Examples from the table-to-text annotations.
Summary of tasks used in human annotation study.
Annotator agreement measured as interannotator agreement (left half of the table) or as agreement with expert consensus (right half of the table, only measured on QRECC and CNN/DM tasks). Metrics include -F1: a F1 measure comparing individual ratings to the consensus rating; PA: pairwise agreement as percentage of individual pairs that agree; α: Krippendorff's alpha measure comparing pairs of individual ratings.
Ablation Survey Results: We asked each group of annotators to complete a post-task survey.
Measuring Attribution in Natural Language Generation Models

July 2023

·

55 Reads

·

21 Citations

Computational Linguistics

Large neural models have brought a new challenge to natural language generation (NLG): it has become imperative to ensure the safety and reliability of the output of models that generate freely. To this end, we present an evaluation framework, Attributable to Identified Sources (AIS), stipulating that NLG output pertaining to the external world is to be verified against an independent, provided source. We define AIS and a two-stage annotation pipeline for allowing annotators to evaluate model output according to annotation guidelines. We successfully validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset). We provide full annotation guidelines in the appendices and publicly release the annotated data at https://github.com/google-research-datasets/AIS.


Figure 1: Example of one of our transition-based coreference systems, the Link-Append system. The system processes a single sentence at a time, using an input encoding of the prior sentences annotated with coreference clusters, followed by the new sentence. As output, the system makes predictions that link mentions in the new sentence to either previously created coreference clusters (e.g., ''You → [1'') or when a new cluster is created, to previous mentions (e.g., ''the apartment → your house''). The system predicts ''SHIFT'' when processing of the sentence is complete. Note in the figure we use the word indices 2 and 17 to distinguish the two incidences of ''I'' in the text.
Coreference Resolution through a seq2seq Transition-Based System

March 2023

·

47 Reads

·

29 Citations

Transactions of the Association for Computational Linguistics

Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score for English (a 2.3 higher F1-score than previous work [Dobrovolskii, 2021]) using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than previous work), and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We obtain substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages. We provide the code and models as open source.1


Learning to Reject with a Fixed Predictor: Application to Decontextualization

January 2023

·

4 Reads

We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. \ignore{where many correct labels are often possible} We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong H-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of 2,0002\mathord,000 examples. Our algorithm significantly outperforms the baselines considered, with a  ⁣ ⁣25%\sim\!\!25\% improvement in coverage when halving the error rate, which is only  ⁣ ⁣3%\sim\!\! 3 \% away from the theoretical limit.


Figure 1: Multi-Source UDPRE Transfer Learning Curves. Baseline approaches are dotted, while ESR variants are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled datasets per language. The plots indicate a significant advantage of ESR over the baselines in low-data regions.
Figure 2: ''From Scratch'' MBERT Transfer Learning Curves. Baseline approaches are dotted, while ESR variants are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled datasets per language. The plots indicate a significant advantage of ESR over the baselines in all regions.
Improving Low-Resource Cross-lingual Parsing with Expected Statistic Regularization

January 2023

·

29 Reads

·

4 Citations

Transactions of the Association for Computational Linguistics

We present Expected Statistic Regulariza tion (ESR), a novel regularization technique that utilizes low-order multi-task structural statistics to shape model distributions for semi- supervised learning on low-resource datasets. We study ESR in the context of cross-lingual transfer for syntactic analysis (POS tagging and labeled dependency parsing) and present several classes of low-order statistic functions that bear on model behavior. Experimentally, we evaluate the proposed statistics with ESR for unsupervised transfer on 5 diverse target languages and show that all statistics, when estimated accurately, yield improvements to both POS and LAS, with the best statistic improving POS by +7.0 and LAS by +8.5 on average. We also present semi-supervised transfer and learning curve experiments that show ESR provides significant gains over strong cross-lingual-transfer-plus-fine-tuning baselines for modest amounts of label data. These results indicate that ESR is a promising and complementary approach to model-transfer approaches for cross-lingual parsing.1



Coreference Resolution through a seq2seq Transition-Based System

November 2022

·

60 Reads

Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score for English (a 2.3 higher F1-score than previous work (Dobrovolskii, 2021)) using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than previous work) and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We get substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages.


Towards Computationally Verifiable Semantic Grounding for Language Models

November 2022

·

39 Reads

The paper presents an approach to semantic grounding of language models (LMs) that conceptualizes the LM as a conditional model generating text given a desired semantic message formalized as a set of entity-relationship triples. It embeds the LM in an auto-encoder by feeding its output to a semantic parser whose output is in the same representation domain as the input message. Compared to a baseline that generates text using greedy search, we demonstrate two techniques that improve the fluency and semantic accuracy of the generated text: The first technique samples multiple candidate text sequences from which the semantic parser chooses. The second trains the language model while keeping the semantic parser frozen to improve the semantic accuracy of the auto-encoder. We carry out experiments on the English WebNLG 3.0 data set, using BLEU to measure the fluency of generated text and standard parsing metrics to measure semantic accuracy. We show that our proposed approaches significantly improve on the greedy search baseline. Human evaluation corroborates the results of the automatic evaluation experiments.


Citations (73)


... We now turn to (neural) MT, where decoding algorithms for sequence generation have been investigated in detail. Here, beam search is the standard method for syntax-and phrase-based models (Rush et al., 2013), as well as for neural encoderdecoders (Freitag and Al-Onaizan, 2017). However, an important difference between the two is that candidates in phrase-based MT are completed in the same number of steps, whereas neural models generate hypotheses of different length and are biased for shorter output (Huang et al., 2017). ...

Reference:

Decoding Strategies for Neural Referring Expression Generation
Optimal Beam Search for Machine Translation
  • Citing Conference Paper
  • January 2013

... Process-based reward modelling (PRMs) [Lightman et al., 2023, Uesato et al., 2022 is an alternative approach which directly predicts the correctness of intermediate CoT reasoning steps. Likewise, various other approaches rely on annotated CoT datasets for benchmarking [Jacovi et al., 2024, Amini et al., 2019, Liu et al., 2020, Xi et al., 2024, Nguyen et al., 2024, Xie et al., 2024, McLeish et al., 2024. While these benchmarks and methodologies can be valuable for improving LLM reasoning, collecting annotated data can be very costly and is not readily scalable to other tasks. ...

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
  • Citing Conference Paper
  • January 2024

... There are several ways to optimize input queries (Gao et al., 2023) in RAG systems to enhance both retrieval and generation. Key methods include query expansion and rewriting (Jagerman et al., 2023;Amplayo et al., 2023; which align with prompting techniques like least-to-most prompting (Zhou et al., 2023), decomposed prompting , and step-back prompting . These techniques are particularly valuable for knowledgeintensive QA tasks requiring complex multi-hop reasoning (Tang and Yang, 2024;Rosset et al., 2024). ...

Query Refinement Prompts for Closed-Book Long-Form QA
  • Citing Conference Paper
  • January 2023

... Factuality is one of the most challenging aspects of Large Language Models (LLMs), referring to a model's ability to generate factually accurate responses in information-seeking scenarios. Commonly, this area of research can be divided into two distinct scenarios: (1) factuality with respect to given context, such as a user request and grounding documents, such that the model response is fully grounded in the input (by this, we imply that a model response has the highest degree of faithfulness to given context as defined by Rashkin et al., 2023), and (2) factuality with respect to external sources and general world knowledge (Tang et al., 2024, cf. Pan et al., 2023Rashkin et al., 2023;Zhao et al., 2024b). ...

Measuring Attribution in Natural Language Generation Models

Computational Linguistics

... Furthermore, RAG can generate a list of citations attached to the generated answers, linking them to the retrieved documents so users can verify the accuracy of the output. This process is known as source attribution (Rashkin et al., 2023;Bohnet et al., 2023;Khalifa et al., 2024). ...

Measuring Attribution in Natural Language Generation Models

Computational Linguistics

... The current state-of-the-art results on OntoNotes (Pradhan et al., 2013), a frequently used English coreference resolution dataset, are achieved by autoregressive models with billions of parameters: Liu et al. (2022) propose a specialized autoregressive system, while Bohnet et al. (2023) employ a text-to-text paradigm. However, both these architectures must call the trained model repeatedly to process a single sentence. ...

Coreference Resolution through a seq2seq Transition-Based System

Transactions of the Association for Computational Linguistics

... 0.50, 0.75, 1.0}. Regarding automatic metrics, we report ROUGE-L F 1 (Lin & Hovy, 2003) and BERTScore F 1 (Zhang et al., 2020) for lexical and semantic relevance, respectively; self-BLEU (Zhu et al., 2018) for lexical diversity; and EDNA (Narayan et al., 2022), a metric quantifying diversity and faithfulness by combining document-summary entailment (Laban et al., 2022) and self-entailment. Alignment, Diversity, and Faithfulness. ...

A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
  • Citing Conference Paper
  • January 2022

... Attention-based methods (Su et al., 2021; capture the interaction between an aspect and its context. Recently, some syntax-aware methods (Tian et al., 2021;Li et al., 2021b;Effland and Collins, 2023) utilized Graph Neural Networks (GNN) based on syntactical dependency trees to exploit syntactic structure information. However, these methods heavily rely on labeled data and may fail to solve unseen aspect categories. ...

Improving Low-Resource Cross-lingual Parsing with Expected Statistic Regularization

Transactions of the Association for Computational Linguistics

... In opendomain QA, the goal is to find the answer to a (typically short) question over a large corpus such as Wikipedia (Voorhees and Tice, 2000;. Passage retrieval has been an essential component of many state-of-the-art open-domain QA systems (Karpukhin et al., 2020;Lewis et al., 2020b;Piktus et al., 2021;Min et al., 2021;Zhu et al., 2021). An effective retrieval component can reduce the search space for answer extraction and identify the support context for users to verify the answer. ...

NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
  • Citing Article
  • January 2021

... Explainability for neural netwrk based models come in two formats: extractive rationales (includes pre-LLM era) and free-text rationales. Extractive rationales (Li et al., 2016;Sundararajan et al., 2017;Lundberg and Lee, 2017;Jin et al., 2019) involve analyzing the influence of input tokens on the predicted output via various methods such as gradient-based analysis of input tokens (Sundararajan et al., 2017;Lundberg and Lee, 2017), input perturbation (Poerner et al., 2018;Kádár et al., 2017), attention heatmap analysis (Pruthi et al., 2020;Stacey et al., 2022;Wiegreffe and Pinter, 2019), and trained models for this purpose (Lei et al., 2016;Chan et al., 2022;Jain et al., 2020;Situ et al., 2021;Liu et al., 2023). However, extractive rationales have limited applicability as discussed previously; hence we focus on free-text rationales. ...

Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students?

Transactions of the Association for Computational Linguistics