Jimmy Lin

Jimmy Lin
TAIWAN POWER RESEARCH INSTITUTE · ICT RESEARCH LABORATORY

About

329
Publications
36,208
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
12,516
Citations

Publications

Publications (329)
Preprint
Pre-trained language models have achieved state-of-the-art results in various natural language processing tasks. Most of them are based on the Transformer architecture, which distinguishes tokens with the token position index of the input sequence. However, sentence index and paragraph index are also important to indicate the token position in a do...
Preprint
The semantics of a text is manifested not only by what is read, but also by what is not read. In this article, we will study how those implicit "not read" information such as end-of-paragraph (EOP) and end-of-sequence (EOS) affect the quality of text generation. Transformer-based pretrained language models (LMs) have demonstrated the ability to gen...
Preprint
Full-text available
There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions...
Preprint
Full-text available
Techniques for automatically extracting important content elements from business documents such as contracts, statements, and filings have the potential to make business operations more efficient. This problem can be formulated as a sequence labeling task, and we demonstrate the adaption of BERT to two types of business documents: regulatory filing...
Conference Paper
Document ranking experiments should be repeatable. However, the interaction between multi-threaded indexing and score ties during retrieval may yield non-deterministic rankings, making repeatability not as trivial as one might imagine. In the context of the open-source Lucene search engine, score ties are broken by internal document ids, which are...
Conference Paper
Millions of consumers issue voice queries through television-based entertainment systems such as the Comcast X1, the Amazon Fire TV, and Roku TV. Automatic speech recognition (ASR) systems are responsible for transcribing these voice queries into text to feed downstream natural language understanding modules. However, ASR is far from perfect, often...
Conference Paper
Is neural IR mostly hype? In a recent SIGIR Forum article, Lin expressed skepticism that neural ranking models were actually improving ad hoc retrieval effectiveness in limited data scenarios. He provided anecdotal evidence that authors of neural IR papers demonstrate "wins" by comparing against weak baselines. This paper provides a rigorous evalua...
Conference Paper
Modern in-home entertainment platforms---representing the evolution of the humble television of yesteryear---are packed with features and content: they offer a dizzying array of programs spanning hundreds of channels as well as a catalog of on-demand programs offering tens of thousands of options. Furthermore, the entertainment platform may serve a...
Conference Paper
Full-text available
Neural network models for many NLP tasks have grown increasingly complex in recent years, making training and deployment more difficult. A number of recent papers have questioned the necessity of such architectures and found that well-executed, simpler models are quite effective. We show that this is also the case for document classification: in a...
Preprint
Is neural IR mostly hype? In a recent SIGIR Forum article, Lin expressed skepticism that neural ranking models were actually improving ad hoc retrieval effectiveness in limited data scenarios. He provided anecdotal evidence that authors of neural IR papers demonstrate "wins" by comparing against weak baselines. This paper provides a rigorous evalua...
Preprint
Motivated by recent commentary that has questioned today's pursuit of ever-more complex models and mathematical formalisms in applied machine learning and whether meaningful empirical progress is actually being made, this paper tries to tackle the decades-old problem of pseudo-relevance feedback with "the simplest thing that can possibly work". I p...
Preprint
Full-text available
Pre-trained language representation models achieve remarkable state of the art across a wide range of tasks in natural language processing. One of the latest advancements is BERT, a deep pre-trained transformer that yields much better results than its predecessors do. Despite its burgeoning popularity, however, BERT has not yet been applied to docu...
Preprint
Full-text available
Recently, a simple combination of passage retrieval using off-the-shelf IR techniques and a BERT reader was found to be very effective for question answering directly on Wikipedia, yielding a large improvement over the previous state of the art on a standard benchmark dataset. In this paper, we present a data augmentation technique using distant su...
Preprint
We present simple BERT-based models for relation extraction and semantic role labeling. In recent years, state-of-the-art performance has been achieved using neural models by incorporating lexical and syntactic features such as part-of-speech tags and dependency trees. In this paper, extensive experiments on datasets for these two tasks show that w...
Chapter
In the framework of axiomatic information retrieval, the semantic term matching technique proposed by Fang and Zhai in SIGIR 2006 has been shown to be effective in addressing the vocabulary mismatch problem, with experimental evidence provided from newswire collections. This paper reproduces and generalizes these results in Anserini, an open-source...
Chapter
We tackle the problem of transferring relevance judgments across document collections for specific information needs by reproducing and generalizing the work of Grossman and Cormack from the TREC 2017 Common Core Track. Their approach involves training relevance classifiers using human judgments on one or more existing (source) document collections...
Preprint
Full-text available
In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding...
Preprint
Following recent successes in applying BERT to question answering, we explore simple applications to ad hoc document retrieval. This required confronting the challenge posed by documents that are typically longer than the length of input BERT was designed to handle. We address this issue by applying inference on sentences individually, and then agg...
Conference Paper
Countless voice-enabled user interfaces rely on keyword spotting (KWS) systems for wake word detection and simple command recognition. As a practical matter, these applications run on "edge" devices, where dozens of different platforms exist; typically, platform-dependent implementation are required whenever keyword spotting capabilities are needed...
Preprint
This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wi...
Preprint
Full-text available
We demonstrate an end-to-end question answering system that integrates BERT with the open-source Anserini information retrieval toolkit. In contrast to most question answering and reading comprehension models today, which operate over small amounts of input text, our system integrates best practices from IR with a BERT-based reader to identify answ...
Preprint
Voice-enabled commercial products are ubiquitous, typically enabled by lightweight on-device keyword spotting (KWS) and full automatic speech recognition (ASR) in the cloud. ASR systems require significant computational resources in training and for inference, not to mention copious amounts of annotated speech data. KWS systems, on the other hand,...
Preprint
There exists a plethora of techniques for inducing structured sparsity in parametric models during the optimization process, with the final goal of resource-efficient inference. However, to the best of our knowledge, none target a specific number of floating-point operations (FLOPs) as part of a single end-to-end optimization objective, despite rep...
Preprint
In recent years, we have witnessed a dramatic shift towards techniques driven by neural networks for a variety of NLP tasks. Undoubtedly, neural language models (NLMs) have reduced perplexity by impressive amounts. This progress, however, comes at a substantial cost in performance, in terms of inference latency and energy consumption, which is part...
Preprint
This paper explores the problem of ranking short social media posts with respect to user queries using neural networks. Instead of starting with a complex architecture, we proceed from the bottom up and examine the effectiveness of a simple, word-level Siamese architecture augmented with attention-based mechanisms for capturing semantic soft matche...
Preprint
Used for simple commands recognition on devices from smart routers to mobile phones, keyword spotting systems are everywhere. Ubiquitous as well are web applications, which have grown in popularity and complexity over the last decade with significant improvements in usability under cross-platform conditions. However, despite their obvious advantage...
Article
This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing systems, they are primarily designed to facilitate the publication of research papers. As such, these systems ar...
Preprint
Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity. In a software keyboard application on mobile devices, this translates into higher power consumption and shorter battery life. This paper represents the first attempt, to our knowledge, i...
Conference Paper
Full-text available
We tackle the challenge of understanding voice queries posed against the Comcast Xfinity X1 entertainment platform, where consumers direct speech input at their "voice remotes". Such queries range from specific program navigation (i.e., watch a movie) to requests with vague intents and even queries that have nothing to do with watching TV. We prese...
Preprint
Document ranking experiments should be repeatable: running the same ranking model over the same collection with the same queries should yield exactly the same output. However, the presence of different documents with the same score may yield non-deterministic rankings, making repeatability not as trivial as one might imagine. In the context of our...
Conference Paper
A recently-introduced product of Comcast, a large cable company in the United States, is a "voice remote" that accepts spoken queries from viewers. We present an analysis of a large query log from this service to answer the question: "What do viewers say to their TVs?" In addition to a descriptive characterization of queries and sessions, we descri...
Conference Paper
Real-time summarization systems that monitor document streams to identify relevant content have a few options for delivering system updates to users. In a mobile context, systems could send push notifications to users' mobile devices, hoping to grab their attention immediately. Alternatively, systems could silently deposit updates into "inboxes" th...
Preprint
RDF data in the linked open data (LOD) cloud is very valuable for many different applications. In order to unlock the full value of this data, users should be able to issue complex queries on the RDF datasets in the LOD cloud. SPARQL can express such complex queries, but constructing SPARQL queries can be a challenge to users since it requires know...
Conference Paper
Twitter's data engineering team is faced with the challenge of processing billions of events every day in batch and in real time, and we have built various tools to meet these demands. In this paper, we describe TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitor...
Preprint
Full-text available
Despite substantial interest in applications of neural networks to information retrieval, neural ranking models have only been applied to standard ad hoc retrieval tasks over web pages and newswire documents. This paper proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network) a novel neural ranking model specifically designed...
Article
The canonical analytics architecture today consists of a browser connected to a backend in the cloud. In all deployments that we are aware of, the browser is simply a dumb rendering endpoint. As an alternative, this paper explores split-execution architectures that push analytics capabilities into the browser. We show that, by taking advantage of t...
Article
Serverless architectures organized around loosely-coupled function invocations represent an emerging design for many applications. Recent work mostly focuses on user-facing products and event-driven processing pipelines. In this paper, we explore a completely different part of the application space and examine the feasibility of analytical processi...
Conference Paper
Large scale retrieval systems often employ cascaded ranking architectures, in which an initial set of candidate documents are iteratively refined and re-ranked by increasingly sophisticated and expensive ranking models. In this paper, we propose a unified framework for predicting a range of performance-sensitive parameters based on minimizing end-t...
Article
We examine the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact. Adopting a straightforward decomposition of the problem into entity detection, entity linking, relation prediction, and evidence combination, we explore simple yet strong baselines. On the SimpleQuest...
Article
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software...
Conference Paper
We tackle the novel problem of navigational voice queries posed against an entertainment system, where viewers interact with a voice-enabled remote controller to specify the TV program to watch. This is a difficult problem for several reasons: such queries are short, even shorter than comparable voice queries in other domains, which offers fewer op...
Conference Paper
There is growing interest in systems that generate timeline summaries by filtering high-volume streams of documents to retain only those that are relevant to a particular event or topic. Continued advances in algorithms and techniques for this task depend on standardized and reproducible evaluation methodologies for comparing systems. However, time...
Article
Nearly all previous work on small-footprint keyword spotting with neural networks quantify model footprint in terms of the number of parameters and multiply operations for an inference pass. These values are, however, proxy measures since empirical performance in actual deployments is determined by many factors. In this paper, we study the power co...
Article
We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model de...
Article
Full-text available
We describe Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow. These models are useful for recognizing "command triggers" in speech-based interfaces (e.g., "Hey Siri"), which serve as explicit cues for audio recordings of utterances that are sent to the cl...
Conference Paper
ThŒere is an emerging consensus that time is an important indicator of relevance for searching streams of social media posts. In a process similar to pseudo-relevance feedback, the distribution of document timestamps from the results of an initial query can be leveraged to infer the distribution of relevant documents, for example, using kernel dens...
Conference Paper
Serverless architectures represent a new approach to designing applications in the cloud without having to explicitly provision or manage servers. The developer specifies functions with well-defined entry and exit points, and the cloud provider handles all other aspects of execution. In this paper, we explore a novel application of serverless archi...
Conference Paper
We propose a utility-based framework for the evaluation of push notification systems that monitor document streams for users' topics of interest. Our starting point is that users derive either positive utility (i.e., "gain") or negative utility (i.e., "pain") from consuming system updates. By separately keeping track of these quantities, we can mea...
Conference Paper
Quantization, the pre-calculation and conversion to integers of term/document weights in an inverted index, is a well studied aspect of search engines that substantially improves retrieval efficiency. Previous work has considered the impact of quantization on effectiveness-efficiency tradeoffs in retrieval, for example, exploring the relationship b...
Article
Full-text available
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software...
Conference Paper
Information retrieval test collections are typically built using data from large-scale evaluations in international forums such as TREC, CLEF, and NTCIR. Previous validation studies on pool-based test collections for ad hoc retrieval have examined their reusability to accurately assess the effectiveness of systems that did not participate in the or...
Conference Paper
Software toolkits play an essential role in information retrieval research. Most open-source toolkits developed by academics are designed to facilitate the evaluation of retrieval models over standard test collections. Efforts are generally directed toward better ranking and less attention is usually given to scalability and other operational consi...
Conference Paper
We present a system for identifying interesting social media posts on Twitter and delivering them to users' mobile devices in real time as push notifications. In our problem formulation, users are interested in broad topics such as politics, sports, and entertainment: our system processes tweets in real time to identify relevant, novel, and salient...
Conference Paper
As an empirical discipline, information access and retrieval research requires substantial software infrastructure to index and search large collections. This workshop is motivated by the desire to better align information retrieval research with the practice of building search applications from the perspective of open-source information retrieval...
Conference Paper
Real-time push notification systems monitor continuous document streams such as social media posts and alert users to relevant content directly on their mobile devices. We describe a user study of such systems in the context of the TREC 2016 Real-Time Summarization Track, where system updates are immediately delivered as push notifications to the m...
Conference Paper
We propose a heuristic called "one answer per document" for automatically extracting high-quality negative examples for answer selection in question answering. Starting with a collection of question-answer pairs from the popular TrecQA dataset, we identify the original documents from which the answers were drawn. Sentences from these source documen...
Conference Paper
Due to Twitter's terms of service that forbid redistribution of content, creating publicly downloadable collections of tweets for research purposes has been a perpetual problem for the research community. Some collections are distributed by making available the ids of the tweets that comprise the collection and providing tools to fetch the actual c...
Conference Paper
Full-text available
In recent years, neural networks have been applied to many text processing problems. One example is learning a similarity function between pairs of text, which has applications to paraphrase extraction, plagiarism detection, question answering, and ad hoc retrieval. Within the information retrieval community, the convolutional neural network model...
Article
Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving...
Article
We explore different approaches to integrating a simple convolutional neural network (CNN) with the Lucene search engine in a multi-stage ranking architecture. Our models are trained using the PyTorch deep learning toolkit, which is implemented in C/C++ with a Python frontend. One obvious integration strategy is to expose the neural network directl...
Article
Full-text available
Time is an important relevance signal when searching streams of social media posts. The distribution of document timestamps from the results of an initial query can be leveraged to infer the distribution of relevant documents, which can then be used to rerank the initial results. Previous experiments have shown that kernel density estimation is a s...
Article
Full-text available
Most work on natural language question answering today focuses on answer selection: given a candidate list of sentences, determine which contains the answer. Although important, answer selection is only one stage in a standard end-to-end question answering pipeline. This paper explores the effectiveness of convolutional neural networks (CNNs) for a...
Conference Paper
With the advent of online social networks, there is an increasing demand for storage and processing of graph-structured data. Social networking applications pose new challenges to data management systems due to demand for real-time querying and manipulation of the graph structure. Recently, several systems specialized systems for graph-structured d...
Article
Full-text available
We tackle the novel problem of navigational voice queries posed against an entertainment system, where viewers interact with a voice-enabled remote controller to specify the program to watch. This is a difficult problem for several reasons: such queries are short, even shorter than comparable voice queries in other domains, which offers fewer oppor...
Conference Paper
This demonstration explores the novel and unconventional idea of implementing an analytical RDBMS in pure JavaScript so that it runs completely inside a browser with no external dependencies. Our prototype, called Afterburner, generates compiled query plans that exploit two JavaScript features: typed arrays and asm.js. On the TPC-H benchmark, we sh...
Article
We tackle the challenge of topic classification of tweets in the context of analyzing a large collection of curated streams by news outlets and other organizations to deliver relevant content to users. Our approach is novel in applying distant supervision based on semi-automatically identifying curated streams that are topically focused (for exampl...
Article
Full-text available
Scalable web search systems typically employ multi-stage retrieval architectures, where an initial stage generates a set of candidate documents that are then pruned and re-ranked. Since subsequent stages typically exploit a multitude of features of varying costs using machine-learned models, reducing the number of documents that are considered at e...
Conference Paper
This paper explores a simple question: How would we provide a high-quality search experience on Mars, where the fundamental physical limit is speed-of-light propagation delays on the order of tens of minutes? On Earth, users are accustomed to nearly instantaneous responses from web services. Is it possible to overcome orders-of-magnitude longer lat...
Conference Paper
The basic idea behind selective search is to partition a collection into topical clusters, and for each query, consider only a subset of the clusters that are likely to contain relevant documents. Previous work on web collections has shown that it is possible to retain high-quality results while considering only a small fraction of the collection....
Conference Paper
We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index o...
Article
Full-text available
This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings wit...
Conference Paper
Modern multi-stage retrieval systems are comprised of a candidate generation stage followed by one or more reranking stages. In such an architecture, the quality of the final ranked list may not be sensitive to the quality of the initial candidate pool, especially in terms of early precision. This provides several opportunities to increase retrieva...
Conference Paper
The size of a search engine index and the time to search are inextricably related through the compression codec. This investigation examines this tradeoff using several relatively unexplored SIMD-based codecs including QMX, TurboPackV, and TurboPFor. It uses (the non-SIMD) OPTPFor as a baseline. Four new variants of QMX are introduced and also comp...
Conference Paper
Nugget-based evaluations, such as those deployed in the TREC Temporal Summarization and Question Answering tracks, require human assessors to determine whether a nugget is present in a given piece of text. This process, known as nugget annotation, is labor-intensive. In this paper, we present two active learning techniques that prioritize the seque...