Ming Zhou

Ming Zhou
New York University | NYU · Department of Pathology

About

541
Publications
77,560
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
28,629
Citations

Publications

Publications (541)
Preprint
Task generalization has been a long standing challenge in Natural Language Processing (NLP). Recent research attempts to improve the task generalization ability of pre-trained language models by mapping NLP tasks into human-readable prompted forms. However, these approaches require laborious and inflexible manual collection of prompts, and differen...
Conference Paper
Tabular and textual question answering requires systems to perform reasoning over heterogeneous information, considering table structure, and the connections among table and text. In this paper, we propose a ChAin-centric Reasoning and Pre-training framework (CARP). CARP utilizes hybrid chain to model the explicit intermediate reasoning process acr...
Article
The 5th edition of the WHO Classification of Tumours of the Urinary and Male Genital Systems encompasses several updates to the classification and diagnosis of prostatic carcinoma as well as incorporating advancements in assessment of its prognosis, including recent grading modifications. Some of the salient aspects include: 1) recognition that PIN...
Preprint
Question Answering (QA) is a longstanding challenge in natural language processing. Existing QA works mostly focus on specific question types, knowledge domains, or reasoning skills. The specialty in QA research hinders systems from modeling commonalities between tasks and generalization for wider applications. To address this issue, we present Pro...
Article
The 5th edition of the WHO Classification of Tumours of the Urinary and Male Genital Systems contains relevant revisions and introduces a group of molecularly defined renal tumour subtypes. Herein we present the World Health Organization (WHO) 2022 perspectives on papillary and chromophobe renal cell carcinoma with emphasis on their evolving classi...
Preprint
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion th...
Preprint
Tabular and textual question answering requires systems to perform reasoning over heterogeneous information, considering table structure, and the connections among table and text. In this paper, we propose a ChAin-centric Reasoning and Pre-training framework (CARP). CARP utilizes hybrid chain to model the explicit intermediate reasoning process acr...
Article
Full-text available
Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent va...
Article
Most succinate dehydrogenase (SDH)-deficient renal cell carcinomas (RCCs) demonstrate stereotypical morphology characterized by bland eosinophilic cells with frequent intracytoplasmic inclusions. However, variant morphologic features have been increasingly recognized. We therefore sought to investigate the incidence and characteristics of SDH-defic...
Chapter
In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through four novel generation tasks, including Adversarial Image Captioning (AIC), Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA),...
Preprint
We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commi...
Preprint
Full-text available
Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to low resource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and cro...
Article
Accurate diagnosis of cribriform Gleason pattern 4 (CrP4) prostate adenocarcinoma (PCa) is important due to its independent association with adverse clinical outcomes and as a growing body of evidence suggests that it impacts clinical decision making in PCa management. To identify reproducible features for diagnosis of CrP4, we assessed interobserv...
Preprint
Full-text available
Complex reasoning aims to draw a correct inference based on complex rules. As a hallmark of human intelligence, it involves a degree of explicit reading comprehension, interpretation of logical knowledge and complex rule application. In this paper, we take a step forward in complex reasoning by systematically studying the three challenging and doma...
Article
The Genitourinary Pathology Society (GUPS) undertook a critical review of the recent advances in bladder cancer focusing on important topics of high interest for the practicing surgical pathologist and urologist. This review represents the second of 2 manuscripts ensuing from this effort. Herein, we address the effective reporting of bladder cancer...
Article
The Genitourinary Pathology Society (GUPS) undertook a critical review of the recent advances in bladder neoplasia with a focus on issues relevant to the practicing surgical pathologist for the understanding and effective reporting of bladder cancer, emphasizing particularly on the newly accumulated evidence post-2016 World Health Organization (WHO...
Chapter
Relation extraction benefits a variety of applications requiring relational understanding of unstructured texts, such as question answering. Recently, capsule network-based models have been proposed for improving relation extraction with better capability of modeling complex entity relations. However, they fail to capture the syntactic structure in...
Preprint
Full-text available
Logical reasoning of text requires understanding critical logical information in the text and performing inference over them. Large-scale pre-trained models for logical reasoning mainly focus on word-level semantics of text while struggling to capture symbolic logic. In this paper, we propose to understand logical symbols and expressions in the tex...
Preprint
Analytical reasoning is an essential and challenging task that requires a system to analyze a scenario involving a set of particular circumstances and perform reasoning over it to make conclusions. In this paper, we study the challenge of analytical reasoning of text and introduce a new dataset consisting of questions from the Law School Admission...
Preprint
Full-text available
Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones nor can they identify the specific discourse phenomena that caused the translation errors. To address these problems, we propose an automatic metric Blon...
Article
Full-text available
The Genitourinary Pathology Society (GUPS) reviewed recent advances in renal neoplasia, particularly post-2016 World Health Organization (WHO) classification, to provide an update on existing entities, including diagnostic criteria, molecular correlates, and updated nomenclature. Key prognostic features for clear cell renal cell carcinoma (RCC) rem...
Preprint
Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and compariso...
Article
The Genitourinary Pathology Society (GUPS) undertook a critical review of the recent advances in renal neoplasia, particularly focusing on the newly accumulated evidence post-2016 World Health Organization (WHO) classification. In the era of evolving histo-molecular classification of renal neoplasia, morphology is still key. However, entities (or g...
Preprint
In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generation can be uniformly regarded as what extend of previous tokens can be attended to, and BANG bridges AR and NAR generation through designing a novel model structure for large-scale pre-trai...
Article
Full-text available
The International Society of Urological Pathology (ISUP) hosts a reference image database supervised by experts with the purpose of establishing an international standard in prostate cancer grading. Here, we aimed to identify areas of grading difficulties and compare the results with those obtained from an artificial intelligence system trained in...
Preprint
In a sponsored search engine, generative retrieval models are recently proposed to mine relevant advertisement keywords for users' input queries. Generative retrieval models generate outputs token by token on a path of the target library prefix tree (Trie), which guarantees all of the generated outputs are legal and covered by the target library. I...
Preprint
Unsupervised extractive document summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for uns...
Preprint
Deepfake detection, the task of automatically discriminating machine-generated text, is increasingly critical with recent advances in natural language generative models. Existing approaches to deepfake detection typically represent documents with coarse-grained representations. However, they struggle to capture factual structures of documents, whic...
Preprint
Full-text available
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection (ESD) and Erroneous Span Correction (ESC). ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. Then, ESC leverages a seq2seq model...
Preprint
In this paper, we propose a novel data augmentation method, referred to as Controllable Rewriting based Question Data Augmentation (CRQDA), for machine reading comprehension (MRC), question generation, and question-answering natural language inference tasks. We treat the question data augmentation task as a constrained question rewriting problem to...
Chapter
In a sponsored search engine, generative retrieval models are recently proposed to mine relevant advertisement keywords for users’ input queries. Generative retrieval models generate outputs token by token on a path of the target library prefix tree (Trie), which guarantees all of the generated outputs are legal and covered by the target library. I...
Preprint
Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural langua...
Preprint
Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code sem...
Article
Full-text available
This study aims to describe the pathological features and clinical outcomes in anterior-dominant prostate cancer (APCA) compared to posterior/posterolateral-dominant prostate cancer (PPCA) among men treated with radical prostatectomy for localized prostate cancer. This is a single-institution, matched case-control analysis of short-term clinical ou...
Preprint
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information w...
Article
Importance For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools to improve the reproducibility of Gleason grading in routine clinical practice. Objective To ev...
Preprint
Full-text available
In this work, we formulate cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, the information-theoretic framework inspires us to propose a pre-trainin...
Article
Full-text available
Context.— Controversies and uncertainty persist in prostate cancer grading. Objective.— To update grading recommendations. Data Sources.— Critical review of the literature along with pathology and clinician surveys. Conclusions.— Percent Gleason pattern 4 (%GP4) is as follows: (1) report %GP4 in needle biopsy with Grade Groups (GrGp) 2 and 3, an...
Preprint
Generating inferential texts about an event in different perspectives requires reasoning over different contexts that the event occurs. Existing works usually ignore the context that is not explicitly provided, resulting in a context-independent semantic representation that struggles to support the generation. To address this, we propose an approac...
Preprint
This paper presents a Multitask Multilingual Multimodal Pre-trained model (M3P) that combines multilingual-monomodal pre-training and monolingual-multimodal pre-training into a unified framework via multitask learning and weight sharing. The model learns universal representations that can map objects that occurred in different modalities or express...
Preprint
Full-text available
Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present \textbf{DocBank}, a benchmark dataset with fine-grained token-level...
Preprint
Natural Questions is a new challenging machine reading comprehension benchmark with two-grained answers, which are a long answer (typically a paragraph) and a short answer (one or more entities inside the long answer). Despite the effectiveness of existing methods on this benchmark, they treat these two sub-tasks individually during training while...
Preprint
We study the detection of propagandistic text fragments in news articles. Instead of merely learning from input-output datapoints in training data, we introduce an approach to inject declarative knowledge of fine-grained propaganda techniques. We leverage declarative knowledge expressed in both natural language and first-order logic. The former ref...
Preprint
Verifying the correctness of a textual statement requires not only semantic reasoning about the meaning of words, but also symbolic reasoning about logical operations like count, superlative, aggregation, etc. In this work, we propose LogicalFactChecker, a neural network approach capable of leveraging logical operations for fact checking. It achiev...
Preprint
Full-text available
In this paper, we introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism, which is a key component of transformer, a state-of-the-art model for various NLP tasks. In contrast to the conventional dropout mechanisms which randomly drop units or connections, the proposed DropHead is a...
Preprint
We study question answering over a dynamic textual environment. Although neural network models achieve impressive accuracy via learning from input-output examples, they rarely leverage various types of knowledge and are generally not interpretable. In this work, we propose a graph-based approach, where a heterogeneous graph is automatically built w...
Preprint
End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition i...
Preprint
Pre-training text representations has recently been shown to significantly improve the state-of-the-art in many natural language processing tasks. The central goal of pre-training is to learn text representations that are useful for subsequent tasks. However, existing approaches are optimized by minimizing a proxy objective, such as the negative lo...
Preprint
Extractive methods have proven to be very effective in automatic document summarization. Previous works perform this task by identifying informative contents at sentence level. However, it is unclear whether performing extraction at sentence level is the best solution. In this work, we show that unnecessity and redundancy issues exist when extracti...