Huizhen Wang's research while affiliated with Northeastern University (Shenyang, China) and other places

Publications (39)

Chapter
Open relation extraction aims at extracting novel relations from open-domain corpora. However, most recent works typically treat entities and tokens equally while encoding sentences, without taking full advantage of the guiding role of entities in representation learning. In this work, we propose the Entity-Aware Relation Representation learning fr...
Preprint
Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0% in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap betwee...
Preprint
Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we investigate the behavior of a well-tuned deep Transformer system. We find that stacking layers is helpful in improvi...
Article
Shift-reduce parsing has been studied extensively for diverse grammars due to the simplicity and running efficiency. However, in the field of constituency parsing, shift-reduce parsers lag behind state-of-the-art parsers. In this paper we propose a semi-supervised approach for advancing shift-reduce constituency parsing. First, we apply the uptrain...
Conference Paper
This paper proposes a method to improve shift-reduce constituency parsing by using lexical dependencies. The lexical dependency information is obtained from a large amount of auto-parsed data that is generated by a baseline shift-reduce parser on unlabeled data. We then incorporate a set of novel features defined on this information into the shift-...
Article
Opinion polling has been traditionally done via customer satisfaction studies in which questions are carefully designed to gather customer opinions about target products or services. This paper studies aspect-based opinion polling from unlabeled free-form textual customer reviews without requiring customers to answer any questions. First, a multi-a...
Article
Full-text available
This paper describes the NiuTrans system developed by the Natural Language Processing Lab at Northeastern University for the Patent Machine Translation Task at NTCIR-9. We present our submissions to the nine tracks of CWMT2011, and show several improvements to our core phrase-based and syntax-based engines, including: an approach to improving searc...
Article
This paper addresses an issue of incorporating topic knowledge to improve Chinese word sense disambiguation. The key is how to learn topic knowledge as features in the design of classifiers for disambiguating word senses. This paper presents two solutions to learn topic knowledge. In the first solution, a Chinese domain knowledge dictionary named N...
Conference Paper
In statistical machine translation (SMT), syntax-based models generally rely on the syntactic information provided by syntactic parsers in source language, target language or both of them. However, whether or how parsers impact the performance of syntax-based systems is still an open issue in the MT field. In this paper, we make an attempt to explo...
Article
To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncer...
Article
This paper addresses the issue of sentiment word identification given an opinionated sentence, which is very important in sentiment analysis tasks. The most common way to tackle this problem is to utilize a readily available sentiment lexicon such as HowNet or SentiWordNet to determine whether a word is a sentiment word. However, in practice, words...
Article
This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages....
Article
Full-text available
In this paper, we propose a novel Chinese-English organization name translation method with the assistance of mix-language web resources. Firstly, all the implicit out-of-vocabulary terms in the input Chinese organization name are recognized by a CRFs model. Then the input Chinese organization name is translated without considering these recognized...
Article
Full-text available
The labor-intensive task of labeling data is a serious bottleneck for many supervised learning approaches for natural language processing applications. Active learning aims to reduce the human labeling cost for supervised learning methods. Determining when to stop the active learning process is a very important practical issue in real-world applica...
Conference Paper
In this paper, we present a simple and effective method to address the issue of how to generate diversified translation systems from a single Statistical Machine Translation (SMT) engine for system combination. Our method is based on the framework of boosting. First, a sequence of weak translation systems is generated from a baseline system in an i...
Conference Paper
Full-text available
This paper proposes a new approach for translating Chinese organization names that uses example-based method along with Web assistance. It consists of two phases, first, it generates a translation candidate for the input Chinese organization name by an example-based translation method; and secondly, it uses the Web to amend this translation candida...
Article
Full-text available
This paper presents an approach to trans- lating Chinese organization names into English based on correlative expansion. Firstly, some candidate translations are generated by using statistical translation method. And several correlative named entities for the input are retrieved from a correlative named entity list. Secondly, three kinds of expansi...
Article
Among the techniques to solve the knowledge bottleneck problem of supervised learning models, active learning is a promising method. One of the popular techniques of active learning is uncertainty sampling which, however, often presents problems when outliers are selected. To solve this problem, this paper presents a density-based re-ranking techni...
Conference Paper
One of the popular techniques of active learning for data annotations is uncertainty sampling, however, which often presents problems when outliers are selected. To solve this problem, this paper proposes a density-based re-ranking technique, in which a density measure is adopted to determine whether an unlabeled example is an outlier. The motivati...
Conference Paper
The performance of machine learning methods heavily depends on the volume of used training data. For the purpose of dataset enlargement, it is of interest to study the problem of unifying multiple labeled datasets with different annotation standards. In this paper, we focus on the case of unifying datasets for sequence labeling problems with natura...
Conference Paper
This paper presents an unsupervised approach to aspect-based opinion polling from raw textual reviews without explicit ratings. The key contribution of this paper is three-fold. First, a multi-aspect bootstrapping algorithm is proposed to learn from unlabeled data aspect-related terms of each aspect to be used for aspect identification. Second, an...
Article
Aspect-based sentiment summarization systems generally use sentences associated with relevant aspects extracted from the reviews as the basis for summarization. However, in real reviews, a single sentence often exhibits several aspects for opinions. This paper proposes a two-stage segmentation model to address the challenge of identifying multiple...
Conference Paper
This paper describes our English patent mining system for NTCIR-7 patent mining task which maps a research paper abstract into IPC taxonomy. Our system is basically under the k-Nearest Neighboring framework, in which various similarity calculation and ranking methods are used. We employ two re-ranking techniques to improve the performance by the us...
Article
A new divergence-based approach to feature selection for naive Bayes text classification is proposed in this paper. In this approach, the discrimination power of each feature is directly used for ranking various features through a criterion named overall-divergence, which is based on the divergence measures evaluated between various class density f...
Article
We present a machine learning approach for coreference resolution of noun phrases. In our method, we use CRFs as a basic training model, and use active learning method to generate combined features so as to use existing features more effectively. We also propose a novel clustering algorithm which uses both linguistic knowledge and statistical knowl...
Conference Paper
Text segmentation has a wide range of applications such as information retrieval, question answering and text summarization. In recent years, the use of semantics has been proven to be effective in improving the performance of text segmentation. Particularly, in finding the subtopic boundaries, there have been efforts in focusing on either maximizi...
Conference Paper
This paper addresses two issues of active learning. Firstly, to solve a problem of uncertainty sampling that it often fails by selecting outliers, this paper presents a new selective sampling technique, sam- pling by uncertainty and density (SUD), in which a k-Nearest-Neighbor-based density measure is adopted to determine whether an unlabeled examp...
Conference Paper
Full-text available
In this paper, we address the issue of de-ciding when to stop active learning for building a labeled training corpus. Firstly, this paper presents a new stopping crite-rion, classification-change, which con-siders the potential ability of each unla-beled example on changing decision boundaries. Secondly, a multi-criteria-based combination strategy...
Conference Paper
In this paper we focus on the problem of class discrimination issues to improve performance of text classification, and study a discrimination-based feature selection technique in which the features are selected based on the criterion of enlarging separation among competing classes, referred to as discrimination capability. The proposed approach di...
Article
We participated in the Third Interna-tional Chinese Word Segmentation Bake-off. Specifically, we evaluated our Chi-nese word segmenter NEUCipSeg in the close track, on all four corpora, namely Academis Sinica (AS), City Uni-versity of Hong Kong (CITYU), Mi-crosoft Research (MSRA), and Univer-sity of Pennsylvania/University of Col-orado (UPENN). Bas...
Conference Paper
The dotplotting method, employed by Reynar (1994), is a state-of-the-art algorithm for automatic linear text segmentation. However, several problems are found in its measure for assessing density that represents topical coherence: the density function is asymmetric, leading to the apparent false conclusion that forward scan may result in different...
Conference Paper
The technology of topic tracking can help people find what they are interested from the vast information sea. Since topics develop dynamically, topic excursion problem may appear in the tracking process. To overcome this problem and the shortcomings of current adaptive methods, we propose a new adaptive method for topic tracking. We call it time ad...
Conference Paper
ABSTRACT This paper presents a cluster-based text categorization system which uses class distributional clustering of words. We propose a new,clustering model,which considers the global information over all the clusters. The model,can group words into clusters based on the distribution of class labels associated with each word. Using these learned...
Article
This paper presents our systems for the participation of Chinese Personal Name Disambiguation task in the CIPS-SIGHAN 2010. We submitted two dif-ferent systems for this task, and both of them all achieve the best performance. This paper introduces the multi-stage clustering framework and some key techniques used in our systems, and demonstrates exp...
Article
Full-text available
This paper describes the NiuTrans system developed by the Natural Language Processing Lab at Northeastern University for the NTCIR-9 Patent Machine Translation task (NTCIR-9 PatentMT). We present our submissions to the two tracks of NTCIR-9 PatentMT, and show several improvements to our phrase-based Statistical MT engine, including: a hybrid reorde...
Article
Full-text available
In this paper, we address the problem of knowing when to stop the process of active learning. We propose a new statistical learning approach, called minimum expected error strategy, to defining a stopping criterion through estimation of the classifier's expected error on future unlabeled examples in the active learning process. In experiments on ac...
Article
Full-text available
The Dotplotting method has been widely used for text segmentation for its merits in detecting lexical repetition in global context. However, a theoretical analysis of its segmentation criterion function finds several deficiencies. The original function can not make full use of the text structure features and does not suit the text segmentation task...

Citations

... For comparison, we also include the results reported by two recent papers which achieved particularly strong performance for distant languages (Zhou et al., 2019;Li et al., 2020). Table 1 shows the UBLI results for distance language pairs. ...
... chatbots) increases, emotion recognition in conversation becomes more important. Emotions are additional information that can better understand the speaker's state of conversation, and this is used to design an empathic dialogue system [1,2,3]. Emotion also helps provide personalized results, such as social media opinion mining [4] and recommendation systems [5]. ...
... Indeed, Suzuki et al. (2009) show that it is possible to reach improvements in dependency parsing beyond what is possible with word clustering when combining a discriminative model that uses word clusters with an ensemble of generative models that are used as features. Zhu et al. (2013) show that it is also possible to use lexical dependency statistics learned from a large corpus to improve a state-of-the-art shift-reduce parser for constituents. ...
... A general formulation for the application of the perceptron algorithm to various problems, including shift-reduce constituency parsing, has been introduced by Zhang and Clark (2011b). Improvements have followed (Zhu et al., 2012;Zhu et al., 2013). A similar strategy has been shown to work well for CCG parsing (Zhang and Clark, 2011a), too. ...
... Malioutov et al. [24] formalize segmentation as a graph-partitioning problem and propose a minimum cut model based on tf ·idf features to segment lectures. Ye et al. [25] minimize between-segment similarity while maximizing within-segment similarity. However, the above complicated approaches are known as global methods: when we perform segmentation between two successive sentences, future context information is needed. ...
... k) Syntactic constraint-lead approaches to Collaborative Fusion Machine Translation (CFMT) is presented in [52] is based on group Rorschach tests, in addition to syntactic constraint-lead machine translation processes. A simple approach for obtaining technical term translation pairs is proposed in [20].The system have following steps first, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. ...
... NLP is a strong tool in such a way that it can identify the credit transfer for a standard course in different universities without studying the course more than once [13]. Some basic studies supported that astronomy improvising student attitudes towards science and evaluation capability Lexicon based method [12]. While converting the user opinions into tokens/words then to Word Cloud, parsing is an essential trick for successful conversion. ...
... Mukherjee et al. presented two improved topic models to extract aspects and sentiment words in [28]. Zhu et al. proposed a twostage segmentation model to identify multiple single-aspect and single-polarity units within one sentence in [29]. ...
... In [1][2][3][4], authors developed clustering techniques over the rich extracted biographic features, such as gender, nationality, origin, date of birth, family relationships, address, title, etc. The approach taken by [5][6][7] and [8] applied hierarchical clustering over the similarity weighted by vector space model between two namesakes. In [9], Lili Jiang changed documents into graph based on the co-occurrence of features, and then graph clustering and partitioning is developed to realize name disambiguation. ...
... These techniques serve to extract information about technological innovations for businesses to make strategic investment [7] and to resolve problems inherent within the patent data for search and analysis. Research based methods generally enrich the patent data, such as with metadata [1], the classification taxonomy from the patent office [11], derived data from the patent text and text summarisations [4], informative data from patent citations and academic citations [6], patent domains [5], relationship with other patents [2], and a combination of methods for forecasting technologies [7]. Figure 1 shows the high level architecture of the system. A Data Extractor module is responsible for extraction-transformation-loading (ETL) of issued and published patents from the USPTO Bulk Data store. ...