Huyen Thi Minh Nguyen

Huyen Thi Minh Nguyen
  • Vietnam National University, Hanoi

About

48
Publications
13,076
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
616
Citations
Introduction
Skills and Expertise
Current institution

Publications

Publications (48)
Conference Paper
Full-text available
This paper presents two neural models for multilingual grammatical error detection and their results in the MultiGED-2023 shared task. The first model uses a simple, purely supervised character-based approach. The second model uses a large language model which is pretrained on 100 different languages and fine-tuned on the provided datasets of the s...
Article
Full-text available
This paper presents VieCap4H, a grand data challenge on automatic image caption generation for the healthcare domain in Vietnamese. VieCap4H is held as part of the eighth annual workshop on VietnameseLanguage and Speech Processing (VLSP 2021). The task is considered as an image captioning task. Given a static image, mostly about healthcare-related...
Preprint
Full-text available
This paper presents VieCap4H, a grand data challenge on automatic image caption generation for the healthcare domain in Vietnamese. VieCap4H is held as part of the eighth annual workshop on Vietnamese Language and Speech Processing (VLSP 2021). The task is considered as an image captioning task. Given a static image, mostly about healthcare-related...
Preprint
Full-text available
This paper reports on the ReINTEL Shared Task for Responsible Information Identification on social network sites, which is hosted at the seventh annual workshop on Vietnamese Language and Speech Processing (VLSP 2020). Given a piece of news with respective textual, visual content and metadata, participants are required to classify whether the news...
Preprint
Full-text available
The paper describes the organisation of the "HateSpeech Detection" (HSD) task at the VLSP workshop 2019 on detecting the fine-grained presence of hate speech in Vietnamese textual items (i.e., messages) extracted from Facebook, which is the most popular social network site (SNS) in Vietnam. The task is organised as a multi-class classification task...
Chapter
Lexical resources play an essential role in text processing. In this paper, we present our work on the construction of a Vietnamese medical terminology integratable into the UMLS multilingual metathesaurus (Unified Medical Language System). The construction of the Vietnamese medical terminology is done by collecting terms from existing lexical sour...
Conference Paper
Bilingual terminologies are important resources for natural language processing as well as for human use. The automatic acquisition of bilingual terminologies is mostly based on bilingual corpora. However, monolingual corpora could also be a good source for extracting bilingual terms. In fact, as English is used for international publications, we o...
Chapter
This paper presents a new approach for text detection using sparse representation over learned dictionaries. More specifically, the K-SVD algorithm is used for constructing two dictionaries, one for the background and one for the text. Then, text detection is done by comparing the error constructions of each patch of image over two dictionaries. Re...
Article
Full-text available
Clinical texts contain textual data recorded by doctors during medical examinations. Sentences in clinical texts are generally short, narrative, not strictly adhering to Vietnamese grammar and contain many medical terms which are not present in general dictionaries. In this paper, we investigate the tasks of lexical analysis and phrase chunking for...
Article
Kho ngữ liệu song ngữ được gióng hàng mức câu là một dạng tài nguyên ngôn ngữ quan trọng được sử dụng trong nhiều ứng dụng của xử lý ngôn ngữ tự nhiên, như: nghiên cứu ngôn ngữ học so sánh, tìm kiếm thông tin xuyên ngữ, xây dựng từ điển song ngữ. Đặc biệt trong lĩnh vực dịch máy, chất lượng và độ lớn của kho ngữ liệu song ngữ có vai trò quyết định...
Article
Full-text available
Recently, deep learning methods have achieved good results in dependency parsing for many natural languages. In this paper, we investigate the use of bidirectional long short-term memory network models for both transition-based and graph-based dependency parsing for the Vietnamese language. We also report our contribution in building a Vietnamese d...
Article
Full-text available
Một ứng dụng dịch tự động (Machine Translation – MT) từ tiếng Việt sang tiếng dân tộc K’Ho được trình bày. Ứng dụng nhằm mục đích giới thiệu phương pháp dịch tự động dựa vào thống kê (Statistics Machine Translation - STMT). Do tiếng Việt và tiếng dân tộc K’Ho cùng ngữ hệ Nam Á, nhưng lại thuộc nhóm ngôn ngữ khác nhau, nên phần chuyển ngữ thường đượ...
Preprint
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a...
Article
Full-text available
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, a first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a n...
Conference Paper
Lexicon is an important resource in natural language processing (NLP), as it provides NLP systems with lexical information at different levels, from morphology to semantics. For Vietnamese, lexical resources are available for several basic tools such as word segmentation, part-of-speech tagging and syntactic parsing. In this paper, we discuss the c...
Article
Full-text available
Trong xử lý ngôn ngữ tự nhiên, gán nhãn từ loại (POS tagging) đóng một vai trò quan trọng, là đầu ra, đầu vào của nhiều nhiệm vụ khác (phân tích cú pháp, phân tích ngữ nghĩa...). Một trong những vấn đề liên quan đến việc gán nhãn từ loại là xác định tập từ loại (POS). Điều này có thể được giải quyết bằng các phương pháp học máy không giám sát. Bài...
Chapter
In this work, we propose to use distributed word representations in a greedy, transition-based dependency parsing framework. Instead of using a very large number of sparse indicator features, the multinomial logistic regression classifier employed by the parser learns and uses a small number of dense features, therefore it can work very fast. The d...
Article
Full-text available
This paper presents the development of a grammar and a syntactic parser for the Vietnamese language. We first discuss the construction of a lexicalized tree-adjoining grammar using an automatic extraction approach. We then present the construction and evaluation of a deep syntactic parser based on the extracted grammar. This is a complete system th...
Conference Paper
Full-text available
In this work, we propose to use distributed word representations in a greedy, transition-based dependency parsing framework. Instead of using a very large number of sparse indicator features, the multinomial logistic regression classifier employed by the parser learns and uses a small number of dense features, therefore it can work very fast. The d...
Conference Paper
Full-text available
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a...
Conference Paper
Full-text available
The problem of Vietnamese syntactic parsing, especially constituency parsing, has recently been tackled by several research groups. A common effort of the Vietnamese language processing community has allowed the creation of VietTreebank, a reference parsed corpus containing about 10,000 sentences for the constituency parsing task. In this paper, we...
Conference Paper
Full-text available
This paper presents a method for automatically inducing the parts-of-speech of the Vietnamese language from a large text corpus. We first build a class-based bigram language model using several statistical algorithms assigning words to classes based on their ability to combine with neighbouring words. We then show that this model is able to extract...
Article
The Sketch Engine is a corpus query system based on grammatical relations of a language. This system has been widely used in lexicography, particularly for building dictionaries of different languages such as English, Japanese, Chinese etc. This paper presents an approach to applying the Sketch Engine to Vietnamese in which a method for building co...
Conference Paper
Full-text available
This paper presents an empirical study on the application of the maximum entropy approach for part-of-speech tagging of Vietnamese text, a language with special characteristics which largely distinguish it from occidental languages. Our best tagger explores and includes useful knowledge sources for tagging Vietnamese text and gives a 93.40%overall...
Article
Full-text available
In this paper, we present a system that automatically extracts lexicalized tree adjoining grammars (LTAG) from treebanks. We first discuss in detail extraction algorithms and compare them to previous works. We then report the first LTAG extraction result for Vietnamese, using a recently released Vietnamese treebank. The implementation of an open so...
Conference Paper
Full-text available
We present for the first time a computational model for the reduplication of the Vietnamese language. Reduplication is a popular phenomenon of Vietnamese in which reduplicative words are created by the combination of multiple syllables whose phonics are similar. We first give a systematical study of Vietnamese reduplicative words, bringing into foc...
Article
Full-text available
Treebank is an important resource for both research and application of natural language processing. For Vietnamese, we still lack such kind of corpora. This paper presents up-to-date results of a project for Vietnamese treebank construction. Since Vietnamese is an isolating language and has no word delimiter, there are many ambiguities in sentence...
Conference Paper
Full-text available
We present in this paper an initial investigation into the use of a metagrammar for explicitly sharing abstract grammatical specifications for the Vietnamese language. We first introduce the essential syntactic mechanisms of the Vietnamese language. We then show that the basic subcategorization frames of Vietnamese can be compactly represented by c...
Article
Full-text available
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear....
Article
Full-text available
Only very recently have Vietnamese researchers begun to be involved in the domain of Natural Language Processing (NLP). As there does not exist any published work in formal linguistics nor any recognizable standard for Vietnamese word definition and word categories, the fundamental tasks for automatic Vietnamese language processing, such as part-of...
Article
Full-text available
The work presented in this document deals with the constitution of linguistic resources and tools for the fundamental tasks of automatic processing of the Vietnamese language, both in monolingual and multilingual contexts. We present possible solutions to the problems of morpho-syntactic annotation (definition of “standardized” lexical descriptors,...
Thesis
Texte intégral accessible uniquement aux membres de l'Université de Lorraine
Conference Paper
Full-text available
This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted...
Conference Paper
Full-text available
In this paper, we present the first sizable grammar built for Vietnamese using LTAG, developed over the past two years, named vnLTAG. This grammar aims at modelling written language and is general enough to be both application- and domain-independent. It can be used for the morpho-syntactic tagging and syntactic parsing of Vietnamese texts, as well...
Article
Full-text available
This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted...
Article
Full-text available
The automatic alignment of parallel corpora is a very rich source of information for automatic translation, multilingual document indexing, information retrieval, etc. The rapid growth of the use of " minority " languages in online documents makes it necessary to develop methods that can easily adapt to any language. We present an evolution over pr...
Article
Full-text available
Vietnamese is spoken by about 80 millions people around the world, yet very few concrete works on this language have been noticed in Natural Language Processing (NLP) until now. The fundamental problems in automatic analysis of Vietnamese, such as part-of speech (POS) tagging, parsing, etc. are extremely difficult due to the lack of formal linguist...
Article
Full-text available
Only very recently have Vietnamese re-searchers begun to be involved in the do-main of Natural Language Processing. As there does not exist any published work in formal linguistics or any recognizable standard for Vietnamese word categories, the fundamental works in Vietnamese text analysis such as part-of-speech tagging, parsing, etc. are very dif...
Article
Full-text available
Dans cet article, nous discutons de la construction des jeux d'étiquettes pour l'analyse morpho-syntaxique du vietnamien, en prenant en compte les spécificités linguistiques de cette langue. Cette construction est inspirée du modèle MULTEXT(*) dans le but de s'orienter vers les applications multilingues ainsi que la réutilisabilité des jeux d'étiqu...

Network

Cited By