Conference PaperPDF Available

Linguistic Rules Based Chinese Error Detection for Second Language Learning

Authors:

Abstract and Figures

In this paper, we handcraft a set of linguistic rules with syntactic information to detect errors occurred in Chinese sentences written by SLL. Experimental results come the similar conclusions with well-known ALEK system used by ETS for English Learning. Our developed Chinese sentence error detection system will be helpful for Chinese self-learners.
Content may be subject to copyright.
27
Choy, D. et al. (Eds.) (2013).
Work-in-Progress
Poster Proceedings of the 21st International Conference on
Computers in Education. Indonesia: Asia-Pacific Society for Computers in Education
Linguistic Rules Based Chinese Error
Det
ec
t
i
on
for Second Language
L
e
a
r
ning
Lung-Hao LEEa,c, Li-Ping CHANG
b
Kuei-Ching LEEa,c, Yuen-Hsien TSENGa*, and Hsin-Hsi CHEN
c
aInformation Technology Center, National Taiwan Normal University, Taiwan
bMandarin Training Center, National Taiwan NormalUniversity, Taiwan
cDept. of Computer Science and Information Engineering, National Taiwan University,Taiwan
*samtseng@ntnu.edu.tw
Abstract: In this paper, we handcraft a set of linguistic rules with syntactic information to
detect errors occurred in Chinese sentences written by SLL. Experimental results come the
similar conclusions with well-known ALEK system used by ETS for English Learning. Our
developed Chinese sentence error detection system will be helpful for Chinese self-learners.
Keywords: Computer-aided language learning, second language learning, computer
education
1. Introduction
Second Language Learners (SLL) usually write ungrammatical sentences with various types of errors.
SLL tends to make mistakes in writing Chinese sentences in their early stage of learning Chinese. For
example, the learner may like to express: “這對夫妻很恩愛(The couple is very affectionate to each
other), where the 恩愛(affectionate) was mistakenly written as another similar word 恩情(kind) as
observed in the learners corpora. Error detection systems that indicate different kinds of errors
embedded in a given sentence are important and invaluable to SLL for self-learning.
Assessing LExical Knowledge (ALEK) system (Chodorow and Leacock, 2000) adopted
statistical analysis to detect the errors of an English sentence. Using 20 target words from the Test of
English as a Foreign Language (TOEFL), it performed with about 80% precision and 20% recall. Izumi
et al. (2003) detected English grammatical and lexical errors made by Japanese learners. Recently,
relative position and parse template language models were proposed to detect various types of Chinese
errors written by US learners (Wu et al. 2010). Different from most of the previous studies, which have
focused on corpus-based statistical methods, we attempt to develop a rule-based system to detect the
common errors embedded in Chinese sentences written by SLL.
In this work, we manually construct a set of linguistic rules with syntactic information to detect
erroneous sentences that were frequently written by the SLL. If a sentence satisfies at least one
syntactic-rule, the developed system will regard the input sentence as erroneous and response with
suggestions to indicate the possible errors.
2. Linguistic Rules Based Chinese Error Detection
Chinese is written without word boundaries. As a result, prior to the implementation of most Natural
Language Processing (NLP) tasks, texts must undergo automatic word segmentation. Automatic
Chinese word segmenters are generally trained by an input lexicon and probability models. However, it
usually suffers from the unknown word (i.e., the out-of-vocabulary, or OOV) problem. In this study, a
corpus-based learning method to merge the unknown words as described in Chen and Ma (2002) is
adopted to tackle the OOV problem. This is followed by a reliable and cost-effective POS-tagging
method to label the segmented words with part-of-speeches similar to the approach proposed by Tsai
and Chen (2004). Take the Chinese sentence 歐巴馬是美國總(Obama is the president of USA)
for instance. It was segmented and tagged in the formof “POS:Word” sequence shown as follows: Nb:
歐巴馬 SHI:Nc:美國 Na:總統. Among these words, the translation of a foreign proper name
巴馬(Obama) is not likely to be included in a lexicon and therefore is extracted by the unknown word
detection method. In this case, the special POS tag SHIis a tag to represent the be-verb ”. The
complete set of part-of-speech tags is definedin the technical reportby CKIP (2003).
28
To represent the syntactic rules for employing them easily to detect errors embedded in Chinese
sentences written by SLL, several symbols are defined. Some of them are explained as follows: 1) The
symbol “*” means a wild card. For example, the whole subordinate tags of Nh”, i.e.,Nhaa,” “Nhab,
Nhac,” “Nhb,” and Nhc, can be denoted as Nh*”. 2) The symbol -means an exclusion from the
previous representation. Take this expression “N*-Nab-Nbc” as an example, it denotes that the
corresponding word should be any noun (N*) excluding countable entity nouns (Nab) and surnames
(Nbc). 3) The symbol /means alternative (the orsituation). The expression 一些/這些/那些
(some/these/those) represents that one of these three words satisfies the rule. 4) The rule mx{W1 W2}
denotes that the two words W1 and W2 shouldnot co-exist (should be mutual exclusive). 5) The symbol
“<” denotes the follow-by condition. For instance, this expression “Nhb < Nepmeans the POS-tag
Nep” follows the tag Nhbthat can exist several words ahead of the Nep”.
With the rule symbols like the above, we manually construct syntactic rules to cover frequent
errors occurred in Chinese sentences written by SLL. We adopt the Analysis of 900 Common
Erroneous Samples of Chinese Sentences(Cheng, 1997) as the development set to handcraft the
linguistic rules with syntactic information. Based on these samples compiled by Chinese teachers in
Beijing, we constructed 60 syntactic rules to detect errors in the samples. Table 1 shows some rules
accompanying with their example sentences. If an input sentence satisfies any syntactic rule, our
developed system will report the input as an erroneous sentence. This can be helpful to SLL for
self-learning of Chinese.
Table 1: Some developed syntactic rules and their detected erroneous sentences.
Rule
Dfa N*-Nb*-Nc*/A/VA*/VB*/VC*/VD*/VE*/VF*/V_12
Example
Nhaa:VK1:覺得 Nhab:自己 DE:Nab:丈夫 Dfa:Nhab:私人
(She feels that her husband is very private)
Notes
私人(private) is an improper word in this sentence. The correct word should be 自私
(selfish). So the correct sentence is 她覺得自己的丈夫很自私”.
Rule
mx{Dbab:可以 Dbab:}
Example
VH11:DE:Nab:雜誌 Dbab:可以 Dbab:VD:Td:
(Can old magazines be borrowed?)
Notes
(able) is a redundantword in this sentence.This word cannot be collocatedwith
another word “可以(can). The correct sentence is 舊的雜誌能借嗎”.
Rule
*:/*:< Da*/Db*:--/Dc/Dd
Example
Nab:自行車 P02:Nb:Dc:VC:騎走
(The bicycle is not ridden by Dingli)
Notes
The word (not) is put in a wrong position. This sentence contains a word ordering
error. The correct sentence is 自行車沒被丁力騎走”.
3. Experiments and Performance Evaluation
The test data comes from a set of real error sentences written by learners of Chinese as a second
language at National Cheng Kung University in Taiwan (Wu et al., 2010). Each erroneous sentence
(positive instance) is accompanied with a correct one (negative instance) in this test set. In total, there
are 1,866 pairs of sentencescollected in years around 2009.
Table 2 shows the confusion matrix of our approach. The results indicate that our linguistic
rules for error detection achieved an accuracy of 58.47%=(418+1764)/(1866+1866), while maintaining
a promising precision of 80.38%=418/(418+102), and a recall of 22.4%=418/(418+1448). The
performance level is similar with that of the ALEK system used by Educational Testing Service (ETS)
for erroneous English sentence detection (Chodorow and Leacock, 2000). In addition, maintaining low
false-alarm rate (which is the ratio of correct sentences that are detected as erroneous ones) is important
for a system to be practical. In the experiments, our approach achieved a false-alarm rate of 5.47%
(among 1,866 correct sentences, 102 were detected as erroneous). This shows that our approach is
feasible to detect errors while not causing much trouble to the users.
29
Table 2: Confusion matrix using our linguistic rule based detection.
Confusion Matrix
Gold Standard
Positive
Negative
Positive
418
102
Negative
1448
1764
4. Conclusions and Future Work
This paper proposes a linguistic rule based Chinese error detection approach. The syntactic rules
handcrafted based on a smaller development set achieve promising performance on a totally different
and larger test set, while maintaining a favorably low false-alarm rate. The major contributions of this
work include: 1) indicating the usefulness of common error samples manually analyzed/collected by
Chinese teachers in previous work; 2) demonstrating the feasibility of linguistic rules handcrafted from
these samples; and 3) developing a system to help self-learning of Chinese for SLL.
This work is our first exploration to automatically detect Chinese erroneous sentences. The
research result can be extended to automatic essay evaluation, which is especially useful for Massive
Open Online Courses (MOOC), because manually evaluating a large scale of Chinese writing
homework and exams is a very challenging issue.
Acknowledgements
This research was partially supported by National Science Council (NSC), Taiwan, under the grant
NSC102-2221-E-002-103-MY3, and the Aim for the Top University Projectof National Taiwan
Normal University (NTNU), sponsored by the Ministry of Education, Taiwan. We are also grateful to
the support of International Research-Intensive Center of Excellence Program of NTNU and NSC,
Taiwan, under the grant NSC102-2911-I-003-301.
References
Chen, K.-J., & Ma, W.-Y. (2002). Unknown word extraction for Chinese documents. Proceedings of COLING’02
(pp. 169-175). Taipei, Taiwan: ACL Press.
Cheng, M. (1997). Analysis of 900 Common Erroneous Samples of Chinese Sentences - for Chinese Learners
from English Speaking Countries (in Chinese). Beijing, CN: Sinolingua.
Chinese Knowledge Information Processing (CKIP) Group. (1993). Categorical analysis of Chinese. ACLCLP
Technical Report # 93-05, Academia Sinica. Available online at:
http://rocling.iis.sinica.edu.tw/CKIP/tr/9305_2013%20revision.pdf
Chodorow, M., & Leacock, C. (2000). An unsupervised method for detecting grammatical errors. Proceedingsof
NAACL’00(pp. 140-147). Seattle, Washington: ACL Press.
Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., & Isahara, H. (2003). Automatic error detection in the Japanese
learner’s English spoken data. Proceedings of ACL’03(pp. 145-148). Sapporo, Japan: ACL Press.
Tsai, Y.-F., & Keh-Jiann Chen, K.-J. (2004). Reliable and cost-effective pos-tagging. International Journal of
Computational Linguistics and Chinese Language Processing, 9(1), 83-96.
Wu, C.-H., Liu, C.-H., Harris, M. & Yu, L.-C. (2010). Sentence correction incorporating relative position and
parse template language model. IEEE Trans. on Audio, Speech, and Language Processing, 18(6),
1170-1181.
... For example, Wu et al. [3] proposed using a relative position language model and parse template language model to detect grammatical errors in articles written by American learners. Lee et al. [4] used a series of manual language rules to detect grammatical errors in Chinese learners' writing. Fu et al. [5] adopted a simple to complex staged error correction method, using a language model to correct simple errors, and a word level transformer model to correct complex errors. ...
Article
Full-text available
In order to further improve the performance of the automatic grammar error detection system, a new Chinese grammar recognition and correction model is proposed in this paper. Based on the transformer attention mechanism, the bias matrix of Gaussian distribution is added to improve the attention of the model to local text and strengthen the information extraction of wrong words and surrounding words in the wrong text. In addition, the ON_LSTM model is used to extract grammatical information from the special grammatical structure of error text. The experimental results show that the two methods can effectively improve the accuracy and recall rate, and the fused model achieves the highest F1 value. Finally, the Chinese text error correction system is designed to expand the application scope of the model, which helps to reduce the human cost in language learning.
... Statistical modeling and machine learning, though easy to implement, are sometimes outperformed by rule-based techniques. In (Lee et al., 2013b;Sun et al., 2007a), it is found that rule-based techniques for detecting grammatical errors yield a better result for the Chinese language. ...
Article
Full-text available
In an interactive online learning system (OLS), it is crucial for the learners to form the questions correctly in order to be provided or recommended appropriate learning materials. The incorrect question formation may lead the OLS to be confused, resulting in providing or recommending inappropriate study materials, which, in turn, affects the learning quality and experience and learner satisfaction. In this paper, we propose a novel method to assess the correctness of the learner's question in terms of syntax and semantics. Assessing the learner’s query precisely will improve the performance of the recommendation. A tri-gram language model is built, and trained and tested on corpora of 2,533 and 634 questions on Java, respectively, collected from books, blogs, websites, and university exam papers. The proposed method has exhibited 92% accuracy in identifying a question as correct or incorrect. Furthermore, in case the learner's input question is not correct, we propose an additional framework to guide the learner leading to a correct question that closely matches her intended question. For recommending correct questions, soft cosine based similarity is used. The proposed framework is tested on a group of learners' real-time questions and observed to accomplish 85% accuracy.
... Many of these learning technologies focus on learners of English as a Foreign Language (EFL), while relatively few grammar checking applications have been developed to support Chinese as a Foreign Language (CFL) learners. Those applications which do exist rely on a range of techniques, such as statistical learning (Chang et al, 2012;Wu et al, 2010;Yu and Chen, 2012), rule-based analysis (Lee et al., 2013), neuro network modelling (Zheng et al., 2016;Fu et al., 2018) and hybrid methods Zhou et al., 2017). ...
Conference Paper
Full-text available
This paper presents the NLPTEA 2020 shared task for Chinese Grammatical Error Diagnosis (CGED) which seeks to identify grammatical error types, their range of occurrence and recommended corrections within sentences written by learners of Chinese as a foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 30 teams registered for this shared task, 17 teams developed the system and submitted a total of 43 runs. System performances achieved a significant progress, reaching F1 of 91% in detection level, 40% in position level and 28% in correction level. All data sets with gold standards and scoring scripts are made publicly available to researchers.
... In 2014, Cheng et al. (2014) proposed an SVM model to further study the Chinese word ordering problems. Lee et al. (2013) used a series of manual linguistic rules to detect grammatical errors in Chinese learners'writings. Lee et al. (2014) then further proposed a system which integrated both handcrafted linguistic rules and N-gram models to detect Chinese grammatical errors in sentences. ...
... A system is to be evaluated using four tasks, including the detection of errors, the identification of error types, the identification of positions, and the corrections. Lee et al. (2013) employed handcrafted linguistic rules to detect grammatical errors made by learners of Chinese as a second language. ...
... Many of these learning technologies focus on learners of English as a Foreign Language (EFL), while relatively few grammar checking applications have been developed to support Chinese as a Foreign Language(CFL) learners. Those applications which do exist rely on a range of techniques, such as statistical learning (Chang et al, 2012;Wu et al, 2010;Yu and Chen, 2012), rule-based analysis (Lee et al., 2013), neuro network modelling (Zheng et al., 2016;Zhou et al., 2017) and hybrid methods (Lee et al., 2014). ...
... Many of these learning technologies focus on learners of English as a Foreign Language (EFL), while relatively few grammar checking applications have been developed to support Chinese as a Foreign Language(CFL) learners. Those applications which do exist rely on a range of techniques, such as statistical learning (Chang et al, 2012;Wu et al, 2010;Yu and Chen, 2012), rule-based analysis (Lee et al., 2013) and hybrid methods . In response to the limited availability of CFL learner data for machine learning and linguistic analysis, the ICCE-2014 workshop on Natural Language Processing Techniques for Educational Applications (NLP-TEA) organized a shared task on diagnosing grammatical errors for CFL . ...
Conference Paper
Full-text available
This paper presents the NLP-TEA 2016 shared task for Chinese grammatical error diagnosis which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 15 teams registered for this shared task, 9 teams developed the system and submitted a total of 36 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.
Article
Full-text available
Textual descriptions of business process models are used as guidance for business executives. While it makes process models much easier to understand, missing procedural text is an important practical problem. In this paper, we propose a method of repairing missing procedural text based on business process model and text alignment. The proposed method realizes the process of repairing unstructured procedural texts based on structured models. Firstly, the key concepts of alignment and repair are formally defined. Then, text repair is divided into three subtasks: alignment creation, selecting the template for activities and procedural text repairing. Each subtasks are solved in a detailed way. Experiments with Chinese recipe documents demonstrate that the proposed method is capable of repairing the missing activities of procedural text. Furthermore, the proposed method could be easily adapted to other real-world domains as it is independent on complicated manually-designed rules.
Conference Paper
Full-text available
In this paper, we proposed a Convolution Neural Network with Long Short-Term Memory (CNN-LSTM) model for Chinese grammatical error detection. The TOCFL learner corpus is adopted to measure the system performance of indicating whether a sentence contains errors or not. Our model performs better than other neural network based methods in terms of accuracy for identifying an erroneous sentence written by Chinese language learners.
Article
Full-text available
In order to achieve fast and high quality Part-of-speech (PoS) tagging, algorithms should be high accuracy and require less manually proofreading. To evaluate a tagging system, we proposed a new criterion of reliability, which is a kind of cost-effective criterion, instead of the conventional criterion of accuracy. The most cost-effective tagging algorithm is judged according to amount of manual editing and achieved final accuracy. The reliability of a tagging algorithm is defined to be the estimated best accuracy of the tagging under a fixed amount of proofreading. We compared the tagging accuracies and reliabilities among different tagging algorithms, such as Markov bi-gram model, Bayesian classifier, and context-rule classifier. According to our experiments, for the best cost-effective tagging algorithm, in average, 20% of samples of ambivalence words need to be rechecked to achieve an estimated final accuracy of 99%. The tradeoffs between amount of proofreading and final accuracy for different algorithms are also compared. It concludes that an algorithm with highest accuracy may not always be the most reliable algorithm.
Conference Paper
Full-text available
There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown words in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented which online identifies new words, including low frequency new words, with high precision and high recall rates.
Conference Paper
Full-text available
This paper describes a method of detecting grammatical and lexical errors made by Japanese learners of English and other techniques that improve the accuracy of error detection with a limited amount of training data. In this paper, we demonstrate to what extent the proposed methods hold promise by conducting experiments using our learner corpus, which contains information on learners' errors.
Article
Full-text available
Sentence correction has been an important emerging issue in computer-assisted language learning. However, existing techniques based on grammar rules or statistical machine translation are still not robust enough to tackle the common errors in sentences produced by second language learners. In this paper, a relative position language model and a parse template language model are proposed to complement traditional language modeling techniques in addressing this problem. A corpus of erroneous English-Chinese language transfer sentences along with their corrected counterparts is created and manually judged by human annotators. Experimental results show that compared to a state-of-the-art phrase-based statistical machine translation system, the error correction performance of the proposed approach achieves a significant improvement using human evaluation.
Article
Full-text available
We present an unsupervised method for detecting grammatical errors by inferring negative evidence from edited textual corpora. The system was developed and tested using essay-length responses to prompts on the Test of English as a Foreign Language (TOEFL). The errorrecognition system, ALEK, performs with about 80% precision and 20% recall.
Analysis of 900 Common Erroneous Samples of Chinese Sentences -for Chinese Learners from English Speaking Countries (in Chinese)
  • M Cheng
Cheng, M. (1997). Analysis of 900 Common Erroneous Samples of Chinese Sentences -for Chinese Learners from English Speaking Countries (in Chinese). Beijing, CN: Sinolingua.
An unsupervised method for detecting grammatical errors
  • M Chodorow
  • C Leacock
Chodorow, M., & Leacock, C. (2000). An unsupervised method for detecting grammatical errors. Proceedings of NAACL'00 (pp. 140-147). Seattle, Washington: ACL Press.