Content uploaded by Yangming Zhou
Author content
All content in this area was uploaded by Yangming Zhou on Dec 17, 2022
Content may be subject to copyright.
Neural Computing and Applications manuscript No.
(will be inserted by the editor)
NE-LP: Normalized Entropy and Loss Prediction based
Sampling for Active Learning in Chinese Word Segmentation
on EHRs
Tingting Cai ·Zhiyuan Ma ·Hong Zheng ·Yangming Zhou*
Received: date / Accepted: date
Abstract Electronic Health Records (EHRs) in hos-
pital information systems contain patients’ diagnoses
and treatments, so EHRs are essential to clinical data
mining. Of all the tasks in the mining process, Chinese
Word Segmentation (CWS) is a fundamental and im-
portant one, and most state-of-the-art methods greatly
rely on large-scale of manually-annotated data. Since
annotation is time-consuming and expensive, efforts have
been devoted to techniques, such as active learning,
to locate the most informative samples for modeling.
In this paper, we follow the trend and present an ac-
tive learning method for CWS in EHRs. Specifically,
a new sampling strategy combining Normalized En-
tropy with Loss Prediction (NE-LP) is proposed to se-
lect the most valuable data. Meanwhile, to minimize
the computational cost of learning, we propose a joint
model including a word segmenter and a loss predic-
Corresponding author: Yangming Zhou
E-mail: ymzhou@ecust.edu.cn
Tingting Cai
School of Information Science and Engineering, East China
University of Science and Technology, Shanghai 200237,
China
E-mail: y30190775@mail.ecust.edu.cn
Zhiyuan Ma
Institute of Machine Intelligence, University of Shanghai for
Science and Technology, Shanghai 200093, China
E-mail: yuliar3514@usst.edu.cn
Hong Zheng
School of Information Science and Engineering, East China
University of Science and Technology, Shanghai 200237,
China
E-mail: zhenghong@ecust.edu.cn
Yangming Zhou
School of Information Science and Engineering, East China
University of Science and Technology, Shanghai 200237,
China E-mail: ymzhou@ecust.edu.cn
tion model. Furthermore, to capture interactions be-
tween adjacent characters, bigram features are also ap-
plied in the joint model. To illustrate the effectiveness of
NE-LP, we conducted experiments on EHRs collected
from the Shuguang Hospital Affiliated to Shanghai Uni-
versity of Traditional Chinese Medicine. The results
demonstrate that NE-LP consistently outperforms con-
ventional uncertainty-based sampling strategies for ac-
tive learning in CWS.
Keywords Active learning ·Chinese word segmenta-
tion ·Deep learning ·Electronic health records
1 Introduction
Electronic Health Records (EHRs) systematically col-
lect patients’ clinical information, such as health pro-
files, examination results, and treatment plans [11]. By
analyzing EHRs, lots of useful information, closely re-
lated to patients, can be discovered [42]. In recent years,
researches on EHRs have attracted increasing attention,
and most of them are based on English EHRs [21, 28].
There are many congenital advantages of English EHRs
for further clinical research. Firstly, English words are
delimited by white spaces, thus saving the efforts of
word segmentation. Secondly, many commonly-used En-
glish terminology databases, such as Unified Medical
Language System (UMLS) [3] and Systematized Nomen-
clature of Medicine-Clinical Terms (SNOMED-CT)1,
can be employed as professional lexicons for NLP down-
stream tasks. Lexicons of other languages generally can-
not match the UMLS and SNOMED-CT in terms of
comprehensiveness [44].
Since Chinese EHRs are recorded without explicit
word delimiters, such as “患者被诊断为糖尿病酮症
1http://www.snomed.org
2 Tingting Cai et al.
酸中毒。” (The patient was diagnosed with diabetic
ketoacidosis.), Chinese Word Segmentation (CWS) is
a prerequisite for pre-processing Chinese EHRs. It is a
fundamental and crucial step for many clinical NLP sys-
tems. Currently, state-of-the-art CWS methods usually
require large amounts of manually-labeled data to reach
their full potential. However, compared to the general
domain, CWS in the medical domain is more difficult.
On one hand, EHRs involve many medical terminolo-
gies, such as “高血压性心脏病” (Hypertensive Heart
Disease) and “罗氏芬” (Rocephin), so only annotators
with medical backgrounds are qualified to label EHRs.
On the other hand, EHRs may involve personal priva-
cies of patients. Therefore, they cannot be released on
large scales for labeling. The above two reasons lead to
the high annotation cost and insufficient training cor-
pus for CWS in medical texts.
CWS was usually formulated as a sequence labeling
task [20], which can be solved by supervised learning
approaches, such as Hidden Markov Model (HMM) [9]
and Conditional Random Field (CRF) [18]. However,
these methods rely heavily on handcrafted features. To
relieve the efforts of feature engineering, neural net-
work based methods are beginning to thrive [5, 10, 22].
However, due to insufficient annotated training data,
conventional models for CWS trained on open corpora
often suffer from significant performance degradation
when transferred to specific domains, let alone the re-
searches are rarely dabbled in the medical domain.
One solution for this obstacle is to use active learn-
ing, where only a small scale of samples are selected
and labeled in an active manner. Active learning meth-
ods are favored by the researchers in many Natural
Language Processing (NLP) tasks, such as text classi-
fication [40] and Named Entity Recognition (NER) [4].
However, only a handful of works are conducted on
CWS [20], and few focuses on the medical domain.
Given the aforementioned challenges and current
researches, we propose a word segmentation method
based on active learning. To select the most informative
data, we incorporate a sampling strategy called NE-
LP, which consists of Normalized Entropy (NE) and
Loss Prediction (LP). Specifically, we leverage the nor-
malized entropy of class posterior possibilities from Bi-
directional Long-Short Term Memory and Conditional
Random Field (BiLSTM-CRF) based word segmenter
to define uncertainty. Then, we attach a “loss predic-
tion model” based on self-attention [33] to the word
segmenter and it aims to predict the loss of input data.
The final decision on the selection of labeling samples
is made by calculating the sum of normalized token en-
tropy and losses according to a certain weight. Besides,
to capture coherence over characters, we additionally
add n-gram features to the input of the joint model
and experimental results show that for specific texts,
such as our medical texts, bigram performs best.
To sum up, the main contributions of our work are
summarized as follows:
–We propose a novel word segmentation method in-
corporating active learning and hybrid features. The
former lightens the burden of labeling large amounts
of data, and the latter combines bigram features
with character embeddngs to achieve better repre-
sentations of the coherence between adjacent char-
acters.
–To improve the performance of active learning, we
propose a simple, yet effective sampling strategy
called NE-LP, which is based on a joint model in-
cluding a word segmenter and a loss prediction model.
Instead of solely relying on the uncertainty of classi-
fying boundary to choose the most informative sam-
ples for labeling, our proposed method utilizes nor-
malized token entropy to estimate the uncertainty
from outputs of the word segmenter at the statisti-
cal level, moreover, we also employ self-attention as
a loss prediction model to simulate human under-
standing of words from the deep learning level.
–Instead of evaluating the performance in simulated
data, we use cardiovascular diseases data collected
from the Shuguang Hospital Affiliated to Shanghai
University of Traditional Chinese Medicine to illus-
trate the improvements of the proposed method. Ex-
perimental results show that NE-LP is superior to
mainstream uncertainty-based sampling strategies
in F1-score.
The rest of this paper is organized as follows. Sec-
tion 2 briefly reviews the related work on CWS and ac-
tive learning. Section 3 details the proposed method for
CWS, followed by experimental evaluations as Section
4. In the end, the conclusions and potential research
directions are summarized as Section 5.
2 Related Work
2.1 Chinese Word Segmentation
Due to the practical significance, CWS has attracted
considerable research efforts, and a great number of so-
lution methods have been proposed in the literature in
the past decades [29, 38, 47]. Generally, all the existing
approaches fall into two categories: statistical machine
learning and deep learning [20].
Statistical Machine Learning Methods. Initially,
statistical machine learning methods were widely-used
Title Suppressed Due to Excessive Length 3
in CWS. Xue and Shen [38] employed a maximum en-
tropy tagger to automatically assign Chinese charac-
ters. Zhao et al. [46] used CRF for tag decoding and
considered both feature template selection and tag set
selection. However, these methods greatly rely on man-
ual feature engineering [27], while handcrafted features
are difficult to design, and the sizes of these features
are too large for practical use [5]. In such a case, deep
learning methods have been increasingly employed for
the ability to minimize the efforts in feature engineer-
ing.
Deep Learning Methods. Recently, researchers
tended to apply various neural networks for CWS and
achieved remarkable performance. To name a few, Zheng
et al. [47] used deep layers of neural networks to learn
feature representations of characters. Chen et al. [5]
adopted LSTM to capture the previous important in-
formation. Wang and Xu [34] proposed a Convolutional
Neural Network (CNN) to capture rich n-gram features
without any feature engineering. Yang [41] conducted
extensive experiments to analyze the effect of BERT
and demonstrated its advantages in solving the CWS
task. Gan and Zhang [10] investigated self-attention for
CWS and observed that self-attention gives highly com-
petitive results. La Su and Liu [17] presented a hybrid
word segmentation algorithm based on Bi-directional
Gated Recurrent Unit (BiGRU) and CRF to learn the
semantic features of the corpus. Ma et al. [25] found
that BiLSTM can achieve better results on many of the
popular CWS datasets as compared to models based on
more complex neural network architectures. Therefore,
in this paper, we adopt BiLSTM-CRF as our base word
segmenter due to its simple architecture, yet remarkable
performance.
Open-source CWS Tools. In recent years, more
and more open-source CWS tools are emerging, such
as Jieba and PyHanLP. These tools are widely-used
due to convenience and great performance for CWS in
general fields. However, terminologies and uncommon
words in medical fields would lead to the unsatisfac-
tory performance of segmentation results. We experi-
mentally compare seven well-known open-source CWS
tools on EHRs. As shown in Table 4, we find that since
these open-source tools are trained from general domain
corpora, the results are not ideal enough to cater to the
needs of subsequent NLP tasks when applied to medical
fields.
Domain-Specific CWS Methods. Currently, a
handful of domain-specific CWS approaches have been
studied, but they focused on decentralized domains.
In the metallurgical field, Shao et al. [29] proposed a
domain-specific CWS method based on BiLSTM model.
In the medical field, Xing et al. [37] proposed an adap-
tive multi-task transfer learning framework to fully lever-
age domain-invariant knowledge from high resource do-
main to medical domain. Meanwhile, transfer learning
still greatly focuses on the corpora in general domain.
When it comes to the specific field, large amounts of
manually-annotated data are necessary. Active learn-
ing can solve this problem to a certain extent, where a
model asks labelers to annotate data which is beneficial
to model performance improvement [6]. However, due
to the challenges faced by performing active learning
on CWS, only a few studies have been conducted. On
judgements, Yan et al. [39] adopted the local annota-
tion strategy, which selects substrings around the in-
formative characters in active learning. However, their
method still stays at the statistical level. Therefore,
compared to the above method, we intend to utilize a
new active learning approach for CWS in medical text,
which combines normalized entropy with loss prediction
to effectively reduce annotation cost.
2.2 Active Learning
Active learning [1] mainly aims to ease data collec-
tion process by automatically deciding which instances
should be labeled by annotators, thus saving the cost of
annotation [6]. In active learning, the sampling strat-
egy plays a key role. Over the past few years, the rapid
development of active learning has resulted in various
sampling strategies, such as query-by-committee [12],
distribution-based sampling [14] and uncertainty-based
sampling [19].
Query-by-committee. This method constructs a
committee comprising multiple independent models, and
selects samples by measuring disagreement among these
models. However, constructing diverse committee mem-
bers increases experimental cost and reduces efficiency
to a certain extent.
Distribution-based Sampling. This method picks
up samples representing the distribution of the unla-
beled data. Intuitively, learning over a representative
subset would be competitive over the whole pool. It can
be formulated as a discrete optimization problem [15].
However, this approach may sample too heavily from
regions where the model is already proficient.
Uncertainty-based Sampling. In recent years,
uncertainty-based method has attracted considerable
attention since it performs well and saves much time
in most cases. It is widely-used in sequence labeling
tasks [4,23]. In the uncertainty-based sampling, a naive
way to define uncertainty is to use the posterior prob-
ability of a predicted label [7], or the margin between
posterior probabilities of a predicted label and the sec-
ondly predicted label [2]. The entropy of label posterior
4 Tingting Cai et al.
Base Model
Loss Prediction Model
Input Data Segmentation Prediction
Loss Prediction
Fig. 1 The overall architecture of the joint model, where the
loss prediction model is attached to the base model.
probabilities can also generalize the definition of uncer-
tainty [26].
However, in some complicated tasks, such as CWS
and NER, only considering the uncertainty of data is
obviously not enough. Therefore, we further take loss
values into account and pick up samples from two per-
spectives including both uncertainty and loss.
3 Joint Model Incorporated Active Learning
framework for Chinese Word Segmentation
3.1 Overview
Active learning algorithm for CWS is composed of two
parts: a learning engine and a selection engine. The
learning engine is essentially a word segmenter, which is
mainly utilized for training. The selection engine picks
up unlabeled samples based on preset sampling strat-
egy and submits these samples for human annotation.
Then, we incorporate them into the training set after
annotation is finished, thus continuously improving the
F1-score of the word segmenter with the increasing of
the training set size. In this paper, we propose a joint
model as a selection engine. Fig. 1 shows the overall ar-
chitecture of the joint model, where the loss prediction
model predicts the loss value from input data. More-
over, the loss prediction model is (i) attached to the
base model, and (ii) jointly learned with the base model.
Here, the base model is employed as a learning engine.
Algorithm 1 demonstrates the procedure of CWS
based on active learning with the sampling strategy of
NE-LP. First, with training set, we train a joint model
including a segmenter and a loss prediction model. Later,
the joint model selects n-highest ranking samples based
on NE-LP strategy, which are expected to improve the
performance of the segmenter to the largest extent.
Afterwards, medical experts annotate these instances
manually. Finally, these annotated instances are incor-
porated into the training set, and we use the new train-
ing set to train the joint model. The above steps iterate
until the desired F1-score is achieved or the number of
iterations has reached a predefined threshold.
Fig. 2 demonstrates the detailed architecture of the
joint model. First, we pre-process EHRs at the character-
level, separating each character of raw EHRs. For in-
Algorithm 1: NE-LP based Active Learning
for Chinese Word Segmentation
Input: labeled data L, unlabeled data U, the
number of iterations M, the number of
samples selected per iteration n, partitioning
function Split, size τ
Output: a word segmentation model f∗with the
smallest testing set loss lmin
1begin
2Initialize: T rainingτ, T estingτ←Split(L, τ )
3train a joint model with a word
segmenter fτand a loss prediction model tτ
4estimate the testing set loss lτon fτ
5label Uby fτ
6for i= 1 to Mdo
7for Sample ∈Udo
8compute UncertaintySample from the
output of fand predict LossSample by
t
9calculate the sum of UncertaintySample
and LossSample according to a certain
weight
10 end
11 select n-highest ranking samples R
12 relabel Rby annotators
13 form a new labeled dataset
T rainingR←T rainingτS{R}
14 form a new unlabeled dataset UR←Uτ\{R}
15 train a joint model with fRand tR
16 estimate the new testing loss lRon fR
17 compute the loss reduction δR←lR−lτ
18 if δR<0then
19 lmin ←lR
20 end
21 else
22 lmin ←lτ
23 end
24 end
25 f∗←fwith the smallest testing set loss lmin
26 end
27 return f∗
stance, given a sentence L= [C0C1C2. . . Cn−1Cn], where
Cirepresents the i-th character, the separated form
is Ls= [C0, C1, C2, . . . , Cn−1, Cn], and we obtain the
character embeddings by converting character indexes
into fixed dimensional dense vectors. Afterwards, to
capture interactions between adjacent characters, bi-
gram embeddings are utilized to feature the coherence
over characters. We construct the bigram feature for
each character by concatenating it with the previous
character, i.e., B= [x0x1, x1x2, . . . , xt−1xt]. We employ
Word2Vec [13] to train bigram features to get bigram
embedding vectors. Then, we concatenate the charac-
ter embeddings and bigram embeddings as the input
of BiLSTM layer. Finally, CRF layer makes positional
tagging decisions over individual characters, and self-
attention layer learns to simulate the loss defined in
the base model.
Title Suppressed Due to Excessive Length 5
Concatenated
Embeddings
CRF Layer
B M E S S E EB B
Self-Attention Layer
a predicted loss
BiLSTM
Layer
Segmentation
Prediction Loss
Prediction
Bigram
Embeddings
Word2Vec
x x!
⏻ᐤx!x"
ᐤ㔃x"x#
㔃ᵚx#x$
ᵚ৺x$x%
৺᰾x%x&
᰾ᱮx%x&
ᱮ㛯x&x'
㛯བྷEmbedding Layer
x x!x"x#x$x%x&x'x(
⏻ᐤ㔃ᵚ৺᰾ᱮ㛯བྷCharacter
Embeddings
Concatenate
Bigrams Characters
……
……
……
……
Fig. 2 The detailed architecture of the joint model, where BiLSTM-CRF is employed as a word segmenter and BiLSTM-Self-
Attention is a loss prediction model. The loss prediction model shares BiLSTM layer parameters with word segmenter to learn
feature representations better for loss prediction.
3.2 BiLSTM-CRF based Word Segmenter
CWS can be formalized as a sequence labeling problem
with character position tags, which are (‘B’, ‘M’, ‘E’,
‘S’), so we convert the labeled data into the ‘BMES’
format, in which each character in the sequence is as-
signed with a label as follows: B=beginning of a word,
M=middle of a word, E=end of a word and S=single
word. For example, a Chinese segmented sentence “淋
巴结/未/及/明显/肿大/。” (No obvious enlargement
of lymph nodes was found.) can be labeled as ‘BMESS-
BEBES’. In this paper, we use BiLSTM-CRF as the
base model for CWS, which is widely-used in sequence
labeling.
3.2.1 BiLSTM Layer
LSTM is an optimization for traditional Recurrent Neu-
ral Network (RNN), and it is widely used in model-
ing sentence. For example, Zhang et al. [45] utilized
LSTM to model variable-length title sentences of cloth
items for clothes matching. Formally, the LSTM unit
performs the following operations at time step t:
ft=σg(Wfxt+Ufht−1+bf) (1)
it=σg(Wixt+Uiht−1+bi) (2)
ot=σg(Woxt+Uoht−1+bo) (3)
ct=ct−1ft+itσc(Wcxt+Ucht−1+bc) (4)
ht=σh(ct)ot(5)
where xt,ct−1,ht−1are the inputs of LSTM, W∗and
U∗are weight matrices, and b∗is a bias vector. ctis the
internal memory cell for dealing with vanishing gradi-
ent, while htis the main output of LSTM.
Obviously, the hidden state htof the current LSTM
unit only relies on the previous one ht−1, while ignor-
ing the next one ht+1. However, future information from
the backward direction is also useful to CWS [32]. BiL-
STM can capture features from two directions of a se-
quence, thus understanding the syntactic and semantic
context from a deeper perspective than LSTM. Assume
that the hidden states of the forward and backward
LSTM are −→
htand ←−
ht, respectively, the context vector
of BiLSTM can be denoted by concatenating the two
hidden vectors as ht=h−→
ht;←−
hti.
In our model, the character embedding and bigram
feature embedding are first concatenated and then fed
into a basic BiLSTM layer:
mi= [Ci;Bi] (6)
hi= BiLSTM −→
hi−1,←−
hi+1, mi(7)
where Ci∈Rdcha denotes the character embedding vec-
tor, Bi∈Rdbig represents the bigram feature vector,
mi∈Rdcha+dbig shows the concatenated embedding,
and hi∈Rdhid means the hidden state of BiLSTM.
dcha,dbig and dhid are hyper-parameters indicating the
dimension of character embedding, bigram embedding
and hidden state of BiLSTM, respectively.
6 Tingting Cai et al.
3.2.2 CRF Layer
For CWS, it is necessary to consider the dependencies
of adjacent tags. For example, a B (Begin) tag should
be followed by an M (Middle) tag or an E (End) tag,
and cannot be followed by an S (Single) tag. Given the
observed sequence, CRF has a single exponential model
for the joint probability of the entire sequence of labels,
so it can solve the label bias problem effectively, which
motivates us to use CRF to model the tag sequence
jointly, not independently [35].
Ais an important parameter in CRF called a trans-
fer matrix, which can be set manually or learned by
model. Ayi,yi+1 denotes the transition probability from
label yito yi+1.y∗represents the most likely tag se-
quence of x and it can be formalized as:
y∗= arg max
yp(y|x;A) (8)
3.3 Self-Attention based Loss Prediction Model
To select the most appropriate sentences in a large
number of unlabeled corpora, we attach a self-attention
based loss prediction model to the base word segmenter,
which is inspired by [43]. The word segmenter is learned
by minimizing the losses. If we can predict the losses of
input data, it is intuitive to choose samples with high
losses, which tend to be more beneficial to current seg-
menter improvement.
3.3.1 Self-Attention Layer
The attention mechanism [33] is a popular method for
NLP tasks in recent years [10,31], which mainly aims to
map a query to a series of key-value pairs [36]. Formally,
attention performs the following three operations:
f(Q, Ki) = QTKi
√dk
(9)
ai= softmax (f(Q, Ki)) = exp (f(Q, Ki))
PLen
j=1 exp (f(Q, Kj)) (10)
Attention(Q, K, V ) =
Len
X
i=1
ai∗Vi(11)
where dkdenotes the dimension of key and value, Len
is the length of the input sequence.
In the self-attention mechanism, Q,Kand Vhave
the same value, i.e., each token in the sequence will
be calculated attention with other remaining tokens.
Self-attention can learn the internal structure of the se-
quence and it is more sensitive to the difference between
input and output, so we use self-attention to learn the
Word Segmenter
Loss Prediction
Model
Input Data
Segmentation
Prediction
Loss Prediction
Segmentation
ŶŶŽƚĂƚŝŽŶSegmentation
Loss
Loss-Prediction-Model Loss
Fig. 3 The method for training loss prediction model. Given
an input, the word segmenter and loss prediction model out-
put a segmentation prediction and a predicted loss, respec-
tively. Next, a segmentation loss can be computed by the seg-
mentation prediction and annotation. Then, the segmentation
loss is regarded as a ground-truth loss for the loss prediction
model, and is used to compute the loss-prediction-model loss.
loss of word segmenter and we define that a sequence
with a higher self-attention score has a higher loss.
3.3.2 Loss Learning
Fig. 3 shows a detailed description of how to train the
loss prediction model. Given the input data x, the seg-
mentation prediction can be obtained through the word
segmenter: spre =Seg(x). Similarly, we can get the loss
prediction through the loss prediction model: losspre =
Loss(x). Next, the segmentation loss can be computed
as: lossSeg =LS eg(spre ,strue), where strue represents
the true annotation of x. Then, lossSeg is regarded as a
ground-truth target for loss prediction model, so we can
compute the loss of loss prediction model as lossLoss =
LLoss(losspr e,lossSeg ). The final loss function of the
joint model is defined as:
Ljoint =LSeg(spre , strue)+LLoss(losspre, lossSeg ) (12)
When training the loss prediction model, we seek to
minimize the segmentation loss and the predicted loss:
LLoss =1
n
n
X
i=1
[losspre −lossS eg]2(13)
where nmeans the number of samples.
3.4 NE-LP Sampling Strategy
To judge whether the samples are effective and bene-
ficial to improve the model performance, we propose a
novel sampling strategy called NE-LP in active learning
by combining the normalized entropy of segmentation
prediction with loss prediction. The former measures
the uncertainty from outputs of the word segmenter at
the statistic level, which can be computed as Equation
(14), while the latter takes segmentation loss into con-
sideration based on a loss prediction model to imitate
Title Suppressed Due to Excessive Length 7
Table 1 Detailed Information of EHRs.
Types Counts Contents
Hospital records 957 Admission dates,
history of present illness.
Medical records 992 Chief complaints,
physical examination.
Ward round records 952 General, heart rates,
laboratory findings.
Discharged records 967 Treatment plans,
dates of discharge.
human understanding of words from the deep learning
level.
Uncertainty(x) = PN
i=1 pSeg (x) log pSeg (x)
log 1
N√Len (14)
where pSeg represents the output probability of word
segmenter, and Ndenotes the number of labeled classes.
To ensure that the normalized entropy and loss are in
the same order of magnitude, we scale the normalized
entropy by 1
√Len , where Len is the length of the input
sequence.
For CWS, we hypothesize that if a sample has both
high uncertainty and high loss, it is probably infor-
mative to the current word segmenter, and we verify
this assumption in our experiments. Therefore, the fi-
nal sampling strategy NE-LP can be formalized as:
SNE −LP (x) = αPN
i=1 pSeg (x) log pSeg (x)
log 1
N√Len +βLoss(x)
(15)
where αand βare the weight coefficients of normalized
entropy and loss prediction, respectively.
4 Experiments & Analysis
4.1 Datasets
We collect 204 EHRs with cardiovascular diseases from
the Shuguang Hospital Affiliated to Shanghai Univer-
sity of Traditional Chinese Medicine and each contains
27 types of records. We choose 4 different types with
a total of 3,868 records from them, which are hospi-
tal records, medical records, ward round records and
discharge records. The detailed information of EHRs is
listed in Table 1.
We divide 3,868 records including 27,442 sentences
into training set, testing set and validation set with the
ratio of 6:2:2. Then, we randomly select 4,950 sentences
from training set as initial labeled set, and the remain-
ing 11,525 sentences as unlabeled set, i.e., we obtain
Table 2 Statistics of Datasets.
Datasets Sentences Words Characters
Training set 16,465 400,878 706,362
Initial labeled set 4,950 120,699 212,598
Unlabeled set 11,525 280,179 493,764
Testing set 5,489 131,624 233,759
Validation set 5,489 135,406 238,954
the initial labeled set and unlabeled set by splitting the
training set according to the ratio of 3:7. Statistics of
datasets are listed in Table 2.
4.2 Baseline Sampling Strategies
We compare our proposed NE-LP with the following
baseline sampling strategies:
– Least Confidence (LC) [7]. The LC strategy se-
lects the samples whose most likely sequence tags
that the model is least confident of. Despite its sim-
plicity, this approach has been proven effective in
various tasks.
SLC (x)=1−p(y∗|x) (16)
where xis the instance to be predicted and y∗rep-
resents the most likely tag sequence of x.
– Maximum Token Entropy (MTE) [26]. The MTE
strategy evaluates the uncertainty of a token by en-
tropy. The closer the distribution of marginal prob-
ability to uniform, the larger the entropy.
SMT E (x) = −
N
X
i=1
p(y∗|x)·log p(y∗|x) (17)
where Nrepresents the number of classes.
– Minimum Token Margin (MTM) [2]. To mea-
sure the informativeness, MTM considers the first
and second most likely assignments and subtracts
the highest probability by the lowest one.
SMT M (x) = max p(y∗|x)−max 0p(y∗|x) (18)
where max0means the second maximum probabil-
ity.
– Lowest Token Probability (LTP) [23]. The LTP
looks for the most likely sequence assignment, and
hopes that each token in the sequence has a high
probability. It selects the tokens whose probability
under the most likely tag sequence y∗is lowest.
SLT P (x)=1−min
y∗
i∈y∗
p(y∗
i|x) (19)
where y∗
iis the most probable label of xat position
iin the sequences.
8 Tingting Cai et al.
Table 3 Hyper-parameter Setting.
Hyper-parameters Setting
Maximum sequence length len = 200
Character embedding dimension dcha = 128
Bigram embedding dimension dbig = 128
Concatenated embedding dimension dcon = 256
BiLSTM hidden unit number nhid = 512
Dropout rate rate = 0.2
– RAND. Selecting samples randomly in the unla-
beled data at each iteration.
The main goal of the evaluation is to compare the
effectiveness of different sampling strategies on picking
up informative samples.
4.3 Parameter Settings
Hyper-parameter configuration may have a great im-
pact on the performance of neural network. The hyper-
parameter configurations of our method are listed in
Table 3.
We initialize bigram embeddings via Word2Vec on
the whole datasets. The dimension of character embed-
dings is set as same as the bigram embeddings. Then,
we concatenate two embeddings with the dimsension of
256 as the input of BiLSTM layer. BiLSTM hidden unit
number is twice the dimension of concatenated embed-
dings. Dropout [30] is applied to the outputs of BiLSTM
layer in order to prevent our model from overfitting.
In active learning, we fix the number of iterations
at 10 since each sampling strategy does not improve
obviously after 10 iterations. At each iteration, we select
1,000 sentences from unlabeled data for the joint model
to learn.
4.4 Experimental Results
4.4.1 Comparisons between Different Open-source
CWS Tools
We select seven widely-used and mainstream open-source
CWS tools from the Internet, which are SnowNLP2, Py-
HanLP3, Jieba4, THULAC5, PyNLPIR6, FoolNLTK7
and pkuseg8. We evaluate them on our whole datasets
2https://github.com/isnowfy/snownlp
3https://github.com/hankcs/pyhanlp
4https://github.com/fxsjy/jieba
5https://github.com/thunlp/THULAC-Python
6https://github.com/tsroten/pynlpir
7https://github.com/rockyzhengwu/FoolNLTK
8https://github.com/lancopku/pkuseg-python
Table 4 Experimental Results of Different Open-source
CWS Tools.
CWS tools Precision Recall F1-score
SnowNLP 59.4 56.68 58.04
PyHanLP 65.01 70.89 67.82
Jieba 70.36 71.48 70.91
THULAC 68.67 77.36 72.76
PyNLPIR 69.14 76.89 72.81
FoolNLTK 72.85 76.98 74.86
pkuseg 78.93 75.86 77.37
with a total of 27,443 sentences. Some open-source CWS
tools, such as Jieba and pkuseg, allow users to train a
new model from scratch with their own training data.
To make comparisons fair, we use the default settings
for all evaluated CWS tools without any additional cor-
pus or personalized dictionaries.
As shown in Table 4, we find that pkuseg performs
the best with the F1-score of 77.37% while SnowNLP
shows the lowest of 58.04%, and THULAC has the high-
est recall of 77.36%. However, since these open-source
tools are trained by general domain corpora, when ap-
plied to specific fields, such as medical domain, the re-
sults are still not satisfactory. Therefore, we need to
train a new segmenter on medical texts.
4.4.2 Comparisons between Different Models for CWS
To select a base word segmenter that is most suitable
for medical texts, we compare different types of mod-
els including both statistical machine learning and deep
learning. These models are trained on the whole train-
ing set with 20 epoches. The results are listed in TA-
BLE 5.
All deep neural networks obtain higher F1-scores
than statistical machine learning model CRF by the
margins between 2.35% and 11.84% since neural net-
works can effectively model feature representations.
We further observe that self-attention-CRF shows
a relatively low F1-score of 86.48% since only a single
self-attention layer cannot extract useful feature repre-
sentations. Thus, to capture more features, we employ
Transformer-CRF, i.e., we use the encoder part of the
model proposed by [33] as the feature extractor, which
is composed of a multi-head attention sub-layer and
a position-wise fully connected feed-forward network.
Results show that Transformer-CRF has an F1-score
of 91.25%, which is a 4.77% improvement compared to
self-attention-CRF.
Among bi-directional RNNs, BiLSTM-CRF shows
the highest F1-score of 95.89%, while BiRNN-CRF and
BiGRU-CRF achieve 95.30% and 95.71%, respectively.
BiLSTM and BiGRU are optimizations for BiRNN since
Title Suppressed Due to Excessive Length 9
Table 5 Experimental Results of Different Models for CWS.
Model Precision Recall F1-score
CRF 83.39 84.88 84.13
Self-Attention-CRF 85.74 87.23 86.48
Transformer-CRF 90.72 91.78 91.25
LSTM-CRF 92.76 93.46 93.11
CNN-CRF 93.73 94.58 94.15
BiRNN-CRF 94.90 95.71 95.30
BiGRU-CRF 95.36 96.06 95.71
BiLSTM-CRF 95.81 95.97 95.89
CNN-BiLSTM-CRF 95.77 96.18 95.97
they introduce gated mechanism to solve the problem
of long-distance dependencies, where BiLSTM contains
three gates, which are forget, input and output gates,
while BiGRU has two gates, which are reset and update
gates.
Furthermore, we notice that BiLSTM-CRF outper-
forms LSTM-CRF by the margins of 2.78%, which shows
that BiLSTM can understand the syntactic and seman-
tic contexts better than LSTM. Compared to CNN-
CRF, the F1-score of BiLSTM-CRF improves by 1.74%.
However, CNN is able to extract more local features,
while BiLSTM may ignore some key local contexts im-
portant for CWS when modeling the whole sentence.
Therefore, when combining BiLSTM and CNN as fea-
ture extractor, the F1-score reaches the peak of 95.97%,
which outperforms BiLSTM-CRF by a small margin of
0.08%.
Given the above experimental results, considering
the computational cost, complexity of model architec-
ture and final results, we adopt BiLSTM-CRF as our
base segmenter since the performance does not improve
greatly when incorporating CNN, but it costs more time
due to a more complex architecture.
4.4.3 Effectiveness of N-gram Features in
BiLSTM-CRF based Word Segmenter
To investigate the effectiveness of n-gram features in
BiLSTM-CRF based word segmenter, we also compare
different n-gram features on EHRs. The results are shown
in Table 6.
By using additional n-gram features in BiLSTM-
CRF based word segmenter, there is an obvious im-
provement of F1-score, where bigram features achieve
97.70% while trigram and four-gram reach 97.32% and
96.71%, respectively. Specifically, bigram, trigram and
four-gram features outperform character-only features
by margins of 1.81%, 1.43% and 0.82%, which indicates
that n-gram features can effectively capture the seman-
tic coherence between characters.
Table 6 Experimental Results with Different N-gram Fea-
tures in BiLSTM-CRF.
Model + feature Precision Recall F1-score
BiLSTM-CRF 95.81 95.97 95.89
BiLSTM-CRF + Four-gram 96.72 96.70 96.71
BiLSTM-CRF + Trigram 97.19 97.44 97.32
BiLSTM-CRF + Bigram 97.59 97.80 97.70
Table 7 Statistics of words whose characters are of different
lengths.
N-character words Training set Testing set Val set
N = 2 147,143 48,717 49,413
N = 3 30,379 10,043 10,385
N = 4 8,187 2,707 2,857
Furthermore, we explore the reason why bigram fea-
tures perform better than trigram and four-gram. We
analyze the number of words consisting of 2, 3 and 4
characters in our datasets. As shown in Table 7, we
find the reason that yields such a phenomenon is that
2-character words appear most often in datasets, with
the appearance of 147,143, 48,717 and 49,413 times in
training, testing and validation set, respectively. There-
fore, in our texts, bigram features can effectively cap-
ture the likelihood of 2 characters being a legal word,
and they are most beneficial to model performance im-
provement.
Given the experimental results, we use bigram as
additional feature for BiLSTM-CRF based word seg-
menter.
4.4.4 Comparisons between Different Weight
Coefficients of Normalized Entropy and Loss
Prediction
To study which part has more influence on the final per-
formance, we conduct an experiment on different weight
coefficients of normalized entropy and loss prediction
with bigram features. We compare five different groups
of parameters in Equation (15).
From the learning curves of Fig. 4, it is clear that
when the weight coefficients αand βare all set to 1,
the results are better than others in early iterations,
and then tend to be uniform, except for the coefficients
of 1 and 100.
Furthermore, we find that, when αand βare 100
and 1, i.e., we enlarge the effect of loss prediction, the
F1-scores are higher than the results when αand βare
1 and 100. We believe the reason is that loss predic-
tion is task-agnostic as the model is learned from losses
regardless of target tasks while normalized entropy is
more effective to the task like classification, which is
10 Tingting Cai et al.
Fig. 4 Comparisons between different weight coefficients of
normalized entropy and loss prediction.
learned to minimize cross-entropy between predictions
and labels.
When the weight coefficients αand βare all set
to 1, respectively, the performance is the best, which
shows that combining two parts together can make full
use of respective advantages to achieve better results,
thus we choose this group of parameters for subsequent
experiments.
4.4.5 Comparisons between Different Sampling
Strategies
In this experiment, we compare the baseline sampling
strategies introduced in Section 4.2 with our proposed
method NE-LP. We evaluate the performance of strat-
egy by its F1-score on the testing set. To prove the ef-
fectiveness of our proposed method, we conduct our ex-
periments in two configurations: adding additional bi-
gram features and using character-only features. For
each iteration, we train 30 epoches with bigram fea-
tures, which is a good trade-off between speed and per-
formance, while 50 epoches without bigram features to
ensure model convergence.
As illustrated in Fig. 5, all sampling strategies per-
form better than RAND baseline. From the left of
Fig. 5, we find that LC and MTE greatly outperform
MTM and LTP in early rounds while from the right
of Fig. 5, we notice that MTE and LTP work effec-
tively with the bigram features, but LC suffers from
performance drop. The reason may be that, on the in-
fluence of bigram features, LC is not accurate enough
to localize the best token to label. In the last few itera-
tions, LTP achieves great performance, which indicates
that LTP is effective on selecting samples that are not
sufficiently learned with more training data.
Furthermore, we observe that the F1-scores improve
greatly when adding bigram features, which again ver-
ifies the effectiveness of bigram features.
Regardless of whether to add bigram features, our
approach NE-LP shows the best performance for all
active learning cycles. The performance gaps between
our method NE-LP and entropy-based MTE are ob-
vious since NE-LP not only captures the uncertainty
of sequences, but also takes segmentation losses into
consideration.
4.4.6 Comparisons between Different Sampling
Strategies with Different Sizes of Initial labeled Set
Furthermore, we also investigate the effects of different
initial labeled set sizes on the final performance. Instead
of using the ratio of 3:7, we now divide the training set
with the ratio of 1:9 to get the initial labeled set and
unlabeled set.
As depicted in Fig. 6, we find that our proposed
method NE-LP still outperforms other uncertainty-
based sampling strategies in all iterations, which shows
that our method can always select informative samples
beneficial to current model improvement regardless of
the size of initial labeled set.
The performance trends of these sampling strategies
are similar to those in Fig. 5 in most cases. NE-LP
shows the best performance, MTE achieves better F1-
scores than LC,MTM and LTP while RAND obtains
the lowest results.
However, we observe that the performance gaps be-
tween NE-LP and MTE are less obvious than those
in Fig. 5. The reason may be that when the ratio is
1:9, losses tend to be smaller than those with the ra-
tio of 3:7. Therefore, in NE-LP, compared to loss pre-
diction, normalized entropy has a greater impact on
performance, leading to the phenomenon that the F1-
score curve of NE-LP is close to MTE when the ratio
of initial labeled set and unlabeled set is 1:9. However,
despite the small gaps, NE-LP outperforms MTE any-
way. Therefore, we still can’t ignore the importance of
loss prediction since it also plays a role to improve the
performance.
4.5 Discussion
From the above experimental results, we have verified
the effectiveness of BiLSTM-CRF with bigram features
in CWS, and our proposed sampling strategy NE-LP
in active learning on EHRs. In this subsection, we fur-
ther conduct four groups of experiments for discussion.
We first show the effectiveness of the pre-trained lan-
guage model. Then, we investigate the effect of sentence
Title Suppressed Due to Excessive Length 11
Fig. 5 Comparisons between different sampling strategies when the ratio of initial labeled set and unlabeled set is 3:7.
Fig. 6 Comparisons between different sampling strategies when the ratio of initial labeled set and unlabeled set is 1:9.
length on model performance, as well as the in-domain
adaption ability of NE-LP. Finally, we explore the effect
of loss function in the loss prediction model.
4.5.1 Effectiveness of Pre-trained Language Model
The past two years have witnessed significant improve-
ments brought by highly-performant pre-trained lan-
guage models, such as BERT [8], which achieves state
of the art in sequence labeling tasks. It can learn deep
bidirectional contextual representations by jointly con-
ditioning on both left and right contexts in all lay-
ers. To investigate the performance of pre-trained lan-
guage model on CWS, we compare BiLSTM adding ad-
ditional bigram features with BERT. We fine-tune on
the BERT-base9in Chinese, which includes 12 layers,
768 hidden units and 12 heads. CRF is employed to
make positional tagging decisions over individual char-
acters for the two models. We train them on the whole
training set with 20 epoches.
Table 8 presents the experimental results of BiL-
STM incorporating bigram features and BERT. We ob-
9https://github.com/google-research/bert
Table 8 Experimental Results of BiLSTM Incorporating Bi-
gram Features and Pre-trained Language Model BERT
Model Precision Recall F1-score Size
BiLSTM+Bigram 97.59 97.80 97.70 5.24M
BERT 97.89 98.00 97.94 406M
serve that BERT achieves higher results in all met-
rics with the precision, recall and F1-score of 97.89%,
98.00% and 97.94%, outperforming BiLSTM by the mar-
gins of 0.3%, 0.2% and 0.24%, respectively. It proves the
effectiveness of bidirectional pre-training for language
representations. However, due to the complex architec-
ture, we find BERT is much heavier than BiLSTM in
model size, which means that BERT has greater costs in
computation and slower speed at inference. Actual set-
tings, especially with limited time and resources in the
hospital, can hardly bring such heavy models into op-
eration. Compared to heavier pre-trained models, the
classic architectures (e.g., BiLSTM with bigram fea-
tures) are more suitable for such a fundamental and
crucial task in many practical scenarios due to their
higher efficiency.
12 Tingting Cai et al.
Table 9 Statistics of Sentence Length in Different Datasets
Datasets Average Maximum Most
Training set 44.46 964 9
Testing set 44.62 945 9
Val set 45.48 948 9
Whole set 44.69 964 9
Table 10 The Coverage of Different Sentence Lengths
Length Training set Testing set Val set
30 8,028 / 16,465 2,763 / 5,489 2,739 / 5,489
50 11,524 / 16,465 3,874 / 5,489 3,817 / 5,489
100 15,092 / 16,465 5,029 / 5,489 4,978 / 5,489
150 15,932 / 16,465 5,290 / 5,489 5,292 / 5,489
200 16,180 / 16,465 5,368 / 5,489 5,378 / 5,489
250 16,310 / 16,465 5,425 / 5,489 5,417 / 5,489
4.5.2 The Effect of Sentence Length
To explore the effect of sentence length on model per-
formance, we experimentally compare models with dif-
ferent sentence lengths on CWS. In this experiment,
the models are trained on the whole training set with
50 epoches. Statistics of sentence length in different
datasets are listed in Table 9, and we set the sentence
length to 30, 50, 100, 150, 200, and 250, respectively.
Fig. 7 The performance of models with different sentence
lengths.
As shown in Fig. 7, BiLSTM-CRF and BERT-CRF
show similar performance-length curves, which reach
a peak at 100-character sentences. Based on the re-
sults in Table 10, one possible reason is that the sen-
tence length of 100 can cover a majority of sentences
in datasets. Furthermore, we observe that the F1-score
of LSTM-CRF decreases when sentence length reaches
250, which indicates that long sentences are semanti-
cally challenging for LSTM-CRF. Among all models,
Table 11 Experimental Results of BiLSTM-CRF with Dif-
ferent Sampling Strategies after 10 Iterations Tested on Chi-
nese Diabetes Dataset
Sampling Strategies Precision Recall F1-score
RAND 53.96 64.74 58.86
MTM 54.22 65.01 59.13
LC 54.95 65.73 59.86
LTP 55.39 65.76 60.13
MTE 55.65 65.46 60.16
NE-LP 56.97 65.91 61.12
BiLSTM-CRF with bigram features shows the most
stable performance-length curve, which shows that bi-
gram feature representation can stabilize the perfor-
mance against the sentence length.
4.5.3 In-domain Adaption
To investigate how transferable our method is, we test
the performance of BiLSTM-CRF on an in-domain dataset
for discussion. We take our cardiovascular disease EHR
dataset as the source texts and select an article from a
Chinese diabetes dataset10 as the target medical texts.
The dataset comes from the authoritative diabetes re-
lated Chinese journals of 7 years, and includes clinical
research, clinical cases, etc. However, this dataset is la-
beled for NER and relation extraction, we cannot di-
rectly apply it to CWS. Therefore, we randomly pick
one article from the diabetes dataset to label due to
the expensive annotation cost. The name entities in
the article are utilized as lexicons, and we annotate ev-
ery word in the lexicons with one position tag. For the
rest of the article, we invite labelers with medical back-
grounds to annotate. The final labeled article contains
45 sentences, 1,139 words, and 2,132 characters. In this
experiment, BiLSTM-CRF is employed as the word seg-
menter, which is trained on EHR samples selected by
different sampling strategies after 10 iterations, and the
ratio of initial labeled set and unlabeled set is 3:7.
From the results in Table 11, we observe that our
model suffers from performance degradation when trans-
ferred to a new textual medical dataset. Furthermore,
we also find that recall is always higher than precision,
which indicates that the model tends to split a multi-
character word into multiple single-character words. One
possible reason for these phenomena is that the target
texts are from an article in a Chinese diabetes jour-
nal, which contain many verb and adjective entities,
such as “研发针对胰升糖素信号途径的药物将为糖尿病
治疗掀开新的一页。” (The development of drugs for
glucagon signaling pathway will open a new page in the
10 https://tianchi.aliyun.com/dataset/dataDetail?dataId=
22288
Title Suppressed Due to Excessive Length 13
treatment of diabetes.). However, such entities rarely
appear in EHRs. Another reason may be that the writ-
ing styles of target texts are different from the source
texts (i.e., EHRs). EHRs include patients’ examination
results, diagnoses, etc, which are recorded objectively,
while journal articles generally introduce research pro-
gresses on a specific disease with the attitude of authors,
which are written subjectively.
Among all sampling strategies, NE-LP achieves the
highest results with the precision, recall, and F1-score
of 56.97%, 65.91%, and 61.12%, and outperforms other
sampling strategies by margins of 0.96%, and 2.26% in
F1-score, which demonstrates the in-domain adaption
ability of our proposed NE-LP.
4.5.4 The Effect of Loss Function in Loss Prediction
Model
We further conduct experiments to explore the influ-
ence of loss function in the loss prediction model, the
ratio of initial labeled set and unlabeled set is set to
3:7, and the weight coefficients αand βin Equation
(15) are all set to 1. We experimentally compare MSE
with the ranking loss proposed by [43] in the field of
computer vision, which is defined as:
Lloss ˆ
li,ˆ
lj, li, lj= max 0,−1(li, lj)·ˆ
li−ˆ
lj+ξ
s.t. 1(li, lj) = +1,if li> lj
−1,otherwise
(20)
where ˆ
li,ˆ
ljdenotes a pair of predicted losses, (li, lj)
means a pair of real losses, and ξis a pre-defined posi-
tive margin. For example, when li> lj, no loss will be
given to the loss prediction model only if ˆ
liis higher
than ˆ
lj+ξ, otherwise, loss will be given to the loss pre-
diction model to make ˆ
liincreased and ˆ
ljdecreased.
In other words, when the output of the loss prediction
model is in accordance with the ranking, the gradient is
zero, otherwise, we decrease the gradient of the higher
value and add the gradient of the lower one.
Experimental results are given in Fig. 8. We observe
that two groups of experiments with different base mod-
els (i.e., BiLSTM and BiLSTM + Bigram) show similar
performance curves. In both cases, NE-LP with ranking
loss achieves better performance than NE-LP with MSE
in the early and last few iterations. Yoo and Kweon [43]
think that directly employing MSE as the loss function
may let the loss prediction model adapt roughly to the
scale changes of the loss linstead of fitting to the ex-
act value since the scale of the real loss lchanges in
the learning progress of the word segmenter. However,
Fig. 8 The effect of loss function in loss prediction model.
we can see that NE-LP with MSE obtains higher F1-
scores than NE-LP with ranking loss in the intermedi-
ate iterations, which indicates that compared with the
field of computer vision, MSE is still effective in NLP
tasks. Furthermore, MSE is more practical due to its
simplicity, while ranking loss has restrictions on setting
hyper-parameters, for example, the mini-batch size B
should be an even number since we need to make B/2
data pairs in order to consider the difference between a
pair of loss predictions.
5 Conclusion and Future Work
To relieve the efforts of EHRs annotation, we propose
an effective word segmentation method based on active
learning with a novel sampling strategy called NE-LP.
NE-LP effectively utilizes the output of a joint model
and combines normalized entropy with self-attention
based loss prediction. Compared to the widely-used and
mainstream uncertainty-based sampling methods, our
sampling strategy selects samples from the statistical
perspective and deep learning level. In addition, to cap-
ture coherence between characters, we further add bi-
gram features to the joint model. Based on EHRs col-
lected from the Shuguang Hospital Affiliated to Shang-
hai University of Traditional Chinese Medicine, we eval-
uate our method on CWS. Compared to conventional
uncertainty-based sampling strategies, NE-LP achieves
the best performance, which proves the effectiveness of
our method to a certain extent.
As possible research directions, in order to imple-
ment highly performant pre-trained language models
into actual settings, we plan to employ some fast ver-
sions of pre-trained neural networks (e.g., FastBERT [24]
and TinyBERT [16]) as the feature extractor for EHRs
segmentation and loss prediction. Then, considering the
characteristics of CWS task and model, we believe that
14 Tingting Cai et al.
our method can also be applied to other tasks, such as
NER and relation extraction.
Acknowledgements We would like to thank the reviewers
for their useful comments and suggestions which helped us
to considerably improve the work. We also kindly thank Ju
Gao from Shuguang Hospital Affiliated to Shanghai Univer-
sity of Traditional Chinese Medicine for providing us clinical
datasets, and Ping He from Shanghai Hospital Development
Center for her help. This work was supported by the Zhe-
jiang Lab (No.2019ND0AB01), the National Natural Science
Foundation of China (No. 61903144) and the National Key
R&D Program of China for “Precision medical research” (No.
2018YFC0910550).
Compliance with ethical standards
Conflict of interest No conflict of interest exits in the
submission of this manuscript.
References
1. Angluin, D.: Queries and concept learning. Machine
Learning 2(4), 319–342 (1988)
2. Balcan, M.F., Broder, A., Zhang, T.: Margin based active
learning. In: International Conference on Computational
Learning Theory, pp. 35–50. Springer (2007)
3. Bodenreider, O.: The unified medical language system
(UMLS): integrating biomedical terminology. Nucleic
Acids Research 32(Database-Issue), 267–270 (2004)
4. Cai, T., Zhou, Y., Zheng, H.: Cost-quality adaptive active
learning for Chinese clinical named entity recognition. In:
2020 IEEE International Conference on Bioinformatics
and Biomedicine, pp. 528–533. IEEE (2020)
5. Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long
short-term memory neural networks for Chinese word
segmentation. In: Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing,
pp. 1197–1206 (2015)
6. Cheng, K., Lu, Z.: Active learning bayesian support vec-
tor regression model for global approximation. Informa-
tion Sciences 544, 549–563 (2021)
7. Culotta, A., McCallum, A.: Reducing labeling effort for
structured prediction tasks. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 5, pp. 746–751
(2005)
8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT:
Pre-training of deep bidirectional transformers for lan-
guage understanding. arXiv preprint arXiv:1810.04805
(2018)
9. Eddy, S.R.: Profile hidden markov models. Bioinformat-
ics (Oxford, England) 14(9), 755–763 (1998)
10. Gan, L., Zhang, Y.: Investigating self-attention network
for Chinese word segmentation. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing 28,
2933–2941 (2020)
11. Gesulga, J.M., Berjame, A., Moquiala, K.S., Galido, A.:
Barriers to electronic health record system implementa-
tion and information systems resources: A structured re-
view. Procedia Computer Science 124, 544–551 (2017)
12. Gilad-Bachrach, R., Navot, A., Tishby, N.: Query by
committee made real. In: Advances in Neural Informa-
tion Processing Systems, pp. 443–450 (2006)
13. Goldberg, Y., Levy, O.: Word2Vec explained: deriv-
ing mikolov et al.’s negative-sampling word-embedding
method. arXiv preprint arXiv:1402.3722 (2014)
14. Guo, Y.: Active instance sampling via matrix partition.
Advances in Neural Information Processing Systems 23,
802–810 (2010)
15. Hasan, M., Roy-Chowdhury, A.K.: Context aware active
learning of activity recognition models. In: Proceedings of
the IEEE International Conference on Computer Vision,
pp. 4543–4551 (2015)
16. Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L.,
Wang, F., Liu, Q.: TinyBERT: Distilling BERT for natu-
ral language understanding. In: Proceedings of the 2020
Conference on Empirical Methods in Natural Language
Processing: Findings, pp. 4163–4174 (2020)
17. La Su, Y., Liu, W.: Research on the LSTM mongolian and
Chinese machine translation based on morpheme encod-
ing. Neural Computing and Applications 32(1), 41–49
(2020)
18. Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional
random fields: Probabilistic models for segmenting and
labeling sequence data. In: Proceedings of the 18th In-
ternational Conference on Machine Learning, pp. 282–
289 (2001)
19. Lewis, D.D., Gale, W.A.: A sequential algorithm for
training text classifiers. In: Proceedings of the 17th An-
nual International Conference on Research and Develop-
ment in Information Retrieval, pp. 3–12. Springer (1994)
20. Li, S., Zhou, G., Huang, C.R.: Active learning for Chi-
nese word segmentation. In: Proceedings of International
Conference on Computational Linguistics 2012: Posters,
pp. 683–692 (2012)
21. Lindberg, D.S., Prosperi, M., Bjarnadottir, R.I., Thomas,
J., Crane, M., Chen, Z., Shear, K., Solberg, L.M., Snig-
urska, U.A., Wu, Y., et al.: Identification of important
factors in an inpatient fall risk prediction model to im-
prove the quality of care using EHR and electronic ad-
ministrative data: A machine-learning approach. Interna-
tional Journal of Medical Informatics 143, 104272 (2020)
22. Liu, J., Wu, F., Wu, C., Huang, Y., Xie, X.: Neural chi-
nese word segmentation with dictionary. Neurocomput-
ing 338, 46–54 (2019)
23. Liu, M., Tu, Z., Wang, Z., Xu, X.: LTP: A new active
learning strategy for BERT-CRF based named entity
recognition. arXiv preprint arXiv:2001.02524 (2020)
24. Liu, W., Zhou, P., Zhao, Z., Wang, Z., Deng, H., Ju,
Q.: FastBERT: a self-distilling BERT with adaptive in-
ference time. In: Proceedings of the 58th Association for
Computational Linguistics, pp. 6035–6044 (2020)
25. Ma, J., Ganchev, K., Weiss, D.: State-of-the-art Chinese
word segmentation with Bi-LSTMs. In: Proceedings of
the 2018 Conference on Empirical Methods in Natural
Language Processing, pp. 4902–4908 (2018)
26. Marcheggiani, D., Artieres, T.: An experimental com-
parison of active learning strategies for partially labeled
sequences. In: Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, pp.
898–906 (2014)
27. Peng, F., Feng, F., McCallum, A.: Chinese segmentation
and new word detection using conditional random fields.
In: Proceedings of the 20th international conference on
Computational Linguistics, pp. 562–568 (2004)
28. Rasmy, L., Tiryaki, F., Zhou, Y., Xiang, Y., Tao, C., Xu,
H., Zhi, D.: Representation of EHR data for predictive
modeling: a comparison between UMLS and other termi-
nologies. Journal of the American Medical Informatics
Association 27(10), 1593–1599 (2020)
Title Suppressed Due to Excessive Length 15
29. Shao, D., Zheng, N., Yang, Z., Chen, Z., Xiang, Y., Xian,
Y., Yu, Z.: Domain-specific Chinese word segmentation
based on bi-directional long-short term memory model.
IEEE Access 7, 12993–13002 (2019)
30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Salakhutdinov, R.: Dropout: a simple way to prevent neu-
ral networks from overfitting. Journal of Machine Learn-
ing Research 15(1), 1929–1958 (2014)
31. Sun, D., Yaqot, A., Qiu, J., Rauchhaupt, L., Jumar, U.,
Wu, H.: Attention-based deep convolutional neural net-
work for spectral efficiency optimization in MIMO sys-
tems. Neural Computing and Applications (2020)
32. Tang, P., Yang, P., Shi, Y., Zhou, Y., Lin, F., Wang,
Y.: Recognizing Chinese judicial named entity using
BiLSTM-CRF. arXiv preprint arXiv:2006.00464 (2020)
33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Atten-
tion is all you need. In: Advances in Neural Information
Processing Systems, pp. 5998–6008 (2017)
34. Wang, C., Xu, B.: Convolutional neural network with
word embeddings for Chinese word segmentation. In:
Proceedings of the 8th International Joint Conference on
Natural Language Processing, pp. 163–172 (2017)
35. Wang, Q., Zhou, Y., Ruan, T., Gao, D., Xia, Y., He, P.:
Incorporating dictionaries into deep neural networks for
the Chinese clinical named entity recognition. Journal of
biomedical informatics 92, 103–133 (2019)
36. Wei, W., Wang, Z., Mao, X., Zhou, G., Zhou, P., Jiang,
S.: Position-aware self-attention based neural sequence
labeling. Pattern Recognition 110, 107636 (2021)
37. Xing, J., Zhu, K., Zhang, S.: Adaptive multi-task transfer
learning for Chinese word segmentation in medical text.
In: Proceedings of the 27th International Conference on
Computational Linguistics, pp. 3619–3630 (2018)
38. Xue, N., Shen, L.: Chinese word segmentation as lmr tag-
ging. In: Proceedings of the second SIGHAN workshop
on Chinese language processing-Volume 17, pp. 176–179.
Association for Computational Linguistics (2003)
39. Yan, Q., Wang, L., Li, S., Liu, H., Zhou, G.: Active
learning for Chinese word segmentation on judgements.
In: National CCF Conference on Natural Language Pro-
cessing and Chinese Computing, pp. 839–848. Springer
(2017)
40. Yan, Y.F., Huang, S.J., Chen, S., Liao, M., Xu, J.: Ac-
tive learning with query generation for cost-effective text
classification. In: Proceedings of the AAAI Conference
on Artificial Intelligence, vol. 34, pp. 6583–6590 (2020)
41. Yang, H.: BERT meets Chinese word segmentation.
arXiv preprint arXiv:1909.09292 (2019)
42. Yang, J., Yu, Q., Guan, Y., Jiang, Z.: An overview of
research on electronic medical record oriented named en-
tity recognition and entity relation extraction. Acta Au-
tomatica Sinica 40(8), 1537–1562 (2014)
43. Yoo, D., Kweon, I.S.: Learning loss for active learning. In:
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 93–102 (2019)
44. Yuan, Z., Liu, Y., Yin, Q., Li, B., Feng, X., Zhang, G., Yu,
S.: Unsupervised multi-granular Chinese word segmenta-
tion and term discovery via graph partition. Journal of
Biomedical Informatics 110, 103542 (2020)
45. Zhang, H., Huang, W., Liu, L., Chow, T.W.S.: Learning
to match clothing from textual feature-based compatible
relationships. IEEE Transactions on Industrial Informat-
ics 16(11), 6750–6759 (2020)
46. Zhao, H., Huang, C.N., Li, M., Lu, B.L.: Effective tag
set selection in Chinese word segmentation via condi-
tional random field modeling. In: Proceedings of the 20th
Pacific Asia Conference on Language, Information and
Computation, pp. 87–94 (2006)
47. Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese
word segmentation and pos tagging. In: Proceedings of
the 2013 Conference on Empirical Methods in Natural
Language Processing, pp. 647–657 (2013)
A preview of this full-text is provided by Springer Nature.
Content available from Neural Computing and Applications
This content is subject to copyright. Terms and conditions apply.