Content uploaded by Nisansa de Silva
Author content
All content in this area was uploaded by Nisansa de Silva on Dec 13, 2020
Content may be subject to copyright.
arXiv:2011.00318v1 [cs.CL] 31 Oct 2020
Effective Approach to Develop a Sentiment Annotator For Legal Domain in
a Low Resource Setting
Gathika Ratnayaka*,1, Nisansa de Silva*, Amal Shehan Perera*, and Ramesh Pathirana+
*Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka
+Faculty of Law, University of Colombo, Sri Lanka
1gathika.14@cse.mrt.ac.lk
Abstract
Analyzing the sentiments of legal opinions
available in Legal Opinion Texts can facili-
tate several use cases such as legal judgement
prediction, contradictory statements identi-
fication and party-based sentiment analysis.
However, the task of developing a legal do-
main specific sentiment annotator is challeng-
ing due to resource constraints such as lack of
domain specific labelled data and domain ex-
pertise. In this study, we propose novel tech-
niques that can be used to develop a sentiment
annotator for the legal domain while minimiz-
ing the need for manual annotations of data.
1 Introduction
Legal Opinion Texts that elaborate on the incidents,
arguments, legal opinions, and judgements associ-
ated with previous court cases are an integral part of
case law. As the information that can be acquired
from these documents has the potential to be di-
rectly applied in similar legal cases, legal officials
use of them as information sources to support their
arguments and opinions when handling a new legal
scenario. Therefore, developing methodologies and
tools that can be used to automatically extract valu-
able information from legal opinion texts while de-
riving useful insights from the extracted data are of
significant importance when it comes to assisting le-
gal officials via automated systems.
Sentiment analysis can be considered as one such
information extraction technique that has a signif-
icant potential to facilitate various information ex-
traction tasks. When a legal case is considered, it is
built around two major parties that are opposing to
each other. The party that brings forward the lawsuit
is usually called the plaintiff and the other party is
known as the defendant. At the beginning of a le-
gal opinion text, a summary of the case is given, de-
scribing the incidents associated with the case and
also explaining how each party is related with those
incidents. Legal opinions or the opinions of judges
about the associated events and laws related to the
court case can be considered as the most important
type of information available in a legal opinion text.
Such opinions may have a positive, neutral, or neg-
ative impact on a particular party. In addition to the
opinions that are directly related to the conduct of
the parties, legal opinion texts also provide interpre-
tations related to previous judgements and also on
statutes that are relevant to the legal case. Such opin-
ions may elaborate on the justifications, purposes,
drawbacks and loopholes that are associated with
a particular statute or a precedent. Moreover, the
descriptions also contain information related to the
proceeding of court cases such as adjournment of the
case and lack of evidence which can be considered
as factors that can directly have an impact on the out-
comes. When all of the above mentioned factors are
considered, sentiment analysis on legal opinion texts
can be considered as a task that can facilitate a wide
range of use cases. Despite its potential and useful-
ness, the attempts to perform sentiment analysis in
legal domain are limited. This study aims to address
this issue by developing a sentiment annotator that
can identify sentiments in a given sentence/phrase
extracted from legal opinion texts related to the
United States Supreme Court. Information that can
be derived from such a sentiment annotator can then
be adapted to facilitate more downstream tasks such
as identifying advantageous and disadvantageous ar-
guments for a particular party, contradictory opin-
ion detection (Ratnayaka et al., 2019), and predict-
ing outcomes of legal cases (Liu and Chen, 2018) .
In order to develop a reliable sentiment anno-
tator using supervised learning, it is required to
have a large amount of labelled data to train the
underlying classification model. However, creat-
ing such sophisticated datasets with manually an-
notated data (by domain experts) for a specialised
domain like legal opinion texts is not practical
due to extensive resource and time requirements
(Gamage et al., 2018; Sharma et al., 2018). In a low
resource setting, transfer learning can be used as a
potential technique to overcome the requirement of
creating a sophisticated data set. Adapting these
models directly into the legal domain will create
drawbacks, especially due to the negative transfer;
which is a phenomenon that occurs due to dissimi-
larities between two domains. Domain specific us-
age of words, domain specific sentiment polarities
and meanings of words can be considered as one ma-
jor reason that causes negative transfer when adapt-
ing datasets/models from one domain to another do-
main (Sharma et al., 2018). In this study, we demon-
strate methodologies that can be used to overcome
drawbacks due to negative transfer, when adapting a
dataset from a source domain to the legal domain.
2 Related Work
It can be observed that the early attempts
(Thelwall et al., 2010) of developing automatic sen-
timent analysis mechanisms make use of senti-
ment lexicons such as AFINN (Nielsen, 2011),
ANEW (Bradley and Lang, 1999), and Sentiword-
net (Baccianella et al., 2010). The sentiment polar-
ity and the strength of a particular word change
from one lexicon to another depending on the do-
main that is being considered when developing the
lexicon (Nielsen, 2011) due to the domain-specific
behaviors of words. However, in recent works re-
lated to sentiment analysis that are based on machine
learning and deep learning techniques, the learning
algorithms are allowed to learn the sentiments as-
sociated with words and how their compositions af-
fect the overall sentiment of a particular text. The
Recursive Neural Tensor Network (RNTN) model
proposed by Socher et al. (Socher et al., 2013) can
be considered as an important step towards this
direction and it has shown promising results for
sentiment classification in movie reviews. How-
ever, the performances of such approaches that
are based on recursive neural network architec-
tures have been surpassed more recently by the
approaches that make use of pretrained language
models (eg: BERT(Devlin et al., 2018)), and such
approaches have now become the state of the art
for sentiment classification (Munikar et al., 2019).
From this point onwards, the RNTN model proposed
in (Socher et al., 2013) will be denoted as RN T Nm.
Though the applications of sentiment analysis in
the legal domain are limited, there is an emerg-
ing interest within the law-tech community to ex-
plore how sentiment analysis can be used to facili-
tate the legal processes(Conrad and Schilder, 2007;
Liu and Chen, 2018). The study by Gamage et
al. (Gamage et al., 2018) on performing sentiment
analysis in US legal opinion texts can be considered
as the closest to our work. However, the direct appli-
cability of their approach into our study is prevented
due to some limitations. In (Gamage et al., 2018),
the sentiment annotator was developed to perform
a binary-classification task (negative sentiment and
non-negative sentiment). Moreover, one of the
key steps in (Gamage et al., 2018) is to identify
words that have different sentiments in the legal
domain when compared with their sentiments in
the movie domain. The identification of such
words with domain-specific sentiments had been
performed manually by human annotators. How-
ever, manually going through a set of words with a
significant size is not ideal for a low resource setting
in which the intention is to create an optimum out-
come from a limited amount of human annotations.
Though (Sharma et al., 2018) proposes an automatic
approach based on word embeddings to minimize
negative transfer by identifying transferable words
that can be used for cross domain sentiment classi-
fication, the proposed approach aims only at binary
sentiment classification that considers only the pos-
itive and negative sentiment classes.
3 Methodology
3.1 Identifying words that can cause negative
transfer
In order to minimize the resource requirements, our
intention is to utilize a labeled high resource source
domain to facilitate sentiment analysis in a low re-
source target domain. The Stanford Sentiment Tree-
bank (SST-5) (Socher et al., 2013) which consists of
Rotten Tomato movie reviews labelled according to
their sentiments was taken as the source dataset and
a corpus of legal opinion texts was selected to ex-
tract legal phrases that will be used as the target
dataset. As the first step, 3 categories were iden-
tified to which the words available in the source
dataset can be assigned. The first category is the
Domain Generic words, the words that behave in
a similar manner across the movie review domain
and the legal domain. The second category is the
Domain Specific words, the type of words that be-
haves differently in the two domains and has the po-
tential to cause negative transfer. Within this cat-
egory, the most frequently used sense/meaning of a
word in one domain may differ from that of the other
domain. Additionally, such a word may have dif-
ferent sentiment polarities across the two domains.
However, there is another important type of words
that can be identified as Under Represented Words.
The set of Under Represented Words consists of
words that are frequently occurring in the target do-
main (legal domain), but are not available or have
occurred with a very less frequency in the source
dataset.
Due to the resource limitations, it is not feasible
to identify domain specific words,domain generic
words, and under represented words manually by
going through each word in the legal opinion text
corpus. Therefore, the following steps were fol-
lowed to minimize the requirements for manual an-
notation. As the first step, stop words in the le-
gal opinion text corpus were removed utilizing the
Van stop list (Van Rijsbergen, 1979). Next word fre-
quency, which is the frequency of occurrence of
a particular word within the corpus was calculated
for each word. Then, the set of words was ar-
ranged in a descending order based on the word fre-
quency to create the sorted word set W. From W,
first k-words (most frequent k-words) were chosen
as the considered set of words S. Here k=minj{j∈
Z+|∑j
i=1(wi)≥0.95 ·∑n
i=1(wi)}, where wiis the ith
element of W and n is the total number of elements
in W.
Algorithm 1
Function assignSentimento(w, sentiment)
if sentiment == N then Don ∪ {w}, Oi− {w}
else if sentiment == P then Dop ∪ {w}, Oi− {w}
end if
EndFunction
Function assignSentimentn(w, sentiment)
if sentiment == N then Dnn ∪ {w}
else if sentiment == P then Dnp ∪ {w}, Ni− {w}
else if sentiment == O then Dno ∪ {w}, Ni− {w}
end if
EndFunction
Function assignSentimentp(w, sentiment)
if sentiment == N then Dpn ∪ {w}, Pi− {w}
else if sentiment == P then Dpp ∪ {w}
else if sentiment == O then Dpo ∪ {w}, Pi− {w}
end if
EndFunction
Pi=Pm, Ni=Nm, Oi=Om,Don ={}, Dop ={}
Dnn, Dnp , Dno , Dpp, Dpn , Dpo ={}
n=0,p=0
While 1 + |Don|> n or 1 + |Dop |> p do
n=1 + |Don|, p =1 + |Dop |
for word w in Oido
l = mostSimilarl(w)
if underRepresented(w) and affinAssignable(w) then
assignSentimento(w, af inn(w))
else if domainSpecific(w) and affinAssignable(w) then
assignSentimento(w, af inn(w))
else if domainGeneric(l) and l∈Nm∪Don then
if notAntonym(w, l)then assignSentimento(w, N )
else if domainGeneric(l) and l∈Pm∪Dop then
if notAntonym(w, l)then assignSentimento(w, P )
end if
end for
end
Next, the Stanford Sentiment Annotator
(RN T Nm) was used to annotate the sentiment
of each word in the considered word set S. After
the annotation process, the words were distributed
into three sets P
M,NM,OMbased on the annotated
sentiment. The set P
Mis made up of words that
were annotated as Very Positive or Positive and the
set NMis made up of words that were annotated
as Very Negative or Negative. The words that
were annotated as having a Neutral sentiment were
included into OM. The sets P
M,NM,OMconsists of
336, 253, and 4992 words respectively. Identifying
words in OMthat have different sentiments across
the two domains by manually going through each
word is resource extensive as it contains nearly 5000
words. To overcome this challenge and to minimize
the required number of manual annotations, we
developed a heuristic approach to identify words in
the neutral word set (OM) that can have different
(deviated) sentiments. It should also be noted that in
our algorithmic approach, words with deviated sen-
timents are identified while automatically assigning
each word with a legal sentiment (Algorithm 1 and
Algorithm 2).
Algorithm 2
n=0,p=0
While 1 + |Dnn|> n or 1 + |Dnp |> p do
n=1 + |Dnn|, p =1 + |Dnp |
Q = Ni∪Don ∪Dnn, R = Pm∪Dop ∪Dnp
for word w in Nido
l=mostSimilarl(w)
if domainGeneric(w) then assignSentimentn(w, N )
else if domainSpecific(w) and affin(w)==N then
assignSentimentn(w, N )
else if domainSpecific(w) and notAntonym(w,l) then
if l∈Qthen assignSentimentn(w, N )
else if domainGeneric(l) and l∈Rthen
assignSentimentn(w, P )
end if
end for
end
for word w in Nido assignSentimentn(w, O )
n=0,p=0
While 1 + |Dpp|> p or 1 + |Dpn |> n do
p=1 + |Dpp|, n =1 + |Dpn |
Q = Nm∪Don ∪Dpn, R = Pi∪Dop ∪Dpp
for word w in Pido
l=mostSimilarl(w)
if domainGeneric(w) then assignSentimentp(w, P )
else if domainSpecific(w) and affin(w)==P then
assignSentimentp(w, P )
else if domainSpecific(w) and notAntonym(w,l) then
if l∈Rthen assignSentimentp(w, P )
else if domainGeneric(l) and l∈Qthen
assignSentimentp(w, N )
end if
end for
end
for word w in Pido assignSentimentp(w, O )
Pl=Dop ∪Dnp ∪Dpp ,Nl=Don ∪D nn ∪Dpn
Though it is feasible to manually annotate all the
words in P
Mand NM, we have developed our algo-
rithmic approach to identify words that can have de-
viated sentiments in P
Mand NMas well (Algorithm
2) because having a heuristic approach to identify
such deviated words can be used to minimize the
number of annotations required in case a signifi-
cant number of words will be identified from OM
as having deviated sentiments exceeding the annota-
tion budget. Moreover, such an automatic approach
has the potential to be utilized as a mechanism to
generate domain specific sentiment lexicons.
Within our approach to distinguish domain spe-
cific words from domain generic words, two key
information that can be derived from word embed-
ding models are considered; 1. Cosine similar-
ity between vector representations of two words u,
vas Cosinedomain (u,v)and the most similar word
for a particular word was mostSimilardomain (w).
Domain specific word embeddings have been uti-
lized within our approach to identify domain spe-
cific words from domain generic words. The
Word2Vec model publicly available at SigmaLaw
dataset (Sugathadasa et al., 2017) that has been
trained using a United States legal opinion text
corpus was selected as the legal domain specific
word embedding model. The SST-5 dataset does
not contain an adequate amount of text data to
be used as a corpus to create an effective word
embedding model. Therefore, we selected the
IMDB movie review corpus (Maas et al., 2011) to
train the movie review domain specific Word2Vec
embedding model. From this point onwards,
Cosinelegal and Cosinemovie−reviews will be denoted
by Cosineland Cosinemrespectively. Similarly,
mostSimilarlegal (w)will be denoted by l(w)while
using m(w)to denote mostSimilarmovie−reviews(w).
First, for a given word w, we obtain l(w)and
m(w). As Word2Vec (Mikolov et al., 2013) embed-
dings are based on distributional similarity, it can
be assumed that the most similar word output by
a domain specific embedding model to a particu-
lar word is related to the domain specific sense of
that considered word. For example, convicted is ob-
tained as l(charged). It can be observed that the
word convicted is associated with the sense of accu-
sation, which is the most frequent sense of charge
in the legal domain. However, when it comes to
m(charged),sympathizing is obtained as the output.
Sympathizing is associated with the sense of filled
with excitement or emotion, which is the most fre-
quent sense of charged in the movie reviews. After
obtaining the most similar words for a given word
w, we define a value domainSimilarity(w)such that
domainSimilarity(w)=Cosinel(l(w),m(w)). As we
are considering the legal embedding model when
getting the cosine similarity values, a higher do-
mainSimilarity(w) value will suggest that legal sense
and movie sense of the word whave a similar mean-
ing in the legal domain while a lower domainSimi-
larity(w) will suggest that the meanings of the two
senses are less similar to each other. For example,
the value obtained for domainSimilarity(Charged)
was 0.06 while it was 0.53 for domainSimilar-
ity(Convicted) (convicted has a similar sense across
the two domains).
The next step is to identify a threshold based
on domainSimilarity(w) to heuristically distinguish
whether a word wis domain generic or not. To that
regard, we made use of already available Verb Sim-
ilarity dataset 1developed for the legal domain. The
dataset consists of 959 verb pairs manually anno-
tated based on whether the two verbs in a pair have
a similar meaning or not. First, a threshold tbased
on cosine similarity was defined. For a given two
verbs vi,vj, if Cosinel(vi,vj)≥t, the two verbs are
considered as having a similar meaning. From the
experiments, it was observed that precision is less
than 0.5 when the threshold value is equal to 0.1.
Therefore, 0.2 is selected as the threshold value to
identify domain generic words based on the domain-
Similarity(w) score. In other words, if domainSim-
ilarity(w) is greater than or equal to 0.2, the word
wwill be considered as domain generic and the at-
tribute domainGeneric(w) will be set to true. Other-
wise, the attribute domainSpecific(w) will be set to
true. Though we have used the aforementioned ap-
proach to determine the threshold, it is a heuristic
and domain specific value that can be decided based
on different experimental techniques (when apply-
ing this methodology to another domain).
Even if a word behaves in a similar manner across
the two domain, it still can be assigned with a wrong
sentiment (neutral sentiment) due to under represen-
tation. However, it is important to identify words
with sentiment polarities (positive or negative) as
the descriptions with positive or negative sentiments
tend to contain more specific information that will
be useful in legal analysis. As a measure of identify-
ing sentiment polarities of under represented words,
we made use of AFINN (Nielsen, 2011) sentiment
lexicon (denoted as set A from this point onwards),
which consists of 3352 words annotated based on
their sentiment polarity (positive, neutral, negative)
and sentiment strength considering the domain of
1https://osf.io/bce9f/
twitter discussions. If a frequency of a word w
is less than 3 in the source dataset, underRepre-
sented(w) is set to true. Assignment of AFINN
sentiment for an under represented word or a do-
main specific word wcan create a positive impact
if the most frequently used sense of win twitter dis-
cusion domain is aligned towards it’s sense in the
legal domain than the sense of that word (w) in
the movie review domain. In order to heuristically
determine this factor, we have defined an attribute
name afinnSimilarity such that a f innSimil arity(w)
=Cosinet(w,l(w)) −Cosinet(w,m(w)), where w is
a given word and Cosinetis the cosine similar-
ity obtained using a publicly available Word2Vec
model (Godin, 2019) trained using tweets. If
Cosinet(w,l(w)) >Cosinet(w,m(w)), it can be as-
sumed that the sense of word win twitter discus-
sions is more closer to its sense in the legal do-
main than that of the movie-reviews. Thus, if
afinnSimilarity(w) >0 and w∈A, the attribute
afinnAssignable(w) is set to true.
Both Algorithm 1 and Algorithm 2 are two parts
of one major algorithmic approach (Algorithm 1 ex-
ecutes first). Therefore, the functions and attributes
defined in Algorithm 1 are applied globally for both
Algorithm 1 and Algorithm 2 and the states of the
attributes after executing Algorithm 1 will be trans-
ferred to the Algorithm 2. In the algorithms, P, N,
O denotes positive, negative, and neutral sentiments
respectively. afinn(w) is the AFINN sentiment cate-
gorization of a given word w. When observing the
algorithm, it can be observed that sentiment of l(w))
is also considered when determining the correct sen-
timents of a word. For a word in Om, the sentiment
of l(w)will be assigned if l(w)is domain generic
(Algorithm 1). This step was followed as another
way to identify words with sentiment polarities (pos-
itive or negative). The sentiments of domain generic
words in P
mor Nmwill not be changed under any
condition. For a domain specific word w in P
mor
Nm, if l(w)has a opposite sentiment polarity to that
of w, the sentiment of l(w)will be assigned to w
only if l(w)is domain generic. All the domain spe-
cific words in P
mor Nmthat do not satisfy any of
the conditions that are required to assign a positive
or negative polarity (Algorithm 2), will be assigned
with a neutral sentiment. This step is taken be-
cause such domain specific words have a relatively
higher probability to have opposite sentiment polar-
ities in the legal domain, thus capable of transfer-
ring wrong information to the classification models
(Sharma et al., 2018). Assigning neutral sentiment
will reduce the impact of negative transfer that can
be caused by such words (neutral sentiment is better
than having the opposite sentiment polarity). Fur-
thermore, it should be noted that an antonym of a
particular word wcan be given as l(w)by the embed-
ding model due to semantic drift. To tackle this chal-
lenge, WordNet (Fellbaum, 2012) was used to check
whether a given word wand l(w)are antonyms.
If they are not antonyms, notAntonyms() attribute
is set true. After running the Algorithm 1 and 2
by taking P
m,Om,Nmas the inputs, the word sets
Don,Do p were obtained that consist of words the
overall algorithm picked from Omas having nega-
tive and positive sentiments respectively. Don,Do p
together with P
m,Nmwere given to a legal expert in
order to annotate the words in these sets based on
their sentiments. |Don|= 220 and |Dop |=116, thus
reducing the required amount of annotations to 925
(925= |W|, where W=Dop ∪Don ∪P
m∪Nm). Af-
ter the annotation process, three word sets Na,Oa,P
a
were obtained that contains words that are anno-
tated as having positive, neutral and negative senti-
ments respectively. Then word sets Dn,Do,Dpwere
created such that Dn={w∈W|w∈Na&w/∈Nm},
Dp={w∈W|w∈P
a&w/∈P
m},Do={w∈W|w∈
Oa&w/∈Om}.P
lcontains the set of words identified
by the overall algorithm as having positive sentiment
and Nlcontains the words identified as having nega-
tive sentiment (without human intervention).
3.2 Fine Tuning the RNTN Model
As an approach to develop a sentiment classifier for
legal opinion texts, RN T Nm(Stanford Sentiment
Annotator) (Socher et al., 2013) was fine tuned
following a similar methodology as proposed by
(Gamage et al., 2018). In the proposed method-
ology (Gamage et al., 2018), there is no need to
further train the RNT Nmmodel or to modify the
neural tensor layer of the model. Instead, the
approach is purely based on replacing the word
vectors. In this approach, if a word vin a word
sequence Shave a deviated sentiment sdin the legal
domain when compared with its sentiment smas
output by the RNT Nm, the vector corresponding to v
will be replaced by the vector of word u, where uis
a word from a list of predefined words that has the
sentiment sdas output by RNT Nm. When choosing
ufrom the list of predefined words, PoS tag of win
word sequence Sis considered in order to preserve
the syntactic properties of the language. For exam-
ple, if we consider the phrase Sam is charged for
a crime, as charged is a word that have a deviated
sentiment, the vector corresponding to charged will
be substituted by the vector of hated (hated is the
word that matches the PoS of charged from the
predefined word list corresponding to the negative
class) (Gamage et al., 2018). When extending
the approach proposed in (Gamage et al., 2018)
for three class sentiment classification, a prede-
fined word list for positive class was developed
by mapping a set of selected words that have
positive sentiment in RN T Nmto each PoS tag.
The mapping can be represented as a dictio-
nary R, where R = {JJ:beautiful, JJR:better,
JJS:best, NN:masterpiece, NNS:masterpieces,
RB:beautifully, RBR:beautifully, RBS:beautifully,
VB:reward, VBZ:appreciates, VBP:reward,
VBD:won, VBN:won, VBG:pleasing}. For
the negative class and the neutral class, the PoS-
word mappings provided by (Gamage et al., 2018)
for negative and non-negative classes were used
respectively. Furthermore, instead of annotating
each word in the selected vocabulary to identify
words with deviated sentiments, we used word sets
Dn,Do,Dpthat were derived using the approaches
described in Section 3.1. In Section 4, the fine tuned
RNTN model developed in this study is denoted as
RN T Nl.
3.3 Adapting the BERT based approaches
An approach based on BERTlarge embeddings
(Munikar et al., 2019) has achieved the state of the
art results for sentiment classification of sentences in
SST-5 dataset. In order to adapt the same approach
for our task, following steps were followed. First,
sentences with their sentiment labels were extracted
from SST-5 training set. The SST-5 training set con-
sists of 8544 sentences labelled for 5 class sentiment
classification. As our focus is on 3 class classifi-
cation, the sentiment labels in the SST training set
were converted for 3 class sentiment classification
by mapping very positive, positive labels as positive
and very negative,negative labels as negative. Next,
following a similar methodology as described in
(Munikar et al., 2019), canonicalization,tokeniza-
tion and special token addition were performed as
the preprocessing steps. Then, the classification
model was designed following the same model ar-
chitecture described in (Munikar et al., 2019), that
consists of a dropout regularization and a softmax
classification layer on top of the pretrained BERT
layer. Similarly to (Munikar et al., 2019), BERTlarge
uncased was used as the pretrained model and dur-
ing the training phase, dropout of probability factor
0.1 was applied as a measure of preventing overfit-
ting. Cross Entropy Loss was used as the cost fun-
tion and stochastic gradient descent was used as the
optimizer (batch size was 8). Then, the model was
trained using the SST-5 training sentences. As in-
formation related to number of training epoch could
not be found in (Munikar et al., 2019), we experi-
mented with 2 and 3 epcohs and calculated the ac-
curacies with a test set of 500 legal phrases (Sec-
tion 4). When trained for 2 epochs, the accuracy
was 57% and for 3 epcohs it was reduced to 52%,
possibly due to the overfitting with the source data.
Therefore, 2 was choosen as the number of training
epochs. This model will be denoted as BE RTmin
next sections.
In order to finetune the BERT based approach to
the legal sentiment classification, the following steps
were followed. First we selected sentences in the
SST training data, that consists of words that were
identified as having deviated sentiments (words in
Do∪Dp∪Dn). If the sentiment label of the sentence
S that has a deviated sentiment word w is different
from the sentiment label assigned to w by the legal
expert, then S will be removed from the original SST
training dataset as a measure of reducing negative
transfer. For example, if there is a sentence S with
word charged and if the sentiment of S is positive
or neutral (sentiment of charged is negative in legal
domain), then that sentence S will be removed from
the training set. After removing such sentences, the
training set was reduced to 6318 instances and this
new training set will be denoted by D from this point
forward. Next, for each word win Dnor Dp, we ran-
domly selected 2 sentences that contains wfrom the
legal opinion text corpus. Then, the sentiments of
the selected sentences were manually annotated by a
legal expert. As |Dn|=206 and |Dp|=82, only 576
new annotations were needed (|Do|=230, but words
in Dowere not considered for this approach as they
are having a neutral legal sentiment). Then, these
576 sentences from legal opinion texts were com-
bined together with sentences in D, thus creating a
new training set L that consists of 6894 instances.
The above mentioned steps were followed to remove
the negative transfer from the source dataset and also
to fine tune the dataset to the legal domain. Then,
L was used to train a BERT based model using the
same architecture, hyper parameters and number of
training epochs that were used to train BERTm. The
model obtained after this training process is denoted
as BERTl.
4 Experiments and Results
4.1 Identification of words with deviated
sentiments
In order to evaluate the effectiveness of the proposed al-
gorithmic approach when it comes to identifying legal
sentiment of a word, we have compared the positive word
list (P
l) and negative word list(Nl) identified by the algo-
rithm with P
mand Nmrespectively as shown in Table 1.
The way in which P
land Nlwere obtained is described in
Algorithm 2. It can be observed that the precision of iden-
tifying words with negative sentiments is 80% in the algo-
rithmic approa ch and it is a 19% improvement when com-
pared with the RN T Nm(Socher et al., 2013). Further-
more, the number of correctly identified negative words
have increase to 317 from 154. Though the precision of
identifying words with positive sentiment is only 62%,
there is an improvement of 21% when compared with
the RNT Nm. Precision of identifying words with positive
sentiment is relatively low due to the fact that most of the
words that have a positive sentiment in generic language
usage have a neutral sentiment in the legal domain. So-
phisticated analysis in relation to the neutral class could
not be performed due to the large amount of words avail-
able in Om. When considering these results, it can be seen
that the proposed algorithm has shown promising results
when it comes to determining the legal domain specific
sentiment of a word. Additionally, it implies that the pro-
posed algorithmic approach is successful in identifying
words that have different sentiments across the two do-
mains. This approach can also be extended to other do-
mains easily as domain specific word embedding models
can be trained using an unlabelled corpus. Furthermore,
the proposed algorithmic approach also has the potential
to be used in automatic generation of domain specific sen-
timent lexicons.
Table 1: Evaluating the word lists generated from Algorithm 1 and Algorithm 2
Polarity
Metric Number of Words Percentages
NmNlP
mP
lNmNlP
mP
l
Negative 154 317 17 20 61% 80% 5% 7%
Neutral 96 73 180 89 38% 19% 54% 41%
Positive 3 4 139 181 1% 1% 41% 62%
Total 253 394 336 290 100% 100% 100% 100%
Table 2: Precision(P), Recall (R) and F-Measure (F) obtained from the considered models
Model
Metric Negative Neutral Positive Accuracy
P R F P R F P R F
RN T Nm0.51 0.68 0.58 0.44 0.52 0.48 0.48 0.10 0.16 0.48
RN T Nl(Improved) 0.55 0.70 0.62 0.54 0.51 0.52 0.73 0.44 0.55 0.57
BERTm0.68 0.73 0.70 0.47 0.68 0.56 0.57 0.13 0.21 0.57
BERTl(Improved) 0.72 0.79 0.75 0.58 0.55 0.57 0.70 0.62 0.66 0.67
4.2 Sentiment Classification
In order to evaluate the performances of the considered
models when it comes to legal sentiment classification, it
is needed to prepare a test set that consists of sentences
from legal opinion texts annotated according to their sen-
timent. As the first step of preparing the test set, 500 sen-
tences were randomly picked from the legal opinion text
corpus such that there is no overlap between the test set
and the sentences used to train BE RTl. Then sentiment of
each sentence was annotated by a legal expert. According
to the human annotations, the number of data instances
belong to negative, neutral and positive classes in the test
set were 211, 168, and 121 respectively. The results ob-
tained for each model for the test set is shown in Table
2. The effectiveness of the fine tuning approaches pro-
posed in this study is evident as the RNTN finetuning has
achieved accuracy increase of 9% while fine tuning the
dataset for BERT training has achieved an accuracy in-
crease of 10% when compared with the performances of
the respective source models. It can be observed that the
BERTmhas the same accuracy as the RNT Nl. However,
the performance of RN T Nlmodel is relatively consistent
across all 3 classes while the recall, f-measure of BE RTm
in relation to the positive class is significantly low. It
should be noted that BE RTlmodel that was trained af-
ter fine tuning the dataset for legal domain outperforms
all other models. Furthermore, the state of the art ac-
curacy value for 5 class sentiment classification of sen-
tences in SST-5 dataset is 55.5%(Munikar et al., 2019).
An accuracy of 67% for 3 class classification in the legal
domain can be considered as satisfactory when we con-
sider the added language complexities in legal opinion
texts, though the number of classes has been reduced to
3. Most importantly, the accuracy enhancement of 10%
compared with BE RTmwas achieved by including only
576 new sentences from legal opinion texts that were an-
notated by a legal expert. Therefore, it can be concluded
that the transfer learning approach mentioned in Section
3.3 is an effective way to develop a domain specific sen-
timent annotator with a considerable accuracy while uti-
lizing a minimum amount of annotations.
5 Conclusion
Developing a sentiment annotator to analyze the senti-
ments of legal opinions can be considered as the primary
contribution of this study. In order to achieve this primary
objective in a low resource setting, we have proposed ef-
fective approaches based on transfer learning while uti-
lizing domain specific word representations to overcome
negative transfer. As a part of the overall methodology,
we have also proposed an algorithmic approach that has
the capability of identifying the words with deviated sen-
timents across the source and target domains, while as-
signing the target domain specific sentiment to the con-
sidered words. The data sets prepared within this study
for testing and training purposes has been made publicly
available 2. Moreover, the methodologies formulated in
this study are designed in a way such that they can be
easily adaptable for any other domain.
Acknowledgments
This research was funded by SRC/LT/2018/08 grant of
University of Moratuwa.
2https://osf.io/zwhm8/
References
[Baccianella et al.2010] Stefano Baccianella, Andrea
Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet
3.0: an enhanced lexical resource for sentiment
analysis and opinion mining. In Lrec, volume 10,
pages 2200–2204.
[Bradley and Lang1999] Margaret M Bradley and Peter J
Lang. 1999. Affective norms for english words
(anew): Instruction manual and affective ratings.
Technical report, Technical report C-1, the center for
research in psychophysiology.
[Conrad and Schilder2007] Jack G Conrad and Frank
Schilder. 2007. Opinion mining in legal blogs. In
Proceedings of the 11th international conference on
Artificial intelligence and law, pages 231–236.
[Devlin et al.2018] Jacob Devlin, Ming-Wei Chang,
Kenton Lee, and Kristina Toutanova. 2018.
Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
[Fellbaum2012] Christiane Fellbaum. 2012. Wordnet.
The encyclopedia of applied linguistics.
[Gamage et al.2018] Viraj Gamage, Menuka
Warushavithana, Nisansa de Silva, Amal Shehan
Perera, Gathika Ratnayaka, and Thejan Rupasinghe.
2018. Fast approach to build an automatic sentiment
annotator for legal domain using transfer learning.
arXiv preprint arXiv:1810.01912.
[Godin2019] Fr´ederic Godin. 2019. Improving and In-
terpreting Neural Networks for Word-Level Prediction
Tasks in Natural Language Processing. Ph.D. thesis,
Ghent University, Belgium.
[Liu and Chen2018] Yi-Hung Liu and Yen-Liang Chen.
2018. A two-phase sentiment analysis approach for
judgement prediction. Journal of Information Science,
44(5):594–607.
[Maas et al.2011] Andrew Maas, Raymond E Daly, Pe-
ter T Pham, Dan Huang, Andrew Y Ng, and Christo-
pher Potts. 2011. Learning word vectors for sentiment
analysis. In Proceedings of the 49th annual meeting of
the association for computational linguistics: Human
language technologies, pages 142–150.
[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai
Chen, Greg S Corrado, and Jeff Dean. 2013. Dis-
tributed representations of words and phrases and their
compositionality. In Advances in neural information
processing systems, pages 3111–3119.
[Munikar et al.2019] Manish Munikar, Sushil Shakya,
and Aakash Shrestha. 2019. Fine-grained sentiment
classification using bert. In 2019 Artificial Intelligence
for Transforming Business and Society (AITB), vol-
ume 1, pages 1–5. IEEE.
[Nielsen2011] Finn ˚
Arup Nielsen. 2011. A new anew:
Evaluation of a word list for sentiment analysis in mi-
croblogs. arXiv preprint arXiv:1103.2903.
[Ratnayaka et al.2019] Gathika Ratnayaka, Thejan Ru-
pasinghe, Nisansa de Silva, Viraj Salaka Gamage,
Menuka Warushavithana, and Amal Shehan Perera.
2019. Shift-of-perspective identification within legal
cases. arXiv preprint arXiv:1906.02430.
[Sharma et al.2018] Raksha Sharma, Pushpak Bhat-
tacharyya, Sandipan Dandapat, and Himanshu Sharad
Bhatt. 2018. Identifying transferable information
across domains for cross-domain sentiment classifica-
tion. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 968–978.
[Socher et al.2013] Richard Socher, Alex Perelygin, Jean
Wu, Jason Chuang, Christopher D Manning, An-
drew Y Ng, and Christopher Potts. 2013. Recursive
deep models for semantic compositionality over a sen-
timent treebank. In Proceedings of the 2013 confer-
ence on empirical methods in natural language pro-
cessing, pages 1631–1642.
[Sugathadasa et al.2017] Keet Sugathadasa, Buddhi
Ayesha, Nisansa de Silva, Amal Shehan Perera,
Vindula Jayawardana, Dimuthu Lakmal, and Madhavi
Perera. 2017. Synergistic union of word2vec and
lexicon for domain specific semantic similarity. In
2017 IEEE International Conference on Industrial
and Information Systems (ICIIS), pages 1–6. IEEE.
[Thelwall et al.2010] Mike Thelwall, Kevan Buckley,
Georgios Paltoglou, Di Cai, and Arvid Kappas. 2010.
Sentiment strength detection in short informal text.
Journal of the American society for information sci-
ence and technology, 61(12):2544–2558.
[Van Rijsbergen1979] C Van Rijsbergen. 1979. Informa-
tion retrieval: theory and practice. In Proceedings of
the Joint IBM/University of Newcastle upon Tyne Sem-
inar on Data Base Systems, pages 1–14.