Conference PaperPDF Available

Neural Network Based Rhetorical Status Classification for Japanese Judgment Documents

Authors:

Abstract and Figures

We address the legal text understanding task, and in particular we treat Japanese judgment documents in civil law. Rhetorical status classification (RSC) is the task of classifying sentences according to the rhetorical functions they fulfil; it is an important preprocessing step for our overall goal of legal summarisation. We present several improvements over our previous RSC classifier, which was based on CRF. The first is a BiLSTM-CRF based model which improves performance significantly over previous baselines. The BiLSTM-CRF architecture is able to additionally take the context in terms of neighbouring sentences into account. The second improvement is the inclusion of section heading information, which resulted in the overall best classifier. Explicit structure in the text, such as headings, is an information source which is likely to be important to legal professionals during the reading phase; this makes the automatic exploitation of such information attractive.We also considerably extended the size of our annotated corpus of judgment documents.
Content may be subject to copyright.
Neural Network Based Rhetorical Status
Hiroaki YAMADA a, Simone TEUFEL a,band Takenobu TOKUNAGA a
aSchool of Computing, Tokyo Institute of Technology, Japan
bUniversity of Cambridge, Computer laboratory, U.K.
Abstract. We address the legal text understanding task, and in particular we treat
Japanese judgment documents in civil law. Rhetorical status classification (RSC) is
the task of classifying sentences according to the rhetorical functions they fulfil; it
is an important preprocessing step for our overall goal of legal summarisation. We
present several improvements over our previous RSC classifier, which was based on
CRF. The first is a BiLSTM-CRF based model which improves performance signif-
icantly over previous baselines. The BiLSTM-CRF architecture is able to addition-
ally take the context in terms of neighbouring sentences into account. The second
improvement is the inclusion of section heading information, which resulted in the
overall best classifier. Explicit structure in the text, such as headings, is an informa-
tion source which is likely to be important to legal professionals during the reading
phase; this makes the automatic exploitation of such information attractive. We also
considerably extended the size of our annotated corpus of judgment documents.
Keywords. Japanese NLP, Legal NLP, Argument understanding, Machine learning,
Sentence classification, Natural language processing, Neural network, Deep
learning, Rhetorical status classification
1. Introduction
Like in all other areas of life, information overload has also become problematic in the
legal domain. Legal practitioners, including lawyers and judges, need to find relevant
documents for their cases, and efficiently extract case-relevant information from them.
In the Japanese legal system, one of the main sources used for this task is the judgment
document, an important type of legal document which is the direct output from court
trials and contains the judgment, the facts and the grounds[1,2]. They are typically long
and linguistically complex, so that it becomes impossible to read all relevant documents
carefully. Summaries of judgment documents are a solid solution to the problem, as they
would facilitate the decision which documents the legal professionals should read with
full attention. Our final goal is to develop methods for automatically generating such
summaries.
Our project is based on the observation that the structure of the legal argument can
guide summarisation. In the Japanese judgment documents, a common structure exists
(Figure 1), which centres around the so-called “Issue Topic,” a legal concept correspond-
ing to pre-defined main points which are to be discussed in a particular court case. An
example for a legal case about a damage compensation case of a traffic accident in a
Classification for Japanese Judgment
Documents
Legal Knowledge and Information Systems
M. Araszkiewicz and V. Rodríguez-Doncel (Eds.)
© 2019 The authors and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/FAIA190314
133
 
 
 

 
   !
!

!


Figure 1. Argument structure of judgment document
bus travel is the question of the degree of plaintiffs own negligence. A case consists of
several Issue Topics (three in the figure), and each is associated with a conclusion by the
judge, and with supporting arguments for the decision. The task of argument structure ex-
traction can be divided into four subtasks [3]: 1. Issue Topic Identification: find sentences
that describe an Issue Topic; 2. Rhetorical Status Classification: determine the rhetori-
cal status of each sentence; 3. Issue Topic Linking: associate each sentence with exactly
one Issue Topic; 4. FRAMING Linking: link two sentences if one provides argumentative
support for the other.
In this paper, we focus on Rhetorical Status Classification (RSC), the task of clas-
sifying sentences according to their rhetorical role (e.g. BACKGROUND or CONCLU-
SION). In the legal domain, this task is often seen as a preprocessing step for later
tasks such as legal information extraction, extractive summarisation and argument min-
ing [4,5,6]. We define seven RSC categories as follows; Table 1 lists them and gives an
example for each. FACT covers descriptions of the facts giving rise to the case; BACK-
GROUND is reserved for quotations or citations of law materials (legislation and rel-
evant precedent cases); CONCLUSION marks the decisions of the judge; and IDEN-
TIFYING is a category used for text that states discussion topics. The primary argu-
mentative material is contained in the two categories FRAMING-main and FRAMING-
sub. FRAMING-main marks material which directly supports the judge’s conclusion,
whereas FRAMING-sub is one of the two categories which can support FRAMING-
main (the other being BACKGROUND). These categories are crucial for downstream
argumentative structure mining (task 4). Material that can not be classified into any of
the above classes is covered by the OTHER category.
In our previous work, RSC performance was acceptable overall, but differed across
category: in particular, in some of the most important categories for downstream tasks,
performance was low. BACKGROUND, which is an important category listing relevant
law materials, achieved only F=0.32, and CONCLUSION, which describes the most
important argumentative sentences, F=0.39. We were also not fully satisfied with the
performance of the two FRAMING categories.
In this paper, we present our improved RSC classifier for Japanese judgement doc-
uments, which uses a neural network-based architecture. One of the new information
sources for our model is information coming from headings in the text. Our method is
motivated by human readers’ scanning behaviour during reading. We also present our
new, considerably larger annotated corpus of Japanese judgements.
2. Data and Annotation
The corpus we used in previous work [3] consists of 89 Japanese judgment documents
of Civil law cases from lower court cases with annotations of argumentative structure.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification134
Table 1. Examples for RSC categories
Label Example (translated)
IDENTIFYING Based on the agreed facts and the gist of the whole argument, we discuss each issue
in the following.
CONCLUSION Therefore, the plaintiff ’s claim is unreasonable since we just found that the officer
was not negligent.
FACT The duties of an execution officer are. .. and officer D properly conducted.. .
BACKGROUND It is reasonable to find the officer negligent when the officer did not the appropri-
ate. . . (1997/7/15 ruling of the third Petty Bench of the Supreme Court).
FRAMING-main The measures performed by the officer comply with the normal procedure for inspec-
tion.
FRAMING-sub It is considered that officer D entered the estate to confirm the circumstance. ..
Table 2. RSC class distribution of our corpora in percent
FACT FR-main FR-sub CONC IDEN BACK OTH sent.
Previous (89 doc) 23.1 19.5 11.5 3.9 2.1 0.3 39.7 37,371
New (t&t, 110 doc) 23.5 19.1 10.6 3.8 2.0 0.3 40.6 44,677
They were sourced from website maintained by the supreme court of Japan1by a random
selection process. Our new corpus extends this set to 120 documents (48,370 sentences,
3.2 million characters) following the same principles, and the same expert annotator (a
PhD candidate in a graduate school of Japanese Law, who was paid for this work) was
used. The annotation is kept consistent with the preceding paper, i.e, annotations for all
four subtasks above are obtained at the same time. Category assignment is exclusive,
i.e., only one category can be assigned to each sentence. We reserved ten documents out
of 120 documents as development data for hyperparameter tuning. The remaining 110
documents are used for the experiments reported here. Table 2 shows the category distri-
bution and total for our test and training corpus of 110 documents, against the previously
used test and training corpus of 89 documents.
3. Conditional Random Field baseline model
Previous work on RSC in legal documents found that RSC is strongly affected by con-
text in terms of other rhetorical roles[3,5]. We therefore use Conditional Random Fields
(CRF) [7]2as a strong baseline model.
As features, we use the seven features from [3]: the bigram,sentence location,sen-
tence length features (calculated in characters). We also use 8 modality features based
on Masuoka’s (2007) modality expression classification, namely the modalities “truth
judgment” (4 features; e.g, “hazu da” (can be expected to be) or “beki da” (should be)),
“value judgment” (3 features), and “explanation”. The function expression feature dis-
tinguishes the 199 semantic equivalence classes contained in the function expression dic-
tionary by [10](such as “evidential” and “contradictory conjunction”); this covers 16,801
separate surface expressions. The cue phrase feature contains an additional 22 phrases
1http://www.courts.go.jp/
2We used Okazaki’s (2011) implementation.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification 135


 













ynyN
y1
w
1
w
M











s
n
s
N
s
1
w
m
cL
cl
c1
Figure 2. BiLSTM-CRF model for RSC.
w: words from an input sentence;
c: characters from an input sentence;
s: contextualised vectors of sentences;
y: predicted RSC category.









ynyN
y1


 

 


 




 



 
s
N

s
1
    



s
n


 
ch
1
c
h
L
ch
k
Figure 3. BiLSTM-CRF model with heading.
ch: characters from an input heading;
s: contextualised vectors of sentences that are the
same in Figure 2.
from a textbook used during the training of judges [11] and from five judgment docu-
ments not included in the training and test data, and the law names feature, which dis-
tinguishes 494 specific law names as features and adds a binary feature indicating the
presence of any law name in the sentence. A document is then input into the CRF model
as a sequence of sentences, where each sentence is represented by the features above.
4. BiLSTM-CRF based model
We tested our BiLSTM-CRF based sentence sequence labelling architecture presented
here against the baseline model.
BiLSTM-CRF architectures have recently become the standard deep learning
method for modelling sequences is the Bidirectional-LSTM(BiLSTM) [12], which can
encode both preceding and succeeding context; they have been used for Named Entity
Recognition (NER) and POS-tagging [13]. Context in the form of surrounding text as
well as surrounding labels can be taken into account with this architecture: past and fu-
ture input features can be modelled through BiLSTM layers, whereas sequences of labels
can be modelled through a CRF layer. Variants of BiLSTM-CRF differ in how to en-
code the token vectors which is the input to a sequence level BiLSTM layer: [14] uses a
Convolutional neural network (CNN)-based character-level representation in addition to
word embeddings, whereas [15] uses a character-level representation which is encoded
by another BiLSTM encoder. Our BiLSTM-CRF model has three main components, a
sentence encoder layer, a BiLSTM-sentence layer, and a CRF layer (Figure 2).
Sentence encoder layer Our target units are sentences, not words as in the POS-
tagging and NER task, so they need to be encoded into vectors before passing them to
the BiLSTM layer. The sentence encoder layer consists of two components, LSTM-word
and CNN-char. LSTM-word takes word embeddings of sentences as input and outputs
H. Yamada et al. / Neural Network Based Rhetorical Status Classification136
the summarised vector for each sentence. CNN-char is a simple CNN with one layer
of convolution [16], which takes character embeddings of sentences as input and gen-
erates the summarised vector for each sentence. In pre-experiments, using both LSTM-
word and CNN-char showed improved performance over using either of these on their
own. While LSTM-word should encode the overall meaning of input sentences, CNN-
word should capture the characteristic combinations of characters such as typical com-
binations of Chinese characters and Hiragana characters. Outputs from LSTM-word and
CNN-char are concatenated.
BiLSTM-sentence layer / CRF layer We use the architecture proposed in [13]. The
BiLSTM-sentence layer takes a sequence of sentences as input and concatenates the
hidden state vectors from two LSTMs run bidirectionally; the output of this step should
correspond to a contextualised representation of the input sentence vector, which is then
input to a CRF layer that computes the final output.
Dropout For regularisation, we include dropout [17] after the LSTM-word and the
BiLSTM-sentence layer.
4.1. Input data and embeddings
The inputs to the sentence encoder layer are vector representation of words and charac-
ters. As for word inputs, we use the SentencePiece algorithm [18] to tokenise a sentence
into tokens, a step necessitated by the fact that the Japanese script does not use an ex-
plicit word separator. SentencePiece is an unsupervised text tokeniser which allows us to
tokenise without any pre-defined dictionaries. We trained the tokeniser on 15 thousands
Civil and Criminal law judgment documents that are published during 1989—2017, us-
ing the same web source as our test and training corpus, but excluding the documents
used in it (note that the domains differ slightly as our test and training corpus consists
only of Civil Law cases). The tokenised words are then input into the embedding layer.
As for character inputs, we simply split a sentence into characters and input them
to the embedding layer. The meaning-bearing part of most open-class Japanese words is
due to one or more Chinese characters, which are semi-compositionally combined. The
characters themselves might therefore contribute additional meaning components and
similarities between words beyond the word identities themselves. Each embedding layer
converts the input to embedding vectors, which form the input to the sentence encoder.
We initialise the embedding layer for characters with GloVe [19] vectors pre-trained with
judgment documents of Civil law cases published in the last 14 years (2004–2017).3
4.2. Input and output handling
An input to the model is a sequence of sentences. We restrict the length of the sequence
to an odd number w.4We obtain a sequence of inputs by sliding the size wwindow from
the beginning to the end of the document, sentence by sentence. The n-th input from
document Dcan be represented as Qn
D(w)={Sn(w1)/2
D,...,Sn
D,...,Sn+(w1)/2
D}, where
Si
Dis the i-th sentence in document D. At the beginning and the end of the document,
3We found in pre-experiments that halving the corpus we used for the tokenisation experiment (disregarding
the older half) lead to better results. The target embedding vector dimension is set to 300.
4Preliminary experiments where an entire document was input as a single sequence showed low results. The
average length of documents was 403.1 lines, which proved too long even for LSTMs with their ability to store
a good amount of long-term context.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification 137
we fill padding tokens if necessary. We have wpredictions in an output for each sentence
according to its relative position in the input. We use the prediction that is located in the
middle of output.5
5. BiLSTM-CRF based model with Headings
We next present a new model which uses the information contained in the documents’
headings. Exploiting explicit structural information from the text, such as headings, could
model the reading strategy of legal professionals. In particular, we hypothesise that when
a human reader notices a new heading in a document, they might interpret this as a signal
of rhetorical status change.
In addition to the components of the BiLSTM-CRF model, a dedicated network for
handling heading information is added to the model (Figure 3). The network consists
of three parts, heading encoder, BiLSTM-heading, heading-sentence concatenator. The
heading encoder is a character-based LSTM encoder which summarises the input char-
acter embeddings of a heading and outputs a heading vector. The Heading BiLSTM is
similar to the BiLSTM-sentence layer, which generates a contextualised representation
of headings per input, but is activated only for headings. It does so by inputting the sen-
tence itself to the the upper network layers; otherwise, a special character “ ” is used,
which signals the non-existence of a heading. The outputs from the BiLSTM-sentence
layer and the heading BiLSTM are concatenated and input to a fully-connected layer.
The CRF layer then receives the output from the fully-connected layer.
As headings are not explicitly annotated in our corpus, we detect them automati-
cally using a binary rule-based heading detector based on the presence of sentence-final
punctuation and sentence length. The detector’s performance of finding headings was
F=0.89 (R=0.99, P=0.81), measured on all 2,061 lines in 5 random documents
(manually annotated by the first author). 622 lines were headings (lines which only con-
tain headings and nothing else) and 1,439 non-headings (either normal sentences, or lines
which erroneously contain both a heading and the beginning of a normal sentence).
6. Experiment
6.1. Experimental setting
We use 110 documents from our corpus for training and testing of our two BiLSTM mod-
els described in section 4 and 5. Hyperparameters of BiLSTM-CRF models are empiri-
cally tuned using the development data (10 documents). The hyperparameters we use for
the experiment are shown in Table 3. We use five-fold cross-validation at the document
level.
In order to make sure that any performance improvement over our previous work is
not only due to the use of heading information per se but to the architecture, we also make
the heading information available to the CRF, in the form of a binary feature expressing
heading existence, a variant we call 6. This means that we report results for a total of
four models (CRF, CRF+H, BiLSTM, BiLSTM+H). We test significance of macro-
5Due to a quirk in the experiments, we only pad at the beginning of documents, not at the end. This leads to
some cases in each document where the predicted item is not in the middle of the outputs. In those cases, we
use the last prediction of the output.
6CRF+H also gets the strings of the headings through bigram feature.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification138
Table 3. Hyperparameters for BiLSTM-CRF models
Hyperparameters values
epochs 1
word emb dim 300
char emb dim 300
LSTM-word 64
CNN-char window 5
Hyperparameters values
CNN-char channels 256
LSTM-word dropout 0.2
BiLSTM-sent 128 + 128
BiLSTM-sent dropout 0.2
heading emb dim* 64
Hyperparameters values
heading encoder* 64
BiLSTM-heading* 64 + 64
final cocat* 128
If applicable
averaged F measure using a Monte Carlo paired permutation test randomisation at the
sentence level with R=100,000 samples at a significance level of α=0.05 (two-tailed).
6.2. Results
Overall results are shown in Table 4. BiLSTM-CRF+H (F=0.654 with setting w=11)
significantly outperforms both CRF (F=0.630) and CRF+H (F=0.632), showing that
the Deep Learning architecture with heading information indeed represents an over-
all improvement. This effect holds also without heading information: BiLSTM-CRF
(w=21) (F=0.651) is significantly better than CRF (F=0.630) and CRF+H (F=0.632).
The BiLSTM-CRF model family overall outperforms the CRF model family.
Although the macro-averaged F performance difference between BiLSTM and
BiLSTM-CRF+H is not significant, several individual categories show significant im-
provement when heading information is added (see Table 5), namely BACKGROUND
(F=0.341), FRAMING-main (F=0.651) and CONCLUSION (F=0.449). These are three
of the four categories we care about most, as they carry most information for the legal
argumentation and form a basis of our further planned processing in this application.
However, this success is paid for with a significant decrease (from F=0.527 to
0.474) for the FRAMING-sub category. These results notwithstanding, we still pro-
mote the heading-enabled BiLSTM as our preferred model, as the three improved cat-
egories also include the previously weakest of those 4 categories (BACKGROUND in-
creased from F=0.319 to 0.341). With a roughly equal performance in CONCLUSION
and FRAMING-sub of both over F=0.45, this leaves us overall in a better situation than
without the heading information.
The confusability between FRAMING-main and FRAMING-sub should be one of
the main reasons for the remaining errors. Table 6 shows the confusion matrix of the
BiLSTM-CRF+H. 1,990 out of 4,727 FRAMING-sub sentences (42.0%) are wrongly
classified as FRAMING-main. According to the agreement study of RSC annotation
scheme from the previous study [3], the distinction between those two categories is hard
even for human annotators. The problem is that the categories both appear in similar
locations and have similar surface characteristics e.g. “therefore” phrase in Japanese.
7. Related Work
Rhetorical Status Classification is a commonly used approach in legal text processing
for associating text pieces with their rhetorical status. Our rhetorical annotation scheme
of six categories plus the OTHER category is an adaptation of previous schemes for the
UK law system [4] and Indian law system [5]. For the automatisation of RSC, CRF and
other machine learning models have been employed. For RSC of the UK law system,
Hachey and Grover used various supervised machine learning systems, achieving the best
H. Yamada et al. / Neural Network Based Rhetorical Status Classification 139
Table 4. Macro-averaged results for models
Models Precision Recall F
CRF 0.681 0.603 0.630
CRF + Heading 0.685 0.605 0.632
BiLSTM-CRF (w=11) 0.663 0.635 0.647
BiLSTM-CRF (w=21) 0.686 0.629 0.651
BiLSTM-CRF (w=31) 0.673 0.615 0.638
BiLSTM-CRF + Heading (w=11) 0.679 0.636 0.654
BiLSTM-CRF + Heading (w=21) 0.657 0.628 0.640
BiLSTM-CRF + Heading (w=31) 0.653 0.620 0.633
Table 5. Results of models by classes (F)
Category CRF BiLSTM-CRF BiLSTM-CRF+H
BACKGROUND 0.344 0.319 0.341
CONCLUSION 0.381 0.415 0.449
FACT 0.853 0.890 0.879
FRAMING-main 0.594 0.642 0.651
FRAMING-sub 0.471 0.527 0.474
IDENTIFYING 0.792 0.798 0.806
OTHER 0.972 0.969 0.975
BiLSTM-CRF is w=21 and BiLSTM-CRF+H is w=11.
Table 6. Confusion Matrix of BiLSTM-CRF + Heading (w=11)
Prediction
BGD CCL FCT FRm FRs IDT OTR Total
BGD 38 0 1938290 3127
CCL 0 699 42 847 28 6 90 1,712
FCT 3369,544 500 181 15 235 10,514
FRm 18 548 745 5,836 1,214 44 132 8,537
FRs 33 31 628 1,990 1,944 49 52 4,727
IDT 215 309653710 24 930
OTR 2 73 191 76 18 7 17,763 18,130
Gold
Total 96 1,402 11,199 9,383 3,467 831 18,299 44,977
results with C4.5 [20] with only the location feature (F=0.65); the second-best (F=0.61)
was achieved using a Support Vector Machine [21] with all features (location, thematic
words, sentence length, quotation, entities and cue phrases). As for RSC of Indian law
system, a CRF classifier with various features similar to our CRF model achieved F=0.82.
Walker et al develop a rule-based RSC classifier from a small amount of labelled
data [6]. Their task is to identify rhetorical roles of sentences such as “Finding”, which
states whether a propositional condition of a legal rule is determined to be true, false or
undecided, “Evidence”, such as the testimony of a lay witness or a medical record, “Rea-
soning” which reports reasoning parts underlying the findings of fact (i.e. a premise),
“Legal-Rule” which states legal rules, and “Citation” which references legal authorities
or other law materials, and “Others”. There are close similarities to our categories. 530
sentences were used to develop a rule set set for their classifier, and the paper reports the
comparison between their low-cost rule-based classifier (F=0.52).
H. Yamada et al. / Neural Network Based Rhetorical Status Classification140
Some F-measures from previous studies are higher than ours, but this mirrors the
difficulty of our task. None of the other schemes makes such fine distinctions as we do,
particularly in the lower levels of argumentative support such as those expressed by the
FRAMING-main vs FRAMING-sub distinction.
Another piece of work performs deontic sentence classification in contract docu-
ments [22], using a hierarchical RNN-based architecture. The sentences are classified
into “Obligation”, “Prohibition”, “Obligation List Intro”, “Obligation List Item”, and
“Prohibition List Item”. The model is based on a BiLSTM-based sequential sentence
classifier which considers both the sequence of words in each sentence and the sequence
of sentences like our models, but it does not employ a label sequence optimiser such as
our CRF layer.
Outside the legal document processing community, RSC is often used in the area of
scientific paper processing for the extraction of relevant material and for summarisation.
An RNN-based model similar to ours has been proposed for the RSC of sentences in
medical scientific abstracts [23]. Our model shares the basic design (a sentence encoder,
a context encoder, and a CRF layer) with this model; however, their model does not
consider heading information.
8. Conclusion
In this paper, we proposed to apply a BiLSTM-CRF based model for rhetorical status
classification. It performs RSC with sequential labelling by taking inter-sentence level
context into account. We also proposed to add a dedicated network which conveys con-
textualised heading information, after headings have been recognised by a simple au-
tomatic heading detector. The model showed significant improvements from the plain
BiLSTM-CRF model in BACKGROUND, FRAMING-main and CONCLUSION. We
also extended the size of our annotated corpus of Japanese judgment documents. The re-
sulting system showed a significant improvement from our CRF based baseline models.
There are several possible directions for future work. One of these is to train our
model with curriculum learning strategy [24]. Curriculum learning is a training ap-
proach that exposes a model by giving training examples in a meaningful order, grad-
ually increasing difficulty. RSC seems to fit this training scheme very well, as it shows
various patterns of sequences from simple ones such as category repetitions ( “FACT,
FACT,FACT ...”)tomore complicated ones such as “FRAMING-sub, FRAMING-sub,
FRAMING-main, FRAMING-sub, BACKGROUND ...”. Curriculum learning might
therefore help our model to learn how to distinguish difficult categories (e.g. FRAMING-
sub v.s. FRAMING-main) in an efficient way. Also, we plan to conduct an extrinsic
evaluation with a summarisation task by lawyers, which uses the results of the RSC.
Acknowledgments. This work was supported by Tokyo Tech World Research Hub Ini-
tiative (WRHI) Program of Institute of Innovative Research, Tokyo Institute of Technol-
ogy.
References
[1] Ministry of Justice, Japan, “Form of Rendition”, Code of Civil Procedure, Article 252.
[2] Ministry of Justice, Japan, “Judgment Document”, Code of Civil Procedure, Article 253.
[3] H. Yamada, S. Teufel and T. Tokunaga, Building a corpus of legal argumentation in Japanese judgement
documents: towards structure-based summarisation, Artificial Intelligence and Law 27(2) (2019), 141–
170.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification 141
[4] B. Hachey and C. Grover, Extractive summarisation of legal texts, Artificial Intelligence and Law 14(4)
(2006), 305–345.
[5] M. Saravanan and B. Ravindran, Identification of Rhetorical Roles for Segmentation and Summarization
of a Legal Judgment, Artificial Intelligence and Law 18(1) (2010), 45–76.
[6] V.R. Walker, K. Pillaipakkamnatt, A.M. Davidson, M. Linares and D.J. Pesce, Automatic Classification
of Rhetorical Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning (2019).
[7] J.D. Lafferty, A. McCallum and F.C.N. Pereira, Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data, in: Proceedings of the Eighteenth International Conference
on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001,
pp. 282–289. ISBN ISBN 1-55860-778-1.
[8] N. Okazaki, CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007.
[9] T. Masuoka, Nihongo Modariti Tankyu (Japanese Modality Investigations), Kuroshio shuppan, 2007.
[10] S. Matsuyoshi, S. Sati and T. Utsuro, A Dictionary of Japanese Functional Expressions with Hierarchical
Organization, Journal of Natural Language Processing 14(5) (2007), 123–146.
[11] Judicial Research and Training Institute of Japan, The guide to write civil judgements (in Japanese),
10th edn, Housou-kai, 2006.
[12] A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other
neural network architectures, Neural networks 18(5–6) (2005), 602–610.
[13] Z. Huang, W. Xu and K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, CoRR
abs/1508.01991 (2015).
[14] X. Ma and E. Hovy, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, in: Proceed-
ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Association for Computational Linguistics, 2016, pp. 1064–1074.
[15] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer, Neural Architectures for Named
Entity Recognition, in: Proceedings of the 2016 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies, Association for Computational
Linguistics, San Diego, California, 2016, pp. 260–270.
[16] Y. Kim, Convolutional Neural Networks for Sentence Classification, in: Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, Doha, Qatar, 2014, pp. 1746–1751.
[17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, Dropout: a simple way
to prevent neural networks from overfitting, The journal of machine learning research 15(1) (2014),
1929–1958.
[18] T. Kudo and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and
detokenizer for Neural Text Processing, in: Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations, Association for Computational Linguistics,
Brussels, Belgium, 2018, pp. 66–71.
[19] J. Pennington, R. Socher and C. Manning, Glove: Global Vectors for Word Representation, in: Pro-
ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543.
[20] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 1993. ISBN ISBN 1-55860-238-0.
[21] C. Cortes and V. Vapnik, Support-vector networks, Machine learning 20(3) (1995), 273–297.
[22] I. Chalkidis, I. Androutsopoulos and A. Michos, Obligation and Prohibition Extraction Using Hierarchi-
cal RNNs, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguis-
tics (Volume 2: Short Papers), Association for Computational Linguistics, Melbourne, Australia, 2018,
pp. 254–259.
[23] D. Jin and P. Szolovits, Hierarchical Neural Networks for Sequential Sentence Classification in Medical
Scientific Abstracts, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 3100–3109.
[24] Y. Bengio, J. Louradour, R. Collobert and J. Weston, Curriculum Learning, in: Proceedings of the 26th
Annual International Conference on Machine Learning, ICML ’09, ACM, New York, NY, USA, 2009,
pp. 41–48. ISBN ISBN 978-1-60558-516-1.
H. Yamada et al. / Neural Network Based Rhetorical Status Classification142
... With the aim of summarizing court decisions, Yamada et al. [51] annotated 89 Japanese civil case judgments. The task of argument structure extraction was divided into four sub-tasks: (i) Issue Topic Identification, to find sentences that describe an Issue topic; (ii) Rhetorical Status Classification, to determine the rhetorical status of each sentence; and (iii) Issue Topic Linking, to associate each sentence with exactly one Issue Topic; and (iv) Framing Linking, to connect two sentences if one provides argumentative support for the other. ...
... Compared to the studies by Mochales and Moens [29], Poudyal et al. [37], Teruel et al. [45], Yamada et al. [51], we distinguish between different typologies of support and attack relations. As detailed in the following, we defined 4 main types of links: (i) the Support fro Premise(s); (ii) the Support from Failure; (iii) the Rebuttal; (iv) the Undercut. ...
... Prior works in RRL on legal judgements have regarded the task either as straightforward classification of sentences without modeling any contextual dependency between them [1,25] or as sequence labeling [27,3,8,12]. Initial works [22,5,9] performed RRL using hand-crafted features as part of a summarization pipeline. ...
... Similarly, Walker et al. [25] used engineered features for RRL on US Board of Veterans' Appeals (BVA) decisions. With the rise of deep learning, Yamada et al. [27] , Ghosh et al. [8], Paheli et al. [3] and Ahmad et al. [1] employed deep learning based BiLSTM-CRF models for RRL on Japanese civil rights judgements, Indian Supreme Court opinions, UK supreme court judgements and the US BVA corpus respectively. More recently, Kalamkar et al. [12] benchmarked RRL on Indian legal documents using a Hierarchical Sequential Labeling Network model (HSLN). ...
Preprint
Full-text available
Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.
... The end goal was to retrieve documents whose fact asserting sentences were similar to a given query. Yamada et al. (2019) use Bi-LSTM-CRF along with heading encoders for the task for rhetorical labeling of sentences in Japanese documents. Distinguishing "facts" and "legal principles" of cited cases in a legal document was studied in Shulayeva et al. (2017). ...
... (Nejadgholi et al. 2017) show the benefit of using deep learning where they model the task as a binary classification problem (identifying factual and non-factual sentences) on a specialised domain of Immigration documents only. Similarly (Yamada et al. 2019) employed neural architectures for rhetorical role labeling of sentences. ...
Article
Full-text available
The task of rhetorical role labeling is to assign labels (such as Fact, Argument, Final Judgement, etc.) to sentences of a court case document. Rhetorical role labeling is an important problem in the field of Legal Analytics, since it can aid in various downstream tasks as well as enhances the readability of lengthy case documents. The task is challenging as case documents are highly various in structure and the rhetorical labels are often subjective. Previous works for automatic rhetorical role identification (i) mainly used Conditional Random Fields over manually handcrafted features, and (ii) focused on certain law domains only (e.g., Immigration cases, Rent law), and a particular jurisdiction/country (e.g., US, Canada, India). In this work, we improve upon the prior works on rhetorical role identification by proposing novel Deep Learning models for automatically identifying rhetorical roles, which substantially outperform the prior methods. Additionally, we show the effectiveness of the proposed models over documents from five different law domains, and from two different jurisdictions—the Supreme Court of India and the Supreme Court of the UK. Through extensive experiments over different variations of the Deep Learning models, including Transformer models based on BERT and LegalBERT, we show the robustness of the methods for the task. We also perform an extensive inter-annotator study and analyse the agreement of the predictions of the proposed model with the annotations by domain experts. We find that some rhetorical labels are inherently hard/subjective and both law experts and neural models frequently get confused in predicting them correctly.
... Further works (Walker et al., 2019;Savelka and Ashley, 2018) used Conditional Random Fields on these hand-crafted features. Recently, deep learning-based methods have been applied to this task on Japanese documents (Yamada et al., 2019), Indian documents (Bhattacharya et al., 2021;Ghosh and Wyner, 2019;Malik et al., 2022;Kalamkar et al., 2022). These methods adopt a hierarchical approach to account for the sequential sentence classification nature of the task, drawing context from surrounding sentences. ...
... With the goal of summarizing court decisions, Yamada et al. (2019a) annotated 89 Japanese civil case judgments (37k sentences) with a tree-structured argument representation; a feature typical to these particular legal documents. They used four phases of annotations performed by a Japanese law Ph.D. student and reported experiments on classifying each sentence into one of the seven classes (Yamada et al. 2019b). Xu et al. (2020) explored argument mining to improve case summaries, which should contain the following key information: (1) the main issues the court addressed in the case, (2) the court's conclusion on each issue, and (3) a description of the reasons the court gave for its conclusion. ...
Article
Full-text available
Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at https://github.com/trusthlt/mining-legal-arguments.
... With the goal of summarizing court decisions, Yamada et al (2019a) annotated 89 Japanese civil case judgments (37k sentences) with a tree-structured argument representation; a feature typical to these particular legal documents. They used four phases of annotations performed by a Japanese law Ph.D. student and reported experiments on classifying each sentence into one of the seven classes (Yamada et al, 2019b). Xu et al (2020) explored argument mining to improve case summaries, which should contain the following key information: 1) the main issues the court addressed in the case, 2) the court's conclusion on each issue, and 3) a description of the reasons the court gave for its conclusion. ...
Preprint
Full-text available
Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at https://github.com/trusthlt/mining-legal-arguments.
Article
Full-text available
The swift rise of digitization in legal documentation has opened doors for artificial intelligence to revolutionize various tasks within the legal domain. Among these tasks is the segmentation of legal documents using rhetorical labels. This process, known as rhetorical role labeling, involves assigning labels (such as Final Judgment, Argument, Fact, etc.) to sentences within a legal case document. This task can be down streamed to various major legal analytics problems such as summarization of legal documents, readability of lengthy case documents, document similarity estimation, etc. The mentioned task of semantic segmentation of documents via labels is challenging as the legal documents are lengthy, unstructured and the labels are subjective in nature. Various previous works on automatic rhetorical role labeling was carried out using methods like conditional random fields with handcrafted features, etc. This research focuses on analyzing case documents from two different legal systems: the High Court of Kerala and the High Court of Justice in the United Kingdom. Through rigorous experimentation with a range of deep learning models, this study highlights the robustness and efficacy of deep learning methods in accurately labeling rhetorical roles within legal texts. Additionally, comprehensive annotation of legal case documents from the UK and analysis of inter-annotator agreement are conducted. The overarching objective of this research is to design systems that facilitate a deeper comprehension of the organizational structure inherent in legal case documents.
Chapter
Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.KeywordsRhetorical Role LabelingSemi-Markov CRFData augmentation
Chapter
In this paper, we present a new corpus for Rhetorical Role Identification in Portuguese legal documents. The corpus comprises petitions from 70 civil lawsuits filed in TJMS court and was manually labeled with rhetorical roles specifically tailored for petitions. Since petition documents are created without a standard structure, we had to deal with several issues to clean the extracted textual content. We assessed classic and deep learning machine learning methods on the proposed corpus. The best performing method obtained an F-score of 80.50. At the best of our knowledge, this is the first work to deal with rhetorical role identification for petitions, given that previous works focused only on judicial decisions. Additionally, it is also the first work to tackle this task for the Portuguese language. The proposed corpus, as well as the proposed rhetorical roles, can foster new research in the judicial area and also lead to new solutions to improve the flow of Brazilian court houses.
Article
Full-text available
We present an annotation scheme describing the argument structure of judgement documents, a central construct in Japanese law. To support the final goal of this work, namely summarisation aimed at the legal professions, we have designed blueprint models of summaries of various granularities, and our annotation model in turn is fitted around the information needed for the summaries. In this paper we report results of a manual annotation study, showing that the annotation is stable. The annotated corpus we created contains 89 documents (37,673 sentences; 2,528,604 characters). We also designed and implemented the first two stages of an algorithm for the automatic extraction of argument structure, and present evaluation results.
Conference Paper
Full-text available
State-of-the-art sequence labeling systems traditionally require large amounts of task-specific knowledge in the form of hand-crafted features and data pre-processing. In this paper, we introduce a novel neutral network architecture that benefits from both word- and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF. Our system is truly end-to-end, requiring no feature engineering or data pre-processing, thus making it applicable to a wide range of sequence labeling tasks on different languages. We evaluate our system on two data sets for two sequence labeling tasks --- Penn Treebank WSJ corpus for part-of-speech (POS) tagging and CoNLL 2003 corpus for named entity recognition (NER). We obtain state-of-the-art performance on both the two data --- 97.55\% accuracy for POS tagging and 91.21\% F1 for NER.
Article
Full-text available
We describe research carried out as part of a text summarisation project for the legal domain for which we use a new XML corpus of judgments of the UK House of Lords. These judgments represent a particularly important part of public discourse due to the role that precedents play in English law. We present experimental results using a range of features and machine learning techniques for the task of predicting the rhetorical status of sentences and for the task of selecting the most summary-worthy sentences from a document. Results for these components are encouraging as they achieve state-of-the-art accuracy using robust, automatically generated cue phrase information. Sample output from the system illustrates the potential of summarisation technology for legal information management systems and highlights the utility of our rhetorical annotation scheme as a model of legal discourse, which provides a clear means for structuring summaries and tailoring them to different types of users.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Legal judgments are complex in nature and hence a brief summary of the judgment, known as a headnote, is generated by experts to enable quick perusal. Headnote generation is a time consuming process and there have been attempts made at automating the process. The difficulty in interpreting such automatically generated summaries is that they are not coherent and do not convey the relative relevance of the various components of the judgment. A legal judgment can be segmented into coherent chunks based on the rhetorical roles played by the sentences. In this paper, a comprehensive system is proposed for labeling sentences with their rhetorical roles and extracting structured head notes automatically from legal judgments. An annotated data set was created with the help of legal experts and used as training data. A machine learning technique, Conditional Random Field, is applied to perform document segmentation by identifying the rhetorical roles. The present work also describes the application of probabilistic models for the extraction of key sentences and composing the relevant chunks in the form of a headnote. The understanding of basic structures and distinct segments is shown to improve the final presentation of the summary. Moreover, by adding simple additional features the system can be extended to other legal sub-domains. The proposed system has been empirically evaluated and found to be highly effective on both the segmentation and summarization tasks. The final summary generated with underlying rhetorical roles improves the readability and efficiency of the system.