ArticlePDF Available
New Methods for Attribution of Rabbinic Literature
Moshe Koppel Dror Mughaz Navot Akiva
{koppel,myghaz,navot}@cs.biu.ac.il
Dept. of Computer Science
Bar-Ilan University
Introduction
In this paper, we will demonstrate how recent developments in the nascent field of
automated text categorization can be applied to Hebrew and Hebrew-Aramaic texts.
In particular, we illustrate the use of new computational methods to address a number
of scholarly problems concerning the classification of rabbinic manuscripts. These
problems include ascertaining answers to the following questions
1. Which of a set of known authors is the most likely author of a given document
of unknown provenance?
2. Were two given corpora written/edited by the same author or not?
3. Which of a set of documents preceded which and did some influence others?
4. From which version (manuscript) of a document is a given fragment taken?
We will apply our techniques to a number of representative problems involving
corpora of rabbinic texts.
Text Categorization
Text categorization is one of the major problems of the field of machine learning
(Sebastiani 2002). The idea is that we are given two or more classes of documents and
we need to find some formula (usually called a “model”) that reflects statistical
differences between the classes and that can then be used to classify a new document.
For example, we might wish to classify a document as being about one of a number of
possible topics, as having been written by a man or a woman, as having been written
by one of a given set of candidate authors and so forth.
Figure 1: Architecture of a text categorization system.
In Figure 1 we show the basic architecture of a text categorization system in which we
are given examples of two classes of documents, Class A and Class B. The first step,
document representation, involves defining a set of text features which might
potentially be useful for categorizing texts in a given corpus (for example, words that
are neither too common nor too rare) and then representing each text as a vector in
which entries represent (some non-decreasing function of) the frequency of each
feature in the text. Optionally, one may then use various criteria for reducing the
dimension of the feature vectors (Yang & Pedersen 1997).
Once documents have been represented as vectors, there are a number of learning
algorithms that can be used to construct models that distinguish between vectors
representing documents in Class A and vectors representing documents in Class B.
Yang (1999) compares and assesses some of the most promising algorithms, which
include k-nearest-neighbor, neural nets, Winnow, SVM, etc. One particular type of
model which is easy to understand, and which we use in this paper, is known as a
linear separator. A linear separator works as follows: we assign to each feature of a
text a certain number of points to either Class A or to Class B. (The class to which
points are assigned and the precise number of points assigned are determined by the
learning algorithm based on the training documents.) Then a new document is
classified by scanning it and counting how many points it contains for each class. The
class with most points in the document is the class to which the document is assigned.
Model for
A vs. B
(x
1
,x
2
,...,x
N
)=A
(x
1
,x
2
,...,x
N
)=A
(x
1
,x
2
,...,x
N
)=A
:
:
(x
1
,x
2
,...,x
N
)=B
(x
1
,x
2
,...,x
N
)=B
(x
1
,x
2
,...,x
N
)=B
Learning
Algorithm
A
B
Text
Text
Cleaning+
Feature
Extraction
Style-Based Text Categorization
Driven largely by the problem of Internet search, the text categorization literature has
dealt primarily with classification of texts by topic: to which category in some
directory of topics should a document (typically, a web page) be assigned. There has,
however, been a considerable amount of research on authorship attribution, which is
what concerns us in this paper. Most of this work has taken place within what is often
called the "stylometric" community (Holmes 1998, McEnery & Oakes 2000), which
has tended to use statistical methods substantially different in flavor from those
typically used by researchers in the machine learning community. Nevertheless, in
recent years machine learning techniques have been used with increasing frequency
for solving style-based problems. The granddaddy of such works is that of Mosteller
and Wallace who applied Naïve Bayes to solve the problem of the Federalist Papers.
Other such works include that of Matthews and Merriam (1993) on the works of
Shakespeare, Argamon et al (1998) on news stories, Koppel et al (2002) on gender,
Wolters and Kirsten on genre (2001), deVel (2001) on email authorship, Stamatatos et
al (2001) on Greek.
Classification according to topic is a significantly easier problem than classifying
according to author style. The kinds of features which researchers use for categorizing
according to topic typically are frequencies of content words. For example,
documents about sports can be distinguished from documents about politics by
checking the frequencies of sports-related or politics-related words. In contrast, for
categorizing according to author style one needs to use precisely those linguistic
features that are content-independent. We wish to find those stylistic features of a
given author’s writing that are independent of any particular topic. Thus, in the past
researchers have used for this purpose lexical features such as function words
(Mosteller & Wallace 1964), syntactic features (Baayen et al 1996, Argamon et al
1999, Stamatatos et al 2001), or complexity-based feature such as word and sentence
length (Yule 1938). As we will see, different applications call for different types of
features.
Hebrew texts present special problems in terms of feature selection for style-based
classification. In particular, function words tend to be conflated into word affixes in
Hebrew, thus decreasing the number of function words but increasing the amount of
morphological features that can be exploited. The richness of Hebrew morphology
also renders part-of-speech tagging a much messier task in Hebrew than in other
languages, such as English, in which each part-of-speech typically is represented as a
separate word. In any case, we did not use a Hebrew part-of-speech tagger for this
study. A good deal of work has been done by Radai (1978, 1979, 1982) on
categorization of Biblical documents but Radai’s work was not done in the machine
learning paradigm used in this paper.
Our Approach
In this paper we will solve four problems all involving texts in Hebrew-Aramaic.
Problem 1: We are given responsa (letters written in response to legal questions) of
two authorities in Jewish law, Rashba and Ritba. Both lived in Spain in the thirteenth
century, where Ritba was a student of Rashba. Their styles are regarded as very
similar to each other. The object is to identify a given responsum as having been
authored by Rashba or Ritba.
Problem 2: We are given one corpus written by a nineteenth century Baghdadi
scholar, Ben Ish Chai, and another corpus believed to have been written by him under
a pseudonym. We need to determine if the same person wrote the two corpora or not.
Problem 3: We are given three sub-corpora of the classic work of Jewish mysticism,
Zohar. Scholars are uncertain whether a single author authored the three corpora and,
if not, which corpora influenced which others. We will resolve the authorship issue
and propose the likeliest relationship between the corpora.
Problem 4: We are given four manuscripts, one printed version and three hand-
written by different scribes, of the same tractate of the Babylonian Talmud. The
object is to determine from which manuscript a given fragment is taken. (We are
given the text of the fragment, not the original so that handwriting is not relevant.)
Problem 1: Authorship Attribution
The problem of authorship attribution is the simplest one we will consider in this
paper and its solution forms the basis for the solutions of all the other problems. It is a
straightforward application of the techniques described above: we are given the
writings of a set of authors and are asked to classify previously unseen documents as
belonging to one or the other of these authors.
To illustrate how this is done, we consider the problem of determining whether a
given responsum was written by Rashba, a leading thirteenth century rabbinic scholar,
or by his student, Ritba. We consider this problem merely as an exercise; to the best
of our knowledge, there are no extant responsa of disputed authorship in which these
two scholars are the candidate authors. We are given 209 responsa from each Ritba
and Rashba. We select a list of lexical features as follows: the 500 most frequent
words in the corpus are selected and all those that are deemed content-words are
eliminated manually. We are left with 304 features. Strictly speaking, these are not all
function words but rather words that are typical of the legal genre generally without
being correlated with any particular sub-genre. Thus a word like הלאש would be
allowed, although in other contexts it would not be considered a function word.
An important aspect of this experiment is the pre-processing that must be applied to
the text before vectors can be constructed. Since the texts we have of the response
have undergone editing, we must make sure to ignore possible effects of differences
in the texts resulting from variant editing practices. Thus, we expand all abbreviations
and unify variant spellings of the same word.
After representing each of our training examples as a numerical vector, we use as our
learning algorithm a generalization of the Balanced Winnow algorithm of Littlestone
(1987) that has previously been shown to be effective for text-categorization by topic
(Lewis et al 1996, Dagan et al 1997) and by style (Koppel et al 2003).
In order to test the accuracy of our methods, we need to test the models on documents
that were not seen by the computer during the training phase. To do this properly, we
use a technique known as five-fold cross-validation, which works as follows: We take
all the documents in our corpus and randomly divide them into five sets. We then
choose four of the sets and learn a model that distinguishes between Rashba and
Ritba. Once we have done this we take the fifth set and apply the learned model to
this set to see how well the model works. We do this five times, each time holding out
a different one of the five sets as a test set. Then we record the accuracy of our models
overall at classifying the test examples.
Application of the Balanced Winnow algorithm on our full feature set in five-fold
cross-validation experiments yielded test accuracy of 85.8%. After removing features
which received low weights and then re-running Balanced Winnow from scratch, we
obtained accuracy of 90.5%.
It is interesting to note the features that turn out to be most effective for distinguishing
between these authors. The word ךכיפלו is used over 30 times more frequently by
Rashba, while רכזנה is used over 40 times more frequently by the Ritba. Similarly,
תרמא and תלאש are used significantly more by Rashba. Table 1 shows a number of
features that are used with significantly different frequency by Rashba and Ritba,
respectively. Note that Rashba tends to employ more second person and plural first
person pronouns than does Ritba. This might be taken as evidence of attempts by
Rashba to encourage “involvedness” (Biber et al 1998) on the part of his respondents.
feature Rashba
Ritba
טושפ 0.86
5.18
הלאש 0.96
7.59
רכזנה 0.96
45.14
יתעד 1.50
6.65
שרופמ 0.64
2.50
תעד 3.63
10.18
תלאש 13.68
5.35
ןניסרגד 4.60
0.78
ונייהו 3.74
0.52
תרמא 3.74
1.04
ךכיפלו 5.45
0.17
ונאש 2.89
1.29
Table 1: Frequencies (per 10000 words) of various words in the Rashba and
Ritba corpora, respectively.
We have run other similar experiments (Mughaz 2003) too numerous to present in
detail here. For example, we have found that glosses of the Tosafists can be classified
according to their provenance (Evreaux, Sens, Germany) with accuracy of 90% (see
Urbach 1954). Likewise, sections of Midrash Halakhah can be classified as
originating in the school of Rabbi Aqiba or the school of Rabbi Yishmael with
accuracy of 95% (see Epstein 1957).
Problem 2: Unmasking Pseudonymous Authors
The second problem we consider is that of authorship verification. In the authorship
verification problem, we are given examples of the writing of a single author and are
asked to determine if given texts were or were not written by this author. As a
categorization problem, verification is significantly more difficult than attribution and
little, if any, work has been performed on it in the learning community. As we have
seen, when we wish to determine if a text was written by, for example, Rashba or
Ritba, it is sufficient to use their respective known writings, to construct a model
distinguishing them, and to test the unknown text against the model. If, on the other
hand, we need to determine if a text was written by Rashba or not, it is very difficult
if not impossible to assemble an exhaustive, or even representative, sample of not-
Rashba. The situation in which we suspect that a given author may have written some
text but do not have an exhaustive list of alternative candidates is a common one.
The particular authorship verification problem we will consider here is a genuine
literary conundrum. We are given two nineteenth century collections of Hebrew-
Aramaic responsa. The first, RP (Rav Pe'alim) includes 509 documents authored by
an Iraqi rabbinic scholar known as Ben Ish Chai. The second, TL (Torah Lishmah)
includes 524 documents that Ben Ish Chai, claims to have found in an archive. There
is ample historical reason to believe that he in fact authored the manuscript but did not
wish to take credit for it for personal reasons (Ben-David 2003). What do the texts tell
us?
The first thing we do is to find four more collections of responsa written by four other
authors working in roughly the same area during (very) roughly the same period.
These texts are Zivhei Zedeq (Iraq, nineteenth century), Shoel veNishal (Tunisia,
nineteenth century), Darhei Noam (Egypt, seventeenth century), and Ginat Veradim
(Egypt, seventeenth century). We begin by checking whether we able to distinguish
one collection from another using standard text categorization techniques. We select a
list of lexical features as follows: the 200 most frequent words in the corpus are
selected and all those that are deemed content-words are eliminated manually. We are
left with 130 features. After pre-processing the text as in the previous experiment, we
constructed vectors of length 130 in which each element represented the relative
frequency (normalized by document length) of each feature.
We then used Balanced Winnow as our learner to distinguish pairwise between the
various collections. Five-fold cross-validation experiments yield accuracy of greater
than 95% for each pair. In particular, we are able to distinguish between RP and TL
with accuracy of 98.5%.
One might thus be led to conclude that RP and TL are by different authors. It is still
possible, however, that in fact only a small number of features are doing all the work
of distinguishing between them. The situation in which an author will use a small
number of features in a consistently different way between works is typical. These
differences might result from thematic differences between the works, from
differences in genre or purpose, from chronological stylistic drift, or from deliberate
attempts by the author to mask his or her identity.
In order to test whether the differences found between RP and TL reflect relatively
shallow differences that can be expected between two works of the same author or
reflect deeper differences that can be expected between two different authors, we
invented a new technique that we call unmasking (Koppel et al 2004, Koppel and
Schler 2004) that works as follows:
We begin by learning models to distinguish TL from each of the other authors
including RP. As noted, such models are quite effective. In each case, we then
eliminate the five highest-weighted features and learn a new model. We iterate this
procedure ten times. The depth of difference between a given pair can then be gauged
by the rate with which results degrade as good features are eliminated.
The results (shown in Figure 1) could not be more glaring. For TL versus each author
other than RP, we are able to distinguish with gradually degrading effectiveness as the
best features are dropped. But for TL versus RP, the effectiveness of the models drops
right off a shelf. This indicates that just a few features, possibly deliberately inserted
as a ruse or possibly a function of slightly differing purposes assigned to the works,
distinguish between the works. For example, the frequency (per 10000 words) of the
word הז in RP is 80 and in TL is 116. A cursory glance at the texts is enough to
establish why this is the case: the author of TL ended every responsum with the phrase
םולש הז היהו, thus artificially inflating the frequency of these words. Indeed the
presence or absence of this phrase alone is enough to allow highly accurate
classification of a given responsum as either RP or TL. Once features of this sort are
eliminated, the works become indistinguishable – a phenomenon which does not
occur when we compare TL to each of the other collections. In other words, many
features can be used to distinguish TL from works in our corpus other than RP, but
only a few distinguish TL from RP. Most features distribute similarly in RP and TL. A
wonderful illustrative example of this is the word וכו' , the respective frequencies of
which in the various corpora are as follows: TL:29 RP:28 SV:4 GV:4 DN:41 ZZ:77
We have shown elsewhere (Koppel and Schler 2004), that the evidence offered in
Figure 1 is sufficient to conclude that the author of RP and TL are one and the same:
Ben Ish Chai.
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11
Fig. 1: Accuracy (y-axis) on training data of learned models comparing TL to other collections
as best features are eliminated, five per iteration (x-axis). Dotted line on bottom is RP vs. TL.
Problem 3: Chronology and Dependence
Given three or more corpora, we can attempt to learn dependencies among the corpora
by checking pairwise similarities. To illustrate what we mean, we consider three sub-
corpora of Zohar, the central text of Jewish mysticism:
HaIdra (47 passages from Zohar vol. 3, pp. 127b-141a; 287b-296b)
Midrash HaNe'elam (67 passages from Zohar vol. 1, pp. 97a-140a)
Raya Mehemna (100 passages from Zohar vol. 3, 215b-283a)
For simplicity, we will refer to these three corpora as I, M and R, respectively. These
sub-corpora are of disputed provenance and their chronology and cross-influence are
not well established.
Lexical features were chosen in a similar fashion to that described above. Separate
models were constructed to distinguish between each pair from among the three
corpora. In five-fold cross-validation experiments on each pair, unseen documents
were classified with approximately 98% accuracy. In addition, degradation using
unmasking is slow. From this we conclude that these corpora were written by three
different authors.
The next stage of the experiment is an attempt to determine the relationship between
the three corpora. We learn models to distinguish two of the corpora from each other
and then use this model to classify the third corpora as more similar to one or the
other. In this way we hope to determine possible dependencies among the corpora.
In our initial experiments, absolutely nothing could be concluded because in each of
the three experiments the passages of the third corpus seemed to split about evenly
between being more similar to the first as to the second. We then ran the experiment
again but this time using grammatical prefixes and suffixes as features. Using the
expanded feature set, we were able to pair-wise distinguish between the corpora with
the same 98% accuracy as with the original lexical feature set. However, the results of
the second experiment changed dramatically. When we learn models distinguishing
between R and M and then use them to classify I, all I passages are classified as closer
to R. Similarly, when we learn models distinguishing between R and I and then use
them to classify M, all M passages are classified as closer to R. However, when we
learn models distinguishing between M and I and then use them to classify R, the
results are ambiguous.
The reason for this is rooted in the fact that, like our other corpora, Zohar is written in
a dialect that combines Aramaic and Hebrew. One of the main distinguishing features
of Hebrew versus Aramaic is the use of certain affixes. For example, in Hebrew the
plural noun suffixes are תו and םי, while in Aramaic ןי and אנ are used. Similarly, in
Hebrew the word which is incorporated as the prefix ש while in Aramaic ד is used. We
find (see Table 2) that M is characterized by a large number of Hebrew affixes and I
is distinguished by a large number of Aramaic affixes. R falls neatly in the middle.
feature I R M
הש* 0.000.17
8.12
שו* 3.874.24
2.92
*תי
6.4411.61
4.37
*וה
13.7025.73
7.36
*יו
8.814.41
3.75
בו* 10.3811.61
4.37
*תו
6.2111.54
15.82
*תא
3.026.53
9.44
*םי
9.5721.42
33.66
*אנ
32.9214.87
7.84
יו* 2.344.65
10.73
*ןי
87.2861.20
17.63
לש* 0.722.34
9.59
Table 2: Frequencies of prefixes and suffixes in I/M/R)
A number of possible conclusions might be drawn from this. For example, the
phenomena uncovered here might support the hypothesis that R lies chronologically
between M and I. However, scholars of this material believe that a more likely
interpretation is that M and I were cotemporaneous and independent of each other and
that R was subsequent to both and may have drawn from each of them.
Problem 4: Assigning manuscript fragments
Our final experiment is a version of an attribution experiment. However, in this case
we wish to distinguish between different versions of the same text. The question is
whether we can exploit differences in orthographic style to correctly assign some text
fragment to the manuscript from which it was taken.
In our experiment, we are given four versions of the same Talmudic text (tractate
Rosh Hashana of the Babylonian Talmud), each version having been transcribed by a
different scribe. We break each of the four manuscripts into 67 fragments
(corresponding to pages in the printed version). The object is to determine from which
version a given fragment might have come.
Note that since we are distinguishing between different versions of the same texts, we
can't realistically expect lexical or morphological features to distinguish very well.
After all, the texts consist of the same words. Rather, the features that are likely to
help here are precisely those that were disqualified in our earlier experiments, namely,
orthographic ones.
Rather than identify these features manually, we proceeded as follows. First, we
simply gathered a list of all lexical features that appeared at least ten times in the
texts. Variant spellings of the same word were treated as separate features. In order to
identify promising features, we used an "instability" measure (Koppel et al, 2003) that
grants a high score to a feature that appears with different frequency in different
versions of the same document.
Specifically, let {d
1
,d
2
,…,d
n
} be a set of texts (in our case n=67) and let {d
i
1
,d
i
2
,…,d
i
m
}
be m > 1 different versions of d
i
(in our case m=4). For each feature c, let c
i
j
be the
relative frequency of c in document d
i
j
. For multiple versions of a single text d
i
, let k
i
= Σ
j
c
i
j
and let Η(c
i
) = −Σ
j
[(c
i
j
/ k
i
)log (c
i
j
/ k
i
)]]/log m. (We can think of c
i
j
/ k
i
as the
probability that a random appearance of c in d
i
is in version d
i
j
so that H(c
i
) is just the
usual entropy measure.) Thus, for example, if a feature c assumed the identical value
in every version of a document d
i
, Η(c
i
) would be 1. To extend the definition to the
whole set {d
1
,d
2
,…,d
n
}, let K = Σ
i
k
i
and let H(c) = Σ
i
[(k
i
/Κ) * Η(c
i
)]. Finally, let
H'(c) = 1-H(c). H'(c) does exactly what we want: features the frequency of which
varies in different versions of the same document score higher than those that have the
same frequency in each version.
We then ranked all features according to H'(c). Those that ranked highest were those
that permitted variant orthographic representations. In particular, some scribes used
abbreviations or acronyms or non-standard spellings in places where other scribes did
not. We choose as our feature set the 200 highest-ranked features according to H' in
the training corpus. Using Naïve Bayes on this feature set in five-fold cross-validation
experiments yielded accuracy of 85.4%.
Thus, by and large, we are able to correctly assign a fragment with its manuscript of
origin. This work recapitulates and extends in automated fashion, a significant amount
of research carried out manually by scholars of Talmudic literature (Friedman 1996).
Among the main distinguishing features we find different substitutions for the Name
of God ( ה ,'יי ,'י"י''' , ), variant abbreviations ( תכד ,'יתכד ,'ביתכד ), and a number of
acronyms ( א"ר ,ת"ר ,מ"ט ,ת"ש ,ס"ד ) used in some manuscripts but not in others.
There is one major limitation to the approach we used here. We assume that within a
given manuscript the frequency of a given feature is reasonably invariant from
fragment to fragment. This is only true if we are considering various versions of a
single thematically homogeneous text. If we wish to train on versions of various texts
as a basis for identifying the scribe/editor of a manuscript of a different text, we need
make a more realistic assumption. This can be done by normalizing our feature
frequencies differently: we must count the number of appearances of a particular
orthographic variant of a word in a manuscript fragment relative to the total number
of appearances of all variants of that word in the fragment. This value should indeed
remain reasonably constant for a single scribe/editor across all texts.
Conclusions
We have shown that the range of issues considered in the field of text categorization
can be significantly broadened to include problems of great importance to scholars in
the humanities. Methods already used in text categorization require a bit of adaptation
to handle these problems. First, the proper choice of feature sets (lexical,
morphological and orthographic) is required. In addition, juxtaposition of a variety of
classification experiments can be used to handle issues of pseudonymous writing,
chronology and other problems in surprising ways. We have seen that for a variety of
textual problems concerning Hebrew-Aramaic texts, proper selection of feature sets
combined with these new techniques can yield results of great use to scholars in these
areas.
References
Argamon-Engelson, S., M. Koppel, G. Avneri (1998). Style-based text categorization: What newspaper
am I reading?, in Proc. of AAAI Workshop on Learning for Text Categorization, 1998, pp. 1-4
Baayen, H., H. van Halteren, F. Tweedie (1996). Outside the cave of shadows: Using syntactic
annotation to enhance authorship attribution, Literary and Linguistic Computing, 11, 1996.
Ben-David, Y. L. (2002). Shevet mi-Yehudah (in Hebrew), Jerusalem, 2002 (no publisher listed)
Biber, D., S. Conrad, R. Reppen (1998). Corpus Linguistics: Investigating Language Structure and
Use, (Cambridge University Press, Cambridge, 1998).
Dagan, I., Y. Karov, D. Roth (1997), Mistake-driven learning in text categorization, in EMNLP-97:
2nd Conf. on Empirical Methods in Natural Language Processing, 1997, pp. 55-63.
de Vel, O., A. Anderson, M. Corney and George M. Mohay (2001). Mining e-mail content for author
identification forensics. SIGMOD Record 30(4), pp. 55-64
Epstein, Y.N. (1957). Mevo'ot le-Sifrut ha-Tana'im, Jerusalem, 1957
Friedman, S. (1996) The Manuscripts of the Babylonian Talmud: A Typology Based Upon
Orthographic and Linguistic Features. In: Bar-Asher, M. (ed.) Studies in Hebrew and Jewish
Languages Presented to Shelomo Morag [in Hebrew], p. 163-190. Jerusalem, 1996.
Holmes, D. (1998). The evolution of stylometry in humanities scholarship, Literary and Linguistic
Computing, 13, 3, 1998, pp. 111-117.
Koppel, M., N. Akiva and I. Dagan (2003), A corpus-independent feature set for style-based text
categorization, in Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis
and Synthesis, Acapulco, Mexico.
Koppel, M., S. Argamon, A. Shimony (2002). Automatically categorizing written texts by author
gender, Literary and Linguistic Computing 17,4, Nov. 2002, pp. 401-412
Koppel, M., D. Mughaz and J. Schler (2004). Text categorization for authorship verification in Proc.
8th Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL, 2004.
Koppel and J. Schler (2004), Authorship Verification as a One-Class Classification Problem, to appear
in Proc. of ICML 2004, Banff, Canada
Lewis, D., R. Schapire, J. Callan, R. Papka (1996). Training algorithms for text classifiers, in Proc.
19th ACM/SIGIR Conf. on R&D in IR, 1996, pp. 306-298.
Littlestone, N. (1987). Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm, Machine Learning, 2, 4, 1987, pp. 285-318.
Matthews, R. and Merriam, T. (1993). Neural computation in stylometry : An application to the works
of Shakespeare and Fletcher. Literary and Linguistic computing, 8(4):203-209.
McEnery, A., M. Oakes (2000). Authorship studies/textual statistics, in R. Dale, H. Moisl, H. Somers
eds., Handbook of Natural Language Processing (Marcel Dekker, 2000).
Merriam, T. and Matthews, R. (1994). Neural computation in stylometry : An application to the works
of Shakespeare and Marlowe. Literary and Linguistic computing, 9(1):1-6.
Mosteller, F. and Wallace, D. L. (1964). Inference and Disputed Authorship: The Federalist.
Reading, Mass. : Addison Wesley, 1964.
Mughaz, D. (2003). Classification Of Hebrew Texts according to Style, M.Sc. Thesis, Bar-Ilan
University, Ramat-Gan, Israel, 2003.
Radai, Y. (1978). Hamikra haMemuchshav: Hesegim Bikoret uMishalot (in Hebrew), Balshanut Ivrit
13: 92-99
Radai, Y. (1979). Od al Hamikra haMemuchshav (in Hebrew), Balshanut Ivrit 15: 58-59
Radai, Y. (1982). Mikra uMachshev: Divrei Idkun (in Hebrew), Balshanut Ivrit 19: 47-52
Sebastiani, F. (2002). Machine learning in automated text categorization, ACM Computing Surveys 34
(1), pp. 1-45
Stamatatos, E., N. Fakotakis & G. Kokkinakis, (2001). Computer-based authorship attribution without
lexical measures, Computers and the Humanities 35, pp. 193—214.
Tishbi, Y. (1949). Mishnat haZohar (in Hebrew), Magnes: Jerusalem, 1949.
Urbach, E. E. (1954). Baalei haTosafot (in Hebrew), Bialik: Jerusalem, 1954.
Wolters, M. and Kirsten, M. (1999): Exploring the Use of Linguistic Features in Domain and Genre
Classification, Proceedings of the Meeting of the European Chapter of the Association for
Computational Linguistics
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information
Retrieval, Vol 1, No. 1/2, pp 67--88, 1999.
Yang, Y. and Pedersen, J.O. (1997). A comparative study on feature selection in text categorization,
Proceedings of ICML-97, 14th International Conference on Machine Learning,412--420
Yule, G.U. (1938). "On Sentence Length as a Statistical Characteristic of Style in Prose with
Application to Two Cases of Disputed Authorship", Biometrika, 30, 363-390, 1938.
... Automatic classification of Hebrew-Aramaic texts is almost an uninvestigated research domain. CHAT, a system for stylistic classification of Hebrew-Aramaic texts is presented in[15,16,20]. It presents applications of several TC tasks to Hebrew-Aramaic texts: 1. ...
... Various machine learning methods have been applied for TC[29], e.g.: Naïve Bayes[28], C4.5[8]and Winnow[15,16,20]. However, the Support Vector Machines (SVM) method[4,31]has been chosen to be applied in this model since it seems to be the most successful for TC[5,6,10,11,33]. ...
Conference Paper
Full-text available
Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firstly, these documents include words from both languages. Secondly, Hebrew and Aramaic are richer than English in their morphology forms. The classification is done using six different sets of stylistic features: quantitative features, orthographic features, topographic features, lexical features and vocabulary richness. Each set of features includes various baseline features, some of them formalized by us. SVM has been chosen as the applied machine learning method since it has been very successful in text classification. The quantitative set was found as very successful and superior to all other sets. Its features are domain-independent and language-independent. It will be interesting to apply these feature sets in general and the quantitative set in particular into other domains as well as into other.
... Another works that are related to document classification and address the challenges of Hebrew involve the classification of Hebrew-Aramaic documents according to style (Koppel, Mughaz, & Akiva, 2006;Mughaz, 2003); authorship verification, including forgers and pseudonyms (Koppel, Mughaz, & Akiva, 2003;Koppel, Schler, & Mughaz, 2004) HaCohen-Kerner et al. (2011) used six machine learning techniques for identifying citations. To achieve this goal they used four feature types, n-gram, stop word-based, quantitative and orthographic, and tested them separately and together. ...
Conference Paper
Full-text available
Aim/Purpose: Finding and tagging citation on an ancient Hebrew religious document. These documents have no structured citations and have no bibliography. Background: We look for common patterns within Hebrew religious texts. Methodology: We developed a method that goes over the texts and extracts sentences con-taining the names of three famous authors. Within these sentences we find common ways of addressing those three authors and with these patterns we find references to various other authors. Contribution: This type of text is rich in citations and references to authors, but because there is no structure of references it is very difficult for a computer to automatically identify the references. We hope that with the method we have developed it will be easier for a computer to identify references and even turn them into hyper-links. Findings: We have provided an algorithm to solve the problem of non-structured cita-tions in an old Hebrew plain text. The algorithm definitely was able to find many citations but it has missed out some types of citations. Impact on Society: When the computer recognizes references, it will be able to build (at least par-tially) a bibliography that currently does not exist in such texts at all. Over time, OCR scans more and more ancient texts. This method can make people's access and understanding much. Future Research: After we identify the references, we plan to automatically create a bibliography for these texts and even transform those references into hyperlinks.
... The use of the set of affixes proved itself, and the affixes were the dominant features of the identification task. Kopel [26,27,28] continued this work and developed an unmasking method (or method for identifying counterfeits) according to the style of writing. ...
Article
Full-text available
This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was living. Finding this era can assist in the examination of a fake document or extracting the time period in which a writer lived. The study and the experiments concern Hebrew, and in some parts, Aramaic and Yiddish rabbinic texts. The rabbinic texts are undated and contain no bibliographic sections, posing an interesting challenge. This work proposes algorithms using key phrases and key words that allow the temporal organization of citations together with linguistic patterns. Based on these key phrases, key words, and the references, we established several types of “Iron-clad,” Heuristic and Greedy rules for estimating the years of birth and death of a writer in an interesting classification task. Experiments were conducted on corpora, including documents authored by 12, 24, and 36 rabbinic writers and demonstrated promising results.
... Several works presented by HaCohen-Kerner et al. [7,8] use stylistic feature sets. Other works presented by Koppel et al., [14,15] use a few hundreds of single words. HaCohen-Kerner et al. [6] investigate as a supervised ML task the use of 'bag of words' for the classification of Hebrew-Aramaic documents according to their historical period and the ethnic origin of their authors. ...
Article
Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is RESPONSA, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 RESPONSA, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the “majority” (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.
... Several works presented by HaCohen-Kerner et al. [7,8] use stylistic feature sets. Other works presented by Koppel et al., [14,15] use a few hundreds of single words. HaCohen-Kerner et al. [6] investigate as a supervised ML task the use of 'bag of words' for the classification of Hebrew-Aramaic documents according to their historical period and the ethnic origin of their authors. ...
Conference Paper
This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents' countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW). The application domain is articles referring to Jewish law written in Hebrew and Aramaic. The clustering experiments have been done using The EM algorithm. To the best of our knowledge, performing clustering tasks according to countries or periods are novel. The improvement rates in these tasks vary from 11.53% to 39.43%. The clustering tasks according to 2 or 3 authors achieved results above 95% and present superior improvement rates (between 15.61% and 56.51%); most of the improvements have been achieved with FW and VFW. These findings are surprising and contrast the initial assumption that FFW is the prime word list for clustering tasks.
... CHAT, a system for stylistic classification of Hebrew-Aramaic texts is presented in [27,28,32]. CHAT present applications of several TC tasks to Hebrew- Aramaic texts: ...
Article
We use text classification to distinguish automatically between original and translated texts in Hebrew, a morphologically complex language. To this end, we design several linguistically informed feature sets that capture word-level and sub-word-level (in particular, morphological) properties of Hebrew. Such features are abstract enough to allow for the development of accurate, robust classifiers, and they also lend themselves to linguistic interpretation. Careful evaluation shows that some of the classifiers we define are, indeed, highly accurate, and scale up nicely to domains that they were not trained on. In addition, analysis of the best features provides insight into the morphological properties of translated texts.
Article
Full-text available
One of the crowning achievements of Yaacov Choueka’s illustrious career has been his guidance of the Bar-Ilan Responsa project from a fledgling research project to a major enterprise awarded the Israel Prize in 2008. Much of the early work on the Responsa project ultimately proved to be foundational in the now burgeoning area of information retrieval, the science of searching large digitized corpora for information. In this paper, I will very briefly review some of the project’s achievements and will discuss some of the directions the project might consider in order to meet ongoing challenges. (The reader wishing to read an insider’s detailed review of the project’s achievements and challenges is referred to (Choueka 1990).) The Responsa project was initiated by Aviezri Fraenkel in 1963, well before massive searchable text corpora became commonplace. In order to appreciate the challenges faced by researchers involved with the Responsa project in those early days, it is instructive to compare the corpus to the most well-known corpus extant at the time, namely, the Brown corpus developed at Brown University (Kucera & Francis 1967).
Article
Full-text available
The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 per cent accuracy.
Article
Full-text available
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
Article
Full-text available
The central questions are: How useful is information about part-of-speech frequency for text categorisation? Is it feasible to limit word features to content words for text classifications? This is examined for 5 domain and 4 genre classification tasks using LIMAS, the German equivalent of the Brown corpus. Because LIMAS is too heterogeneous, neither question can be answered reliably for any of the tasks. However, the results suggest that both questions have to be examined separately for each task at hand, because in some cases, the additional information can indeed improve performance.
Article
This session describes an experiment in author- ship attribution in which statistical measures and methods that have been widely applied to words and their frequencies of use are applied to rewrite rules as they appear in a syntactically annotated corpus. The outcome of this experiment suggests that the frequencies with which syntactic rewrite rules are put to use provide at least as good a cue to authorship as word usage. Moreover, one me- thod, which focuses on the use of the lowest-fre- quency syntactic rules, has a higher resolution than traditional word-based analyses, and promi- ses to be a useful new technique for authorship attribution.
Article
This paper traces the historical development of the use of statistical methods in the analysis of literary style. Commencing with stylometry’s early origins, the paper looks at both successful and unsuccessful applications, and at the internal struggles as statisticians search for a proven methodology. The growing power of the computer and the ready availability of machine-readable texts are transforming modern stylometry, which has now attracted the attention of the media. Stylometry’s interaction with more traditional literary scholarship is also discussed.