
Moshe Koppel- Bar Ilan University
Moshe Koppel
- Bar Ilan University
About
148
Publications
79,676
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,560
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (148)
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new languag...
We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release two fine-tuned versions of the model, designed to perform two specific foundational tasks in the analysis of Hebrew texts: prefix segmentation and morphological tagging. These fine-tuned m...
We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before. We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performan...
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel (BERT Embeddings for Rabbinic-Encoded Language). Whilst other PLMs exist for processing Hebrew texts (e.g., HeBERT, AlephBert), they are all trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphologi...
One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to tr...
We consider a thought experiment in which voters could submit binary preferences regarding each of a pre-determined list of independent relevant issues, so that majorities could be tallied per issue. It might be thought that if such voting became technically feasible and widespread, parties and coalitions could be circumvented altogether and would...
Many classical texts are available in multiple versions that almost always differ from each other due to transcription error and editorial discretion. One of the central challenges in the study of such texts is the preparation of a ‘synoptic’ text: an aligned presentation of the various versions in which corresponding words or phrases, even if not...
We present a system for automatic diacritization of Hebrew text. The system combines modern neural models with carefully curated declarative linguistic knowledge and comprehensive manually constructed tables and dictionaries. Besides providing state of the art diacritization accuracy, the system also supports an interface for manual editing and cor...
Brain-computer interfaces (BCIs) have been employed to provide different patient groups with communication and control that does not require the use of limbs that have been damaged. In this study, we explored BCI-based navigation in three long term amputees. Each participant attempted motor execution with the affected limb, and performed motor exec...
The identification of pseudepigraphic texts—texts not written by the authors to which they are attributed—has important historical, forensic, and commercial applications. Any method for identifying such pseudepigrapha must ultimately depend on some measure of a given document’s similarity to the other documents in a corpus. We show that for this pu...
This paper demonstrates the use of genetic algorithms for evolving: 1) a grandmaster-level evaluation function, and 2) a search mechanism for a chess program, the parameter values of which are initialized randomly. The evaluation function of the program is evolved by learning from databases of (human) grandmaster games. At first, the organisms are...
This paper demonstrates the use of genetic algorithms for evolving a grandmaster-level evaluation function for a chess program. This is achieved by combining supervised and unsupervised learning. In the supervised learning phase the organisms are evolved to mimic the behavior of human grandmasters, and in the unsupervised learning phase these evolv...
In this paper we demonstrate how genetic algorithms can be used to reverse engineer an evaluation function's parameters for computer chess. Our results show that using an appropriate mentor, we can evolve a program that is on par with top tournament-playing chess programs, outperforming a two-time World Computer Chess Champion. This performance gai...
In this paper we demonstrate how genetic algorithms can be used to reverse engineer an evaluation function's parameters for computer chess. Our results show that using an appropriate expert (or mentor), we can evolve a program that is on par with top tournament-playing chess programs, outperforming a two-time World Computer Chess Champion. This per...
We consider several fundamental principles of rabbinic approaches to handling uncertainty for legal purposes. We find that non-numerical versions of ideas subsequently developed in the literature on interpretations of probabilistic statements are useful for explicating these rabbinic principles.
Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it...
Purpose
Social network sites have been widely adopted by politicians in the last election campaigns. To increase the effectiveness of these campaigns the potential electorate is to be identified, as targeted ads are much more effective than non-targeted ads. Therefore, the purpose of this paper is to propose and implement a new methodology for auto...
Purpose
– Reliability and political bias of mass media has been a controversial topic in the literature. The purpose of this paper is to propose and implement a methodology for fully automatic evaluation of the political tendency of the written media on the web, which does not rely on subjective human judgments.
Design/methodology/approach
– The u...
In this paper, we shed new light on the authenticity of the Corpus Caesarianum, a group of five commentaries describing the campaigns of Julius Caesar (100-44 BC), the founder of the Roman empire. While Caesar himself has authored at least part of these commentaries, the authorship of the rest of the texts remains a puzzle that has persisted for ni...
Introduction: One of the main goals of brain computer interface (BCI) research is providing disabled patients with some levels of communication, control of external devices, and mobility. For such patients to use BCI we need to address two major questions: i) can motor brain circuits that have been deprived of input following trauma still be used f...
We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at m...
We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at m...
We have developed an automated method to separate biblical texts according to author or scribal school. At the core of this new approach is the identification of correlations in word preference that are then used to quantify stylistic similarity between sections. In so doing, our method ignores literary features-such as possible repetitions, narrat...
We discuss a real-world application of a recently proposed machine learning method for authorship verification. Authorship verification is considered an extremely difficult task in computational text classification, because it does not assume that the correct author of an anonymous text is included in the candidate authors available. To determine w...
In this paper we consider automatic political tendency recognition in a variety of genres. To this end, four different types of texts in Hebrew with varying levels of political content (manifestly political, semipolitical, non-political) are examined. It is found that in each case, training and testing in the same genre yields strong results. More...
We show how automatically extracted citations in historical corpora can be used to measure the direct and indirect influence of authors on each other. These measures can in turn be used to determine an author's overall prominence in the corpus and to identify distinct schools of thought. We apply our methods to two major historical corpora. Using s...
Objective. We have developed an efficient generic machine learning (ML) tool for real-time fMRI whole brain classification, which can be used to explore novel brain-computer interface (BCI) or advanced neurofeedback (NF) strategies. Approach. We use information gain for isolating the most relevant voxels in the brain and a support vector machine cl...
We show that it is possible to automatically detect machine translated text at sentence level from monolingual corpora, using text classification methods. We show further that the accuracy with which a learned classifier can detect text as machine translated is strongly correlated with the translation quality of the machine translation system that...
Objective. We have developed a brain–computer interface (BCI) system based on real-time functional magnetic resonance imaging (fMRI) with virtual reality feedback. The advantage of fMRI is the relatively high spatial resolution and the coverage of the whole brain; thus we expect that it may be used to explore novel BCI strategies, based on new type...
Felsenthal and Machover have made substantial contributions to the measurement of voting power. It is worth bearing in mind, however, that the notion of political power is actually a quite general one of which voting power is one instantiation. In this brief paper, we consider political power in the general sense and propose a definition. We will s...
Almost any conceivable authorship attribution problem can be reduced to one fundamental problem: whether a pair of (possibly short) documents were written by the same author. In this article, we offer an (almost) unsupervised method for solving this problem with surprisingly high accuracy. The main idea is to use repeated feature subsampling method...
One of the crowning achievements of Yaacov Choueka’s illustrious career has been his guidance of the Bar-Ilan Responsa project from a fledgling research project to a major enterprise awarded the Israel Prize in 2008. Much of the early work on the Responsa project ultimately proved to be foundational in the now burgeoning area of information retriev...
Given an unsegmented multi-author text, we wish to automatically separate out distinct authorial threads. We present a novel, entirely unsupervised, method that achieves strong results on multiple testbeds, including those for which authorial threads are topically identical. Unlike previous work, our method requires no specialized linguistic tools...
Full natural language understanding requires identifying and analyzing the meanings of metaphors, which are ubiquitous in both text and speech. Over the last thirty years, linguistic metaphors have been shown to be based on more general conceptual metaphors, partial semantic mappings between disparate conceptual domains. Though some achievements ha...
This paper considers four versions of the authorship attribution problem that are typically encountered in the forensic context and offers algorithmic solutions for each. Part I describes the simple authorship attribution problem described above. Part II considers the long-text verification problem, in which we are asked if two long texts are by th...
Distinguishing between literal and metaphorical language is a major challenge facing natural language processing. Heuristically, metaphors can be divided into three general types in which type III metaphors are those involving an adjective-noun relationship (e.g. “dark humor”). This paper describes our approach for automatic identification of type...
Attribution of anonymous texts, if not based on factors external to the text (such as paper and ink type or document provenance, as used in forensic document examination), is largely, if not entirely, based on considerations of language style. We will consider here the question of how to best deconstruct a text into quantitative features for purpos...
Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author's unique "signature", and show that such signatures are typical of many authors when...
Given a multi-author document, we use unsupervised methods to identify distinct authorial threads. Although this problem is of great practical interest for security and forensic reasons, as well as for commercial purposes, this paper is, to the best of our knowledge, the first presentation of a general-purpose method for solving it.
We introduce the “fundamental problem” of authorship attribution: determining if two, possibly short, documents were written by a single author. A solution to this problem can serve as a building block for solving almost any conceivable authorship attribution problem. Our preliminary work on this problem is based on earlier work in authorship attri...
We define a family of solutions for n-person bargaining problems which generalizes the discrete Raiffa solution and approaches the continuous Raiffa solution. Each member of this family is a stepwise solution, which is a pair of functions: a step-function that determines a new disagreement point for a given bargaining problem, and a solution functi...
Given a corpus of financial news items labelled according to the market reaction following their publication, we investigate
‘cotemporeneous’ and forward-looking price stock movements. Our approach is to provide a pool of relevant textual features
to a machine learning algorithm to detect substantial stock price variations. Our two working hypothes...
The Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN10) was held in conjunction with the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (CLEF-10) in Padua, Italy. The workshop was organized as a competition covering two tasks: plagiarism detection and Wikipedia vandali...
Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the ...
Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to
one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of
known candidates is extremely large (possibly many thousands) and might not even include the actual au...
In this paper we demonstrate how genetic algorithms can be used to reverse engineer an evaluation function’s parameters for computer chess. Our results show that using an appropriate expert (or mentor), we can evolve a program that is on par with top tournament-playing chess programs, outperforming a two-time World Computer Chess Champion. This per...
The Babylonian Talmud, compiled from the 2nd to 7th centuries C.E., is the primary source for all subsequent Jewish laws. It is not written in apodeictic style, but rather as a discursive record of (real or imagined) legal (and other) arguments crossing a wide range of technical topics. Thus, it is not a simple matter to infer general methodologica...
In certain judgmental situations where a “correct” decision is presumed to exist, optimal decision making requires evaluation
of the decision-makers’ capabilities and the selection of the appropriate aggregation rule. The major and so far unresolved
difficulty is the former necessity. This article presents the optimal aggregation rule that simultan...
Given the judgments of multiple voters regarding some issue, it is generally assumed that the best way to arrive at some collective judgment is by following the majority. We consider here the now common case in which each voter expresses some (binary) judgment regarding each of a multiplicity of independent issues and assume that each voter has som...
While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the pr...
In this paper we introduce a novel method for automatically tuning the search parameters of a chess program using genetic algorithms. Our results show that a large set of parameter values can be learned automatically, such that the resulting performance is comparable with that of manually tuned parameters of top tournament-playing chess programs.
In this paper we introduce a novel method for automatically tuning the search parameters of a chess program using genetic algorithms (GA). Our results show that a large set of parameter values can be learned automatically, such that the resulting performance is comparable with that of manually tuned parameters of top tournament-playing chess progra...
The computational analysis of the style of natural language texts, computational stylistics, seeks to develop automated methods to (1) effectively distinguish texts with one stylistic character from those of another, and (2) give a meaningful representation of the differences between textual styles. Such methods have many potential applications in...
The main aim of this paper is to study the power of legislators in the Lower House of the Czech Parliament in 1996–2004 with respect to power distribution and its uncertainty. A discrepancy between a-priori computed power indices and outcome of voting leads to necessity to reveal the possible source of uncertainty. This paper studies uncertainty in...
In certain judgmental situations where a “correct” decision is presumed to exist, optimal decision making requires evaluation of the decision-maker's capabilities and the selection of the appropriate aggregation rule. The major and so far unresolved difficulty is the former necessity. This paper presents the optimal aggregation rule that simultaneo...
Prominent formal thought disorder, expressed as unusual language in speech and writing, is often a central feature of Schizophrenia. Since a more comprehensive understanding of phenomenology surrounding thought disorder is needed, this study investigates these processes by examining writing in Schizophrenia by novel computer-aided analysis. Thirty-...
The measurement of disproportionality, volatility and malapportionment often employ similar indices. Yet the debate on the
issue of adequate measurement has remained open. We offer a formal and rigorous list of properties that roughly subsume those
of Taagepera and Grofman (Party Polit 9(6):659–677, 2003). One of these properties, Dalton’s principl...
This paper demonstrates the use of genetic algorithms for evolving a grandmaster-level evaluation function for a chess program. This is achieved by combining supervised and unsupervised learning. In the supervised learning phase the organisms are evolved to mimic the behavior of human grandmasters, and in the unsupervised learning phase these evolv...
Authorship profiling problem is of growing importance in the global information environment, and can help police identify characteristics of the perpetrator of a crime when there are specific suspects to consider. The approach is to apply machine learning to text categorization, for which the corpus of training documents, each labeled according to...
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typical...
We show how an Arabic language religious-political document can be automatically classified according to the ideological stream and organizational affiliation that it represents. Tests show that our methods achieve near-perfect accuracy.
In this paper we demonstrate how genetic algorithms can be used to reverse engineer an evaluation function's param- eters for computer chess. Our results show that using an appropriate mentor, we can evolve a program that is on par with top tournament-playing chess programs, outperforming a two-time World Computer Chess Champion. This perfor- mance...
Goal of the workshop was to bring together experts and prospective researchers around the exciting and future-oriented topic of plagiarism analysis, authorship identification, and high similarity search. This topic receives increasing attention, which results, among others, from the fact that information about nearly any subject can be found on the...
The growth of the blogosphere offers an unprecedented opportunity to study language and how people use it on a large scale. We present an analysis of over 140 million words of English text drawn from the blogosphere, exploring if and how age and gender affect writing style and topic. Our primary result is that a number of stylistic and content-base...
This paper proposes a method for automatic correction of bias in speaker recognition systems, especially fusion-based systems. The method is based on a post-classifier which learns the relative performance obtained by the constituent systems in key trials, given the training and testing conditions in which they occurred. These conditions generally...
This paper presents a simplified post-classificatio n framework for enhancing the performance of a given speaker re cognition classifier by means of other "auxiliary" classifier s. We call it Virtual Fusion , since the assisting classifiers are used only for training the post-classifier and are not necessary in operating mode. Experiments performed...
In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verif...
We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and...
In this paper we present a general machine learning framework for score bias reduction and analysis in speaker recognition systems. The general principle is to learn a meta-system using recognition systems' errors, given the training and testing conditions in which they occurred. In the context of speaker recognition, the proposed method is able to...
We introduce a new measure on linguistic features, called stability, which captures the extent to which language element, such as a word or a syntactic construct, is replaceable by semantically equivalent elemen This measure may be perceived as quantifying the degree of available "synonymy" for a language item. W show that frequent but unstable fea...
In this paper, we use a blog corpus to demonstrate that we can often
identify the author of an anonymous text even where there are many
thousands of candidate authors. Our approach combines standard
information retrieval methods with a text categorization meta-learning
scheme that determines when to even venture a guess.
Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutra...
Analysis of a corpus of tens of thousands of blogs - incorporating close to 300 million words - indicates significant differences in writing style and content between male and female bloggers as well as among authors of different ages. Such differences can be exploited to determine an unknown author's age and gender on the basis of a blog's vocabul...
This paper emphasizes the benefits of embedding data categorization within fusion of classifiers for text-independent speaker verification. A selective fusion framework is presented which considers data idiosyncrasies by assigning particular test samples to appropriate fusion schemes. As an extension, incompatible data can be spotted and excluded f...
In this paper, we show that stylistic text features can be exploited to determine an anonymous author's native language with high accuracy. Specifically, we first use automatic tools to ascertain frequencies of various stylistic idiosyncrasies in a text. These frequencies then serve as features for support vector machines that learn to classify tex...
This paper presents an improved speaker verification technique tha t is especially appropriate for surveillance scenarios. The m ain idea is a meta- learning scheme aimed at improving fusion of low- and high-le vel speech in- formation. While some existing systems fuse several clas sifier outputs, the pro- posed method uses a selective fusion schem...
Text authored by an unidentified assailant can offer valuable clues to the assailant’s identity. In this paper, we show that
stylistic text features can be exploited to determine an anonymous author’s native language with high accuracy.
Sentiment analysis is an example of polarity learning. Most research on learning to identify sentiment ignores "neutral" examples and instead performs training and testing using only examples of significant polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons and show how neutral examples help u...
The textual entailment problem is to determine if a given text entails a given hypothesis. This paper describes first a general generative probabilistic setting for textual entailment. We then focus on the sub-task of recognizing whether the lexical con-cepts present in the hypothesis are entailed from the text. This problem is recast as one of tex...
This paper describes the Bar-Ilan system participating in the Recognising Textual Entailment Challenge. The paper proposes first a general probabilistic setting that formalizes the notion of textual en- tailment. We then describe a concrete alignment-based model for lexical entailment, which utilizes web co-occurrence statistics in a bag of words r...
The textual entailment task – determining if a given text entails a given hypothesis – provides an abstraction of applied semantic inference. This paper describes first a general generative probabilistic setting for textual entailment. We then focus on the sub-task of recognizing whether the lexical concepts present in the hypothesis are entailed f...