About
133
Publications
24,961
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,523
Citations
Introduction
Current institution
Additional affiliations
September 1996 - present
Publications
Publications (133)
This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet p...
Safety cases play a significant role in the development of safety-critical systems. The key components in a safety case are safety arguments, that are designated to demonstrate that the system is acceptably safe. Inappropriate reasoning with safety arguments could undermine a system’s safety claims which in turn contribute to safety-related failure...
Safety cases play a significant role in the development of safety-critical systems. The key components in a safety case are safety arguments, that are designated to demonstrate that the system is acceptably safe. Inappropriate reasoning with safety arguments could undermine a system's safety claims which in turn contribute to safety-related failure...
Despite increasing research into their use as a vehicle for Human-Computer Dialogue and Inter-Agent Communication, Dialogue Games have not seen good uptake in industry. One of the reasons for this is the lack of method-ologies and tooling for the development, evaluation, and exploitation of such systems. In this paper we build on the ProtOCL method...
Generally surnames (family name) or forenames are evolved over generations which can be used to understand population origins, migration, identity, social norms and cultural customs. These forenames or surnames may have hidden structure associated with them called communities. Each community might have strong correlation among several forenames and...
Traditional methods for identifying communities in networks are based on direct link structures, which ignore the content information shared among groups of entities. Recently, community detection approaches by using both link and content have been studied. It is necessary to identify communities with different sentiment distributions based on corr...
This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum lik...
Word Sense Induction (WSI) is the task of identifying the different uses (senses) of a target word in a given text in an unsupervised manner, i.e. without relying on any external resources such as dictionaries or sense-tagged data. This paper presents a thorough description of the SemEval-2010 WSI task and a new evaluation setting for sense inducti...
This paper presents a novel method for mining suspicious websites from World Wide Web by using state-of-the-art pattern mining and machine learning methods. In this document, the term "suspicious website" is used to mean any website that contains known or suspected violations of law. Although, we present our evaluation on illegal online organ tradi...
In this paper, we present an agglomerative hierarchical clustering algorithm for labelling morphs. The algorithm aims to capture allomorphs and homophonous morphemes for a deeper analysis of segmentation results of a morphological segmentation system. Most morphological segmentation systems focus only on segmentation rather than labelling morphs ac...
This paper presents a novel methodology for learning the behavioural profiles of sexual predators by using state-of-the-art machine learning and computational linguistics methods. The presented methodology targets at distinguishing between predatory and non-predatory conversations and is evaluated in real-world data. All the text fragments within a...
We propose a novel method for learning morphological paradigms that are structured within a hierarchy. The hierarchical structuring of paradigms groups morphologically similar words close to each other in a tree structure. This allows detecting morphological similarities easily leading to improved morphological segmentation. Our evaluation using (K...
Automatic evaluation of speech segmentation is problematic as predicted segment boundaries never align precisely. For this reason, most researchers apply ad-hoc methods for measuring the accuracy of speech segmentation. This makes judging the relative merits of each method extremely subjective and difficult. We address problem by proposing a new me...
A further investigation into the role of linguistic evolution as an alternative to biological evolution in the emergence of syntax is presented. This follows on from the idea that languages themselves are evolving entities, which adapt to be easily acquired by the human learner. It has already been shown that it is possible for the rudimentary elem...
In this paper, we highlight the problems of polysemy in word space models of compositionality detection. Most models represent each word as a single prototype-based vector without addressing polysemy. We propose an exemplar-based model which is designed to handle polysemy. This model is tested for compositionality detection and it is found to outpe...
Relationship mining or Relation Extraction (RE) is the task of identifying the different relations that might exist between
two or more named entities. Relation extraction can be exploited in order to enhance the usability of a variety of applications,
including web search, information retrieval, question answering and others. This paper presents a...
In this paper we address the problem of question recommendation from large archives of community question answering data by exploiting the users' information needs. Our experimental results indicate that questions based on the same or similar information need can provide excellent question recommendation. We show that translation model can be effec...
A phoneme segmentation method based on the analysis of discrete wavelet transform
spectra is described. The localization of phoneme boundaries is particularly
useful in speech recognition. It enables one to use more accurate acoustic models
since the length of phonemes provide more information for parametrization.
Our method relies on the values of...
In distributional semantics studies, there is a growing attention in compositionally determining the distributional meaning of word sequences. Yet, compositional dis- tributional models depend on a large set of parameters that have not been explored. In this paper we propose a novel approach to estimate parameters for a class of com- positional dis...
Taxonomies are an important resource for a variety of Natural Language Processing (NLP) applications. Despite this, the current state-of-the-art methods in taxonomy learning have disregarded word polysemy, in effect, developing taxonomies that conflate word senses. In this paper, we present an unsupervised method that builds a taxonomy of senses le...
This paper presents an unsupervised graph-based method for automatic word sense induction and disambiguation. The innovative part of our method is the as-signment of either a word or a word pair to each vertex of the constructed graph. Word senses are induced by clustering the constructed graph. In the disambiguation stage, each induced cluster is...
This paper presents the description and evaluation framework of SemEval-2010 Word Sense Induction & Disambiguation task, as well as the evaluation results of 26 participating systems. In this task, participants were required to induce the senses of 100 target words using a training set, and then disambiguate unseen instances of the same words using...
The paper presents an evaluation of Polish phone segmentation for different types of phones. The categorisation was done based on acoustic properties. The segmentation method is based on discrete wavelet transform and was already published. The results show that several types of transitions, especially from and to vowels cause more errors than othe...
There is significant evidence in the literature that integrating knowledge about multiword expressions can improve shallow parsing accuracy. We present an experimental study to quantify this improvement, focusing on compound nominals, proper names and adjective-noun constructions. The evaluation set of multiword expressions is derived from Word-Net...
Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studie...
Many existing methods for bilingual lexicon learning from comparable corpora are based on similarity of context vectors. These methods suffer from noisy vectors that greatly affect their accuracy. We introduce a method for filtering this noise allowing highly accurate learning of bilingual lexicons. Our method is based on the notion of in-domain te...
We propose a new clustering algorithm for the induction of the morphological paradigms. Our method is unsupervised and exploits
the syntactic categories of the words acquired by an unsupervised syntactic category induction algorithm [1]. Previous research
[2,3] on joint learning of morphology and syntax has shown that both types of knowledge affect...
One of the basic problems of efficiently generating information-seeking dialogue in interactive question answering is to find the topic of an information-seeking question with respect to the answer documents. In this paper we propose an approach to solving this problem using concept clusters. Our empirical results on TREC collections and our ambigu...
This paper presents the evaluation setting for the SemEval-2010 Word Sense Induction (WSI) task. The setting of the SemEval-2007 WSI task consists of two evaluation schemes, i.e. unsupervised evaluation and supervised evaluation. The first one evaluates WSI meth- ods in a similar fashion to Information Re- trieval exercises using F-Score. However,...
Word Sense Induction (WSI) is the task of identifying the different senses (uses) of a tar- get word in a given text. This paper focuses on the unsupervised estimation of the free pa- rameters of a graph-based WSI method, and explores the use of eight Graph Connectiv- ity Measures (GCM) that assess the degree of connectivity in a graph. Given a tar...
To model combinatorial decision problems involving uncertainty and probability, we extend the stochastic constraint programming framework proposed in [Walsh, 2002] along a number of important dimensions (e.g. to multiple chance constraints and to a range of new objectives). We also provide a new (but equivalent) semantics based on scenarios. Using...
A semantic language modelling method for speech recognition is presented. The method is somehow similar to latent semantic analysis, but it does not need so much memory and training data. Even though it gave better experimental results, provided as percentage of correctly recognized sentences from a corpus. It differentiate from latent semantic ana...
To model combinatorial decision problems involving uncertainty and probability, we introduce scenario based stochastic constraint programming. Stochastic constraint programs contain both decision variables, which we can set, and stochastic variables, which follow a discrete probability distribution. We provide a semantics for stochastic constraint...
This article describes a method for classifying dialogue utterances and detecting the interlocutor's agreement or disagreement. This labelling can help improve dialogue management by providing additional information on the utterance's content without deep parsing. The proposed technique improves upon state of the art approaches by using a Support V...
Identifying whether a multi-word expres- sion (MWE) is compositional or not is im- portant for numerous NLP applications. Sense induction can partition the context of MWEs into semantic uses and there- fore aid in deciding compositionality. We propose an unsupervised system to ex- plore this hypothesis on compound nom- inals, proper names and adjec...
This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [4][12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic inform...
Interactive question answering (QA), where a dialogue interface enables follow-up and clarification questions, is a recent although long-advocated field of research. We report on the design and implementation of YourQA, our open-domain, interactive QA system. YourQA relies on a Web search engine to obtain answers to both fact-based and complex ques...
This paper demonstrates one efficient technique in extracting bilingual word pairs from non-parallel but comparable corpora. Instead of using the common approach of taking high frequency words to build up the initial bilingual lexicon, we show contextually relevant terms that co-occur with cognate pairs can be efficiently utilized to build a biling...
This paper presents the evaluation setting for the SemEval-2010 Word Sense Induction (WSI) task. The setting of the SemEval-2007 WSI task consists of two evaluation schemes, i.e. unsupervised evaluation and supervised evaluation. The first one evaluates WSI methods in a similar fashion to Information Retrieval exercises using F-Score. However, F-Sc...
A speech recognition system based on HTK for Polish is presented. It was trained on 365 utterances, all spoken by 26 males. The features of Polish with respect to speech recognition are described. Some aspects of speech recognition differ in comparison to English. Errors in recognition were analysed in details in an attempt to find reasons and scen...
Argumentation is an emerging topic in the field of human computer dialogue. In this paper we describe a novel approach to dia-logue management that has been developed to achieve persuasion using a textual argumen-tation dialogue system. The paper introduces a layered management architecture that mixes task-oriented dialogue techniques with chat-bot...
Word Sense Induction (WSI) is the task of identifying the different senses (uses) of a target word in a given text. Traditional graph-based approaches create and then cluster a graph, in which each vertex corresponds to a word that co-occurs with the target word, and edges between vertices are weighted based on the co-occurrence frequency of their...
Segmenting the speech signals on the basis of time-frequency analysis is the most natural approach. Boundaries are located in places where energy of some frequency subband rapidly changes. Speech segmentation method which bases on discrete wavelet transform, the resulting power spectrum and its derivatives is presented. This information allows to l...
In our previous work we introduced a hybrid, GA&ILP-based approach for learning of stem-suffix segmentation rules from an
unmarked list of words. Evaluation of the method was made difficult by the lack of word corpora annotated with their morphological
segmentation. Here the hybrid approach is evaluated indirectly, on the task of tag prediction. A...
Automatic Term Recognition (ATR) is defined as the task of identifying domain specific terms from technical corpora. Termhood-based approaches measure the degree that a candidate term refers to a domain specific concept. Unithood-based approaches measure the attachment strength of a candidate term constituents. These methods have been evaluated usi...
A new method of semantic modelling for speech recognition is presented. The method has some similarities to latent semantic analysis, but it gave better experimental results, which are provided as percentage of correctly recognised sentences from a corpus. The main difference is a choice of similar topics influencing a matrix describing probability...
A speech recognition system based on HTK for Polish is pre- sented. It was trained on 365 utterances, all spoken by 26 males. We describe specific features of Polish with respect to speech recognition. Polish is, like other Slavic languages, non- positional and highly inflective. This is the reason why some aspects of speech recognition may differ...
Abstract The Polish text corpus was analysed to find information about phoneme,statistics. We were especially interested in triphones as they are commonly,used in many,speech processing applications like HTK speech recogniser. An attempt to create the full list of triphones for Polish language,is presented. A vast amount,of phonetically transcribed...
In this paper, we address the problem of personalization in question answering (QA). We describe the personalization component
of YourQA, our web-based QA system, which creates individual models of users based on their reading level and interests.
First, we explain how user models are dynamically created, saved and updated to filter and re-rank th...
In this paper, we study novel structures to represent infor- mation in three vital tasks in question answering: question classification, answer classification and answer reranking. We define a new tree struc- ture called PAS to represent predicate-argument relations, as well as a new kernel function to exploit its representative power. Our experime...
We study the impact of syntactic and shallow semantic information in automatic classifi- cation of questions and answers and answer re-ranking. We define (a) new tree struc- tures based on shallow semantics encoded in Predicate Argument Structures (PASs) and (b) new kernel functions to exploit the representational power of such structures with Supp...
This paper is an outcome of ongoing re-search and presents an unsupervised method for automatic word sense induction (WSI) and disambiguation (WSD). The induction algorithm is based on modeling the co-occurrences of two or more words using hypergraphs. WSI takes place by detect-ing high-density components in the co-occurrence hypergraphs. WSD assig...
An important issue in the construction of domain ontologies is the task of identifying terms and their corresponding definitions. Though many methods exist for automatic extraction of terminology from plain text, the semantic interpretation of these terms is either manual or semi-automatic. In this paper we present an unsupervised method for automa...
The paper presents the decision list learning system Clog and the results of using it to learn nominal inflections of English, Romanian, Czech, Slovene, and Estonian. The dataset used to induce rules for the synthesis and analysis of the inflectional paradigms of nouns and adjectives of these languages is the Multext-East multilingual tagged corpus...
Most question answering and information retrieval systems are insensitive to differ-ent users' needs and preferences, as well as their reading level. In (Quarteroni and Manandhar, 2006), we introduce a hybrid QA-IR system based on a a user model. In this paper we focus on how the system filters and re-ranks the search engine re-sults for a query ac...
This paper presents a novel unsupervised methodology for automatic disam- biguation of nouns found in unrestricted corpora. The proposed method is based on extending the context of a target word by querying the web, and then measuring the overlap of the extended context with the topic signatures of the different senses by using Bayes rule. The algo...
Most question answering (QA) and infor- mation retrieval (IR) systems are insensi- tive to different users' needs and prefer- ences, and also to the existence of multi- ple, complex or controversial answers. We introduce adaptivity in QA and IR by cre- ating a hybrid system based on a dialogue interface and a user model. Keywords: question answerin...
In the field of natural language dialogue, a new trend is ex- ploring persuasive argumentation theories. Applying these theories to human-computer dialogue management could lead to a more comfortable experience for the user and give way to new applications. In this paper, we study the different aspects of persuasive communication needed for health-...
Most question answering (QA) and information retrieval (IR) systems are insensitive to different users' needs and preferences, and also to the existence of multiple, complex or controversial answers. We propose the notion of adaptivity in QA and IR by introducing a hybrid QA-IR system based on a user model. Our current prototype filters and re-rank...
Semantic similarity or inversely, semantic distance measures are useful in a variety of circumstances, from spell checking applications to a lightweight replacement for parsing within a natural language engine. Within this work, we examine the (Jiang & Conrath 1997) algorithm; evaluated by (Budanit- sky & Hirst 2000) as being the best performing, a...
In most approaches to speech recognition, the speech signals are segmented using constant-time segmentation, for example into 25 ms blocks. Constant segmentation risks losing information about the phonemes. Different sounds may be merged into single blocks and individual phonemes lost completely. A more satisfactory approach is to attempt to segmen...
In this paper a new method of speech segmentation is sug- gested. It is based on power fluctuations of the wavelet spectrum for a speech signal. In most approaches to speech recognition, the speech signals are segmented using constant- time segmentation. Constant segmentation needs to use win- dows to decrease the boundary distortions. A more natur...
Statistical data on phonemes, useful in continuous speech recognition system, are presented. This paper explains basics of a simple system for phonemes, diphones and triphones statistics estimation from a text corpus of Polish language. Obtained results are presented for exemplar text database. Possible application of the statistics is suggested.
To model combinatorial decision problems involving uncertainty and probability, we introduce scenario based stochastic constraint programming. Stochastic constraint programs contain both decision variables, which we can set, and stochastic variables, which follow a discrete probability distribution. We provide a semantics for stochastic constraint...
Abstract Weexamine the implementation of clarification dialogues, a mechanism for ensuring that question answering systems take into account user goals by allowing,them to ask series of related questions either by refining or expanding on previous questions with follow-up questions, in the context of open domain,Question Answering systems. We devel...
Interoperability is a highly valued concept in providing services to a smart home environment. Presenting a consistent, seamless interface to the user requires individual devices within the smart home be able to communicate with each other. This in turn indicates the necessity for common standards and means of information interchange and at times e...
In the field of Artificial Intelligence, question answering is a bridge between information retrieval and natural language understanding. We propose to introduce adaptivity in question answering by creating a system based on a dialogue interface and a user modelling component. Our system will be able to handle complex questions and to present answe...
Conventional document search techniques are constrained by attempting to match individual keywords or phrases to source documents.
Thus, these techniques miss out documents that contain semantically similar terms, thereby achieving a relatively low degree of recall. At the same time, processing capabilities and tools for syntactic and semantic anal...
We examine clarification dialogue, a mechanism for refining user questions with follow-up questions, in the context of open domain Question Answering systems. We develop an algorithm for clarification dialogue recognition through the analysis of collected data on clarification dialogues and examine the importance of clarification dialogue recogniti...
A method is presented for automatically extending WordNet with the telic relationships proposed in Pustejovsky's lexicon model. The method extracts telic relationships from WordNet glosses by first selecting a telic word through a pattern matcher aided by a part-of-speech tagger and then employing a word disambiguation module to select the specific...
An algorithm for calculating semantic similarity between sentences using a variety of linguistic information is presented and applied to the problem of Question Answering. This semantic similarity measure is used in order to determine the semantic relevance of an answer in respect to a question. The algorithm is evaluated against the TREC Question...
To model combinatorial decision problems involving uncertainty and probability, we extend the stochastic constraint programming framework proposed in [Walsh, 2002] along a number of important dimensions (e.g. to multiple chance constraints and to a range of new objectives). We also provide a new (but equivalent) semantics based on scenarios. Using...
An algorithm for calculating semantic similarity between sentences using a variety of linguistic information is presented and applied to the problem of Question Answering. This semantic similarity measure is used in order to determine the semantic relevance of an answer in respect to a question. The algorithm is evaluated against the TREC Question...
Ontologies are a tool for Knowledge Representation that is now widely used, but the effort employed to build an ontology is high. We describe here a procedure to automatically extend an ontology such as WordNet with domain-specific knowledge. The main advantage of our approach is that it is completely unsupervised, so it can be applied to different...
Abstract We describe here a procedure to combine,two different existing techniques for Ontology Enrichment with domain-specific concepts. The resulting algorithm is fully unsupervised, and the level of precision is higher than when they are used separately, so we believe that both algorithms benefit from each other. The experiments have bee n perfo...
We provide a constraint based computational model of linear precedence as employed in the HPSG grammar formalism. An extended feature logic which adds a wide range of constraints involving precedence is described. A sound, complete and terminating deterministic constraint solving procedure is given. Deterministic computational model is achieved by...
A preliminary analysis of our QA system implemented for TREC-11 is presented, with an initial evaluation.
We describe here the algorithm and criteria we have used for the identification of events in a text, and for finding the tem- poral relations between them. Some of our ad hoc heuristics provide very good results, with precision and recall values around 90%, that are able to provide us a fairly good ordering of events from different kinds of texts....
We develop a new computational model for representing the fine-grained meanings of near-synonyms and the differences between them. We also develop a lexical-choice process that can decide which of several near-synonyms is most appropriate in a particular ...
Introduction This was our first entry at TREC and the system we presented was, due to time constraints, an incomplete prototype. Our main aims were to verify the usefulness of syntactic analysis for QA and to experiment with different semantic distance metrics in view of a more complete and fully integrated future system. To this end we made use of...
Computational learning of natural lan- guage is often attempted without using the knowledge available from other re- search areas such as psychology and linguistics. This can lead to systems that solve problems that are neither theoretically or practically useful. In this paper we present a system CLL which aims to learn natural language syntax in...
. This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word segmentation is introduced and a simple genetic algorithm is used in the search for a segmentation that corresponds to the best bias value. In the second pha...
Context free grammars parse faster than TFS grammars, but have disadvantages. On our test TFS grammar, precompilation into CFG results in a speedup of 16 times for parsing without taking into account additional mechanisms for increasing parsing efficiency. A formal overview is given of precompilation and parsing. Modifications to ALE rules permit a...
The identification of phrases in a sentence can be useful as a pre-processing step before attempting the full parsing. There is already much literature about finding simple non-recursive non-overlapping Noun Phrases. We have modified the learning paradigm CLOG [4] to produce transformation lists, and we arrived to several interesting conclusions ab...
The identification of phrases in a sentence can be useful as a pre-processing step before attempting the full parsing. There is already much literature about finding simple non-recursive non-overlapping Noun Phrases. We have modified the learning paradigm CLOG [4] to produce transformation lists, and we arrived to several interesting conclusions ab...
Constraint-based grammars based on Head Driven Phrase Structure Grammar employ typed feature structures (TFSs) which are often deeply nested. Unification in Prolog (B. Carpenter, G. Penn, ALE: The Attribute Logic Engine User’s Guide, version 2.0, Technical report, Philosophy Department, Carnegie Mellon University, Pittsburgh, PA, 1994) is then too...
This paper describes some of the results of the project The Reusability of Grammatical Resources. The aim of the project is to extend current grammar formalisms with notational devices and constraint solvers in order to aid the development of reusable grammars. The project took the Advanced Linguistic Engineering Platform (ALEP) as its starting poi...