[Show abstract][Hide abstract] ABSTRACT: Based on an earlier proposed procedure and data, we extended our signature database and examined the differences between signature samples recorded at different times and the relevance of training data selection. We found that the false accept and false reject rates strongly depend on the selection of the training data, but samples taken during different time intervals hardly affect the error rates.
Proceedings of the 1st International Workshop on Automated Forensic Handwriting Analysis: A Satellite Workshop of ICDAR-2011. 09/2011; 768:6-10.
[Show abstract][Hide abstract] ABSTRACT: In the current paper we consider the task of object classification in wireless sensor networks. Assuming that each feature needed for classification is acquired by a sensor, a new approach is proposed that aims at minimizing the number of features used for classification while maintaining a given correct classification rate. In particular, we address the case where a sensor may have a failure before its battery is exhausted. In experiments with data from the UCI repository, the feasibility of this approach is demonstrated.
Proceedings of the 12th WSEAS international conference on Neural networks, fuzzy systems, evolutionary computing & automation; 04/2011
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.
Text, Speech and Dialogue - 14th International Conference, TSD 2011, Pilsen, Czech Republic, September 1-5, 2011. Proceedings; 01/2011
[Show abstract][Hide abstract] ABSTRACT: In the current paper we consider the task of object classification in wireless sensor networks. Due to restricted battery capacity, minimizing the energy consumption is a main concern in wireless sensor networks. Assuming that each feature needed for classification is acquired by a sensor, a sequential classifier combination approach is proposed that aims at minimizing the number of features used for classification while maintaining a given correct classification rate. In experiments with data from the UCI repository, the feasibility of this approach is demonstrated.
Multiple Classifier Systems - 10th International Workshop, MCS 2011, Naples, Italy, June 15-17, 2011. Proceedings; 01/2011
[Show abstract][Hide abstract] ABSTRACT: The CoNLL-2010 Shared Task was dedi-cated to the detection of uncertainty cues and their linguistic scope in natural lan-guage texts. The motivation behind this task was that distinguishing factual and uncertain information in texts is of essen-tial importance in information extraction. This paper provides a general overview of the shared task, including the annota-tion protocols of the training and evalua-tion datasets, the exact task definitions, the evaluation metrics employed and the over-all results. The paper concludes with an analysis of the prominent approaches and an overview of the systems submitted to the shared task.
[Show abstract][Hide abstract] ABSTRACT: Given a set of m identical bins of size 1, the online input consists of a (potentially infinite) stream of items in (0,1]. Each item is to be assigned to a bin upon arrival. The goal is to cover all bins, that is, to reach a situation where a total size of items of at least 1 is assigned to each bin. The cost of an algorithm is the sum of all used items at the moment when the goal is first fulfilled. We consider three variants of the problem, the online problem, where there is no restriction of the input items, and the two semi-online models, where the items arrive sorted by size, that is, either by non-decreasing size or by non-increasing size. The offline problem is considered as well.
[Show abstract][Hide abstract] ABSTRACT: Herein, we present the process of developing the fi rst Hungarian Dependency TreeBank. First, short refe rences are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into depend ency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency co rpus for Hungarian. We also go into detail about th e two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, le gal texts and texts in informatics, at the same tim e, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the sys tem: the present database may be utilized - among others - in information extraction and machine translation as well.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta; 01/2010
[Show abstract][Hide abstract] ABSTRACT: In online clustering problems, the classification of points into sets (called clusters) is done in an online fashion. Points
arrive one by one at arbitrary locations, to be assigned to clusters at the time of arrival. A point can be assigned to an
existing cluster, or a new cluster can be opened for it. We study a one dimensional variant on a line, where there is no restriction
on the length of a cluster, and the cost of a cluster is the sum of a fixed set-up cost and its diameter. The goal is to minimize
the sum of costs of the clusters used by the algorithm.
We study several variants, all maintaining the essential property that a point which was assigned to a given cluster must
remain assigned to this cluster, and clusters cannot be merged. In the strict variant, the diameter and the exact location
of the cluster must be fixed when it is initialized. In the flexible variant, the algorithm can shift the cluster or expand
it, as long as it contains all points assigned to it. In an intermediate model, the diameter is fixed in advance while the
exact location can be modified. We give tight bounds on the competitive ratio of any online algorithm in each of these variants.
In addition, for each one of the models, we consider also the semi-online case, where points are presented sorted by their
Mathematical Foundations of Computer Science 2010, 35th International Symposium, MFCS 2010, Brno, Czech Republic, August 23-27, 2010. Proceedings; 01/2010
[Show abstract][Hide abstract] ABSTRACT: This article reports on a corpus annotation project that has produced a freely available re-source for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). The corpus con-sists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist anno-tators and a chief annotator – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. We will report our statistics on corpus size, ambiguity levels and the consistency of anno-tations.
[Show abstract][Hide abstract] ABSTRACT: Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).
The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist--also responsible for setting up the annotation guidelines --who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.
Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.
[Show abstract][Hide abstract] ABSTRACT: 1 Abstract The average case analysis of algorithms usually assumes independent, identical distributions for the inputs. In (?), Kenyon introduced the random-order ratio, a new average case performance metric for bin packing heuristics, and gave upper and lower bounds for it for he Best Fit heuristics. We introduce an alternative deflnition of the random-order ratio and show that the two deflnitions give the same result for Next Fit. We also show that the random-order ratio of Next Fit equals to its asymptotic worst case, i.e., it is 2.
[Show abstract][Hide abstract] ABSTRACT: To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage (frequency rates available in the Hungarian National Corpus (HNC) were used for measurement (Váradi, 2000)), and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság (HVG) subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (the whole HVG article), and information on the lemma, POS-tagging and automatic tokenization is also available.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
[Show abstract][Hide abstract] ABSTRACT: Wordnets are lexical databases in which words are organized into clusters based on their meanings, and they are linked to each other through different semantic and lexical relations. The first wordnet called the Princeton WordNet was created for English, which were followed by various wordnets created within the framework of the EuroWordNet and BalkaNet projects, among others. Here we focus on the development of wordnets in general and of the Hungarian WordNet (HuWN). The process of constructing HuWn is illustrated by examples, some language-specific and language-independent problems encountered during the construction process are discussed, and then basic statistical data on HuWN are presented as well. Finally, two subontologies of HuWN, namely, the financial domain ontology and the legal domain ontology are also presented, and possible applications of WordNets are outlined.
[Show abstract][Hide abstract] ABSTRACT: A labeled natural language corpus is often difficult, expensive or time-consuming to obtain as its construction requires expert human effort. On the other hand, unlabelled texts are available in abundance thanks to the World Wide Web. The importance of utilizing unlabeled data in machine learning systems is growing. Here, we investigate classic semi-supervised approaches and examine the potential advantages of applying special techniques for Natural Language Processing tasks.
Proceedings of the 6th WSEAS international conference on Computational intelligence, man-machine systems and cybernetics; 12/2007
[Show abstract][Hide abstract] ABSTRACT: Semi-structured medical texts like discharge summaries are rich sources of information that can exploit the research results of physicians by performing statistical analysis of similar cases. In this paper we introduce a system based on Machine Learning algorithms that successfully classifies discharge records according to the smoking status of the patient (we distinguish between current smoker, past smoker, smoker /where a decision between the former two classes cannot be made/, non-smoker and unknown /where the document contains no data on smoking status/ classes). Such systems are useful for examining the connection between certain social habits and diseases like cancer or asthma. We trained and tested our model on the shared task organized by the I2B2 (Informatics for Integrating Biology and the Bedside) research center , and despite the low amount of training data available, our system shows promising results in identifying the smoking habits of patients based on their medical discharge summaries.
Proceedings of the 5th WSEAS international conference on System science and simulation in engineering; 12/2006
[Show abstract][Hide abstract] ABSTRACT: This report present a recent Hungarian project started in Spring of 2005. The goals of the project are to produce a Hungarian version of the EuroWordNet ontology database, to extend it with concepts specific to business domain, and to develop a demonstration version of an ontology-based Information Extraction (IE) system. The system will extract summarized data from short business articles concerning company fusions, acquisitions, profits, new products, new plants etc. A consortium of three leading Hungarian human language technology institutions won substantial governmental support lasting till 2007.