
Rafael C. Carrasco- Computer Science & Physics
- University of Alicante
Rafael C. Carrasco
- Computer Science & Physics
- University of Alicante
About
106
Publications
12,327
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,398
Citations
Introduction
Current institution
Additional affiliations
January 2005 - December 2010
January 1988 - December 1991
Publications
Publications (106)
Diversity indices have been traditionally used to capture the biodiversity of ecosystems by measuring the effective number of species or groups of species. In contrast to abundance, which grows with the amount of data available and is sensitive to the appearance of small groups, diversity indices provide a more robust indicator on the variability o...
Diversity indices have been traditionally used to capture the biodiversity of ecosystems by measuring the effective number of species or groups of species. In contrast to abundance, which is correlated with the amount of data available, diversity indices provide a more robust indicator on the variability of individuals. These types of indices can b...
For some decades now, galleries, libraries, archives, and museums (GLAM) institutions have provided access to information resources in digital format. Although some datasets are openly available, they are often not used to their full potential. Recently, approaches such as the so‐called Labs within GLAM institutions promote the reuse of digital col...
Cultural heritage institutions have recently started to share their metadata as Linked Open Data (LOD) in order to disseminate and enrich them. The publication of large bibliographic data sets as LOD is a challenge that requires the design and implementation of custom methods for the transformation, management, querying and enrichment of the data....
Translation models based on hierarchical phrase-based statistical machine translation (HSMT) have shown better performances than the non-hierarchical phrase-based counterparts for some language pairs. The standard approach to HSMT learns and apply a synchronous context-free grammar with a single non-terminal. The hypothesis behind the grammar refin...
The normalization of edit sequences sorts the operations according to the document position they modify instead of the instant when they were generated. The stress on the spatial distribution of operations simplifies the analysis of conflictive types of operations and provides an alternative formalism-also within the operational transformation sche...
Cultural heritage institutions have recently begun to consider the benefits of sharing their collections using linked open data to disseminate and enrich their metadata. As datasets become very large, challenges appear, such as ingestion, management, querying and enrichment. Furthermore, each institution has particular features related to important...
This paper presents a new method with which to assist individuals with no background in linguistics to create monolingual dictionaries such as those used by the morphological analysers of many natural language processing applications.
The involvement of non-expert users is especially critical for under-resourced languages which either lack or canno...
The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The entries in the catalogue have been recently migrated to a new relational database whose data model adheres to the conceptual models promoted by the International Federation of Library A...
Bibliographic collections in traditional libraries often compile records from distributed sources where variable criteria have been applied to the normalization of the data. Furthermore, the source records often follow classical standards, such as MARC21, where a strict normalization of author names is not enforced. The identification of equivalent...
The 200,000 records in the catalogue of the Biblioteca Virtual Miguel de Cervantes have been migrated to a new relational database whose data model adheres to the FRBR and FRAD specifications. The database content has been later mapped to RDF triples which employ the RDA vocabulary to describe the entities, as well as their properties and relations...
The BVC section of the impact-es diachronic corpus of historical Spanish compiles 86 books —containing approximately 2 million words. About 27% of the words —providing a representative coverage of the most frequent word forms— have been annotated with their lemma, part of speech, and modern equivalent following the Text Encoding Initiative guidelin...
This paper describes an open-source tool which computes statistics of the differences between a reference text an the output of an OCR engine. It also facilitates the spotting of mismatches by generating an aligned bitext where the differences are highlighted and cross linked.
The tool accepts a variety of input formats (both for the reference and...
Large bilingual parallel texts (also known as bitexts) are usually stored in
a compressed form, and previous work has shown that they can be more
efficiently compressed if the fact that the two texts are mutual translations
is exploited. For example, a bitext can be seen as a sequence of biwords
---pairs of parallel words with a high probability of...
The impact-es diachronic corpus of historical Spanish compiles over one hundred books —containing approximately 8 million words— in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under...
The impact-es diachronic corpus of historical Spanish compiles over one hundred books—containing approximately 8 million words—in addition to a complementary lexicon which links more than 10,000 lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an ope...
We compare different strategies to apply statistical machine translation techniques in order to retrieve documents that are a plausible translation of a given source document. Finding the translated version of a document is a relevant task; for example, when building a corpus of parallel texts that can help to create and evaluate new machine transl...
It often occurs that local copies of a text are modified by users
but that the local modifications are not synchronized (thus allowing
the merged text to become the source for later editions) until later
when, for instance the network connection is reestablished. Since
text editions usually affect a small fraction of the whole content,
the history...
Although optimal staff scheduling often requires elaborate computational methods, those cases which are not highly constrained can be efficiently solved using simpler approaches. This paper describes how a simple procedure, combining random and greedy strategies with heuristics, has been successfully applied in a Spanish hospital to assign guard sh...
We describe a technique that maps unranked trees to arbitrary hash codes using a bottom-up Deterministic Tree Automaton (DTA). In contrast to other hashing techniques based on automata, our procedure builds a pseudo-minimal DTA for this purpose. A pseudo-minimal automaton may be larger than the minimal one accepting the same language but, in turn,...
We describe an algorithm that allows the incremental addition or removal of unranked ordered trees to a minimal frontier-to-root
deterministic finite-state tree automaton (DTA). The algorithm takes a tree t and a minimal DTA A as input; it outputs a minimal DTA A′ which accepts the language L(A) accepted by A incremented (or decremented) with the t...
The amount of information that is stored in digital form in more than one language is growing very fast as a consequence of the globalization. Furthermore, there are countries and supra-national entities whose legislation enforces the translation (and storage) of all the official texts into all their official languages. Two texts that are mutual tr...
A frontier-to-root deterministic finite-state tree automaton (DTA) can be used as a compact data structure to store collections of unranked ordered trees. DTAs are usually sparser than string automata, as most transitions are undefined and therefore, special care must be taken in order to minimize them eciently. However, it is dicult to find simple...
This paper describes the current status of development of an open-source shallow-transfer machine translation (MT) system for the (European) Portuguese ↔ Spanish language pair, developed using the OpenTrad Apertium MT toolbox (www.apertium.org). Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech t...
Recent work has shown that the extraction of symbolic rules improves the generalization performance of recurrent neural networks trained with complete (positive and negative) samples of regular languages. This paper explores the possibility of inferring the rules of the language when the network is trained instead with stochastic, positive-only dat...
Starting from basic couplings of the photons to mesons, nucleons and isobars a microscopic manybody theory is developped which allows one to evaluate different photonuclear reactions at intermediate energies. The theory is applied to obtain the total photonuclear cross section and the separation between absorption and (, ) reaction channels.
The increase in the amount of data available in digital libraries calls for the development of search engines that allow the users to find quickly and effectively what they are looking for. The XML tagging makes possible the addition of structural information in digitized content. These metadata offer new opportunities to a wide variety of new serv...
In this paper, we describe some techniques to learn probabilistic k-testable tree models, a generalization of the well-known k-gram models, that can be used to compress or classify structured data. These models are easy to infer from samples and allow for incremental updates. Moreover, as shown here, backing-off schemes can be defined to solve data...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition, and machine translation are some of them. In Part I of this paper, we sur...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition or in fields to which pattern recognition is linked. In Part I of this paper, we surveyed these objects and studied their properties. In this Part II, we study the relations between probabilistic finite-state automata and other well-known devices that ge...
Probabilistic k-testable models (usually known as k-gram models in the case of strings) can be easily identified from samples and allow for smoothing techniques to deal with unseen events during pattern classification. In this paper, we introduce the family of stochastic k-testable tree languages and describe how these models can approximate any st...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked. In part I of this paper, we surveyed these objects and studied their properties. In this part II, we study the relations between probabilistic finite-state automata and other well known devices that g...
In this paper, we present a natural generalization of k-gram models for tree stochastic languages based on the k-testable class. In this class of models, frequencies are estimated for a probabilistic regular tree grammar wich is bottom-up
deterministic. One of the advantages of this approach is that the model can be updated in an incremental fashio...
A simple, robust sliding-window part-of-speech tagger is pre- sented and a method is given to estimate its parameters from an un- tagged corpus. Its performance is compared to a standard Baum-Welch- trained hidden-Markov-model part-of-speech tagger. Transformation into a finite-state machine —behaving exactly as the tagger itself— is demon- strated...
This paper describes the application of a new model to learn probabilistic context-free grammars (PCFGs) from a tree bank
corpus. The model estimates the probabilities according to a generalized k-gram scheme for trees.It allows for faster parsing,decreases considerably the perplexity of the test samples and tends to
give more structured and refine...
We describe a general approach to compute a similarity measure between distributions generated by probabilistic tree automata that may be used in a number of applications in the pattern recognition field. In particular, we show how this similarity can be computed for families of structured (XML) documents. In such case, the use of regular expressio...
Probabilistic k-testable models (usually known as k-gram models in the case of strings) can be easily identified from samples and allow for smoothing techniques to deal with unseen events. In this paper we introduce the family of stochastic k-testable tree languages and describe how these models can approximate any stochastic rational tree language...
In a previous work, a new probabilistic context-free gram- mar (PCFG) model for natural language parsing derived from a tree bank corpus has been introduced. The model estimates the probabili- ties according to a generalized k-grammar scheme for trees. It allows for faster parsing, decreases considerably the perplexity of the test samples and tends...
In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: 1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule 2) a model that also stores information about the parent node's category and, 3)...
We propose a new Mgorithm which allows for the identification of any stochastic deterministic regular language as well as the determination of the probabilities of the strings in the language. The algorithm builds the prefix tree acceptor from the sample set and merges systematically equivaJent states. Experimentally, it proves very fast a.ad the t...
In this paper, we describe a generalization for tree stochastic languages of the k-gram models. These models are based on the k-testable class, a subclass of the languages recognizable by ascending tree automata. One of the advantages of this approach
is that the probabilistic model can be updated in an incremental fashion. Another feature is that...
In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: (1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule; (2) a model that also stores information about the parent node’s category, and...
Daciuk et al. [Computational Linguistics 26(1):3–16 (2000)] describe a method for constructing incrementally minimal, deterministic, acyclic finite-state automata (dictionaries) from sets of strings. But acyclic finite-state automata have limitations: For instance, if one wants a linguistic application to accept all possible integer numbers or Inte...
We define deterministic augmented letter transducers (DALTs), a class of finitestate transducers which provide an e#cient way of implementing morphological analysers which tokenize their input (i.e., divide texts in tokens or words) as they analyse it, and show how these morphological analysers may be maintained (i.e., how surface form--lexical for...
Regular tree automata (RTA) or, equivalently, forest regular grammars (FRG) have been recently proposed for use as XML (extended markup language) schemata. They are more powerful than usual XML DTDs (document-type definitions) , make the implementation, optimization and pruning of XML queries easier and allow for the implementation of context-sensi...
Recently, a number of authors have explored the use of recursive recursive neural nets (RNN) for the adaptive processing of trees or tree-like structures. One of the most important language-theoretical formalizations of the processing of tree-structured data is that of finite-state tree automata (FSTA). In many cases, the number of states of a nond...
Recently, a number of authors have explored the use of recursive recursive neural nets (RNN) for the adaptive processing of trees or tree-like structures. One of the most important language-theoretical formalizations of the processing of tree-structured data is that of deterministic finite-state tree automata (DFSTA). DFSTA may easily be realized a...
Finite-state machines are the most pervasive models of computation, not only in theoretical computer science, but also in all of its applications to real-life problems, and constitute the best characterized computational model. On the other hand, neural networks ---proposed almost sixty years ago by McCulloch and Pitts as a simplified model of nerv...
Recently, a number of authors have explored the use of recursive recursive neural nets (RNN) for the adaptive processing of
trees or tree-like structures. One of the most important language-theoretical formalizations of the processing of tree-structured
data is that of finite-state tree automata (FSTA). In many cases, the number of states of a nond...
. In this paper, we present a natural generalization of k-gram models for tree stochastic languages based on the k-testable class. In this class of models, frequencies are estimated for a probabilistic regular tree grammar wich is bottom-up deterministic. One of the advantages of this approach is that the model can be updated in an incremental fash...
In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: 1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule 2) a model that also stores information about the parent node's category and, 3)...
There has been a lot of interest in the use of discrete-time recurrent neural nets (DTRNN) to learn finite-state tasks, with interesting results regarding the induction of simple finite-state machines from input–output strings. Parallel work has studied the computational power of DTRNN in connection with finite-state computation. This article descr...
In many applications, objects are represented by a collection of unorganized points that scan the surface of the object. In
such cases, an efficient way of storing this information is of interest. In this paper we present an arithmetic compression
scheme that uses a tree representation of the data set and allows for better compression rates than ge...
We generalize a former algorithm for regular language identification from stochastic samples to the case of tree languages. It can also be used to identify context-free languages when structural information about the strings is available. The procedure identifies equivalent subtrees in the sample and outputs the hypothesis in linear time with the n...
Recently, a number of authors have explored the use of tree-walking (also called recursive) neural nets (TWNN) for the adaptive processing of data which present themselves as trees or tree-like structures such as directed acyclic graphs. On the other hand, one of the most important language-theoretical formalizations of the processing of treestruct...
. In recent years, there has been a lot of interest in the use of discrete-time recurrent neural nets (DTRNN) to learn nite-state tasks, and in the computational power of DTRNN, particularly in connection with nite-state computation. This paper describes a simple strategy to devise stable encodings of sequential nite-state translators (SFST) in a s...
In this paper, the identification of stochastic
regular languages is addressed.
For this purpose, we propose a class of algorithms which
allow
for the identification of the structure
of the minimal stochastic automaton generating the language.
It is shown that the time needed grows only linearly with the size of the
sample set and a measure of t...
A number of researchers have used discrete-time recurrent neural
nets (DTRNN) to learn finite-state machines (FSM) from samples of input
and output strings. Trained DTRNN usually show FSM behaviour for strings
up to a certain length, but not beyond; this is usually called
instability. Other authors have shown that DTRNN may actually behave as
FSM f...
In this paper, we explore the applicability to compression tasks of the algorithms for regular language inference from stochastic samples. We compare two arithmetic encoders based upon two dierent kinds of formal languages: string languages and tree languages. The experiments show that tree-based methods outperform the predictive capability of stri...
Stochastic grammars provide a formal background in order to deal with tasks where a random source of structured data is involved. In particular, stochastic tree grammars can be useful if hierarchical relations are established among the elementary components of the data. Grammatical inference methods are often checked with training samples generated...
. We generalize a former algorithm for regular language identification from stochastic samples to the case of tree languages or, equivalently, string languages where structural information is available. We also describe a method to compute efficiently the relative entropy between the target grammar and the inferred one, useful for the evaluation of...
Works dealing with grammatical inference of stochastic grammars often evaluate the relative entropy between the model and the true grammar by means of large test sets generated with the true distribution. In this paper, an iterative procedure to compute the relative entropy between two stochastic deterministic regular grammars is proposed. Resum'e...
We consider the problem of learning context-free grammars from stochastic structural data. For this purpose, we have developed an algorithm (tlips) which identifies any rational tree set from stochastic samples and approximates the probability distribution of the trees in the language. The procedure identifies equivalent subtrees in the sample and...
. Recent work has shown that the extraction of symbolic rules improves the generalization performance of recurrent neural networks trained with complete (positive and negative) samples of regular languages. This paper explores the possibility of inferring the rules of the language when the network is trained instead with stochastic, positiveonly da...
Recent work has shown that second-order recurrent neural networks (2ORNNs) may be used to infer deterministic finite automata (DFA) when trained with positive and negative string examples. This paper shows that 2ORNN can also learn DFA from samples consisting of pairs (W,μ
W
) where W is a noisy string of input vectors describing the degree of rese...
Recent work has shown that the extraction of symbolic rules improves the generalization power of recurrent neural networks trained with complete samples of regular languages. This paper explores the possibility of learning rules when the network is trained with stochastic data. For this purpose, a network with two layers is used. If an automaton is...
The recently introduced algorithm LAESA finds the nearest neighbour prototype in a metric space. The average number of distances computed in the algorithm does not depend on the number of prototypes but it shows linear space and time complexities. In this paper, a new algorithm (TLAESA) is proposed which has a sublinear time complexity and keeps th...
The recently introduced algorithm LAESA finds the nearest neighbour prototype in a metric space. The average number of distances computed in the algorithm does not depend on the number of prototypes but it shows linear space and time complexities. In this paper, a new algorithm (TLAESA) is proposed which has a sublinear time complexity and keeps th...
Recent work has shown that second-order recurrent neural networks (2ORNNs) may be used to infer regular languages. This paper presents a modified version of the real-time recurrent learning (RTRL) algorithm used to train 2ORNNs, that learns the initial state in addition to the weights. The results of this modification, which adds extra flexibility...
A symmetrized version of the Nagendraprasad-Wang-Gupta thinning algorithm (Digital Signal Processing 3(1993)97) is presented, which pro- duces simpler and more elegant skeletons of handwritten characters at zero extra computational cost.
Recent work has shown that second-order recurrent neural networks (20RNNs) may be used to infer deterministic finite automata (DFA) when trained with positive and negative string examples. This paper shows that 20RNN can also learn DFA from samples consisting of pairs (W,mu(W)) where W is a noisy string of input vectors describing the degree of res...
Differential cross sections of nucleons excited in photonuclear reactions in medium and heavy nuclei are studied by considering all relevant reaction mechanisms leading to the excitation of protons or neutrons. We take advantage of previous microscopic studies for the absorption and scattering of photons and photoproduced pions, and implement a sim...
An abstract is not available.
This volume presents the proceedings of the Second International Colloquium on Grammatical Inference (ICGI-94), held in Alicante, Spain in September 1994.
Besides 25 research papers carefully selected and refereed by the program committee, the book contains a survey by E. Vidal. The book is devoted to all those aspects of automatic learning that ex...
A local model for the nuclear medium modifications to the photoproduction of [eta] mesons through the [ital N][sup *](1535) resonance is applied to the study of the inclusive reaction in medium and heavy nuclei. The use of effective Lagrangians and many body quantum theory allows one to incorporate the nuclear decay channels of the [ital N][sup *]...
We develop a local approximation to the Δh model for coherent π0 photoproduction in nuclei which allows one to perform reliable calculations in heavy nuclei where the traditional Δh approach is technically unfeasible. We evaluate the cross section in different nuclei and compare our results with available data in 12C.
We study the contribution to ordinary Compton nuclear scattering of the resonant channel gamma + A --> (A'pi-) --> gamma + A with the pi- bound in the nucleus. We show that the interference of this resonant channel with background amplitudes produces significant peaks in the elastic backward differential cross section as a function of the incoming...
The double-differential cross section for inclusive (γ, π) reactions in nuclei is studied by combining elements of microscopic many-body theories previously developed. Pion production in nuclei changes with respect to the impulse approximation not only because of nuclear modifications to the primary (γ, π) reaction compared to the free case, but mo...
We use a recently developed microscopic approach to photonuclear reactions at intermediate energies in order to calculate the enhancement kappa of the nuclear dipole sum rule over its classical Thomas-Reiche-Kuhn value (60 mb MeV)NZ/A. The difficulties in comparing kappa, evaluated with the double commutator, with the observable photonuclear cross...
Starting from the basic interactions between photons, pions, nucleons and isobars we reconstruct a standard model providing an adequate description of the γN→πN reaction. With this, and the ph, Δh effective interactions used with success in the pion-nuclear reactions, we develop a systematic many-body expansion in the number of ph excitations in th...
Using a microscopical many body approach to pion and photonuclear reactions we study the mechanisms of pion and photon absorptions with emphasis on the number of nucleons involved in the genuine absorption process.
Results of computer simulation of pion production in photonuclear reactions are presented. Differential cross sections and photon absorption cross sections are shown. (AIP)
The problem of inclusive radiative pion capture is reanalyzed from a many-body point of view which allows to investigate effects like the Pauli blocking and the polarization of the medium by the spin-isospin interaction in a systematic way. Standard approximations are improved by means of this method, which is however much simpler technically than...
In this work we apply a microscopic many-body approach to photonuclear reactions[1] in order to improve the understanding of the dipole sum rule. At the same time, the sum rule provides us with test of consistency of the underlying theory at low photon energies
A new, robust sliding-window part-of-speech tagger is presented, which itself is an approximation of an existing model, and a method is described to esti-mate its parameters from an untagged corpus. The ap-proximation reduces the memory requirements with-out a significant loss in accuracy. Its performance is compared to that of the original sliding...
We describe a technique that maps unranked trees to their hash codes using a bottom-up deterministic tree automaton (DTA). In contrast to techniques imple-mented with minimal tree automata, our procedure builds a pseudo-minimal DTA. Pseudo-minimal automata are larger than the minimal ones but in turn the map-ping can be arbitrary, so it can be dete...
This paper describes a set of tools and Java classes that allow the Lucene text search engine to use morphological information to index and search; in particular, it describes the use of the linguistic resources developed for the Apertium open-source machine translation platform to extract morphological information while indexing. We describe which...