
Colin de la Higuera- Nantes Université
Colin de la Higuera
- Nantes Université
About
130
Publications
14,650
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,610
Citations
Introduction
Current institution
Publications
Publications (130)
This contribution presents two perspectives: one based on AI expertise and the other on the importance of knowledge as human knowledge. The later focuses on the purpose of education by revisiting Alexander von Humboldt’s concept of Bildung as the main purpose of education. Resisting the shifting to radically pragmatic models of education, without B...
(ré)Impulsée par l’Unesco depuis 2016, l’éducation ouverte (ÉO) vise des objectifs de développement durable (ODD), et ainsi, tend à assurer à tous et à toutes une éducation équitable, inclusive et de qualité, des possibilités d’apprentissage tout au long de la vie. Elle se déploie désormais à l’échelle internationale via de multiples initiatives au...
A classical problem in grammatical inference is to identify a language from a set of examples. In this paper, we address the problem of identifying a union of languages from examples that belong to several different unknown languages. Indeed, decomposing a language into smaller pieces that are easier to represent should make learning easier than ai...
A classical problem in grammatical inference is to identify a language from a set of examples. In this paper, we address the problem of identifying a union of languages from examples that belong to several different unknown languages. Indeed, decomposing a language into smaller pieces that are easier to represent should make learning easier than ai...
In a number of fields, it is necessary to compare a witness string with a distribution. One possibility is to compute the probability of the string for that distribution. Another, giving a more global view, is to compute the expected edit distance from a string randomly drawn to the witness string. This number is often used to measure the performan...
In a number of fields one is to compare a witness string with a distribution. One possibility is to compute the probability of the string for that distribution. Another, giving a more global view, is to compute the expected edit distance from a string randomly drawn to the witness string. This number is often used to measure the performance of a pr...
Two combinatorial maps \(M_1\) and \(M_2\) overlap if they share a sub-map, called an overlapping pattern, which can be extended without conflicting neither with \(M_1\) nor with \(M_2\). Isomorphism and subisomorphism are two particular cases of map overlaps which have been studied in the literature. In this paper, we show that finding the largest...
When learning languages or grammars, an attractive alternative to using a large corpus is to learn by interacting with the environment. This can allow us to deal with situations where data is scarce or expensive, but testing or experimenting is possible. The situation, which arises in a number of fields, is formalised in a setting called active lea...
This book provides a thorough introduction to the subfield of theoretical computer science known as grammatical inference from a computational linguistic perspective. Grammatical inference provides principled methods for developing computationally sound algorithms that learn structure from strings of symbols. The relationship to computational lingu...
Grammatical inference is a subfield of theoretical computer science which aims to characterize, understand, and solve learning problems in terms of formal languages and grammars. The field of computational linguistics faces many different kinds of tasks which involve natural languages and learning. Many of these tasks aim to automate decisions and...
Research in the field of grammatical inference deals with learnability of languages. In general, the setup is as follows. Given a family of languages, one specific language is selected and a set of sample strings is extracted. The learner now has to identify the language, from the family of languages, that was used to generate the sample strings.
We conclude the book with a brief summary of what has been covered, the main lessons we wish to impart, and the open problems where research efforts ought to be directed.
This paper investigates the possibilities offered by the more and more common availability of scientific video material. In particular it investigates how to best study research results by combining recorded talks and their corresponding scientific articles. To do so, it outlines desired properties of an interesting e-research system based on cogni...
We exhibit a family of computably enumerable sets which can be learned within
polynomial resource bounds given access only to a teacher, but which requires
exponential resources to be learned given access only to a membership oracle.
In general, we compare the families that can be learned with and without
teachers and oracles for four measures of e...
Generalized maps describe the subdivision of objects in cells and are widely used to model 2D and 3D images. In this context, several pattern recognition tasks involve solving submap isomorphism problems (to decide if a map is included in another map) or, more generally, computing maximum common submaps (to measure the distance between two maps). R...
Probabilistic context-free grammars (PCFGs) are used to define distributions
over strings, and are powerful modelling tools in a number of areas, including
natural language processing, software engineering, model checking,
bio-informatics, and pattern recognition. A common important question is that
of comparing the distributions generated or model...
Approximating distributions over strings is a hard learning problem. Typical techniques involve using finite state machines as models and attempting to learn these; these machines can either be hand built and then have their weights estimated, or built by grammatical inference techniques: the structure and the weights are then learned simultaneousl...
As a field, Grammatical Inference addresses both theoretical and empirical learning problems, and the collection of papers within this special issue attests both to the diversity of these problems as well as the advances and insights that are being made by the researchers working within it. Thus we hope this special issue is of interest to the read...
We prove the existence of a canonical form for semi-deterministic transducers
with incomparable sets of output strings. Based on this, we develop an
algorithm which learns semi-deterministic transducers given access to
translation queries. We also prove that there is no learning algorithm for
semi-deterministic transducers that uses only domain kno...
The problem of finding the consensus (most probable string) for a distribution generated by a weighted finite automaton or
a probabilistic grammar is related to a number of important questions: computing the distance between two distributions or
finding the best translation (the most probable one) given a probabilistic finite state transducer. The...
Avec la participation de François Bancilhon (Data publica) François Bourdoncle (Dassault systèmes) Stephan Clemencon (Telecom ParisTech) Colin de la Higuera (U. Nantes, SIF) Gilbert Saporta (CNAM) Francoise Soulie-‐Fogelman (Kxen) François Bourdoncle et Paul Hermelin ont été nommés « chefs de file » de la filière Big Data française. Leur mission e...
Graphs are used as models in a variety of situations. In some cases, e.g. to model images or maps, the graphs will be drawn in the plane, and this feature can be used to obtain new algorithmic results. In this work, we introduce a special class of graphs, called open plane graphs, which can be used to represent images or maps for robots: they are p...
In this paper we present a novel algorithm for learning probabilistic subsequential transducers from a randomly drawn sample. We formalize the properties of the training data that are sufficient conditions for the learning algorithm to infer the correct machine. Finally, we report experimental evidences to backup the correctness of our proposed alg...
This paper reviews the development of active learning in the last decade under the per-spective of treating of data, a major source of undecidability, and therefore a key problem to achieve practicality. Starting with the first case studies, in which data was completely disregarded, we revisit different steps towards dealing with data explicitly in...
This chapter provides an overview of the most well-known settings and paradigms. The task of language learning deals with finding a language given a sample taken from that language. Typically, this language is described using a grammar, which is a compact, finite representation of a possibly infinite language. In general, language learning is perfo...
Combinatorial maps describe the subdivision of objects in cells, and incidence and adjacency relations between cells, and they are widely used to model 2D and 3D images. However, there is no algorithm for comparing combinatorial maps, which is an important issue for image processing and analysis. In this paper, we address two basic comparison probl...
Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induc...
The problem of finding the most probable string for a distribution generated by a weighted finite automaton or a probabilistic grammar is related to a number of important questions: computing the distance between two distributions or finding the best translation (the most probable one) given a probabilistic finite state transducer. The problem is u...
In order to use structural techniques from graph-based pattern recognition, a first necessary step consists in extracting a graph in an automatic way from an image. We propose to extract plane graphs, because of algorithmic properties these graphs have for isomorphism elated problems. We also consider the problem of extracting semantically well-fou...
Although MATLAB has become one of the mainstream languages for the machine learning community, there is still skepticism among the Grammatical Inference (GI) community regarding the suitability of MATLAB for implementing and running GI algorithms. In this paper we will present implementation results of several GI algorithms, e.g., RPNI (Regular Pos...
The problem of inducing, learning or inferring grammars has been studied for decades, but only in recent years has grammatical inference emerged as an independent field with connections to many scientific disciplines, including bio-informatics, computational linguistics and pattern recognition. This book meets the need for a comprehensive and unifi...
The terms grammatical inference and grammar induction both seem to indicate that techniques aiming at building grammatical formalisms when given some information about a language
are not concerned with automata or other finite state machines. This is far from true, and many of the more important results
in grammatical inference rely heavily on auto...
We address the problem of searching for a pattern in a plane graph, that is, a planar drawing of a planar graph. We define plane subgraph isomorphism and give a polynomial algorithm for this problem. We show that this algorithm may be used even when the pattern graph has holes.
Active language learning is an interesting task for which theoretical results are known and several applications exist. In
order to better understand what the better strategies may be, a new competition called Zulu (
http://labh-curien.univ-st-etienne.fr/zulu/
) is launched: participants are invited to learn deterministic finite automata from membe...
In this paper, we address the problem of searching for a pattern in a plane graph, i.e., a planar drawing of a planar graph. To do that, we propose to model plane graphs with 2-dimensional combinatorial maps,
which provide nice data structures for modelling the topology of a subdivision of a plane into nodes, edges and faces. We
define submap isomo...
Current methods for evaluating research are based on counting the number of citations received for publications. Thus, the more an article is cited and the more its impact is considered as important. In this article, we propose a new method for assessing the reputation of scientific journals, based on a web application in which the votes of expert...
When dealing with language, (machine) learning can take many different faces, of which the most important are those concerned with learning languages and grammars from data. Questions in this context have been at the intersection of the fields of inductive inference and computational linguistics for the past fifty years. To go back to the pioneerin...
Comparison of standard language learning paradigms (identification in the limit, query learning, Pac learning) has always been a complex question. Moreover, when to the question of converging to a target one adds computational constraints, the picture becomes even less clear: how much do queries or negative examples help? Can we find good algorithm...
When facing the question of learning languages in realistic settings, one has to tackle several prob- lems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few al- gorithms are not really able to cope with noise. On...
Current methods for evaluating research are based on counting the number of citations received for publications. Thus, the more an article is cited and the more its impact is considered as important. In this article, we propose a new method for assessing the reputation of scientific journals, based on a Web application in which are gathered the vot...
Résumé : Il existe un grand nombre de paradigmes permettant d'étudier l'appre-nabilité de classes de langages : identification à la limite, apprentissage à partir de requêtes, apprentissage probablement approximativement correct. La compa-raison entre ces cadres est difficile. Pour en montrer toute la richesse, nous nous concentrons sur deux classe...
In order to better fit a variety of pattern recognition problems over strings, using a normalised version of the edit or Levenshtein distance is considered to be an appropriate approach. The goal of normalisation is to take into account the lengths of the strings. We define a new normalisation, contextual, where each edit operation is divided by th...
Applied Artificial Intelligence discusses applications of grammar induction (GI) that identifies grammar. A grammar is a rule-based, generative model of the elements in a possibly infinite set, where these elements are complex, structured objects like strings, trees, and graphs. The GI problem is to identify a grammar given some of the elements in...
In most countries where computer science is taught, there are usually a number of courses in theoretical topics, closer to discrete math- ematics, and often known under the generic name of theoretical computer science. For- mal languages, graph theory, algorithm de- sign, logics, complexity and computability are some of the topics that appear in su...
There are a number of established paradigms to study the learnability of classes of functions or languages: Query learning, Identification in the limit, Probably Approximately Correct learning. Comparison between these paradigms is hard. Moreover, when to the question of converging one adds computational constraints, the picture becomes even less c...
Dans les années 80, Angluin a développé un paradigme d'apprentis-sage actif basé sur un oracle, capable de répondre à des requêtes d'appartenance et des requêtes d'équivalence. Or, si dans les différentes applications de l'ap-prentissage actif, les réponses aux premières sont souvent faciles à obtenir, avoir droit aux secondes n'est pas toujours ré...
During the 80's, Angluin introduced an active learning paradigm, using an Oracle, capable of answering both membership and equivalence queries. However, practical evidence tends to show that if the former are often available, this is usually not the case of the latter. We propose new queries, called correction queries, which we study in the framewo...
Whereas there is a number of methods and algorithms to learn regular languages, moving up the Chomsky hierarchy is proving to be a challenging task. Indeed, several theo- retical barriers make the class of context-free languages hard to learn. To tackle these barriers, we choose to change the way we represent these languages. Among the formalisms t...
To study the problem of learning from noisy data, the common approach is to use a statistical model of noise. The influence
of the noise is then considered according to pragmatic or statistical criteria, by using a paradigm taking into account a
distribution of the data. In this article, we study the noise as a nonstatistical phenomenon, by definin...
We propose 10 different open problems in the field of gram- matical inference. In all cases, problems are theoretically oriented but correspond to practical questions. They cover the areas of polynomial learning models, learning from ordered alphabets, learning determinis- tic Pomdps, learning negotiation processes, learning from context-free backg...
Algorithms that infer deterministic finite automata from given data and that comply with the identification in the limit condition have been thoroughly tested and are in practice often preferred to elaborate heuristics. Even if there is no guarantee of identification from the available data, the existence of associated characteristic sets means tha...
Grammatical inference (also known as grammar induction) is a field transversal to a number of research areas including machine
learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology,
and speech recognition. Specificities of the problems that are studied include those related...
The field of grammatical inference (also known as grammar induction) is transversal to a number of research areas including machine learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology and speech recognition. There is no uniform literature on the subject and one can find ma...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition, and machine translation are some of them. In Part I of this paper, we sur...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition or in fields to which pattern recognition is linked. In Part I of this paper, we surveyed these objects and studied their properties. In this Part II, we study the relations between probabilistic finite-state automata and other well-known devices that ge...
Stochastic deterministic finite automata have been intro- duced and are used in a variety of settings. We use them to model musical styles: a same automaton can be used to classify new melodies but also to generate them. Through grammatical inference these automata are learned and new pieces of music can be parsed. We show that this works by propos...
Grammatical inference deals with learning grammars or automata from different textual informations. A general paradigm allowing us to describe the convergence of the process is that of identification in the limit. When trying to combine this paradigm with complexity issues, problems arise. We revisit identification in the limit from a (slightly) ca...
Powerful methods and algorithms are known to learn regular languages. Aiming at extending them to more complex grammars, we choose to change the way we represent these languages. Among the formalisms that allow to define classes of languages, the one of string-rewriting systems (SRS) has outstanding properties. Indeed, SRS are expressive enough to...
Stochastic deterministic finite automata have been introduced and are used in a variety of settings. We report here a number of results concerning the learnability of these finite state machines. In the setting of identification in the limit with probability one, we prove that stochastic deterministic finite automata cannot be identified from only...
Grammatical inference consists in learning formal grammars for unknown languages when given sequential learning data. Classically
this data is raw: Strings that belong to the language and eventually strings that do not. In this paper, we present a generic
setting allowing to express domain and typing background knowledge. Algorithmic solutions are...
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked. In part I of this paper, we surveyed these objects and studied their properties. In this part II, we study the relations between probabilistic finite-state automata and other well known devices that g...
In this paper, we propose a way of incorporating additional knowledge in probabilistic automata inference, by using typed
automata. We compare two kinds of knowledge that are introduced into the learning algorithms. A statistical clustering algorithm
and a part-of-speech tagger are used to label the data according to statistical or syntactic inform...
Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern
analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical
distribution, or to evaluate the progress of a learning process, we propose to use distances based...
Büchi automata are used to recognize languages of infinite strings. Such languages have been introduced to describe the behavior of real-time systems or infinite games. The question of inferring them from infinite examples has already been studied, but it may seem more reasonable to believe that the data from which we want to learn is a set of fini...
Learning context-free grammars is generally considered a very hard task. This is even more the case when learning has to be done from positive examples only. In this context one possibility is to learn stochastic context-free grammars, by making the implicit assumption that the distribution of the examples is given by such an object. Nevertheless t...
Grammatical inference consists in learning formal grammars for unknown languages when given learning data. Classically this data is raw: strings that belong to the language and eventually strings that do not. We present in this paper the possibility of learning when presented with additional information such as the knowledge that the hidden languag...
Linearity and determinism seem to be two essential conditions for polynomial learning of grammars to be possible. We propose a general condition valid for certain subclasses of the linear grammars given which these classes can be polynomially identified in the limit from given data. This enables us to give new proofs of the identification of well k...
Linearity and determinism seem to be two essential condi- tions for polynomial language learning to be possible. We compare several definitions of deterministic linear grammars, and for a reasonable defi- nition prove the existence of a canonical normal form. This enables us to obtain positive learning results in case of polynomial learning from a...
We present in this paper a new learning problem called learning distributions from experts. In the case we study the experts are stochastic deterministic finite automata (sdfa). We deal with the situation arising when wanting to learn sdfa from unrepeated examples. This is intended to model the situation where the data is not generated automaticall...
When concerned about efficient grammatical inference two issues are relevant: the first one is to determine the quality of the result, and the second is to try to use polynomial time and space. A typical idea to deal with the first point is to say that an algorithm performs well if it infers in the limit the correct language. The second point has l...
Bchi automata are used to recognize languages of infinite words.
Signatures, as introduced to cope with semantics of programming languages, provide an interesting framework for knowledge representation. They allow structured and recursive properties that attribute representation does not permit, and at the same time seem to be algorithmically treatable. We recall how a generalisation relation on terms can be int...