Simon Mille

Simon Mille
University Pompeu Fabra | UPF · Department of Information and Communication Technologies (DTIC)

About

76
Publications
9,623
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
866
Citations
Introduction
Simon Mille currently works at the Department of Information and Communication Technologies (DTIC), University Pompeu Fabra. Simon does research in Computational Linguistics. Their most recent publication is 'KRISTINA: A Knowledge-Based Virtual Conversation Agent'.

Publications

Publications (76)
Preprint
Full-text available
The acquisition of high-quality human annotations through crowdsourcing platforms like Amazon Mechanical Turk (MTurk) is more challenging than expected. The annotation quality might be affected by various aspects like annotation instructions, Human Intelligence Task (HIT) design, and wages paid to annotators, etc. To avoid potentially low-quality a...
Preprint
Full-text available
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requi...
Conference Paper
Full-text available
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between , different reproductio...
Preprint
Full-text available
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproduction...
Preprint
Full-text available
Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (...
Preprint
Full-text available
Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such,...
Preprint
Multilingual Transformer-based language models, usually pretrained on more than 100 languages, have been shown to achieve outstanding results in a wide range of cross-lingual transfer tasks. However, it remains unknown whether the optimization for different languages conditions the capacity of the models to generalize over syntactic structures, and...
Preprint
Full-text available
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-...
Conference Paper
Full-text available
In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data. The pipeline comprises five main components: (i) a textual analysis component, which extracts information from Wikipedia pages; (ii) a visual analysis component, which extracts information from copyright-free ima...
Chapter
Full-text available
In this paper, based on the recent outcome of two shared tasks on structured data verbalisation, and examining one system in particular, we present some evidence why grammar-based systems are particularly relevant for the verbalisation of structured data as found in the Semantic Web. We then define possible future lines of research, centered around...
Article
Full-text available
MindSpaces provides solutions for creating functionally and emotionally appealing architectural designs in urban spaces. Social media services, physiological sensing devices and video cameras provide data from sensing environments. State-of-the-Art technology including VR, 3D design tools, emotion extraction, visual behaviour analysis, and textual...
Conference Paper
In this paper, we report on adapting a Natural Language Generation system to Semantic Web datasets, and on the results obtained in two triple verbalization challenges.
Conference Paper
Full-text available
During climate-related crises vast volumes of heterogeneous multimodal information are generated. Meaningfully processing and communicating this information for efficient decision support is a key challenge. The paper describes applying Semantic Web technologies for decision support during such crises. We are proposing the application of these tech...
Conference Paper
Full-text available
are in great need of acquiring, re-using and re-purposing visual and textual data to recreate, renovate or produce a novel target space, building or element. This come in align with the abrupt increase, which is lately observed, in the use of immersive VR environments and the great technological advance that can be found in the acquisition and mani...
Conference Paper
We present an intelligent embodied conversation agent with linguistic, social and emotional competence. Unlike the vast majority of the state-of-the-art conversation agents, the proposed agent is constructed around an ontology-based knowledge model that allows for flexible reasoning-driven dialogue planning, instead of using predefined dialogue scr...
Article
Full-text available
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge model that supports flexible reasoning-driven conversation planning strategies. A dedicated sea...
Article
Full-text available
Patent search is recall-driven, which goes hand in hand with at least a partial sacrifice of precision. As a consequence, patent analysts have to regularly view and examine a large amount of patents. This implies a very high workload. Interactive analysis aids that help to minimize this workload are thus of high demand. Still, these aids do not red...
Conference Paper
Full-text available
We present work in progress on an intelligent embodied conversation agent in the basic care and healthcare domain. In contrast to most of the existing agents, the presented agent is aimed to have linguistic cultural, social and emotional competence needed to interact with elderly and migrants. It is composed of an ontology-based and reasoning-drive...
Conference Paper
Full-text available
We present work in progress on an intelligent embodied conversation agent in the basic care and healthcare domain. In contrast to most of the existing agents, the presented agent is aimed to have linguistic cultural, social and emotional competence needed to interact with elderly and migrants. It is composed of an ontology-based and reasoning-drive...
Conference Paper
Full-text available
We report here on progress toward a pipeline for the deep generation of metaphorical expressions in natural language. Our approach uses a combination of artificial intelligence and deep natural language generation. Metaphor is ubiquitous in forms of everyday discourse [1], [2], such as ordinary conversation, news articles, popular novels, advertise...
Article
Full-text available
‘Deep-syntactic’ dependency structures that capture the argumentative, attributive and coordinative relations between full words of a sentence have a great potential for a number of NLP-applications. The abstraction degree of these structures is in between the output of a syntactic dependency parser (connected trees defined over all words of a sent...
Article
Data on observed and forecasted environmental conditions, such as weather, air quality and pollen, are offered in a great variety in the web and serve as basis for decisions taken by a wide range of the population. However, the value of these data is limited because their quality varies largely and because the burden of their interpretation in the...
Article
The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. I...
Article
The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. I...
Conference Paper
Full-text available
Deep-syntactic " dependency structures bridge the gap between the surface-syntactic structures as produced by state-of-the-art dependency parsers and semantic logical forms in that they abstract away from surface-syntactic idiosyncrasies, but still keep the linguistic structure of a sentence. They have thus a great potential for such downstream app...
Conference Paper
Full-text available
structures from which the generation naturally starts often do not contain any functional nodes, while surface-syntactic structures or a chain of tokens in a linearized tree contain all of them. Therefore, data-driven linguistic generation needs to be able to cope with the projection between non-isomorphic structures that differ in their topology a...
Article
In this article, we present an operational prototype of a workbench for intelligent patent document analysis and summarization that has been developed in the context of the R&D project TOPAS, partially funded by the European Commission. The workbench uses the GATE environment as infrastructure for document representation and algorithm integration....
Article
Environmental and meteorological conditions are of utmost importance for the population, as they are strongly related to the quality of life. Citizens are increasingly aware of this importance. This awareness results in an increasing demand for environmental information tailored to their specific needs and background. We present an environmental in...
Article
Semantic stochastic sentence realization is still in its fledgling stage. Most of the available stochastic realizers start from syntactic structures or shallow semantic structures, which still contain numerous syntactic features. This is unsatisfactory since sentence generation traditionally starts from abstract semantic or conceptual structures. H...
Conference Paper
Full-text available
Environmental and meteorological conditions are of utmost importance for the population, as they are strongly related to the quality of life. Citizens are increasingly aware of this importance. This awareness results in an increasing demand for environmental information tailored to their specific needs and background. We present an environmental in...
Conference Paper
Full-text available
Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of research. Howev-er, most proposals tend to have in common that they start from KBs of limited size that either already contain lin-guistically-oriented knowledge structures or to whose structures different ways of realization are explicitly assigned. To avoi...
Article
Team sports commentaries call for techniques that are able to select content and generate wordings to reflect the affinity of the targeted reader for one of the teams. The existing works tend to have in common that they either start from knowledge sources of limited size to whose structures then different ways of realization are explicitly assigned...
Conference Paper
Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of research. However, most proposals tend to have in common that they start from KBs of limited size that either already contain linguistically-oriented knowledge structures or to whose structures different ways of realization are explicitly assigned. To avoid...
Conference Paper
Full-text available
Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. This results in an increasing demand for personalized environmental information, i.e., information that is tailored to citizens’ specific context and background. In this demonstration, we present an environmental information...
Article
Full-text available
A treebank may contain the annotation of di�erent phenomena such as word order, morphological features, syntactic and semantic relations, etc., which are rather di�erent in their nature. Quite often, the annotation of these phenomena is combined in a single structure, which leads to low-quality training results and is veri ably de cient from a theo...
Conference Paper
Full-text available
The Surface Realisation Shared Task was first run in 2011. Two common-ground input rep-resentations were developed and for the first time several independently developed surface realisers produced realisations from the same shared inputs. However, the input representa-tions had several shortcomings which we have been aiming to address in the time s...
Conference Paper
Full-text available
Until recently, deep stochastic surface realiza-tion has been hindered by the lack of seman-tically annotated corpora. This is about to change. Such corpora are increasingly avail-able, e.g., in the context of CoNLL shared tasks. However, recent experiments with CoNLL 2009 corpora show that these popu-lar resources, which serve well for other ap-pl...
Conference Paper
Full-text available
The common use of a single de facto standard annotation scheme for dependency treebank creation leaves the question open to what extent the performance of an application trained on a treebank depends on this annotation scheme and whether a linguistically richer scheme would imply a decrease of the performance of the application. We investigate the...
Article
Full-text available
In this paper we describe the development of a text simplification system for Spanish. Text simplification is the adaptation of a text to the special needs of certain groups of readers, such as language learners, people with cognitive difficulties and elderly people, among others. There is a clear need for simplified texts, but manual production an...
Conference Paper
Full-text available
Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. This results in an increasing demand for personalized environmental information, i.e., information that is tailored to citizens’ specific context and background. In this work we describe the development of an environmental i...
Conference Paper
Full-text available
Semantic stochastic sentence realization is still in its fledgling stage. Most of the avail-able stochastic realizers start from syntactic structures or shallow semantic input struc-tures which still contain numerous syntactic features. This is unsatisfactory since sen-tence generation traditionally starts from ab-stract semantic or conceptual stru...
Conference Paper
Full-text available
Over the last decade, the prominence of sta-tistical NLP applications that use syntactic rather than only word-based shallow clues in-creased very significantly. This prominence triggered the creation of large scale treebanks, i.e., corpora annotated with syntactic struc-tures. However, a look at the annotation schemata used across these treebanks...
Conference Paper
Full-text available
Description of the statistical generator submitted at the First Surface Realization Shared Task.
Conference Paper
Full-text available
Communicative structure is central to the linguistic representation at nearly all levels of the Meaning-Text Models (MTMs). Its correlation with lexical and syntactic features makes it also essential for such natural language processing applications as text generation, which is about to undergo a significant shift from the symbolic, rule-based para...
Conference Paper
Full-text available
The relevance of syntactic dependency annotated corpora is nowadays unquestioned. However, a broad debate on the optimal set of dependency relation tags did not take place yet. As a result, largely varying tag sets of a largely varying size are used in different annotation initiatives. We propose a hierarchical dependency structure annotation schem...
Conference Paper
Full-text available
Most of the known stochastic sentence generators use syntactically annotated corpora, performing the projection to the surface in one stage. However, in full-fledged text generation, sentence realization usually starts from semantic (predicate-argument) structures. To be able to deal with semantic structures, stochastic generators require semantica...
Conference Paper
Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. The consequence of this awareness is the demand for personalized environmental information, i.e., information that is tailored to their specific context and background. The EU-funded project PESCaDO addresses this demand in...
Conference Paper
Full-text available
With their abstract vocabulary and overly long sentences, patent claims, like several other genres of legal discourse, are notoriously difficult to read and comprehend. The enormous number of both native and non-native users reading patent claims on a daily basis raises the demand for means that make them easier and faster to understand. An obvious...
Conference Paper
Full-text available
We present a cost effective strategy for the creation of a mid-size fine-grained dependency treebank of surface-and deep-syntactic structures as defined in the Meaning-Text Theory for Spanish. The strategy starts from a small seed dependency corpus, the AnCora corpus, whose annotation is considerably more coarse-grained than our target annotation....
Conference Paper
Full-text available
Hardly any other kind of text structures is as noto riously difficult to read as patents - which is fir st of all due to their abstract vocabulary and their very complex syntactic constru ctions. Especially the claims in a patent are a cha llenge: in accordance with international patent writing regulations, each clai m must be rendered in a single...
Article
Full-text available
Hardly any other type of textual material is as difficult to read and comprehend as patents. Especially the claims in a patent reveal very complex syntactic constructions which are difficult to process even for native speakers, let alone for foreigners who do not master well the language in which the patent is written. Therefore, multilingual summa...

Network

Cited By