Laurent Romary

Laurent Romary
National Institute for Research in Computer Science and Control | INRIA · ALPAGE - Large-scale Deep Linguistic Processing Research Team

Dr. Habil.

About

331
Publications
26,169
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,617
Citations
Additional affiliations
January 2008 - August 2015
Humboldt-Universität zu Berlin
Position
  • Guest researcher
January 1997 - August 2006
Lorrain de Recherche en Informatique et Ses Applications
Position
  • Chercheur

Publications

Publications (331)
Presentation
Full-text available
Presentation outlining a proposal for a deep revision of ISO 24611 (Morphosyntactic Annotation Framework, MAF). The proposal has been subsequently approved by the committee ballot.
Article
Full-text available
A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus...
Preprint
Full-text available
A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus...
Article
Factoid Question-Answering (QA) Systems were developed to provide an accurate answer to a factoid question expressed in a natural language. The prime knowledge resources for most of factoid QA systems are online databases. However, the unstructured information in these resources rises the complexity of the information retrieval task. In this paper,...
Article
This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the nati...
Article
This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context...
Conference Paper
Full-text available
In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demo...
Article
Full-text available
This article provides an in-depth comparison and proposal for mapping between Simple Know-ledge Organization System (SKOS) and TermBase eXchange (TBX), two important exchange standards within the knowledge and terminology landscape. The attempt to develop an interface or conversion routine between SKOS and TBX is rooted in a strong demand in the la...
Preprint
Full-text available
Lexical Markup Framework (LMF) or ISO 24613 [1] is a de jure standard that provides a framework for modelling and encoding lexical information in retrodigitised print dictionaries and NLP lexical databases. An in-depth review is currently underway within the standardisation subcommittee , ISO-TC37/SC4/WG4, to find a more modular, flexible and durab...
Chapter
A segmentation tool for a hadith corpus is necessary to prepare the TEI hadith encoding process. In this context, we aim to develop a tool allowing the segmentation of hadith text from Sahih al-Bukhari corpus. To achieve this objective, we start by identifying different hadith structures. Then, we elaborate an automatic processing tool for hadith s...
Chapter
The standardization of Al-Hadith Al-Shareef can guarantee the interoperability and interchangeability with other textual sources and takes the processing of Al-Hadith corpus to a higher level. Still, research works on Hadith corpora had not previously considered the standardization as real objective, especially for some standards such as TEI (Text...
Poster
This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the nati...
Article
This paper discusses the digital dictionary component in an ongoing language documentation project for the Mixtepec-Mixtec language (iso 639-3: mix). Mixtepec-Mixtec (Sa'an Savi 'rain language') is an Otomonguean language spoken by roughly 9,000-10,000 people in the Juxtlahuaca district of Oaxaca and in parts of the Guerrero and Puebla states of Me...
Chapter
Full-text available
This chapter provides a broad overview of the state-of-the-art in standards development for language resources, beginning with a brief historical overview to serve as context. It describes in some detail several current, major efforts that define the standardization landscape for language resources today, with the aim of outlining their differences...
Article
This paper aims to provide a comprehensive modeling and representation of etymological data in digital dictionaries. The purpose is to integrate in one coherent framework both digital representations of legacy dictionaries, and also born-digital lexical databases that are constructed manually or semi-automatically. We want to propose a systematic a...
Chapter
Lexical resources are increasingly multiplatform due to the diverse needs of linguists as well as other various user communities. Merging, comparing, making correspondences and deducing differences between these lexical resources remain difficult tasks. Thus, interoperability between these resources is an extremely difficult task. In this context,...
Article
This paper provides both an update concerning the setting up of the European DARIAH infrastructure and a series of strong action lines related to the development of a data centred strategy for the humanities in the coming years. In particular we tackle various aspect of data management: data hosting, the setting up of a DARIAH seal of approval, the...
Book
Full-text available
Humanities have convincingly argued that they need transnational research opportunities and through the digital transformation of their disciplines also have the means to proceed with it on an up to now unknown scale. The digital transformation of research and its resources means that many of the artifacts, documents, materials, etc. that interest...
Article
This paper provides an overview of the various projects carried out within ISO committee TC 37/SC 4 dealing with the management of language (digital) resources. On the basis of the technical experience gained in the committee and the wider standardization landscape the paper identifies some possible trends for the future.
Article
We are pleased to present here our poster about "A TEI conformant pivot format for HAL back-office".We explain how the french open archive HAL uses the TEI guidelines for the construction of its xml pivot format. With a little custumization, we have described the resource metadata used at all levels of representation in HAL : submissions, queries,...
Conference Paper
Full-text available
The automatic development of termino-logical databases, especially in a standardized format, has a crucial aspect for multiple applications related to technical and scientific knowledge that requires semantic and terminological descriptions covering multiple domains. In this context, we have, in this paper, two challenges: the first is the automati...
Article
Full-text available
This paper introduces, an XML format developed to serialise the object model defined by the ISO Syntactic Annotation Framework SynAF. Basing on widespread best practices we adapt a popular XML format for syntactic annotations, TigerXML, with additional features to support a variety of syntactic phenomena including constituent and dependency structu...
Article
Full-text available
The report is the output of the ALLEA Working Group on E-Humanities, chaired by Dr Sandra Collins, Director of the Digital Repository of Ireland at the Royal Irish Academy. Edited by Dr. Natalie Harrower, the report was co-written by experts in the Digital Humanities from six Academies across Europe. The report discusses the state of the art of D...
Article
Full-text available
This article outlines a proposal for a consistent encoding of stand-off annotations in the frame of the TEI standard. The proposed encoding requires the extension of the current TEI schema with three additional elements, directly related to the encoding of stand-off annotations that provide a generic and flexible structure for encoding stand-off an...
Conference Paper
Full-text available
Patent applications are similarly structured worldwide. They consist of a cover page, a specification, claims, drawings (if necessary) and an abstract. In addition to their content (text, numbers and citations), all patent publications contain a relatively rich set of well-defined metadata. In the Arabic world, there is no North African or Arabian...
Article
Full-text available
This paper delineates the main characteristics of the Episciences platform, an environment for overlay peer-reviewing that complements existing publication repositories, designed by the Centre pour la Communication Scientifique directe (CCSD) service unit. We describe the main characteristics of the platform and present the first experiment of laun...
Article
In recent years, new developments in the area of lexicography have altered not only the management, processing and publishing of lexicographical data, but also created new types of products such as electronic dictionaries and thesauri. These expand the range of possible uses of lexical data and support users with more flexibility, for instance in a...
Article
This paper presents an attempt to customise the TEI (Text Encoding Initiative) guidelines in order to offer the possibility to incorporate TBX (TermBase eXchange) based terminological entries within any kind of TEI documents. After presenting the general historical, conceptual and technical contexts, we describe the various design choices we had to...
Article
In recent years, new developments in the area of lexicography have altered not only the management, processing and publishing of lexicographical data, but also created new types of products such as electronic dictionaries and thesauri. These expand the range of possible uses of lexical data and support users with more flexibility, for instance in a...
Article
Full-text available
Academic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and diction...
Article
Full-text available
A variety of initiatives for developing virtual research environments, research infrastructures, and cyberinfrastructures have been funded in recent years. The systems produced vary considerably, but they all face the issue of sustainability, namely how to ensure the continued existence of a resource once the project that created it has finished. T...
Article
Full-text available
In recent years, European governments and funders, universities and academic societies have increasingly discovered the digital humanities as a new and exciting field that promises new discoveries in humanities research. The funded projects are, however, often ad hoc experiments and stand in isolation from other national and international work. Wha...
Article
Full-text available
The present paper explores various arguments in favour of making the Text Encoding Initia-tive (TEI) guidelines an appropriate serialisation for ISO standard 24613:2008 (LMF, Lexi-cal Mark-up Framework) . It also identifies the issues that would have to be resolved in order to reach an appropriate implementation of these ideas, in particular in ter...
Article
Full-text available
This paper presents the application of the format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 (SynAF) standard. After outlining the main characteristics of both the SynAF meta-model and the format, as extended from the initial Tiger XML format (König & Lezius, 2000), we show through a range...
Article
Our paper outlines a proposal for the consistent modeling of heterogeneous lexical structures in semasiological dictionaries, based on the element structures described in detail in chapter 9 (Dictionaries) of the TEI Guidelines. The core of our proposal describes a system of relatively autonomous lexical “crystals” that can, within the constraints...
Article
Full-text available
An elaborated approach to creating a reference terminology has been developed. First, terms are selected from everyday interactions between doctors (occurrences in medical records, in guidelines), based on frequency count and relevance. Then an onomasological approach is applied to make a choice on basic concepts most intricately related to the mos...
Article
After two decades of repository development, some conclusions may be drawn as to which type of repository and what kind of service best supports digital scholarly communication. In this regard, four types of publication repository may be distinguished, namely the subject-based repository, research repository, national repository system, and institu...
Article
Full-text available
The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of owcharts taken from patent drawings to produce summaries containing information about their structure. The textual summaries include information about the owchart title, the box-node shapes, the connecting edge types, t...
Article
Full-text available
The comparative evaluation of Arabic HPSG grammar lexica requires a deep study of their linguistic coverage. The complexity of this task results mainly from the heterogeneity of the descriptive components within those lexica (underlying linguistic resources and different data categories, for example). It is therefore essential to define more homoge...
Article
The current system of so-called institutional repositories, even if it was a sensible response at an earlier stage, may not answer the needs of the scholarly community, scientific communication and accompanied stakeholders in a sustainable way. However, having a robust repository infrastructure is essential to academic work. Yet, current institutio...
Article
Full-text available
In this paper, we present, SALT, a framework for mapping heterogeneous linguistic formats from one another based on a model-based approach, i.e. independently of the actual formats in which the corresponding linguistic data is being expressed. While we describe the underlying concept of this framework, we identify how it echoes past ongoing standar...
Conference Paper
Full-text available
French researchers are required to fre-quently translate into French the descrip-tion of their work published in English. At the same time, the need for French people to access articles in English, or to interna-tional researchers to access theses or pa-pers in French, is incorrectly resolved via the use of generic translation tools. We propose the...
Chapter
After two decades of repository development, some conclusions may be drawn as to which type of repository and what kind of service best supports digital scholarly communication. In this regard, four types of publication repository may be distinguished, namely the subject-based repository, research repository, national repository system, and institu...
Chapter
The current system of so-called institutional repositories, even if it was a sensible response at an earlier stage, may not answer the needs of the scholarly community, scientific communication and accompanied stakeholders in a sustainable way. However, having a robust repository infrastructure is essential to academic work. Yet, current institutio...
Conference Paper
Full-text available
Health care professionals experience difficulties in the correct medical registration of clinical work and in the efficient searching for answers to clinical questions. These difficulties arise often from a deficient interface between human and machine language. Terminological solutions are often naive attempts to standardize language and terms, wi...
Article
Full-text available
This paper addresses the need of a meta-model for corpus annotation schemes. Through the analysis of a specific annotation level - the reference level defined in the MATE project - and the experience we gained in annoting multimodal corpora for , we show that the current MATE annotation scheme can be extended to a general framework.The available ta...
Article
Full-text available
The goal of the present chapter is to explore the possibility of providing the research (but also the industrial) community that commonly uses spoken corpora with a stable portfolio of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to...
Article
Full-text available
This document proposes an overview of the current (at the time of writing) scene towards an Interoperability Framework and acts as a reference point for the standards that our community supports. This initiative is in close synchronization with other relevant initiatives such as CLARIN, ELRA, ISO and TEI and META- Share. The document builds on th...
Article
Full-text available
The chapter tackles the role of scholarly publication in the research process (quality, preservation) and looks at the consequences of new information technologies in the organization of the scholarly communication ecology. It will then show how new technologies have had an impact on the scholarly communication process and made it depart from the t...
Article
Full-text available
We present the infrastructure for mapping publishers' metadata formats into a standardized TEI representation in the context of the EU PEER project. Initiated as an experiment to observe the consequence of large scale author manuscript deposit in publication repositories, the project led to the design and implementation of an information HUB (the P...
Article
Full-text available
It is usual to consider that standards generate mixed feelings among scientists. They are often seen as not really reflecting the state of the art in a given domain and a hindrance to scientific creativity. Still, scientists should theoretically be at the best place to bring their expertise into standard developments, being even more neutral on iss...
Article
Full-text available
This technical note presents the system built for the IP track of CLEF 2010 based on PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS), the modular search infrastructure initially realized for CLEF IP 2009. We largely reused the system of the previous CLEF IP but at a relatively smaller scale and with the improvement of three main comp...
Article
Full-text available
This report presents the main components of the Tiger2 format for the representation of syntactic annotations for linguistic data. Derived from the existing Tiger format and in compliance with ISO standard 24615 (SynAF), it offers mechanisms covering the wide range of constituency and dependency annotations.
Article
Full-text available
The Semeval task 5 was an opportunity for experimenting with the key term ex- traction module of GROBID, a system for extracting and generating bibliographical information from technical and scientific documents. The tool first uses GROBID's facilities for analyzing the structure of sci- entific articles, resulting in a first set of structural features...
Article
Full-text available
L'internet peut-il sonner le glas des revues scientifiques ? si les protagonistes de la publication en ligne espèrent une évolution du modèle économique de ces revues, leur priorité est d'améliorer la diffusion des résultats de recherche.
Conference Paper
Full-text available