Citations

... This evaluation measured the impact of various parameters on the approach. Second, a manual evaluation was carried out on the Taxon dataset about plant taxonomy, composed of 4 large populated ontologies: AgronomicTaxon [25], AgroVoc [5], DBpedia [3] and TaxRef-LD [16]. 6 CQAs from AgronomicTaxon have been manually generated. ...
... The implementation in Java of the evaluation system, as well as the Populated Conference dataset, is available. 5 The variants of the approach have been compared to its baseline ( Table 2). The parameters that are not described in this table such as path length threshold (3), DL formula filtering threshold (0.6), and structural similarity constants (0.5 for a path, 0 for a class expression) as presented in Section 5.1. ...
... The Taxon dataset is composed of 4 ontologies that describe the classification of species: AgronomicTaxon [25], AgroVoc [5], DBpedia [3] and TaxRef-LD [16]. The CQA used in this evaluation are the one presented in [31] which were manually written from AgronomicTaxon CQs [25]. ...
Article
Full-text available
Ontology matching aims at making ontologies interoperable. While the field has fully developed in the last years, most approaches are still limited to the generation of simple correspondences. More expressiveness is, however, required to better address the different kinds of ontology heterogeneities. This paper presents CANARD (Complex Alignment Need and A-box based Relation Discovery), an approach for generating expressive correspondences that rely on the notion of competency questions for alignment (CQA). A CQA expresses the user knowledge needs in terms of alignment and aims at reducing the alignment space. The approach takes as input a set of CQAs as SPARQL queries over the source ontology. The generation of correspondences is performed by matching the subgraph from the source CQA to the similar surroundings of the instances from the target ontology. Evaluation is carried out on both synthetic and real-world datasets. The impact of several approach parameters is discussed. Experiments have showed that CANARD performs, overall, better on CQA coverage than precision and that using existing same:As links, between the instances of the source and target ontologies, gives better results than exact label matches of their labels. The use of CQA improved also both CQA coverage and precision with respect to using automatically generated queries. The reassessment of the counter-example increased significantly the precision, to the detriment of runtime. Finally, experiments on large datasets showed that CANARD is one of the few systems that can perform on large knowledge bases, but depends on regularly populated knowledge bases and the quality of instance links.
... Since then, word representations have changed language modelling [53]. Following up is work that includes applications to automatic speech recognition and machine translation [54], [55], and a wide range of Natural Language Processing (NLP) tasks [56]- [62]. Word embeddings have been used in combination with ML, improving results from biomedical named entity recognition [63], capturing word analogies [64], extracting latent knowledge from scientific literature and going towards a generalized approach to the process of mining scientific literature [65], etc. Word embeddings are vector space models (VSM) that in a low-dimensional semantic space (much smaller than the vocabulary size) represent words in a form of real-valued vectors. ...
Thesis
Full-text available
Human knowledge about food and nutrition has evolved drastically with time. With food and nutrition-related data being mass produced and easily accessible, the next step is to use Artificial Intelligence (AI) to translate data into knowledge. The majority of AI research is model-driven, and classical Machine Learning (ML) pipelines concentrate on the model-centric approach, prioritizing training the best model for a specific task, with the main focus on improving model parameters, overlooking the importance of data. We propose a novel ML pipeline that fused data and domain-driven knowledge for a predictive task from the Food and Nutrition domain – fast prediction of nutrient values from unstructured recipe text. Our proposed pipeline consists of three parts: representation learning (RL), unsupervised ML, and supervised ML. In the RL part, word and paragraph embeddings are learned for text short descriptions of foods (recipe titles), in the unsupervised ML part the recipes are separated in clusters based on a domain-specific coding (FoodEx2 classification) from external domain resource, and in the supervised ML part, the two parts are combined – separate predictive models are trained for each cluster for separate nutrients using the learned embeddings as input features. The pipeline is evaluated with a criteria defined using domain knowledge (nutrient tolerance levels) and compared to baselines also calculated using the same criteria. As the evaluation results showed that including the domain knowledge in the unsupervised ML part improved the results compared to the baseline, we propose an alteration of the ML pipeline. We include two different external sources of domain knowledge for clustering in the unsupervised ML part, to explore the domain bias for the same prediction task. To further improve the ML pipeline, we include domain knowledge in the RL part of the pipeline. Instead of obtaining recipe title embeddings, we introduce a domain heuristic for merging embeddings of the ingredients of the recipe. This proved to be a successful way to train excellent performing predictive models for predicting nutrient values, as the accuracies obtained were significantly higher than the baseline. As the domain-specific embeddings showed to be high performant, through the process of data normalization using dictionary and rule-based Named Entity Recognition and data mapping to a Food Composition Database from six heterogeneous multilingual recipe datasets, we composed two predefined corpora of embeddings – ingredient and recipe embeddings. Training embeddings tailored for a specific task is a very time-consuming process, therefore these corpora of predefined embeddings can be used for research purposes as well as transferred to other tasks for application purposes. To explore the major impact data has on model-performance, we focused on generalization of predictive models, by defining a generalizability index that indicates the trust of transferring a predictive model learned on one dataset to another. Going a step further to show the importance of data in predictive modeling, we show different ways of selecting a representative training dataset, and the results show how different selections of the training dataset produce different outcomes. The training data should be representative of the data expected in deployment, covering all variations that deployment data will present.
... Approach Level Type of resource Type of linking (Caselli et al., 2014) semantic similarity monolingual lexical lexical (Bennett and Fellbaum, 2006;Kwong, 1998) formalisms monolingual lexical ontological (Niles and Pease, 2003) formalisms monolingual lexical ontological (Sánchez-Rada and Iglesias, 2016) formalisms monolingual ontological structural (Diosan et al., 2008) machine learning monolingual lexical lexical (Subirats and Sato, 2004) corpus monolingual lexical lexical (Bond and Foster, 2013) formalisms multilingual lexical structural (Caracciolo et al., 2012;Cimiano et al., 2020b;Moussallem et al., 2018) string similarity multilingual lexical ontological (Lesnikova, 2013;Lesnikova et al., 2016) machine translation multilingual lexical structural (Gracia, 2015; formalisms multilingual lexical structural machine learning multilingual ontological structural (Damova et al., 2013) formalisms multilingual lexical/ontological structural formalisms multilingual knowledge graph ontological distributional semantics multilingual lexical lexical/ontological (Chen et al., 2016c) graph neural networks cross-lingual knowledge graph structural (Schuster et al., 2019) mapping of word spaces cross-lingual contextual embeddings lexical ...
Preprint
Full-text available
The focus of this thesis is broadly on the alignment of lexicographical data, particularly dictionaries. In order to tackle some of the challenges in this field, two main tasks of word sense alignment and translation inference are addressed. The first task aims to find an optimal alignment given the sense definitions of a headword in two different monolingual dictionaries. This is a challenging task, especially due to differences in sense granularity, coverage and description in two resources. After describing the characteristics of various lexical semantic resources, we introduce a benchmark containing 17 datasets of 15 languages where monolingual word senses and definitions are manually annotated across different resources by experts. In the creation of the benchmark, lexicographers' knowledge is incorporated through the annotations where a semantic relation, namely exact, narrower, broader, related or none, is selected for each sense pair. This benchmark can be used for evaluation purposes of word-sense alignment systems. The performance of a few alignment techniques based on textual and non-textual semantic similarity detection and semantic relation induction is evaluated using the benchmark. Finally, we extend this work to translation inference where translation pairs are induced to generate bilingual lexicons in an unsupervised way using various approaches based on graph analysis. This task is of particular interest for the creation of lexicographical resources for less-resourced and under-represented languages and also, assists in increasing coverage of the existing resources. From a practical point of view, the techniques and methods that are developed in this thesis are implemented within a tool that can facilitate the alignment task.
... In the linked data cloud, it is linked with over twenty other agricultural relevant data resources [89] . Data is further organized by the SKOS (Simple Knowledge Organization System) and brought into the semantic web by the OWL (Ontology Web Language) [90] . The web addresses of the databases are identified by URI (Uniform Resource Identifiers) with a corresponding URL (Uniform Resource Locator). ...
... Using the standard query language SPARQL (SPARQL Protocol And RDF Query Language) the advantage of a single point of access is offered. For logical reasons it is recommended to make efforts to include non-RDF applications and models into today's editing [90] . ...
Article
Full-text available
Digitization in agriculture is rapidly advancing further on. New technologies and solutions were developed and get invented which ease farmers’ daily life, help them and their partners to gain knowledge about farming processes and environmental interrelations. This knowledge leads to better decisions and contributes to increased farm productivity, resource efficiency, and environmental health. Along with numerous advantages, some negative aspects and dependencies risk seamless workflow of agricultural production. Therefore, this study presents the state of the art of digitization in agriculture and points out vulnerabilities in digitized farming processes. The most important are the lack of interoperability and the dependency on internet connection. Hence, requirements are posed to meet these vulnerabilities in future IT (information technology) systems resulting in successive levels of resilience that cover the individual needs of farms adjusted to their mobile and landline internet supply. These findings are incorporated in a conceptual framework for a highly digitized fictive farm. Resilience is ensured by decentralized storage and computing capacities and internet independent communication networks including cooperation with machinery rings and contractors.
... The Food Ontology Knowledge Base [25] is a basic model of food nutritional information for basic food types from the Republic of Turkey's Ministry of Food database. The Food and Agriculture Organization of the United Nations provides a linked terminology for expert researchers that covers over 35,000 concepts called AGROVOC [26]. The Food Product Ontology [27] extends the well-known GoodRelations ontology [28] to represent concepts for food products, its pricing, and the associated business entity. ...
Article
Full-text available
Background Fast food with its abundance and availability to consumers may have health consequences due to the high calorie intake which is a major contributor to life threatening diseases. Providing nutritional information has some impact on consumer decisions to self regulate and promote healthier diets, and thus, government regulations have mandated the publishing of nutritional content to assist consumers, including for fast food. However, fast food nutritional information is fragmented, and we realize a benefit to collate nutritional data to synthesize knowledge for individuals. Methods We developed the ontology of fast food facts as an opportunity to standardize knowledge of fast food and link nutritional data that could be analyzed and aggregated for the information needs of consumers and experts. The ontology is based on metadata from 21 fast food establishment nutritional resources and authored in OWL2 using Protégé. Results Three evaluators reviewed the logical structure of the ontology through natural language translation of the axioms. While there is majority agreement (76.1% pairwise agreement) of the veracity of the ontology, we identified 103 out of the 430 statements that were erroneous. We revised the ontology and publicably published the initial release of the ontology. The ontology has 413 classes, 21 object properties, 13 data properties, and 494 logical axioms. Conclusion With the initial release of the ontology of fast food facts we discuss some future visions with the continued evolution of this knowledge base, and the challenges we plan to address, like the management and publication of voluminous amount of semantically linked fast food nutritional data.
... The translation of a subgraph into a SPARQL query is the same for binary and unary CQAs. Therefore, the subgraph will be transformed into a SPARQL query and saved as the following DL formula: dom(o 2 :Paper) 4 o 2 :writes − . ...
... This evaluation measured the impact of various parameters on the approach. Second, a manual evaluation was carried out on the Taxon dataset about plant taxonomy, composed of 4 large populated ontologies: AgronomicTaxon [23], AgroVoc [4], DBpedia [2] and TaxRef-LD [13]. 6 CQAs from AgronomicTaxon have been manually generated. ...
Chapter
Full-text available
Ontology matching aims at making different ontologies interoperable. While most approaches have addressed the generation of simple correspondences, more expressiveness is required to better address the different kinds of ontology heterogeneities. This paper presents an approach for generating complex correspondences that relies on the notion of competency questions for alignment (CQA). A CQA expresses the user knowledge needs in terms of alignment and aims at reducing the alignment scope. The approach takes as input a set of CQAs as SPARQL queries over the source ontology. The generation of correspondences is performed by matching the subgraph from the source CQA to the lexically similar surroundings of the instances from the target ontology. Evaluation of the approach has been carried out on both synthetically generated and real-word datasets.
... Since then, word representations have changes language modelling [18]. Following up is work that includes applications to automatic speech recognition and machine translation [19,20], and a wide range of Natural Language Processing (NLP) tasks [21][22][23][24][25][26][27]. Word embeddings have been used in combination with machine learning, improving results from biomedical named entity recognition [28], capturing word analogies [29], extracting latent knowledge from scientific literature and going towards a generalized approach to the process of mining scientific literature [30], etc. ...
Article
Full-text available
Assessing nutritional content is very relevant for patients suffering from various diseases, professional athletes, and for health reasons is becoming part of everyday life for many. However, it is a very challenging task as it requires complete and reliable sources. We introduce a machine learning pipeline for predicting macronutrient values of foods using learned vector representations from short text descriptions of food products. On a dataset used from health specialists, containing short descriptions of foods and macronutrient values: we generate paragraph embeddings, introduce clustering in food groups, using graph-based vector representations, that include food domain knowledge information, and train regression models for each cluster. The predictions are for four macronutrients: carbohydrates, fat, protein and water. The highest accuracy was obtained for carbohydrate predictions – 86%, compared to the baseline – 27% and 36%. The protein predictions yielded the best results across all clusters, 53%–77% of the values fall in the tolerance-level range. These results were obtained using short descriptions, the embeddings can be improved if they are learned on longer descriptions, which would lead to better prediction results. Since the task of calculating macronutrients requires exact quantities of ingredients, these results obtained only from short description are a huge leap forward.
... Some of them are developed specifically for the food domain (i.e. FoodWiki [6], AGROVOC [5], Open Food Facts [4], Food Product Ontology [18], FOODS [30], and FoodOn [15]), while others are related to the biomedical domain, but also include links to food and environmental concepts (i.e. SNOMED CT [10], MeSH [9], SNMI [32], and NDDF [8]). ...
Chapter
Nowadays, the existence of several available biomedical vocabularies and standards play a crucial role in understanding health information. While there is a large number of available resources in the biomedical domain, only a limited number of resources can be utilized in the food domain. There are only a few annotated corpora with food concepts, as well as a small number of rule-based food named-entity recognition systems for food concept extraction. Additionally, several food ontologies exist, each developed for a specific application scenario. To address the issue of ontology alignment, we have previously created a resource, named FoodOntoMap, that consists of food concepts extracted from recipes. The extracted concepts were annotated by using semantic tags from four different food ontologies. To make the resource more comprehensive, as well as more representative of the domain, in this paper we have extended this resource by creating a second version, appropriately named FoodOntoMapV2. This was done by including an additional four ontologies that contain food concepts. Moreover, this resource can be used for normalizing food concepts across ontologies and developing applications for understanding the relation between food systems, human health, and the environment.
... The Linked Data approach has been widely used by different governments to increase data re-usability, as it allows data from different disciplines to be interconnected (Zaveri et al., 2013) (Gür et al., 2012) (Oren et al., 2008) (Caracciolo et al., 2012). Linked statistical datasets can be applied in various domains for different purposes. ...
Article
Full-text available
Governments are publishing enormous amounts of open data on the web every day in an effort to increase transparency and reusability. Linking data from multiple sources on the web enables the performance of advanced data analytics, which can lead to the development of valuable services and data products. However, Canada's open government data portals are isolated from one another and remain unlinked to other resources on the web. In this paper, we first expose the statistical data sets in Canadian provincial open data portals as Linked Data, and then integrate them using RDF Cube vocabulary, thereby making different open data portals available through a single search endpoint. We leverage Semantic Web Technologies to publish open data sets taken from two provincial portals (Nova Scotia and Alberta) as RDF (the Linked Data format), and to connect them to one another. The success of our approach illustrates its high potential for linking open government data sets across Canada, which will in turn enable greater data accessibility and improved search results.
... Since then this idea has been successfully applied to statistical language modeling [2]. The follow up work includes applications to automatic speech recognition and machine translation [6], [7], and a wide range of NLP tasks [8]- [14]. Regarding neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. ...