Marco A. CasanovaPontifical Catholic University of Rio de Janeiro · Department of Informatics (INF)
Marco A. Casanova
Ph.D.
About
466
Publications
93,050
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,048
Citations
Introduction
Marco A. Casanova is Full Professor at the Department of Informatics of PUC-Rio. He obtained a Ph.D. in Applied Mathematics from Harvard University in 1979. His research interests concentrate on database conceptual modeling and construction of database management systems. He wrote 7 books, 48 journal articles and over 200 conference papers. He advised 15 Ph.D. theses and 53 M.Sc. dissertations. In July 2012, he received the Scientific Merit Award from the Brazilian Computer Society.
Publications
Publications (466)
The Text-to-SQL task involves generating SQL queries based on a given relational database and a Natural Language (NL) question. Although Large Language Models (LLMs) show good performance on well-known benchmarks, they are evaluated on databases with simpler schemas. This dissertation first evaluates their effectiveness on a complex and openly avai...
An Enterprise Knowledge Graph (EKG) is a robust foundation for knowledge management, data integration, and advanced analytics across organizations. It achieves this by offering a semantic view that semantically integrates various data sources within an organization’s data lake. This paper introduces a novel data design pattern (DDP) aimed at constr...
Oil and gas industry applications often require querying data of various types and integrating the query results. Data range from structured tables stored in databases to documents and images organized in digital libraries. The users typically have technical training but are not necessarily versed in Information Technology, meaning the data process...
This paper proposes the use of narrative patterns as an effective guide to preserve thematic consistency in the composition of stories using Large Language Models (LLMs). Our approach drove inspiration from a well-accepted, thorough, and overarching classification of folklore types and the deservedly famous Monomyth characterization of heroic quest...
A method for generating narratives by analyzing single images or image sequences is presented, inspired by the time immemorial tradition of Narrative Art. The proposed method explores the multimodal capabilities of GPT-4o to interpret visual content and create engaging stories, which are illustrated by a Stable Diffusion XL model. The method is sup...
This paper first presents DANKE, a data and knowledge management platform that allows users to submit keyword queries to a centralized database. DANKE uses a knowledge graph to provide a semantic view of the centralized database in a vocabulary familiar to the users. The paper then describes DANKE-U, a specialized module that enables DANKE to handl...
This article presents a novel and highly interactive process to generate natural language narratives based on our ongoing work on semiotic relations, providing four criteria for composing new narratives from existing stories. The wide applicability of this semiotic reconstruction process is suggested by a reputed literary scholar’s deconstructive c...
The field of Personal Knowledge Management (PKM) has seen a surge in popularity in recent years. Interestingly, Natural Language Processing (NLP) and Large Language Models are also becoming mainstream, but PKM has not seen much integration with NLP. With this motivation, this article first introduces a methodology to automatically interconnect isol...
Text-to-SQL refers to the task defined as “given a relational database D and a natural language sentence S
that describes a question on D, generate an SQL query Q over D that expresses S”. Numerous tools have
addressed this task with relative success over well-known benchmarks. Recently, several LLM-based text-to-SQL tools, that is, text-to-SQL too...
The leaderboards of familiar benchmarks indicate that the best text-to-SQL tools are based on Large Language Models (LLMs). However, when applied to real-world databases, the performance of LLM-based text-to-SQL tools is significantly less than that reported for these benchmarks. A closer analysis reveals that one of the problems lies in that the r...
This poster paper proposes a family of Natural Language (NL) interfaces for databases (NLIDBs) that use ChatGPT and LangChain features to compile NL sentences expressing database questions into SQL queries or to extract keywords from NL sentences, which are passed to a database keyword search tool. The use of ChatGPT reduces dealing with NL questio...
In this paper we introduce a novel highly interactive process to generate natural language narratives on the basis of our ongoing work on semiotic relations. To the two basic components of interactive systems, namely, a software tool and a user interface, we add a third component-AI agents, understood as an upgraded rendition of software agents. Ou...
This paper addresses the access control problem in the context of database keyword search, when a user defines a query by a list of keywords, and not by SQL (or SPARQL) code. It describes the solutions implemented in DANKE, a database keyword search platform currently used in several industrial applications. DANKE offers two alternatives for managi...
Assuming that the term 'metaverse' could be understood as a computer-based implementation of multiverse applications, we started to look in the present work for a logic that would be powerful enough to handle the situations arising both in the real and in the fictional underlying application domains. Realizing that first-order logic fails to accoun...
Recently, the topic of Personal Knowledge Management (PKM) has seen a surge in popularity. This is illustrated by the accelerated growth of apps such as Notion, Obsidian, and Roam Research, as well as the appearance of books like “How to Take Smart Notes” and “Building a Second Brain.” However, the area of PKM has not seen much integration with Nat...
A knowledge base, expressed using the Resource Description Framework (RDF), can be viewed as a graph whose nodes represent entities and whose edges denote relationships. The entity relatedness problem refers to the problem of discovering and understanding how two entities are related, directly or indirectly, that is, how they are connected by paths...
In this paper we propose a new plot composition method based on situation calculus and Petri net models, which are applied, in a complementary fashion, to a narrative open to user co-authorship. The method starts with the specification of situation calculus schemas, which allow a planning algorithm to check if the specification covers the desired c...
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This article addresses such problem by combining distributed RDF path search and ranking strategies in a framework called DCoEPinKB, which helps reduce the overall execution tim...
Purpose
Enterprise knowledge graphs (EKG) in resource description framework (RDF) consolidate and semantically integrate heterogeneous data sources into a comprehensive dataspace. However, to make an external relational data source accessible through an EKG, an RDF view of the underlying relational database, called an RDB2RDF view, must be created....
The situation calculus logic model is convenient for modelling the actions that can occur in an information system application. The interplay of pre-conditions and post-conditions determines a semantically justified partial order of the defined actions and serves to enforce integrity constraints. This form of specification allows the use of plan-ge...
Keyword search systems provide users with a friendly alternative to access Resource Description Framework (RDF) datasets. Evaluating such systems requires adequate benchmarks, consisting of RDF datasets, keyword queries, and correct answers. However, available benchmarks often have small sets of queries and incomplete sets of answers, mainly becaus...
Keyword search is typically associated with information retrieval systems. However, recently, keyword search has been expanded to relational databases and RDF datasets, as an attractive alternative to traditional database access. This paper introduces DANKE, a platform for keyword search over databases, and discusses how third-party applications ca...
The answer of a query, submitted to a database or a knowledge base, is often long and may contain redundant data. The user is frequently forced to browse through a long answer or refine and repeat the query until the answer reaches a manageable size. Without proper treatment, consuming the answer may indeed become a tedious task. This article then...
A Natural Language Interface to Database (NLIDB) refers to a database interface that translates a question asked in natural language into a structured query. Aggregation questions express aggregation functions, such as count, sum, average, minimum and maximum, and optionally a group by clause and a having clause. NLIDBs deliver good results for sta...
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This question can be addressed by implementing a path search strategy, which combines an entity similarity measure, with an expansion limit, to reduce the path search space and...
A knowledge base, expressed using the Resource Description Framework (RDF), can be viewed as a graph whose nodes represent entities and whose edges denote relationships. The entity relatedness problem refers to the problem of discovering and understanding how two entities are related, directly or indirectly, that is, how they are connected by paths...
This article introduces an algorithm to automatically translate a user-specified keyword-based query K to a SPARQL query Q so that the answers Q returns are also answers for K. The algorithm does not rely on an RDF schema, but it synthesizes SPARQL queries by exploring the similarity between the property domains and ranges, and the class instance s...
Surveys are pervasive in the modern world, with its usage ranging from the field of customer satisfaction measurement to global economic trends tracking. Data collection is at the core of survey processes and, usually, is computer- aided. The development of data collection software involves the codification of questionnaires, which vary from simple...
This chapter first defines a set of operations that create new ontologies, including their constraints, out of other ontologies. The projection, union, and deprecation operations help define new ontologies by reusing fragments of other ontologies, the intersection operation constructs the constraints that hold in two ontologies, and the difference...
Natural Language Interface to Databases (NLIDB) systems usually do not deal with aggregations, which can be of two types: aggregation functions (such as count, sum, average, minimum, and maximum) and grouping functions (GROUP BY). This paper addresses the creation of a generic module, to be used in NLIDB systems, that allows such systems to perform...
Keyword search is typically associated with information retrieval systems. However, recently, keyword search has been expanded to relational databases and RDF datasets, as an attractive alternative to traditional database access. With this motivation, this paper first introduces a platform for data and knowledge retrieval, called DANKE, concentrati...
This paper proposes a process that modifies the presentation of a query answer to improve the quality of the user’s experience. The process is particularly useful when the answer is long and repetitive. The process reorganizes the original query answer by applying heuristics to summarize the results and to select template questions that create a us...
Cloud computing is a general term that involves delivering hosted services over the Internet. With the accelerated growth of the volume of data used by applications, many organizations have moved their data into cloud servers to provide scalable, reliable and highly available services. A particularly challenging issue that arises in the context of...
Stop-and-move semantic trajectories are segmented trajectories where the stops and moves are semantically enriched with additional data. A query language for semantic trajectory datasets has to include selectors for stops or moves based on their enrichments and sequence expressions that define how to match the results of selectors with the sequence...
The proliferation of shared multimedia narratives on the Internet is due to three main factors: increasing number of narrative producers, availability of narrative-sharing services, and increasing popularization of mobile devices that allow recording, editing, and sharing narratives. These factors characterize the emergence of an environment we cal...
This article presents an in-depth analysis and comparison of two computer science degree offerings, viz.. The analysis is based on the student transcripts collected from the academic systems of both institutions over circa one decade. The article starts with a description of the degrees and global statistics of the student population considered. Th...
Cloud computing is a general term that involves delivering hosted services over the Internet. With the accelerated growth of the volume of data used by applications, many organizations have moved their data into cloud servers to provide scalable, reliable and highly available services. A particularly challenging issue that arises in the context of...
This extended abstract first introduces the problem of keyword search overRDF datasets. Then, it expands the discussion to cover the question of serendipitous search as a strategy to diversify answers. Finally, it briefly presents the entity relatedness problem, which refers to the problem of exploring an RDF dataset to discover and understand how...
A key contributor to the success of keyword search systems is a ranking mechanism that considers the importance of the retrieved documents. The notion of importance in graphs is typically computed using centrality measures that highly depend on the degree of the nodes, such as PageRank. However, in RDF graphs, the notion of importance is not necess...
For several applications, an integrated view of linked data, denoted linked data mashup, is a critical requirement. Nonetheless, the quality of linked data mashups highly depends on the quality of the data sources. In this sense, it is essential to analyze data source quality and to make this information explicit to consumers of such data. This pap...
For several applications, an integrated view of linked data, denoted linked data mashup, is a critical requirement. Nonetheless, the quality of linked data mashups highly depends on the quality of the data sources. In this sense, it is essential to analyze data source quality and to make this information explicit to consumers of such data. This pap...
The world-wide drive for academic excellence is placing new requirements on educational data analysis, triggering the need to find less-trivial educational patterns in non-identically distributed data with noise, missing values and non-constant relations. Biclustering, the discovery of a subset of objects (whether students, teachers, researchers, c...
Identifying and monitoring students who are likely to dropout is a vital issue for universities. Early detection allows institutions to intervene, addressing problems and retaining students. Prior research into the early detection of at-risk students has opted for the use of predictive models, but a comprehensive assessment of the suitability of di...
This article presents a novel approach to estimate semantic entity similarity using entity features available as Linked Data. The key idea is to exploit ranked lists of features, extracted from Linked Data sources, as a representation of the entities to be compared. The similarity between two entities is then estimated by comparing their ranked lis...
In the last decade, RDF emerged as a new kind of standardized data model, and a sizable body of knowledge from fields such as Information Retrieval was adapted to RDF graphs. One common task in graph databases is to define an importance score for nodes based on centrality measures, such as PageRank and HITS. The majority of the strategies highly de...
This paper argues that certain ontology design problems are profitably addressed by treating ontologies as theories and by defining a set of operations that create new ontologies, including their constraints, out of other ontologies. The paper first shows how to use the operations in the context of ontology reuse, how to take advantage of the opera...
Currently available datasets still have a large unexplored potential for interlinking. Ranking techniques contribute to this task by scoring datasets according to the likelihood of finding entities related to those of a target dataset. Ranked datasets can be either manually selected for standalone linking discovery tasks or automatically inspected...
This article defines, implements, and evaluates techniques to automatically compare and recommend conferences. The techniques for comparing conferences use familiar similarity measures and a new measure based on co-authorship communities, called co-authorship network community similarity index. The experiments reported in the article indicate that...
This paper describes an algorithm to perform keyword search over federated RDF datasets. The algorithm compiles keyword-based queries into federated SPARQL queries, without user intervention, under the assumption that the RDF datasets and the federation have a schema. The compilation process is explained in detail, including how to synthesize exter...
This Demo presents a framework for the live synchronization of an RDF view defined on top of relational database. In the proposed framework, rules are responsible for computing and publishing the changeset required for the RDB-RDF view to stay synchronized with the relational database. The computed changesets are then used for the incremental maint...
Collecting huge volumes of trajectories opens up new opportunities to capture time-varying and uncertain travel costs to traverse segments on a network. This kind of analyses happens to be conducted offline, by means of data mining analysis on historical data. However, there is a need to deal with the incremental nature of spatio-temporal data and...
A knowledge base stores descriptions of entities and their relationships, often in the form of a very large RDF graph, such as DBpedia or Wikidata. The entity relatedness problem refers to the question of computing the relationship paths that better capture the connectivity between a given entity pair. This paper describes a dataset created to supp...
This paper1 first argues that ontology design may benefit from treating ontologies as theories and from the definition of a set of operations that map ontologies into ontologies, especially their constraints. The paper then defines the class of ontologies used and proposes four operations to manipulate them. It proceeds to discuss how the operation...