Mário J. Silva

Mário J. Silva
INESC-ID. Instituto Superior Técnico, Universidade de Lisboa · Information and Decision Support Systems

PhD

About

207
Publications
36,212
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,193
Citations
Additional affiliations
July 2011 - present
Inesc-ID, Institutp Superior Técnico, Universidade de Lisboa
Position
  • Senior Researcher
September 1996 - June 2011
University of Lisbon
Position
  • Professor (Full)
August 1990 - December 1994
University of California, Berkeley
Position
  • Research Assistant

Publications

Publications (207)
Chapter
This work presents an analysis of Brazilian political discourse from speeches and social media posts, focusing on the ability to transfer learned models’ knowledge between different contexts. The analysis is conducted through PoliS, a new resource containing two datasets of political discussions labeled for party and ideological leaning from congre...
Chapter
Most misinformation corpora are composed of explicitly real and fake news content. This results from the idea that misinformation can be approached as a binary classification problem. However, such approach oversimplifies the diversity of properties usually associated with credibility of different textual genres and types. To address this problem,...
Article
Full-text available
EuroVoc is a thesaurus maintained by the European Union Publication Office, used to describe and index legislative documents. The EuroVoc concepts are organized following a hierarchical structure, with 21 domains, 127 micro-thesauri terms, and more than 6,700 detailed descriptors. The large number of concepts in the EuroVoc thesaurus makes the manu...
Preprint
Full-text available
This paper presents and characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources, over a 10-month period. The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories. Throughout this paper,...
Article
Full-text available
This review discusses the dynamic mechanisms of misinformation creation and spreading used in social networks. It includes: (1) a conceptualization of misinformation and related terms, such as rumors and disinformation; (2) an analysis of the cognitive vulnerabilities that hinder the correction of the effects of an inaccurate narrative already assi...
Preprint
Full-text available
Phenotypic Disease Networks (PDNs), which capture comorbidities, can be used to study phenotypic differences between demographics. Networks are an useful way to understand disease progression as they tend to occur in the vicinity of current conditions. Additionally, the local and global structure of these networks, in combination with centrality me...
Preprint
Full-text available
The development of explainable news credibility prediction models is critical both for fighting the viral propagation of misinformation and improving media literacy. This work investigates a variety of content indicators approaching different semantic and discourse dimensions, such as title representativeness, reasoning errors, and sentiment intens...
Chapter
We propose a method for focused crawling of linked data with a frontier based on the semantic data elements in use within a knowledge domain. This method addresses the challenges of crawling large volumes of heterogeneous linked data, aiming to achieve improvements in crawling efficiency and accuracy. We present the results obtained by our method i...
Preprint
A bstract Understanding COVID-19 and its risk factors in the Portuguese population is critical to combat this condition. To study the impact of multimorbidity in the population with COVID-19 infection, we performed a descriptive analysis of a dataset extracted from all reported confirmed cases of COVID-19 in Portugal until April 28, 2020. We observ...
Chapter
Current approaches to the computational modelling of irony mostly address verbal irony and sarcasm, neglecting other productive types of irony, namely situational irony. The function of situational irony is to lay emphasis on (real or fictional) events that evoke peculiar and unexpected images, which usually create a comical effect on the audience....
Book
This two-volume set LNCS 12035 and 12036 constitutes the refereed proceedings of the 42nd European Conference on IR Research, ECIR 2020, held in Lisbon, Portugal, in April 2020. The 55 full papers presented together with 8 reproducibility papers, 46 short papers, 10 demonstration papers, 12 invited CLEF papers, 7 doctoral consortium papers, 4 works...
Book
This two-volume set LNCS 12035 and 12036 constitutes the refereed proceedings of the 42nd European Conference on IR Research, ECIR 2020, held in Lisbon, Portugal, in April 2020. The 55 full papers presented together with 8 reproducibility papers, 46 short papers, 10 demonstration papers, 12 invited CLEF papers, 7 doctoral consortium papers, 4 works...
Chapter
Non-invasive medical imaging techniques, such as radiography or computed tomography, are extensively used in hospitals and clinics for the diagnosis of diverse injuries or diseases. However, the interpretation of these images, which often results in a free-text radiology report and/or a classification, requires specialized medical professionals, le...
Chapter
EuroVoc is a thesaurus maintained by the European Union Publication Office, used to describe and index legislative documents. The Eurovoc concepts are organized following a hierarchical structure, with 21 domains, 127 micro-thesauri terms, and more than 6,700 detailed descriptors. The large number of concepts in the EuroVoc thesaurus makes the manu...
Preprint
Full-text available
This paper describes ongoing work on the creation of a multilingual rumour dataset on football transfer news, FTR-18. Transfer rumours are continuously published by sports media. They can both harm the image of player or a club or increase the player's market value. The proposed dataset includes transfer articles written in English, Spanish and Por...
Article
We address the assignment of ICD-10 codes for causes of death by analyzing free-text descriptions in death certificates, together with the associated autopsy reports and clinical bulletins, from the Portuguese Ministry of Health. We leverage a deep neural network that combines word embeddings, recurrent units, and neural attention, for the generati...
Article
Full-text available
Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2...
Conference Paper
The assignment of disease codes to clinical texts has a wide range of applications, including epidemiological studies or disease surveillance. We address the task of automatically assigning the ICD-10 codes for the underlying cause of death, from the free-text descriptions included in death certificates obtained from the Portuguese Ministry of Heal...
Article
Mental illnesses adversely affect a significant proportion of the population worldwide. However, the methods traditionally used for estimating and characterizing the prevalence of mental health conditions are time-consuming and expensive. Consequently, best-available estimates concerning the prevalence of mental health conditions are often years ou...
Article
Full-text available
Recent approaches for sentiment lexicon induction have capitalized on pre-trained word embeddings that capture latent semantic properties. However, embeddings obtained by optimizing performance of a given task (e.g. predicting contextual words) are sub-optimal for other applications. In this paper, we address this problem by exploiting task-specifi...
Conference Paper
Full-text available
We introduce a deep neural network for automated sarcasm detection. Recent work has emphasized the need for models to capitalize on contextual features, beyond lexical and syntactic cues present in utterances. For example, different speakers will tend to employ sarcasm regarding different subjects and, thus, sarcasm detection models ought to encode...
Article
The authors introduce a group-based discretionary access control with decentralized permission and group management for scientific repositories. Currently, access control approaches for repositories have inflexible centralized administrations, which do not scale well to large numbers of users. Moreover, discretionary access control is a legal stand...
Article
We address the publication of a large academic information dataset while ensuring privacy. We evaluate anonymization techniques achieving the intended protection, while retaining the utility of the anonymized data. The published data can help to infer behaviors and study interaction patterns in an academic population. These could subsequently be us...
Article
Full-text available
The automatic content analysis of mass media in the social sciences has become necessary and possible with the raise of social media and computational power. One particularly promising avenue of research concerns the use of opinion mining. We design and implement the POPmine system which is able to collect texts from web-based conventional media (n...
Conference Paper
Full-text available
Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed relationships while limiting the semantic drift. We research bootstrapping for relationship extraction using word embeddings to find similar relationships. Experimental results show that relying on word embeddings achieves a bette...
Article
This paper describes the main characteristics of SentiLex-PT, a sentiment lexicon designed for the extraction of sentiment and opinion about human entities in Portuguese texts. The potential of this resource is illustrated on its application to two types of corpora, the SentiCorpus-PT, a social media corpus, consisting of user comments to news arti...
Article
Full-text available
We investigate a technique to adapt unsupervised word embeddings to specific applications, when only small and noisy labeled datasets are available. Current methods use pre-Trained embeddings to initialize model parameters, and then use the labeled data to tailor them for the intended task. However, this approach is prone to overfitting when the tr...
Conference Paper
We address the publication of a large academic information dataset addressing privacy issues. We evaluate anonymization techniques achieving the intended protection, while retaining the utility of the anonymized data. The released data could help infer behaviors and subsequently find solutions for daily planning activities, such as cafeteria attend...
Conference Paper
Full-text available
Web archives already hold together more than 534 billion files and this number continues to grow as new initiatives arise. Searching on all versions of these files acquired through-out time is challenging, since users expect as fast and precise answers from web archives as the ones provided by current web search engines. This work studies, for the...
Article
Full-text available
Epidemiology is a data-intensive and multi-disciplinary subject, where data integration, curation and sharing are becoming increasingly relevant, given its global context and time constraints. The semantic annotation of epidemiology resources is a cornerstone to effectively support such activities. Although several ontologies cover some of the subd...
Article
Full-text available
Scientific repositories promote information sharing at a global scale. However, users will share sensitive resources in such repositories only if they are trustable. In addition, they must provide intuitive mechanisms to manage who can access to resource collections. Current approaches, which rely on a central administration, are not flexible and d...
Conference Paper
Full-text available
Relationship extraction concerns with the detection and classification of semantic relationships between entities mentioned in a collection of textual documents. This paper proposes a simple and on-line approach for addressing the automated extraction of semantic relations, based on the idea of nearest neighbor classification, and leveraging a minw...
Conference Paper
Full-text available
Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of managem...
Article
Purpose ‐ POWER is an ontology of political processes and entities. It is designed for tracking politicians, political organizations and elections, both in mainstream and social media. The aim of this paper is to propose a data model to describe political agents and their relations over time. Design/methodology/approach ‐ The authors propose a data...
Conference Paper
Full-text available
The information published on the web, a representation of our collective memory, is rapidly vanishing. At least 77 web archives have been developed to cope with the web's transience problem, but despite their technology having achieved a good maturity level, the retrieval effectiveness of the search services they provide still presents unsatisfacto...
Article
Full-text available
Context-aware search in mobile web environments demands new re-trieval methods that rank web resources based on the proximity to users' loca-tions. This paper presents the indexing and ranking architecture of a new geo-graphic web retrieval system that can accept the user location as input and ranks searched items based on the estimated distances b...
Conference Paper
Full-text available
We propose a new heuristic for toponym sense disambiguation, to be used when mapping toponyms in text to ontology concepts, using techniques based on semantic similarity measures. We evaluated the proposed approach using a collection of Portuguese news articles from which the geographic entity names were extracted and then manually mapped to concep...
Conference Paper
We present a methodology for automatically enlarging a Portuguese sentiment lexicon for mining social judgments from text, i.e., detecting opinions on human entities. Starting from publicly-availabe language resources, the identification of human adjectives is performed through the combination of a linguistic-based strategy, for extracting human ad...
Conference Paper
Full-text available
Epidemiology is a domain of knowledge interconnected with many other domains, thus making it a good candidate for reusing existing ontologies that, despite having been created for different purposes, characterize information frequently manipulated by epidemiologists and public health scientists. This paper presents an evaluation of existing ontolog...
Chapter
Full-text available
We present a novel approach for epidemic data collection and integration based on the principles of interoperability and modularity. Accurate and timely epidemic models require large, fresh datasets. The World Wide Web, due to its explosion in data availability, represents a valuable source for epidemiological datasets. From an e-science perspectiv...
Article
Full-text available
The large-scale effort in developing, maintaining and making biomedical ontologies available motivates the application of similarity measures to compare ontology concepts or, by extension, the entities described therein. A common approach, known as semantic similarity, compares ontology concepts through the information content they share in the ont...
Conference Paper
Full-text available
POWER is an ontology of political processes. It is designed for tracking politicians, political organisations and elections, both in mainstream and social media. In social media, these entities (particularly humans) are frequently named by emergent abbreviations, non-standardized acronyms, nicknames, metaphoric expressions and neologisms. Politicia...
Article
Full-text available
Web archives are a huge source of information to mine the past. However, tools to explore web archives are still in their infancy, in part due to the reduced knowledge that we have of their users. We contribute to this knowledge by presenting the first search behavior characterization of web archive users. We obtained detailed statistics about the...
Conference Paper
Full-text available
A complete characterization of web archive users must respond to three questions: why, what and how do users search? This study focuses on the first two: what are the user intents and which topics are most interesting to them? Answers to these questions are essential for guiding the development of web archives towards better user satisfaction. We u...
Conference Paper
Full-text available
We investigate the expression of opinions about human entities in user-generated content (UGC). A set of 2,800 online news comments (8,000 sentences) was manually annotated, following a rich annotation scheme designed for this purpose. We conclude that the challenge in performing opinion mining in such type of content is correctly identifying the p...
Conference Paper
Full-text available
The Internet of Things makes it possible to adapt the behaviour of business processes in response to real-time context updates. In addition, physical items can run and validate parts of the business processes and optimise their execution, while reducing message transmissions. State-of-the-art event-driven, service-oriented architecture approaches c...
Chapter
Molecular Biology research projects produced vast amounts of data, part of which has been preserved in a variety of public databases. However, a large portion of the data contains a significant number of errors and therefore requires careful verification by curators, a painful and costly task, before being reliable enough to derive valid conclusion...
Article
Full-text available
We present a characterization of the information-seeking be-havior of the users of a Portuguese web search engine, based on the analysis of its logs. We obtained detailed statistics about the users' ses-sions, queries, terms and searched topics over a period of two years. The results show that the users prefer fast and short sessions, composed of s...
Conference Paper
Full-text available
Semantic Web applications that include map visualization clients are becoming common. When the description of an entity contains coordinate pairs, semantic applications often lay them as pins on maps provided by Web mapping service applications, such as Google Maps. Nowadays, semantic applications cannot guarantee that those maps provide spatial in...
Article
Full-text available
Background: Geo-Net-PT is a geospatial ontology representing the Portuguese territory and the relations between the several locations within it. Yahoo! GeoPlanetTis a geospatial ontology that covers the whole world. To interlink the two ontologies and reduce the effects of repeated information, we propose an automatic alignment between their admini...
Conference Paper
Full-text available
The performance of some tasks in Information Retrieval is strongly related to the extent and quality of the geographic knowledge about named places. This paper presents a conceptualization of the geographic knowledge, the Geo-Net vocabulary, and a tool for building large knowledge bases of named places, the GKB management system, developed in the G...
Conference Paper
Full-text available
The Epidemic Marketplace is part of a computational framework for organizing data for epidemic modeling and forecasting. It is a distributed data management platform where epidemiological data can be stored, managed and made available to the scientific community. It includes tools for the automatic interaction with other applications through web se...
Conference Paper
Full-text available
This paper analyzes the requirements and presents a novel approach to the development of a system for epidemiological data collection and integration based on the principles of interoperability and modularity. Accurate and timely epidemic models require the integration of large, fresh datasets. Thus, from an e-science perspective, collected data sh...
Conference Paper
Full-text available
Most geographic queries include references to entities (geographic and non-geographic). Grounding such entities is essential to properly understand the user's information need. As statistical-based query reformulation strategies work at term level, not entity level, they don't use the semantic information given by such entities, which is considerab...
Conference Paper
Full-text available
Geographic Information Retrieval (GIR) systems rely on the identification and disambiguation of place names in documents to determine the region about which they are relevant. The place names are mapped into geographic concepts and used to assign an encompassing concept (a scope) to each document. However, sometimes a single scope is too restrictiv...
Article
Full-text available
We propose and evaluate a method for automatically creating a reference corpus for training text classification procedures for mining political opinions in user-generated content. The process starts by compiling a collection of highly opinionated comments posted by users on an on-line newspaper. Then, we define and use a set of manually-crafted hig...
Article
Full-text available
We investigate the accuracy of a set of surface patterns in identifying ironic sentences in comments submitted by users to an on-line newspaper. The initial focus is on identifying irony in sentences containing positive predicates since these sentences are more exposed to irony, making their true polarity harder to recognize. We show that it is pos...