Altigran Soares da Silva

Altigran Soares da Silva
Federal University of Amazonas | UFAM · Institute of Computing (IComp)

Ph.D.

About

188
Publications
36,671
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,321
Citations
Introduction
Researcher, lecturer and advisor at the undergraduate, masters, and doctoral degrees. His research interests involve Data Management, Information Retrieval, and Data Mining with emphasis on the World-Wide Web and Social Media.
Additional affiliations
August 2010 - July 2011
Federal University of Minas Gerais
Position
  • Researcher
March 1991 - June 2020
Federal University of Amazonas
Position
  • Professor (Full)

Publications

Publications (188)
Article
Full-text available
As online purchasing becomes more popular, users trust more information published on social media than on advertisement content. Opinion mining is often applied to social media, and opinion target extraction is one of its main sub-tasks. In this paper, we focus on recognizing target entities related to electronic products. We propose a method calle...
Article
Full-text available
Product Graphs (PGs) are knowledge graphs that structure the relationship of products and their characteristics. They have become very popular lately due to their potential to enable AI-related tasks in e-commerce. With the rise of social media, many dynamic and subjective information on products and their characteristics became widely available, c...
Preprint
Full-text available
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. These systems take the keywords from the input query, locate the elements of the target database that correspond to these keywords, and look for ways to "connect" thes...
Article
Full-text available
Schema matching is the problem of finding semantic correspondences between elements from different schemas. This is a challenging problem since disparate elements in the schemas often represent the same concept. Traditional instances of this problem involved a pair of schemas. However, recently, there has been an increasing interest in matching sev...
Conference Paper
Gerenciadores de documentos (GDs) ou document stores, como MongoDB e CouchDB, têm se tornado cada vez mais populares devido à flexibilidade em carregar e recuperar dados em larga escala usando documentos semi-estruturados, pois evitam a necessidade de definição de esquemas antes da ingestão de dados. Por outro lado, especificar consultas neste tipo...
Conference Paper
Ao longo das últimas décadas os bancos de dados (BDs) têm sido o principal recurso computacional utilizado para armazenamento e gerenciar dados dos mais variados tipos de aplicações. Tipicamente, BDs armazenam informações factuais e objetivas sobre entidades do mundo real, que são representadas como um conjunto de atributos. No entanto, tem havido...
Chapter
Due to globalization and the technological advances of the last decades, a large amount of data is created every day, especially on the Web. Web Services are one of the main artifacts created on the Web; they can provide access to sources of data of many sizes and types. In this work, we approach the challenge of designing and evaluating a Web Craw...
Conference Paper
The approval of the General Data Protection Regulation (GDPR) brought a revolution in the way we treat data produced in digital media. The GDPR increases individuals’ participation in the treatment of their data, and it also introduces technical challenges, whose failure can lead to a fine of 4% of the organization’s annual revenue. Among many appr...
Preprint
Full-text available
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from addit...
Chapter
A large number of opinions on products and their features are posted every day on e-commerce websites in user reviews. They are a valuable source of knowledge for both manufacturers and customers. However, reviews often bring so much information that exceeds the human capacity of reasoning and hampers their effective use. Thus, researchers on how t...
Article
Several systems proposed for processing keyword queries over relational databases rely on the generation and evaluation of Candidate Networks (CNs), i.e., networks of joined database relations that when processed as SQL queries, provide a relevant answer to the input keyword query. Although the evaluation of CNs has been extensively addressed in th...
Article
Full-text available
In this study, we propose and evaluate a novel learning-to-rank (L2R) approach that produces results on par with those of the state-of-the-art L2R methods while being computationally effective. We start by presenting a modified gradient boosted regression tree algorithm to generate unified term impact (UTI) values at indexing time. Each unified ter...
Article
Medical data processing has found a new dimension with the extensive use of machine-learning techniques to classify and extract features. Machine learning strongly benefits from computing accelerators. However, such accelerators are not easily available at hospital premises, although they can be easily found on public cloud infrastructures or resea...
Conference Paper
Full-text available
When making purchasing decisions, customers usually rely on information from two types of sources: product specifications, provided by manufacturers, and reviews, posted by other customers. Both kinds of information are often available on e-commerce websites. While researchers have demonstrated the importance of product specifications and reviews a...
Article
A vast number of user opinions are available from reviews posted on e-commerce websites. Although these opinions are a valuable source of knowledge for both manufacturers and customers, they provide volumes of information that exceeds the human cognitive processing capacity, which can be a major bottleneck for their effective use. To address this p...
Chapter
Full-text available
User opinions posted on e-commerce websites are a valuable source to support purchase making-decision. Unfortunately, it is not generally feasible for an ordinary buyer to examine a large set of reviews on a given product for useful information on certain attributes. We present a system named Contender that can summarize product reviews aligned to...
Article
Full-text available
Abstract In this paper, we propose a method for enriching product catalogs, which traditionally include only objective data provided by manufacturers or retailers, with subjective information extracted from reviews written by customers. Our method was designed to associate opinions taken from reviews with the product attributes they refer to. This...
Conference Paper
O problema da árvore de Steiner em grafos é NP-difícil, no entanto, algoritmos que fazem uso de heurísticas conseguem obter resultados próximos da solução ótima em tempo polinomial. Neste trabalho apresentamos um novo algoritmo baseado em um algoritmo exato enumerativo da literatura. A heurística proposta seleciona vértices como candidatos a serem...
Article
Full-text available
This article addresses the problem of representation, indexing and retrieval of images through the signature-based bag of visual words (S-BoVW) paradigm, which maps features extracted from image blocks into a set of words without the need of clustering processes. Here, we propose the first ever method based on the S-BoVW paradigm that considers inf...
Article
Full-text available
In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each...
Conference Paper
Online social media has become an essential part of our life. This media is often characterized by its diverse content, which is produced by ordinary users. The potential to easily express ideas and opinions has made social media a source of valuable information on a variety of topics. In particular, information containing comments about consumer p...
Article
In this paper we propose and evaluate the Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP. It is an extension of BMW-CS, a method previously proposed by us. Although very efficient, BMW-CS does not guarantee preserving the top-. k results for a given query. Algorithms that do not preserve the top results m...
Article
In a stream environment, differently from traditional databases, data arrive continuously, unindexed and potentially unbounded, whereas queries must be evaluated for producing results on the fly. In this article, we propose two new algorithms (called SLCAStream and ELCAStream) for processing multiple keyword queries over XML streams. Both algorithm...
Article
A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate docu...
Article
In this paper, we revisit SDLC, an image retrieval method that adopts a signature-based approach to identify visual words, instead of the more conventional approach that identifies them by using clustering techniques. We start by providing a formal and generalized definition of the approach adopted in SDLC, which we call Signature-Based Bag of Visu...
Article
Relational keyword search (R-KwS) systems based on schema graphs take the keywords from the input query, find the tuples and tables where these keywords occur and look for ways to 'connect' these keywords using information on referential integrity constraints, i.e., key/foreign key pairs. The result is a number of expressions, called Candidate Netw...
Conference Paper
Important applications in product opinion mining such as opinion summarization and aspect extraction require the recognition of product mentions as a basic task. In the case of consumer electronic products, Web forums are important and popular sources of valuable opinions. Forum users often refer to products by means of their model numbers. In a po...
Article
Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an im...
Conference Paper
Full-text available
In this paper, we tackle the problem of processing various keyword-based queries over XML streams in a scalable way, improving recent multi-query processing approaches. We propose a customized algorithm, called MKStream, that relies on parsing stacks designed for simultaneously matching several queries. Particularly, it explores the possibility of...
Conference Paper
The late-breaking Results, the Doctoral Consortium and the Workshop papers at the 25th ACM conference on Hypertext and Social Media deal with different exciting topics related to emerging areas of research, with the aim of discussing of best practices and innovative approaches. Late-breaking results and Doctoral Consortium works focus on some key i...
Article
Full-text available
A substantial fraction of web search queries contain references to entities, such as persons, organizations, and locations. Recently, methods that exploit named entities have been shown to be more effective for query expansion than traditional pseudo-relevance feedback methods. In this paper, we introduce a supervised learning approach that exploit...
Article
The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages sho...
Chapter
This chapter presents ONDUX (On Demand Unsupervised Information Extraction) a method that relies on the presented unsupervised approach to deal with the Information Extraction by Text Segmentation problem. ONDUX was first presented in Cortez et al. (2010) and in Cortez and da Silva (2010). Following, a tool based on ONDUX was presented in Porto et...
Chapter
This chapter presents the conclusions and discuss directions for future work based on the unsupervised approach presented here.
Chapter
This chapter presents iForm, a method for automatically using data-rich text for filling form-based input interfaces that rely on the presented unsupervised approach to deal with the Information Extraction by Text Segmentation problem. iForm was first presented in Toda et al. (2009, 2010). In the following is described the scenario where iForm is a...
Chapter
This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-b...
Chapter
In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divid...
Chapter
This chapter presents Joint Unsupervised Structure Discovery and Information Extraction (JUDIE) a method for addressing the IETS problem. JUDIE was presented in (Cortez et al. 2011). First, it is introduced the scenario to which JUDIE is targeted to, then we go over the proposed solution detailing all the steps that comprise JUDIE. Finally, an expe...
Conference Paper
In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in th...
Conference Paper
A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions....
Conference Paper
In this paper we present two new algorithms designed to reduce the overall time required to process top-k queries. These algorithms are based on the document-at-a-time approach and modify the best baseline we found in the literature, Blockmax WAND (BMW), to take advantage of a two-tiered index, in which the first tier is a small index containing on...
Conference Paper
Full-text available
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messa...
Article
Full-text available
The schema matching problem can be defined as the task of finding semantic relationships between schema elements existing in different data repositories. Despite the existence of elaborated graphic tools for helping to find such matches, this task is usually manually done. In this paper, we propose a novel evolutionary approach to addressing the pr...
Article
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messa...
Book
A new unsupervised approach to the problem of Information Extraction by Text Segmentation (IETS) is proposed, implemented and evaluated herein. The authors' approach relies on information available on pre-existing data to learn how to associate segments in the input string with attributes of a given domain relying on a very effective set of content...
Article
Resumo Previous work in literature has indicated that template of web pages represent noisy information in web collections, and advocate that the simple removal of template result in improvements in quality of results provided by Web search systems. In this paper, we study the impact of template removal in two distinct scenarios: large scale web s...
Conference Paper
The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique real-world entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality sense-annotated data, however, are hard to be obta...
Article
State-of-the-art search engine ranking methods combine several distinct sources of relevance evidence to produce a high-quality ranking of results for each query. The fusion of information is currently done at query-processing time, which has a direct effect on the response time of search systems. Previous research also shows that an alternative to...
Article
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for develop...
Conference Paper
Recent work on Content-Based Image Retrieval (CBIR) have presented alternative methods for fast image indexing and retrieval using Bags of Visual Words (BoVW). In such methods, images are represented as sets of visual words, which can be indexed and searched using well-known text retrieval techniques, allowing fast search on large image databases....
Conference Paper
Full-text available
Many user queries nowadays contain references to named entities, which has motivated the development of new methods that exploit entity semantics for query expansion. At the same time, Wikipedia has been widely recognized as a large network of named entities, where entity-related articles are organized into a comprehensive hierarchy of categories a...
Conference Paper
Full-text available
In this paper, we propose that various keyword-based queries be processed over XML streams in a multi-query processing way. Our algorithms rely on parsing stacks designed for simultaneously matching terms from several distinct queries and use new query indexes to speed up search operations when processing a large number of queries. Besides defining...
Conference Paper
Full-text available
The Web has become a huge repository of pages and search engines allow users to find relevant information in this repository. Web crawlers are an important component of search engines. They find, download, parse content and store pages in a repository. In this paper, we present a new algorithm for verifying URL uniqueness in a large-scale web crawl...
Article
In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information ab...
Article
Full-text available
We propose a strategy for automatically obtaining datasets from Wikipedia to support unsupervised Information Extraction by Text Segmentation (IETS) methods. Despite the importance of preexisting datasets to unsupervised IETS methods, there has been no proper discussion in the literature on how such datasets can be effectively obtained or built. We...
Conference Paper
Full-text available
Learning from unlabeled data provides innumerable advantages to a wide range of applications where there is a huge amount of unlabeled data freely available. Semi-supervised learning, which builds models from a small set of labeled examples and a potential large set of unlabeled examples, is a paradigm that may effectively use those unlabeled data....
Conference Paper
Full-text available
As the number of research papers available on the Web has increased enormously over the years, paper recommender systems have been proposed to help researchers on automatically finding works of interest. The main problem with the current approaches is that they assume that recommending algorithms are provided with a rich set of evidence (e.g., docu...
Conference Paper
Full-text available
In this poster paper, we present an overview of CiênciaBrasil, a research social network involving researchers within the Brazilian INCT program. We describe its architecture and the solutions adopted for data collection, extraction, and deduplication, and for materializing and visualizing the network.
Conference Paper
In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art In...
Conference Paper
Information about how to segment a Web page can be used nowadays by applications such as segment aware Web search, classification and link analysis. In this research, we propose a fully automatic method for page segmentation and evaluate its application through experiments with four separate Web sites. While the method may be used in other applicat...
Conference Paper
Full-text available
In this paper, we present a novel method for automatically deriving structured XML queries from keyword-based queries and show how it was applied to the experimental tasks proposed for the INEX 2010 data-centric track. In our method, called StruX, users specify a schema-independent unstructured keyword-based query and it automatically generates a t...
Article
In this work, we investigate the problem of using the block structure of Web pages to improve ranking results. Starting with basic intuitions provided by the concepts of term frequency (TF) and inverse document frequency (IDF), we propose nine block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside...