Altigran Soares da Silva

Altigran Soares da Silva
  • Ph.D.
  • Professor (Full) at Federal University of Amazonas

About

216
Publications
74,935
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,753
Citations
Introduction
Researcher, lecturer and advisor at the undergraduate, masters, and doctoral degrees. His research interests involve Data Management, Information Retrieval, and Data Mining with emphasis on the World-Wide Web and Social Media.
Current institution
Federal University of Amazonas
Current position
  • Professor (Full)
Additional affiliations
August 2010 - July 2011
Federal University of Minas Gerais
Position
  • Researcher
March 1991 - June 2020
Federal University of Amazonas
Position
  • Professor (Full)

Publications

Publications (216)
Preprint
Full-text available
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been proposed, most still focus on queries referring only to attribute values or primarily address performance enha...
Preprint
Full-text available
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been proposed, most still focus on queries referring only to attribute values or primarily address performance enha...
Conference Paper
Full-text available
Online reviews play a key role in influence customer decisions during their purchase journey. Consequently, negative feedback from customers can have an adverse impact on the sales of products or services, potentially leading to diminished revenue and market share. However, this effect can be mitigated by crafting thoughtful responses to these comm...
Conference Paper
Full-text available
This study explores the application of instruction tuning in open-source small language models for Portuguese End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA), focusing on restaurant reviews. Utilizing a diverse dataset from sources such as Google Reviews, TripAdvisor, Instagram, and iFood, the research evaluates the performance of PTT5 Base,...
Article
Full-text available
We tackle the challenge of conducting an approximate prefix search within datasets of strings. We explore using a bit-parallelism technique to compute the edit distance between distinct strings and illustrate its adaptation for an approximate prefix search procedure referred to as BWBEV. This technique employs a unary representation of edit vectors...
Conference Paper
Modelos de Linguagem de Larga Escala (LLMs) vêm sendo utilizados em sistemas de recomendação para melhorar a experiência dos usuários e reduzir a sobrecarga de informação. Com a popularidade da IA generativa, essa abordagem cresce e mostra resultados promissores. LLMs abertos são de grande interesse devido à sua acessibilidade e potencial para ajus...
Conference Paper
Full-text available
This paper describes the method developed by the UFAM team in the 10th COLIEE for Task 1, the legal case retrieval task. In a nutshell, we propose a topic-based approach composed of two phases: filtering and ranking. In the filtering phase, a topic discovery technique is applied to the entire dataset to select an initial set of candidate cases for...
Article
Full-text available
The adoption of document stores, such as MongoDB or CouchDB, has increased drastically in recent years. Part of this popularity can certainly be explained by their flexibility in loading, storing, and retrieving semi-structured data on massive scales. However, adopting such systems presents challenges when exploring the data they store, since docum...
Article
Full-text available
Este artigo apresenta um relato de experiências exitosas na UFAM envolvendo a criação de startups a partir do ambiente acadêmico e o estabelecimento de cooperações com grandes empresas em projetos de pesquisa e desenvolvimento. Mostra como o ecossistema criado em torno da Universidade propiciou o surgimento de novos negócios e a cooperação com empr...
Conference Paper
O termo "Engenharia de Dados"(ED) tem sido utilizado frequentemente na literatura e em propostas curriculares atuais para se referir aos processos de adquirir, organizar e preparar dados para serem consumidos em análises exploratórias, como entrada de sistemas e aplicações ou outros contextos similares. Com o surgimento da área de Ciência de Dados,...
Preprint
Full-text available
Named Entity Recognition (NER) is a machine learning task that traditionally relies on supervised learning and annotated data. Acquiring such data is often a challenge, particularly in specialized fields like medical, legal, and financial sectors. Those are commonly referred to as low-resource domains, which comprise long-tail entities, due to the...
Conference Paper
In this paper, we present a tool for querying relational DBs that uses a KG as an approach to generate SQL queries from NL specifications. In this approach, we argue that a KG representation of a relational DB schema can become an auxiliary tool in the translation process. Furthermore, we propose to automate the process of generating such a KG. Our...
Article
Full-text available
The proliferation of legal documents in various formats and their dispersion across multiple courts present a significant challenge for users seeking precise matches to their information requirements. Despite notable advancements in legal information retrieval systems, research into legal recommender systems remains limited. A plausible factor cont...
Conference Paper
Este artigo descreve uma abordagem baseada em tópicos para o problema de recuperação de casos jurídicos (legal case retrieval). O método consiste em duas fases: filtragem e ordenação. Na primeira fase, uma técnica de modelagem de tópicos é aplicada em todo o conjunto de dados para selecionar um conjunto inicial de casos candidatos para cada consult...
Conference Paper
O termo Engenharia de Dados (ED) tem sido utilizado para se referir aos processos de adquirir, organizar e preparar dados para serem consumidos em análises ou aplicações. Com o surgimento da área de Ciência de Dados, esse termo tem sido usado para englobar o que tradicionalmente era conhecido como gerenciamento de dados. Neste estudo, exploramos a...
Conference Paper
Várias métricas de avaliação para geração de texto foram propostas nos últimos anos. No entanto, muitas questões surgiram sobre o quão bem elas podem avaliar a acurácia e a qualidade do texto gerado. Neste trabalho, estudamos como algumas das métricas de geração de texto mais populares se comportam ao lidar com a tarefa de sumarização de texto no d...
Conference Paper
There has been an explosion of available pre-trained and fine-tuned Generative Language Models (LM). They vary in the number of parameters, architecture, training strategy, and training set size. Aligned with it, alternative strategies exist to exploit these models, such as Fine-tuning and Prompt Engineering. However, many questions may arise throu...
Preprint
Full-text available
The adoption of document stores, such as MongoDB or CouchDB, has drastically increasedin the past years. Part of this popularity can certainly be explained by their flexibility interms of loading, storing, and retrieving semi-structured data on massive scales. However,adopting such systems presents challenges when exploring the data they store sinc...
Article
Full-text available
Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development of intelligent tools. In this work, we tackle the challenging task of organizing and summarizing the c...
Article
Full-text available
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. They take a keyword query, locate their corresponding elements in the target database, and connect them using information on PK/FK constraints. Although there are many...
Conference Paper
Full-text available
In this paper, we present a benchmark of several session-based, session-based with reminders and session-aware recommender systems that can be used to improve legal document recommendation in Jusbrasil, the largest legal search engine in Brazil. We focus this benchmark on the logged users, and the results show that some recommender systems can achi...
Article
Full-text available
In this work we address the problem of performing an error-tolerant prefix search on a set of string keys. While the ideas presented here could be adopted in other applications, our primary target application is error-tolerant query autocompletion. Tries and their variations have been adopted as the basic data structure to implement recently propos...
Article
Full-text available
As online purchasing becomes more popular, users trust more information published on social media than on advertisement content. Opinion mining is often applied to social media, and opinion target extraction is one of its main sub-tasks. In this paper, we focus on recognizing target entities related to electronic products. We propose a method calle...
Article
Full-text available
Product Graphs (PGs) are knowledge graphs that structure the relationship of products and their characteristics. They have become very popular lately due to their potential to enable AI-related tasks in e-commerce. With the rise of social media, many dynamic and subjective information on products and their characteristics became widely available, c...
Preprint
Full-text available
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. These systems take the keywords from the input query, locate the elements of the target database that correspond to these keywords, and look for ways to "connect" thes...
Article
Full-text available
Schema matching is the problem of finding semantic correspondences between elements from different schemas. This is a challenging problem since disparate elements in the schemas often represent the same concept. Traditional instances of this problem involved a pair of schemas. However, recently, there has been an increasing interest in matching sev...
Conference Paper
Full-text available
Gerenciadores de documentos (GDs) ou document stores, como MongoDB e CouchDB, têm se tornado cada vez mais populares devido à flexibilidade em carregar e recuperar dados em larga escala usando documentos semi-estruturados, pois evitam a necessidade de definição de esquemas antes da ingestão de dados. Por outro lado, especificar consultas neste tipo...
Conference Paper
Ao longo das últimas décadas os bancos de dados (BDs) têm sido o principal recurso computacional utilizado para armazenamento e gerenciar dados dos mais variados tipos de aplicações. Tipicamente, BDs armazenam informações factuais e objetivas sobre entidades do mundo real, que são representadas como um conjunto de atributos. No entanto, tem havido...
Chapter
Due to globalization and the technological advances of the last decades, a large amount of data is created every day, especially on the Web. Web Services are one of the main artifacts created on the Web; they can provide access to sources of data of many sizes and types. In this work, we approach the challenge of designing and evaluating a Web Craw...
Conference Paper
The approval of the General Data Protection Regulation (GDPR) brought a revolution in the way we treat data produced in digital media. The GDPR increases individuals’ participation in the treatment of their data, and it also introduces technical challenges, whose failure can lead to a fine of 4% of the organization’s annual revenue. Among many appr...
Preprint
Full-text available
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from addit...
Chapter
A large number of opinions on products and their features are posted every day on e-commerce websites in user reviews. They are a valuable source of knowledge for both manufacturers and customers. However, reviews often bring so much information that exceeds the human capacity of reasoning and hampers their effective use. Thus, researchers on how t...
Article
Several systems proposed for processing keyword queries over relational databases rely on the generation and evaluation of Candidate Networks (CNs), i.e., networks of joined database relations that when processed as SQL queries, provide a relevant answer to the input keyword query. Although the evaluation of CNs has been extensively addressed in th...
Article
Full-text available
In this study, we propose and evaluate a novel learning-to-rank (L2R) approach that produces results on par with those of the state-of-the-art L2R methods while being computationally effective. We start by presenting a modified gradient boosted regression tree algorithm to generate unified term impact (UTI) values at indexing time. Each unified ter...
Article
Medical data processing has found a new dimension with the extensive use of machine-learning techniques to classify and extract features. Machine learning strongly benefits from computing accelerators. However, such accelerators are not easily available at hospital premises, although they can be easily found on public cloud infrastructures or resea...
Conference Paper
Full-text available
When making purchasing decisions, customers usually rely on information from two types of sources: product specifications, provided by manufacturers, and reviews, posted by other customers. Both kinds of information are often available on e-commerce websites. While researchers have demonstrated the importance of product specifications and reviews a...
Article
A vast number of user opinions are available from reviews posted on e-commerce websites. Although these opinions are a valuable source of knowledge for both manufacturers and customers, they provide volumes of information that exceeds the human cognitive processing capacity, which can be a major bottleneck for their effective use. To address this p...
Chapter
Full-text available
User opinions posted on e-commerce websites are a valuable source to support purchase making-decision. Unfortunately, it is not generally feasible for an ordinary buyer to examine a large set of reviews on a given product for useful information on certain attributes. We present a system named Contender that can summarize product reviews aligned to...
Article
Full-text available
Abstract In this paper, we propose a method for enriching product catalogs, which traditionally include only objective data provided by manufacturers or retailers, with subjective information extracted from reviews written by customers. Our method was designed to associate opinions taken from reviews with the product attributes they refer to. This...
Conference Paper
O problema da árvore de Steiner em grafos é NP-difícil, no entanto, algoritmos que fazem uso de heurísticas conseguem obter resultados próximos da solução ótima em tempo polinomial. Neste trabalho apresentamos um novo algoritmo baseado em um algoritmo exato enumerativo da literatura. A heurística proposta seleciona vértices como candidatos a serem...
Article
Full-text available
This article addresses the problem of representation, indexing and retrieval of images through the signature-based bag of visual words (S-BoVW) paradigm, which maps features extracted from image blocks into a set of words without the need of clustering processes. Here, we propose the first ever method based on the S-BoVW paradigm that considers inf...
Article
Full-text available
In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each...
Conference Paper
Online social media has become an essential part of our life. This media is often characterized by its diverse content, which is produced by ordinary users. The potential to easily express ideas and opinions has made social media a source of valuable information on a variety of topics. In particular, information containing comments about consumer p...
Article
In this paper we propose and evaluate the Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP. It is an extension of BMW-CS, a method previously proposed by us. Although very efficient, BMW-CS does not guarantee preserving the top-. k results for a given query. Algorithms that do not preserve the top results m...
Article
In a stream environment, differently from traditional databases, data arrive continuously, unindexed and potentially unbounded, whereas queries must be evaluated for producing results on the fly. In this article, we propose two new algorithms (called SLCAStream and ELCAStream) for processing multiple keyword queries over XML streams. Both algorithm...
Article
A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate docu...
Article
In this paper, we revisit SDLC, an image retrieval method that adopts a signature-based approach to identify visual words, instead of the more conventional approach that identifies them by using clustering techniques. We start by providing a formal and generalized definition of the approach adopted in SDLC, which we call Signature-Based Bag of Visu...
Article
Relational keyword search (R-KwS) systems based on schema graphs take the keywords from the input query, find the tuples and tables where these keywords occur and look for ways to 'connect' these keywords using information on referential integrity constraints, i.e., key/foreign key pairs. The result is a number of expressions, called Candidate Netw...
Conference Paper
Important applications in product opinion mining such as opinion summarization and aspect extraction require the recognition of product mentions as a basic task. In the case of consumer electronic products, Web forums are important and popular sources of valuable opinions. Forum users often refer to products by means of their model numbers. In a po...
Article
Full-text available
Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an im...
Conference Paper
Full-text available
In this paper, we tackle the problem of processing various keyword-based queries over XML streams in a scalable way, improving recent multi-query processing approaches. We propose a customized algorithm, called MKStream, that relies on parsing stacks designed for simultaneously matching several queries. Particularly, it explores the possibility of...
Conference Paper
The late-breaking Results, the Doctoral Consortium and the Workshop papers at the 25th ACM conference on Hypertext and Social Media deal with different exciting topics related to emerging areas of research, with the aim of discussing of best practices and innovative approaches. Late-breaking results and Doctoral Consortium works focus on some key i...
Article
Full-text available
A substantial fraction of web search queries contain references to entities, such as persons, organizations, and locations. Recently, methods that exploit named entities have been shown to be more effective for query expansion than traditional pseudo-relevance feedback methods. In this paper, we introduce a supervised learning approach that exploit...
Article
The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages sho...
Chapter
This chapter presents ONDUX (On Demand Unsupervised Information Extraction) a method that relies on the presented unsupervised approach to deal with the Information Extraction by Text Segmentation problem. ONDUX was first presented in Cortez et al. (2010) and in Cortez and da Silva (2010). Following, a tool based on ONDUX was presented in Porto et...
Chapter
This chapter presents the conclusions and discuss directions for future work based on the unsupervised approach presented here.
Chapter
This chapter presents iForm, a method for automatically using data-rich text for filling form-based input interfaces that rely on the presented unsupervised approach to deal with the Information Extraction by Text Segmentation problem. iForm was first presented in Toda et al. (2009, 2010). In the following is described the scenario where iForm is a...
Chapter
This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-b...
Chapter
In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divid...
Chapter
This chapter presents Joint Unsupervised Structure Discovery and Information Extraction (JUDIE) a method for addressing the IETS problem. JUDIE was presented in (Cortez et al. 2011). First, it is introduced the scenario to which JUDIE is targeted to, then we go over the proposed solution detailing all the steps that comprise JUDIE. Finally, an expe...
Conference Paper
In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in th...
Conference Paper
A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions....
Conference Paper
In this paper we present two new algorithms designed to reduce the overall time required to process top-k queries. These algorithms are based on the document-at-a-time approach and modify the best baseline we found in the literature, Blockmax WAND (BMW), to take advantage of a two-tiered index, in which the first tier is a small index containing on...
Conference Paper
Full-text available
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messa...
Article
Full-text available
The schema matching problem can be defined as the task of finding semantic relationships between schema elements existing in different data repositories. Despite the existence of elaborated graphic tools for helping to find such matches, this task is usually manually done. In this paper, we propose a novel evolutionary approach to addressing the pr...
Article
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messa...
Book
A new unsupervised approach to the problem of Information Extraction by Text Segmentation (IETS) is proposed, implemented and evaluated herein. The authors' approach relies on information available on pre-existing data to learn how to associate segments in the input string with attributes of a given domain relying on a very effective set of content...
Article
Resumo Previous work in literature has indicated that template of web pages represent noisy information in web collections, and advocate that the simple removal of template result in improvements in quality of results provided by Web search systems. In this paper, we study the impact of template removal in two distinct scenarios: large scale web s...
Conference Paper
The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique real-world entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality sense-annotated data, however, are hard to be obta...
Article
State-of-the-art search engine ranking methods combine several distinct sources of relevance evidence to produce a high-quality ranking of results for each query. The fusion of information is currently done at query-processing time, which has a direct effect on the response time of search systems. Previous research also shows that an alternative to...
Article
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for develop...
Conference Paper
Recent work on Content-Based Image Retrieval (CBIR) have presented alternative methods for fast image indexing and retrieval using Bags of Visual Words (BoVW). In such methods, images are represented as sets of visual words, which can be indexed and searched using well-known text retrieval techniques, allowing fast search on large image databases....
Conference Paper
Full-text available
Many user queries nowadays contain references to named entities, which has motivated the development of new methods that exploit entity semantics for query expansion. At the same time, Wikipedia has been widely recognized as a large network of named entities, where entity-related articles are organized into a comprehensive hierarchy of categories a...
Conference Paper
Full-text available
In this paper, we propose that various keyword-based queries be processed over XML streams in a multi-query processing way. Our algorithms rely on parsing stacks designed for simultaneously matching terms from several distinct queries and use new query indexes to speed up search operations when processing a large number of queries. Besides defining...
Conference Paper
Full-text available
The Web has become a huge repository of pages and search engines allow users to find relevant information in this repository. Web crawlers are an important component of search engines. They find, download, parse content and store pages in a repository. In this paper, we present a new algorithm for verifying URL uniqueness in a large-scale web crawl...
Article
In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information ab...
Conference Paper
Information about how to segment a Web page can be used nowadays by applications such as segment aware Web search, classification and link analysis. In this research, we propose a fully automatic method for page segmentation and evaluate its application through experiments with four separate Web sites. While the method may be used in other applicat...
Conference Paper
Full-text available
Learning from unlabeled data provides innumerable advantages to a wide range of applications where there is a huge amount of unlabeled data freely available. Semi-supervised learning, which builds models from a small set of labeled examples and a potential large set of unlabeled examples, is a paradigm that may effectively use those unlabeled data....
Conference Paper
Full-text available
As the number of research papers available on the Web has increased enormously over the years, paper recommender systems have been proposed to help researchers on automatically finding works of interest. The main problem with the current approaches is that they assume that recommending algorithms are provided with a rich set of evidence (e.g., docu...
Conference Paper
Full-text available
In this poster paper, we present an overview of CiênciaBrasil, a research social network involving researchers within the Brazilian INCT program. We describe its architecture and the solutions adopted for data collection, extraction, and deduplication, and for materializing and visualizing the network.
Conference Paper
In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art In...
Article
Full-text available
We propose a strategy for automatically obtaining datasets from Wikipedia to support unsupervised Information Extraction by Text Segmentation (IETS) methods. Despite the importance of preexisting datasets to unsupervised IETS methods, there has been no proper discussion in the literature on how such datasets can be effectively obtained or built. We...
Conference Paper
Full-text available
In this paper, we present a novel method for automatically deriving structured XML queries from keyword-based queries and show how it was applied to the experimental tasks proposed for the INEX 2010 data-centric track. In our method, called StruX, users specify a schema-independent unstructured keyword-based query and it automatically generates a t...

Network

Cited By