
Ruggero G. PensaUniversità degli Studi di Torino | UNITO · Dipartimento di Informatica
Ruggero G. Pensa
Associate Professor
About
93
Publications
18,072
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,200
Citations
Citations since 2017
Introduction
My major research interests regard data mining and knowledge discovery, bioinformatics (gene expression data analysis), privacy-preserving algorithms for data management and social network analysis.
Additional affiliations
December 2011 - December 2011
December 2011 - present
November 2010 - October 2011
Education
November 2003 - November 2006
October 1998 - November 2003
Publications
Publications (93)
User-generated contents often contain private information, even when they are shared publicly on social media and on the web in general. Although many filtering and natural language approaches for automatically detecting obscenities or hate speech have been proposed, determining whether a shared post contains sensitive information is still an open...
Semi-supervised learning is crucial in many applications where accessing class labels is unaffordable or costly. The most promising approaches are graph-based but they are transductive and they do not provide a generalized model working on inductive scenarios. To address this problem, we propose a generic framework for inductive semi-supervised lea...
Distance-based machine learning methods have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Nonetheless, categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Although...
For their stability and detectability faecal microRNAs represent promising molecules with potential clinical interest as non-invasive diagnostic and prognostic biomarkers. However, there is no evidence on how stool miRNA profiles change according to an individual’s age, sex, and body mass index (BMI) or how lifestyle habits influence the expression...
Most privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the comp...
Semi-supervised learning is crucial in many applications where accessing class labels is unaffordable or costly. The most promising approaches are graph-based but they are transductive and they do not provide a generalized model working on inductive scenarios. To address this problem, we propose a generic framework, ESA☆, for inductive semi-supervi...
Objectives: MicroRNA (miRNA) profiles have been evaluated in several biospecimens in relation to common diseases for which diet may have a considerable impact. We aimed at characterising how specific diets are associated with the miRNome in stool of vegans, vegetarians and omnivores and how this is reflected in the gut microbial composition, as thi...
The majority of the data produced by human activities and modern cyber-physical systems involve complex relations among their features. Such relations can be often represented by means of tensors, which can be viewed as generalization of matrices and, as such, can be analyzed by using higher-order extensions of existing machine learning methods, su...
Satellite image time series (SITS) collected by modern Earth Observation (EO) systems represent a valuable source of information that supports several tasks related to the monitoring of the Earth surface dynamics over large areas. A main challenge is then to design methods able to leverage the complementarity between the temporal dynamics and the s...
Contagion processes have been widely studied in epidemiology and life science in general, but their implications are largely tangible in other research areas, such as in network science and computational social science. Contagion models, in particular, have proven helpful in the study of information diffusion, a very topical issue thanks to its app...
With the availability of user-generated content in the Web, malicious users dispose of huge repositories of private (and often sensitive) information regarding a large part of the world’s population. The self-disclosure of personal information, in the form of text, pictures and videos, exposes the authors of such contents (and not only them) to man...
Semi-supervised learning is a family of classification methods conceived to reduce the amount of required labeled information in the training phase. Graph-based methods are among the most popular semi-supervised strategies: a nearest neighbor graph is built in such a way that the manifold of the data is captured and the labeled information is propa...
The quality of the transport system offered at city level constitutes an important and challenging goal for society, for local authorities, and transport operators. Therefore, appropriate evaluation of travellers' satisfaction is required to support service performance monitoring, benchmarking, and market analysis. This aspect implies the collectio...
In most real world scenarios, experts dispose of limited background knowledge that they can exploit for guiding the analysis process. In this context, semi-supervised clustering can be employed to leverage such knowledge and enable the discovery of clusters that meet the analysts’ expectations. To this end, we propose a semi-supervised deep embeddi...
Tensors co-clustering has been proven useful in many applications, due to its ability of coping with high-dimensional data and sparsity. However, setting up a co-clustering algorithm properly requires the specification of the desired number of clusters for each mode as input parameters. This choice is already difficult in relatively easy settings,...
Introduzione:I dati ISTAT (2018) riportano che l’86% dei minori, tra 11 e 14 anni, utilizza internet, il 5% in più rispetto al 2014. I Nativi Digitali rappresentano la nuova generazione di studenti, cresciuti con dispositivi digitali che permettono la connessione in rete in ogni momento della giornata. L’utilizzo delle nuove tecnologie e la maggior...
In most real world scenarios, experts dispose of limited background knowledge that they can exploit for guiding the analysis process. In this context, semi-supervised clustering can be employed to leverage such knowledge and enable the discovery of clusters that meet the analysts' expectations. To this end, we propose a semi-supervised deep embeddi...
In this paper we point out some relevant issues in relation to privacy when providing holistic recommendations. We emphasize that a holistic recommender should be fair, explainable and privacy-preserving to ensure the ethicality of the recommendation process. Further, we point out relevant research questions that should be addressed in the future,...
Online social networks expose their users to privacy leakage risks. To measure the risk, privacy scores can be computed to quantify the users' profile exposure according to their privacy preferences or attitude. However, user privacy can be also influenced by external factors (e.g., the relative risk of the network, the position of the user within...
This book constitutes revised selected papers from two workshops held at the 18th European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2018, in Dublin, Ireland, in September 2018, namely:
MIDAS 2018 – Third Workshop on Mining Data for Financial Applications
and
PAP 2018 – Second International Workshop on Personal...
Modern Earth Observation systems provide remote sensing data at different temporal and spatial resolutions. Among all the available spatial mission, today the Sentinel-2 program supplies high temporal (every five days) and high spatial resolution (HSR) (10 m) images that can be useful to monitor land cover dynamics. On the other hand, very HSR (VHS...
The success of a film is usually measured through its box-office revenue or through the opinion of professional critics; such measures, however, may be influenced by external factors, such as advertisement or trends, and are not able to capture the impact of a film over time. Thanks to the recent availability of data on references among movies, som...
In this paper, we address the problem of enhancing young people's awareness of the mechanisms involving privacy in online social networks by presenting an innovative approach based on gamification. In particular, we propose a web application that allows kids and teenagers to experience the typical dynamics of information spread through a realistic...
The success of a movie is usually measured through its box-office revenue or the opinion of professional critics, but such measures may be influenced by external factors, such as advertisement or trends, and are not able to capture the impact over time of a film. A more efficient measure should account to what extent a given movie has influenced ot...
Public participation has become an important driver in increasing public acceptance of policy decisions, especially in the forestry sector, where conflicting interests among the actors are frequent. Stakeholder Analysis, complemented by Social Network Analysis techniques, was used to support the participatory process and to understand the complex r...
Modern Earth Observation systems provide sensing data at different temporal and spatial resolutions. Among optical sensors, today the Sentinel-2 program supplies high-resolution temporal (every 5 days) and high spatial resolution (10m) images that can be useful to monitor land cover dynamics. On the other hand, Very High Spatial Resolution images (...
Modern Earth Observation systems provide sensing data at different temporal and spatial resolutions. Among optical sensors, today the Sentinel-2 program supplies high-resolution temporal (every 5 days) and high spatial resolution (10m) images that can be useful to monitor land cover dynamics. On the other hand, Very High Spatial Resolution images (...
The problem of user privacy enforcement in online social networks (OSN) cannot be ignored and, in recent years, Facebook and other providers have improved considerably their privacy protection tools. However, in OSN’s the most powerful data protection “weapons” are the users themselves. The behavior of an individual acting in an OSN highly depends...
The problem of user privacy enforcement in online social networks (OSN) cannot be ignored and, in recent years, Facebook and other providers have improved considerably their privacy protection tools. However, in OSN’s the most powerful data protection “weapons” are the users themselves. The behavior of an individual acting in an OSN highly depends...
Information diffusion is a widely-studied topic thanks to its applications to social media/network analysis, viral marketing campaigns, influence maximization and prediction. In bibliographic networks, for instance, an information diffusion process takes place when some authors, that publish papers in a given topic, influence some of their neighbor...
The maturity of structured knowledge bases and semantic resources has contributed to the enhancement of document clustering algorithms, that may take advantage of conceptual representations as an alternative for classic bag-of-words models. However, operating in the semantic space is not always the best choice in those domain where the choice of te...
During our digital social life, we share terabytes of information that can potentially reveal private facts and personality traits to unexpected strangers. Despite the research efforts aiming at providing efficient solutions for the anonymization of huge databases (including networked data), in online social networks the most powerful privacy prote...
In this paper, we introduce a new approach of semisupervised anomaly detection that deals with categorical data. Given a training set of instances (all belonging to the normal class), we analyze the relationship among features for the extraction of a discriminative characterization of the anomalous instances. Our key idea is to build a model that c...
Humans like to disseminate ideas and news, as proved by the huge success of online social networking platforms such as Facebook or Twitter. On the other hand, these platforms have emphasized the dark side of information spreading, such as the diffusion of private facts and rumors in the society. Fortunately, in some cases, online social network use...
During our digital social life, we share terabytes of information that can potentially reveal private facts and personality traits to unexpected strangers. Despite the research efforts aiming at providing efficient solutions for the anonymization of huge databases (including networked data), in online social networks the most powerful privacy prote...
The risks due to a global and unaware diffusion of our personal data cannot be overlooked when more than two billion people are estimated to be registered in at least one of the most popular online social networks. As a consequence, privacy has become a primary concern among social network analysts and Web/data scientists. Some studies propose to "...
The way we watch television is changing with the introduction of attractive Web activities that move users away from TV to other media. The social multimedia and user-generated contents are dramatically changing all phases of the value chain of contents (production, distribution and consumption). We propose a concept-level integration framework in...
Location-based social networks (LBSN) are capturing large amount of data related to whereabouts of their users. This has become a social phenomenon, that is changing the normal communication means and it opens new research perspectives on how to compute descriptive models out of this collection of geo-spatial data. In this paper, we propose a metho...
The valorization and promotion of worldwide Cultural Heritage by the adoption of Information and Communication Technologies represent nowadays some of the most important research issues with a large variety of potential applications. This challenge is particularly perceived in the Italian scenario, where the artistic patrimony is one of the most di...
In common binary classification scenarios, the presence of both positive and negative examples in training data is needed to build an efficient classifier. Unfortunately, in many domains, this requirement is not satisfied and only one class of examples is available. To cope with this setting, classification algorithms have been introduced that lear...
In the last decade, the spread of broadband Internet connections even for mobile devices has contributed to an increased availability of multimedia information on the Web. At the same time, due to the decrease of storage cost and the increasing popularity of storage services in the cloud, the problem of information overload has become extremely ser...
The increasing availability of gene expression data has encouraged the development of purposely-built intelligent data analysis techniques. Grouping genes characterized by similar expression patterns is a widely accepted - and often mandatory - analysis step. Despite the fact that a number of biclustering methods have been developed to discover clu...
The increasing availability of personal data of a sequential nature, such as time-stamped transaction or location data, enables increasingly sophisticated sequential pattern mining techniques. However, privacy is at risk if it is possible to reconstruct the identity of individuals from sequential data. Therefore, it is important to develop privacy-...
In this paper, we present a research prototype for creating geographic summaries using the whereabouts of Foursquare users. Exploiting the density of the venue types in a particular region, the system adds a layer over any typical cartography geographic maps service, creating a first glance summary over the venues sampled from the Foursquare knowle...
In this work, we present a general framework for Cultural Heritage applications able to uniformly manage heterogeneous multimedia data coming from several web repositories and to provide context- Aware recommendation services in order to generate dynamic multimedia visiting paths useful for the users during the exploration of different kinds of cul...
In this chapter the authors propose a new methodology that minimizes the intervention of the analyst within the coclustering process and that provides meaningful coclusters whose discovery and interpretation are enhanced by embedding gene ontology (GO) annotations. To show the effectiveness of this approach, the authors apply their methodology on a...
Italy’s Cultural Heritage is the world’s most diverse and rich patrimony and attracts millions of visitors every year to monuments, archaeological sites and museums. The valorization of cultural heritage represents nowadays one of the most important research challenges in the Italian scenario. In this paper, we present a general multimedia recommen...
People on the Web talk about television. TV users' social activities implicitly connect the concepts referred to by videos, news, comments, and posts. The strength of such connections may change as the perception of users on the Web changes over time. With the goal of leveraging users' social activities to better understand how TV programs are perc...
Searching, browsing and analyzing web contents is today a challenging problem when compared to early Internet ages. This is due to the fact that web content is multimedial, social and dynamic. Moreover, concepts referred by videos, news, comments, posts, are implicitly linked by the fact that people on the Web talks about something, somewhere at so...
The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these da...
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. In this article, we propose a framework to learn a context-based distance for categorical attri...
Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which make...
In the generic setting of objects × attributes matrix data analysis, co-clustering appears as an interesting unsupervised data mining method. A co-clustering task provides a bi-partition made of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support expert interpretations. Many cons...
The huge volume of gene expression data produced by microarrays and other high-throughput techniques has encouraged the development
of new computational techniques to evaluate the data and to formulate new biological hypotheses. To this purpose, co-clustering
techniques are widely used: these identify groups of genes that show similar activity patt...
Clustering high-dimensional data is challenging. Classic met- rics fail in identifying real similarities between objects. Moreover, the huge number of features makes the cluster interpretation hard. To tackle these problems, several co-clustering approaches have been proposed which try to compute a partition of objects and a partition of features s...
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attribute...
Today digital bibliographies are a powerful instrument that collects a great amount of data about scientific publications. Digital bibliographies have been used as basis of many studies focused on the knowledge extraction in databases. Here we present anew methodology for mining knowledge in this field. Our approach aims to apply the potential of s...
The increasing availability of personal data of a sequential nature, such as time-stamped transaction or location data, enables increasingly sophisticated sequential pattern mining techniques. However, privacy is at risk if it is possible to reconstruct the identity of individuals from sequential data. Therefore, it is important to develop privacy-...
There is an increasing need in transcriptome research for gene expression data and pattern warehouses. It is of importance to integrate in these warehouses both raw transcriptomic data, as well as some properties encoded in these data, like local patterns.
We have developed an application called SQUAT (SAGE Querying and Analysis Tools) which is ava...
SQUAT relational schema. This figures displays the tables and the relation between the table of the SQUAT database.
We investigate a co-clustering framework (i.e., a method that provides a partition of objects and a linked partition of features) for binary data sets. So far, constrained co-clustering has been seldomly explored. First, we consider straightforward extensions of the classical instance level constraints (must-link, cannot-link) to express relationsh...
In many applications, the expert interpretation of co- clustering is easier than for mono-dimensional clustering. Co-clustering aims at computing a bi-partition that is a col- lection of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support interpretations. Many constrained cluster...