About
127
Publications
28,349
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,424
Citations
Introduction
Skills and Expertise
Additional affiliations
September 1995 - July 1996
January 1997 - present
Publications
Publications (127)
The present study aims to clarify the role of the fraction of patients under antiretroviral therapy (ART) achieving viral suppression (VS) (i.e. having plasma viral load below the detectability threshold) on the human immunodeficiency virus (HIV) epidemic in Italy. Based on the hypothesis that VS makes the virus untransmittable, we extend a previou...
Background
In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experimen...
Background
The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models compo...
Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of...
In this paper we consider a particular graph-optimization problem. Given an edge-colored graph and a set of constraints on the sequence of the colors, one is to find the longest path whose colored edges obey the constraints on the sequence of the colors. In the actual formulation, the problem generalizes already known NP-Complete problems, and, evi...
Background:
Alzheimer's Disease (AD) is a neurodegenaritive disorder characterized by a progressive dementia, for which actually no cure is known. An early detection of patients affected by AD can be obtained by analyzing their electroencephalography (EEG) signals, which show a reduction of the complexity, a perturbation of the synchrony, and a sl...
Common operation scheduling (COS) problems arise in real-world applications, such as industrial processes of material cutting or component dismantling. In COS, distinct jobs may share operations, and when an operation is done, it is done for all the jobs that share it. We here propose a 0-1 LP formulation with exponentially many inequalities to min...
There are several examples of dual propulsion vehicles: hybrid cars, bi-fuel vehicles, electric bikes. Compute a path from a starting point to a destination for these typologies of vehicles requires evaluation of many alternatives. In this paper we develop a mathematical model, able to compute paths for dual propulsion vehicles, that takes in accou...
The analysis of high throughput gene expression patients/controls experiments is based on the determination of differentially expressed genes according to standard statistical tests. A typical bioinformatics approach to this problem is composed of two separate steps: first, a subset of genes with altered expression level is identified; then the pat...
A substantial connection exists between supervised learning from data represented in logic form and the solution of the Minimum Cost Satisfiability Problem (MinCostSAT). Methods based on such connection have been developed and successfully applied in many contexts. The deployment of such methods to large-scale learning problem is often hindered by...
Increasing evidence points to a key role played by epithelial-mesenchymal transition (EMT) in cancer progression and drug resistance. In this study, we used wet and in silico approaches to investigate whether EMT phenotypes are associated to resistance to target therapy in a non-small cell lung cancer model system harboring activating mutations of...
Data integration is one of the most challenging research topic in many knowledge domains, and biology is surely one of them. However theory and state of the art methods make this task complex for most of the small research centers. Fortunately, several organizations are focusing on collecting heterogeneous data making an easier task to design analy...
In the present paper we propose a simple time-varying ODE model to describe the evolution of HIV epidemic in Italy. The model considers a single population of susceptibles, without distinction of high-risk groups within the general population, and accounts for the presence of immigration and emigration, modelling their effects on both the general d...
Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classi...
Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classi...
Background
Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addre...
Due to the great advances of Next Generation Sequencing (NGS) techniques, bioinformaticians are faced with
large amounts of genomic and clinical data, which are growing exponentially. A striking example is The Cancer Genome Atlas (TCGA), whose aim is to provide a comprehensive archive of biomedical data about tumors. Indeed, TCGA contains more than...
The Closest String Problem (CSP) calls for finding an n-string that minimizes its maximum distance from m given n-strings. Integer linear programming (ILP) proved to be able to solve large CSPs under the Hamming distance, whereas for the Levenshtein distance, preferred in computational biology, no ILP formulation has so far be investigated. Recent...
In this paper we present a new bound obtained with the probabilistic method
for the solution of the Set Covering problem with unit costs. The bound is
valid for problems of fixed dimension, thus extending previous similar
asymptotic results, and it depends only on the number of rows of the
coefficient matrix and the row densities. We also consider...
Table of contents
A1 Highlights from the eleventh ISCB Student Council Symposium 2015
Katie Wilkins, Mehedi Hassan, Margherita Francescatto, Jakob Jespersen, R. Gonzalo Parra, Bart Cuypers, Dan DeBlasio, Alexander Junge, Anupama Jigisha, Farzana Rahman
O1 Prioritizing a drug’s targets using both gene expression and structural similarity
Griet Laene...
We propose a new Robust Optimization method for the energy offering problem
of a price-taker generating company that wants to build offering curves for its
generation units, in order to maximize its profit while taking into account the
uncertainty of market price. Our investigations have been motivated by a
critique to another Robust Optimization m...
The analysis of gene expression profiles from microarray/RNA sequencing (RNA-Seq) experimental samples demands new efficient methods from statistics and computer science. This chapter considers two main types of gene expression data analysis such as gene clustering and experiment classification. It introduces the transcriptome analysis, highlightin...
Global sourcing in complex assembly production systems entails the management of potentially high variability and multiple risks in costs, quality and lead times. Additionally, current strategies of many companies or environmental regulatory frameworks impose - or will impose - on industries worldwide to take control, among others, of CO2 emissions...
Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences.
In this paper, we present Logic Ali...
Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality,...
Motivation:
Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorith...
In this paper we propose a new method to measure the contribution of discretized features
for supervised learning and discuss its applications to biological data analysis. We
restrict the description and the experiments to the most representative case of
discretization in two intervals and of samples belonging to two classes. In order to test
the v...
The EURO Working Group on Operations Research in Computational Biology, Bioinformatics
and Medicine held its fourth conference in Poznan-Biedrusko, Poland, June 26–28, 2014. The
editorial board of RAIRO-OR invited submissions of papers to a special issue on Recent
Advances in Operations Research in Computational Biology, Bioinformatics and Medicine...
Feature selection methods are used in machine learning and data analysis to select a subset of features that may be successfully used in the construction of a model for the data. These methods are applied under the assumption that often many of the available features are redundant for the purpose of the analysis. In this paper, we focus on a partic...
Many approaches exist to integrate protein-protein interaction data with other sources of information, most notably with gene co-expression data, to obtain information on network dynamics. It is of interest to look at groups of interacting gene products that form a protein complex. We were interested in applying new tools to the characterization of...
Alzheimer's Disease (AD) and its preliminary stage - Mild Cognitive Impairment (MCI) - are the most widespread neurodegenerative disorders, and their investigation remains an open challenge. ElectroEncephalography (EEG) appears as a non-invasive and repeatable technique to diagnose brain abnormalities. Despite technical advances, the analysis of EE...
Next Generation Sequencing (NGS) machines extract from a biological sample a large number ofshort DNA fragments (reads). These reads are then used for several applications, e.g., sequencereconstruction, DNA assembly, gene expression profiling, mutation analysis.
We propose a method to evaluate the similarity between reads. This method does not rely...
We study an operation scheduling problem where a finite set of jobs with due dates must be completed by one machine: each job is completed as soon as a specific subset of unit operations is done. Distinct jobs may share operations, and when an operation is done, it is done for all the jobs that share it. The goal is to schedule operations so that t...
In order to understand a network function, it’s necessary the understanding of its topology, since the topology is designed to better undertake the function, and the efficiency of network function is influenced by its topology. For this reason, topological analysis of complex networks has been an intensely researched area in the last decade.
Result...
Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which
actually no cure is known [1]. Different studies have shown that AD has (at least) three major
effects on electroencephalography (EEG) signals: enhanced complexity, slowing of signals,
and perturbations in EEG synchrony [2]. The aim of this work is to achieve an a...
Much of the valuable information in supporting decision making processes originates in text-based documents. Although these documents can be effectively searched and ranked by modern search engines, actionable knowledge need to be extracted and transformed in a structured form before being used in a decision process. In this paper we describe how t...
Alignment-free methods are routinely used in largescale, gene-independent phylogeny reconstruction. Such methods measure the similarity of two genomes by comparing the frequency of all their distinct substrings of length k. In this paper we apply logic data mining methods to discover a minimal subset of k-mers whose frequency information is suffici...
In this paper we address the issue of solving a Unit Commitment (UC) problem including the transmission network with Active Switching (AS). The switching operation consists in a dynamic reconfiguration of the network, i.e. a tripping of some lines; this paradigm is named UC with Optimal Transmission Switching (UCOTS). The UCOTS is a novel way to le...
Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been...
In this paper we introduce the Mathematical Desk for Italian Industry, a project based on applied and industrial mathematics developed by a team of researchers from the Italian National Research Council in collaboration with two major Italian associations for applied mathematics, SIMAI and AIRO. The scope of this paper is to clarify the motivations...
The large integration of wind energy into electrical systems poses important challenges to the power operators in the scheduling of the production and in the management of the network. This leads to the necessity to modify the current industry procedures, such as the Unit Commitment (UC) and the Economic Dispatch (ED), to take into account large am...
Abstract Experimental co-expression data and protein-protein interaction networks are frequently used to analyze the interactions among genes or proteins. Recent studies have investigated methods to integrate these two sources of information. We propose a new method to integrate co-expression data obtained through DNA microarray analysis (MA) and p...
A concept of an Orderly Colored Longest Path (OCLP) refers to the problem of finding the longest path in a graph whose edges are colored with a given number of colors, under the constraint that the path follows a predefined order of colors. The problem has not been widely studied in the previous literature, especially for more than two colors in th...
Background / Purpose:
We propose a filtering method for read pairs based on alignment free distance. The similarity of two reads is assessed by comparing the frequencies of their substrings of fixed dimensions (k-mers).
Main conclusion:
We present computational results that show the efficacy of an alignment free distance in estimating a good r...
The wide spread of electronic data collection in medical environments leads to an exponential growth of clinical data extracted from heterogeneous patient samples. Collecting, managing, integrating and analyzing these data are essential activities in order to shed light on diseases and on related therapies. The major issues in clinical data analysi...
Machine Learning (ML) algorithms are used to train computers to perform a
variety of complex tasks and improve with experience. Computers learn how to
recognize patterns, make unintended decisions, or react to a dynamic
environment. Certain trained machines may be more effective than others because
they are based on more suitable ML algorithms or b...
The Sensor Networks Localization Problem (SNLP) consists in seeking on embedding in R2 of a
given weighted undirected graph where the vertices represent the sensors in a global coordinate
system and the weight of each edge is the Euclidean distance between two sensors. The SNLP
belongs to the class of the problems of Distance Geometry (DGP). In thi...
In this paper we describe an effective approach to design, implement, and operate a traffic control system based on logic programming. With this approach it is possible to implement very flexible control strategies that can be easily developed by traffic engineers using a simple description language. An important feature of the system is the use of...
BLOG (Barcoding with LOGic) is a diagnostic and character-based DNA Barcode analysis method. Its aim is to classify specimens to species based on DNA Barcode sequences and on a supervised machine learning approach, using classification rules that compactly characterize species in terms of DNA Barcode locations of key diagnostic nucleotides. The BLO...
Methods ODNA sequence assembly The DNA sequence assembly process is based on the alignment and merging of reads (stretch of sequences) in order to reconstruct the original primary structure of the DNA sample sequences. Given a set of sequences S={S1, S2,…, sn}, where s∈ S is a fragment of the primary structure of DNA (read)(eg s={ATTCGA... CTGACT})...
Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engineered for microar-ray gene expression analysis. The aims of MALA are to cluster the microarray gene expression profiles in order to reduce the amount of data to be analyzed and to classify the microarray ex-periments. To fulfil this objective MALA uses a...
Differences in genomic sequences are crucial for the classification of viruses into different species. In this work, viral DNA sequences belonging to the human polyomaviruses BKPyV, JCPyV, KIPyV, WUPyV, and MCPyV are analyzed using a logic data mining method in order to identify the nucleotides which are able to distinguish the five different human...
Separating formulas for ST gene region. All the discriminating base pairs for the virus classification within the gene region ST.
Separating formulas for LT gene region. All the discriminating base pairs for the virus classification within the gene region LT.
Appendix. Test Plan and statistical experiments.
Recently diverged species are challenging for identification, yet they are frequently of special interest scientifically as well as from a regulatory perspective. DNA barcoding has proven instrumental in species identification, especially in insects and vertebrates, but for the identification of recently diverged species it has been reported to be...
Relative method performance based on simulated data for all species. Boxplots of query identification success (N = 300) of six methods that were applied to ‘recently diverged’ species in simulated query data sets. NJ = neighbor joining, PAR = parsimony, NN = nearest neighbor. Success scores not significantly different in post-hoc pairwise Wilcoxon...
Influence of divergence time on species identification success per method compared.
(PDF)
Simulated ultrametric species tree. Tree with 50 species simulated under the Yule model and with a total tree depth of 1 million generations. Terminal branches subtending species considered as ‘recently diverged’ are in red, those subtending species considered as ‘old’ are in blue.
(TIF)
Method performance based on simulated data for all species.
(PDF)
Results for all 112 species represented by 5 or more sequences in the Cypraeidae empirical data set.
(PDF)
Influence of effective population size (
Ne
) on species identification success per method compared.
(PDF)