About
68
Publications
21,395
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,397
Citations
Introduction
Additional affiliations
May 2015 - August 2018
June 2013 - May 2015
June 2013 - May 2015
Publications
Publications (68)
Feature selection methods are used in machine learning and data analysis to select a subset of features that may be successfully used in the construction of a model for the data. These methods are applied under the assumption that often many of the available features are redundant for the purpose of the analysis. In this paper, we focus on a partic...
Motivation:
Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorith...
Background
Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, m...
DNA methylation is a well-studied genetic modification crucial to regulate the functioning of the genome. Its alterations play an important role in tumorigenesis and tumor-suppression. Thus, studying DNA methylation data may help biomarker discovery in cancer. Since public data on DNA methylation become abundant – and considering the high number of...
Background:
Alzheimer's Disease (AD) is a neurodegenaritive disorder characterized by a progressive dementia, for which actually no cure is known. An early detection of patients affected by AD can be obtained by analyzing their electroencephalography (EEG) signals, which show a reduction of the complexity, a perturbation of the synchrony, and a sl...
The continuingly decreasing cost of next-generation sequencing has recently led to a significant increase in the number of microbiome-related studies, providing invaluable information for understanding host-microbiome interactions and their relation to diseases. A common approach in metagenomics consists of determining the composition of samples in...
Electroencephalography (EEG) signal analysis is a fast, inexpensive, and accessible technique to detect the early stages of dementia, such as Mild Cognitive Impairment (MCI) and Alzheimer’s disease (AD). In the last years, EEG signal analysis has become an important topic of research to extract suitable biomarkers to determine the subject’s cogniti...
Big Data technologies have significantly increased the possibility for sellers to adopt personalisation strategies, especially in digital markets. Among such strategies, price discrimination, a practice where the same commodity is sold at different prices, either to the same customer or to different customers, stands out. Particularly, the online a...
The advent of Next Generation Sequencing (NGS) technologies and the reduction of sequencing costs, characterized the last decades by a massive production of experimental data. These data cover a wide range of biological experiments derived from several sequencing strategies, producing a big amount of heterogeneous data. They are often linked to a s...
The recent advancements in cancer genomics have put under the spotlight DNA methylation, a genetic modification that regulates the functioning of the genome and whose modifications have an important role in tumorigenesis and tumor-suppression. Because of the high dimensionality and the enormous amount of genomic data that are produced through the l...
With Next Generation DNA Sequencing techniques (NGS) we are witnessing a high growth of genomic data. In this work, we focus on the NGS DNA methylation experiment, whose aim is to shed light on the biological process that controls the functioning of the genome and whose modifications are deeply investigated in cancer studies for biomarker discovery...
Next Generation Sequencing technologies have produced a substantial increase of publicly available genomic data and related clinical/biospecimen information. New models and methods to easily access, integrate and search them effectively are needed. An effort was made by the Genomic Data Commons (GDC), which defined strict procedures for harmonizing...
In this work, new implementations of the U-BRAIN (Uncertainty-managing Bach Relevance-Based Artificial Intelligence) supervised machine learning algorithm are described. The implementations, referred as SP-BRAIN (SP stands for Spark), aim to efficiently process large datasets. Given the iterative nature of the algorithm together with its dependence...
Thanks to next-generation sequencing techniques, a very big amount of genomic data are available. Therefore, in the last years, biomedical databases are growing more and more. Analyzing this big amount of data with bioinformatics and big data techniques could lead to the discovery of new knowledge for the treatment of serious diseases. In this work...
The continuous growth of experimental data generated by Next Generation Sequencing (NGS) machines has led to the adoption of advanced techniques to intelligently manage them. The advent of the Big Data era posed new challenges that led to the development of novel methods and tools, which were initially born to face with computational science proble...
Knowledge extraction methods from Next Generation Sequencing Data (NGS) are highly requested nowadays. This technology has led to an explosion in the amount of genomic data. However, the efficiency of NGS has posed a challenge for analysis this vast genomic data, gene interaction and expression studies. In this work, we focus on RNA-seq gene expres...
Background
In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experimen...
Background
The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models compo...
Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of...
The development of high-throughput Next Generation Sequencing (NGS) technologies allows to massively extract at low cost an extremely large amount of biological sequences in the form of reads, i.e., short fragments of an organism’s genome. The advent of NGS poses new issues for computer scientists and bioinformaticians, leading to the design of alg...
Data integration is one of the most challenging research topic in many knowledge domains, and biology is surely one of them. However theory and state of the art methods make this task complex for most of the small research centers. Fortunately, several organizations are focusing on collecting heterogeneous data making an easier task to design analy...
In bioinformatics and biology researchers annotate experimental data in many different ways. When other researchers need to query these data, they are typically unaware of the specificity of the annotations; often they encounter possible mismatches between the granularity of the query and the granularity of the annotations. In this work, we propose...
Background
Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addre...
The IN-CLOUD project, funded in the framework of the Erasmus+ Programme Strategic Partnership, aims to raise awareness regarding how cloud computing can boost economical growth and innovation and to qualify professionals able to introduce the Cloud technologies in SMEs and public administrations. The project aims to deliver and award VET qualificat...
The IN-CLOUD project, funded in the framework of the Erasmus+ Programme Strategic Partnership, aims to raise awareness regarding how cloud computing can boost economical growth and innovation and to qualify professionals able to introduce the Cloud technologies in SMEs and public administrations. The project aims to deliver and award VET qualificat...
Due to the great advances of Next Generation Sequencing (NGS) techniques, bioinformaticians are faced with
large amounts of genomic and clinical data, which are growing exponentially. A striking example is The Cancer Genome Atlas (TCGA), whose aim is to provide a comprehensive archive of biomedical data about tumors. Indeed, TCGA contains more than...
Table of contents
A1 Highlights from the eleventh ISCB Student Council Symposium 2015
Katie Wilkins, Mehedi Hassan, Margherita Francescatto, Jakob Jespersen, R. Gonzalo Parra, Bart Cuypers, Dan DeBlasio, Alexander Junge, Anupama Jigisha, Farzana Rahman
O1 Prioritizing a drug’s targets using both gene expression and structural similarity
Griet Laene...
The analysis of gene expression profiles from microarray/RNA sequencing (RNA-Seq) experimental samples demands new efficient methods from statistics and computer science. This chapter considers two main types of gene expression data analysis such as gene clustering and experiment classification. It introduces the transcriptome analysis, highlightin...
Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences.
In this paper, we present Logic Ali...
In this paper we propose a new method to measure the contribution of discretized features
for supervised learning and discuss its applications to biological data analysis. We
restrict the description and the experiments to the most representative case of
discretization in two intervals and of samples belonging to two classes. In order to test
the v...
Alzheimer's Disease (AD) and its preliminary stage - Mild Cognitive Impairment (MCI) - are the most widespread neurodegenerative disorders, and their investigation remains an open challenge. ElectroEncephalography (EEG) appears as a non-invasive and repeatable technique to diagnose brain abnormalities. Despite technical advances, the analysis of EE...
Next Generation Sequencing (NGS) machines extract from a biological sample a large number ofshort DNA fragments (reads). These reads are then used for several applications, e.g., sequencereconstruction, DNA assembly, gene expression profiling, mutation analysis.
We propose a method to evaluate the similarity between reads. This method does not rely...
Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which
actually no cure is known [1]. Different studies have shown that AD has (at least) three major
effects on electroencephalography (EEG) signals: enhanced complexity, slowing of signals,
and perturbations in EEG synchrony [2]. The aim of this work is to achieve an a...
Alignment-free methods are routinely used in largescale, gene-independent phylogeny reconstruction. Such methods measure the similarity of two genomes by comparing the frequency of all their distinct substrings of length k. In this paper we apply logic data mining methods to discover a minimal subset of k-mers whose frequency information is suffici...
Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been...
Background / Purpose:
We propose a filtering method for read pairs based on alignment free distance. The similarity of two reads is assessed by comparing the frequencies of their substrings of fixed dimensions (k-mers).
Main conclusion:
We present computational results that show the efficacy of an alignment free distance in estimating a good r...
The wide spread of electronic data collection in medical environments leads to an exponential growth of clinical data extracted from heterogeneous patient samples. Collecting, managing, integrating and analyzing these data are essential activities in order to shed light on diseases and on related therapies. The major issues in clinical data analysi...
BLOG (Barcoding with LOGic) is a diagnostic and character-based DNA Barcode analysis method. Its aim is to classify specimens to species based on DNA Barcode sequences and on a supervised machine learning approach, using classification rules that compactly characterize species in terms of DNA Barcode locations of key diagnostic nucleotides. The BLO...
Methods ODNA sequence assembly The DNA sequence assembly process is based on the alignment and merging of reads (stretch of sequences) in order to reconstruct the original primary structure of the DNA sample sequences. Given a set of sequences S={S1, S2,…, sn}, where s∈ S is a fragment of the primary structure of DNA (read)(eg s={ATTCGA... CTGACT})...
Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engineered for microar-ray gene expression analysis. The aims of MALA are to cluster the microarray gene expression profiles in order to reduce the amount of data to be analyzed and to classify the microarray ex-periments. To fulfil this objective MALA uses a...
Differences in genomic sequences are crucial for the classification of viruses into different species. In this work, viral DNA sequences belonging to the human polyomaviruses BKPyV, JCPyV, KIPyV, WUPyV, and MCPyV are analyzed using a logic data mining method in order to identify the nucleotides which are able to distinguish the five different human...
Separating formulas for ST gene region. All the discriminating base pairs for the virus classification within the gene region ST.
Separating formulas for LT gene region. All the discriminating base pairs for the virus classification within the gene region LT.
Appendix. Test Plan and statistical experiments.
Recently diverged species are challenging for identification, yet they are frequently of special interest scientifically as well as from a regulatory perspective. DNA barcoding has proven instrumental in species identification, especially in insects and vertebrates, but for the identification of recently diverged species it has been reported to be...
Relative method performance based on simulated data for all species. Boxplots of query identification success (N = 300) of six methods that were applied to ‘recently diverged’ species in simulated query data sets. NJ = neighbor joining, PAR = parsimony, NN = nearest neighbor. Success scores not significantly different in post-hoc pairwise Wilcoxon...
Influence of divergence time on species identification success per method compared.
(PDF)
Simulated ultrametric species tree. Tree with 50 species simulated under the Yule model and with a total tree depth of 1 million generations. Terminal branches subtending species considered as ‘recently diverged’ are in red, those subtending species considered as ‘old’ are in blue.
(TIF)
Method performance based on simulated data for all species.
(PDF)
Results for all 112 species represented by 5 or more sequences in the Cypraeidae empirical data set.
(PDF)
Influence of effective population size (
Ne
) on species identification success per method compared.
(PDF)
Results for all 15 species represented by 5 or more sequences in the
Drosophila
empirical data set.
(PDF)
Results for all 35 species represented by 5 or more sequences in the
Inga
empirical data set.
(PDF)
In this work we consider a method for the extraction of knowledge expressed in Disjunctive Normal Form (DNF) from data. The method is mainly designed for classification purposes, and is based on three main steps: Discretization, Feature Selection, and Formula Extraction. The three steps are formulated as optimization problems and solved with ad hoc...
Background / Purpose:
Data Mining Big (DMB) is a collection of logic data mining software tools engineered for the analysis and classification of biological data. Three different analysis tools are presented, specially designed for DNA Barcode classification (BLOG), microarray gene expression profiles characterization (MALA) and multipurpose soft...
The identification of early and stage-specific biomarkers for Alzheimer’s disease (AD) is critical, as the development of disease-modification therapies may depend on the discovery and validation of such markers. The identification of early reliable biomarkers depends on the development of new diagnostic algorithms to computationally exploit the in...
According to many field experts, specimens classification based on morphological keys needs to be supported with automated techniques based on the analysis of DNA fragments. The most successful results in this area are those obtained from a particular fragment of mitochondrial DNA, the gene cytochrome c oxidase I (COI) (the "barcode"). Since 2004 t...