Conference PaperPDF Available

Abstract

Recently, the rising of the Big Data paradigm has had a great impact in several fields. Bioformatics is one such field. In fact, Bioinfomatics had to evolve in order to adapt to this phenomenon. The exponential increase of the biological information available, forced the researchers to find new solutions to handle these new challenges. In this paper we present our point of view on the problems intrinsic to Big Data (volume, velocity, variety and veracity), how they affect the Bioinformatics field, and some solutions that can help Bioinformatics practitioners to deal with the difficulties presented by Big Data.
A preview of the PDF is not available
... One of the areas of research in which great progress has been made in recent years to address the aforementioned big data challenges is biclustering [1][2][3][4]. This analytical technique of data mining, which is also known as subspace clustering, coclustering, block clustering, or 2-mode clustering, has already become an essential tool for gene expression analysis because it is capable of capturing similar gene expression profiles under different subsets of experimental conditions [5]. It is not without reason that biclustering has found hundreds of applications in bioinformatics, and, as a result, there has been a call for increased use of this approach [6]. ...
... Apache Spark [https://spark.apache.org], and massively parallel systems with multiple GPUs [5]. ...
... Considering the algorithm design, it is crucial to understand where and how parallelization in the method may be exploited to provide speedup, so desirable for analyzing large datasets [5]. Alternatively, the development of a new method could start with understanding hardware limitations. ...
Article
Full-text available
Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.
... In particular, Biclustering techniques find groups of genes that share a common behaviour under a certain group of experimental conditions (biclusters) [10]. However, nowadays, they present two main issues: the computational performance when processing large input gene expression datasets and the need to validate the huge number of generated biclusters [11]. ...
Article
Full-text available
Nowadays, Biclustering is one of the most widely used machine learning techniques to discover local patterns in datasets from different areas such as energy consumption, marketing, social networks or bioinformatics, among them. Particularly in bioinformatics, Biclustering techniques have become extremely time-consuming, also being huge the number of results generated, due to the continuous increase in the size of the databases over the last few years. For this reason, validation techniques must be adapted to this new environment in order to help researchers focus their efforts on a specific subset of results in an efficient, fast and reliable way. The aforementioned situation may well be considered as Big Data context. In this sense, multiple machine learning techniques have been implemented by the application of Graphic Processing Units (GPU) technology and CUDA architecture to accelerate the processing of large databases. However, as far as we know, this technology has not yet been applied to any bicluster validation technique. In this work, a multi-GPU version of one of the most used bicluster validation measure, Mean Squared Residue (MSR), is presented. It takes advantage of all the hardware and memory resources offered by GPU devices. Because of to this, gMSR is able to validate a massive number of biclusters in any Biclustering-based study within a Big Data context.
Article
Full-text available
Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets.
Article
Full-text available
New technologies are revolutionising biological research and its applications by making it easier and cheaper to generate ever-greater volumes and types of data. In response, the services and infrastructure of the European Bioinformatics Institute (EMBL-EBI, www.ebi.ac.uk) are continually expanding: total disk capacity increases significantly every year to keep pace with demand (75 petabytes as of December 2015), and interoperability between resources remains a strategic priority. Since 2014 we have launched two new resources: the European Variation Archive for genetic variation data and EMPIAR for two-dimensional electron microscopy data, as well as a Resource Description Framework platform. We also launched the Embassy Cloud service, which allows users to run large analyses in a virtual environment next to EMBL-EBI's vast public data resources.
Article
Full-text available
This article presents the benefits and limitations related to designing a parallel biclustering algorithm on a GPU. A definition of biclustering is provided together with a brief description of the GPU architecture. We then review algorithm strategy patterns, which are helpful in providing efficient implementations on GPU. Finally, we highlight programming aspects of implementing biclustering algorithms in CUDA/OpenCL programming language. © 2015, Wydawnictwo SIGMA - N O T Sp. z o.o. All rights reserved.
Article
Full-text available
In recent years, advances in technology have led to increasingly high-dimensional datasets. This increase of dimensionality along with the presence of irrelevant and redundant features make the feature selection process challenging with respect to efficiency and effectiveness. In this context, approximate algorithms are typically applied since they provide good solutions in a reasonable time. On the other hand, feature grouping has arisen as a powerful approach to reduce dimensionality in high-dimensional data. Recently, some authors have focused their attention on developing methods that combine feature grouping and feature selection to improve the model. In this paper, we propose a feature selection strategy that utilizes feature grouping to increase the effectiveness of the search. As feature selection strategy, we propose a Variable Neighborhood Search (VNS) metaheuristic. Then, we propose to group the input space into subsets of features by using the concept of Markov blankets. To the best of our knowledge, this is the first time in which the Markov blanket is used for grouping features. We test the performance of VNS by conducting experiments on several high-dimensional datasets from two different domains: microarray and text mining. We compare VNS with popular and competitive techniques. Results show that VNS is a competitive strategy capable of finding a small size of features with similar predictive power than that obtained with other algorithms used in this study.
Article
Full-text available
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: Biclustering Algorithms based on Evaluation Measures and Non Metric-based Biclustering Algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on. Copyright © 2015. Published by Elsevier Inc.
Conference Paper
DNA sequencing is one of the most important areas of research today. DNA sequencing is used in a variety of areas like forensic science, agriculture, medical field etc. The disease diagnosis from DNA sequencing is one of them which are harmless method to find out chances of disease occurrence. Chemical as well as sequential change in DNA leads to diseases. DNA is a large database; an efficient algorithm is needed to carry out the disease diagnosis. The study proposed sequential as well as GPGPU based multi string pattern matching Aho-corasick algorithm to find out the chances of occurrence of certain nucleotide repeat diseases and some cancer types from different DNA sequences. The results demonstrate that the algorithm works better with GPGPU based parallel approach and gives better speed up when patterns increases. So, with the proposed way the algorithm is well worked for Bioinformatics applications.
Conference Paper
The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.
Article
The Reactome Knowledgebase (www.reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations - an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression pattern surveys or somatic mutation catalogues from tumour cells. Over the last two years we redeveloped major components of the Reactome web interface to improve usability, responsiveness and data visualization. A new pathway diagram viewer provides a faster, clearer interface and smooth zooming from the entire reaction network to the details of individual reactions. Tool performance for analysis of user datasets has been substantially improved, now generating detailed results for genome-wide expression datasets within seconds. The analysis module can now be accessed through a RESTFul interface, facilitating its inclusion in third party applications. A new overview module allows the visualization of analysis results on a genome-wide Reactome pathway hierarchy using a single screen page. The search interface now provides auto-completion as well as a faceted search to narrow result lists efficiently.
Article
Gene association networks have become one of the most important approaches to modelling of biological processes by means of gene expression data. According to the literature, co-expression-based methods are the main approaches to identification of gene association networks because such methods can identify gene expression patterns in a dataset and can determine relations among genes. These methods usually have two fundamental drawbacks. Firstly, they are dependent on quality of the input dataset for construction of reliable models because of the sensitivity to data noise. Secondly, these methods require that the user select a threshold to determine whether a relation is biologically relevant. Due to these shortcomings, such methods may ignore some relevant information. We present a novel fuzzy approach named FyNE (Fuzzy NEtworks) for modelling of gene association networks. FyNE has two fundamental features. Firstly, it can deal with data noise using a fuzzy-set-based protocol. Secondly, the proposed approach can incorporate prior biological knowledge into the modelling phase, through a fuzzy aggregation function. These features help to gain some insights into doubtful gene relations. The performance of FyNE was tested in four different experiments. Firstly, the improvement offered by FyNE over the results of a co-expression-based method in terms of identification of gene networks was demonstrated on different datasets from different organisms. Secondly, the results produced by FyNE showed its low sensitivity to noise data in a randomness experiment. Additionally, FyNE could infer gene networks with a biological structure in a topological analysis. Finally, the validity of our proposed method was confirmed by comparing its performance with that of some representative methods for identification of gene networks
Article
Cheng-Church (CC) biclustering algorithm is the popular algorithm for the gene expression data mining at present. Only find one biclustering can be found at one time and the biclustering that overlap each other can hardly be found when using this algorithm. This article puts forward a modified algorithm for the gene expression data mining that uses the middle biclustering result to conduct the randomization process, digging up more eligible biclustering data. It also raised a parallel computing method that uses the multi-core processor or cluster environment to improve efficiency. It is proved by experimental verification that the modified algorithm enhances the precision and efficiency of the gene expression data mining to a certain degree.