Figure 2 - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

# Software transactional memory. Software transactional memory circumvents the need for explicit locking of resources. All changes to the state of data are encapsulated in transactions, i.e. every thread has a copy of its working data and can change its value. During submission of the changes to the shared memory, the consistency of the internal state is checked. If no interim changes occurred, the submission is performed. If another thread working on another copy of the same data has meanwhile submitted its changes, the transaction is rejected and restarted with a new copy of the data.

Source publication

In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most ap...

## Similar publications

In view of the challenges of the group Lasso penalty methods for multicancer microarray data analysis, e.g., dividing genes into groups in advance and biological interpretability, we propose a robust adaptive multinomial regression with sparse group Lasso penalty (RAMRSGL) model. By adopting the overlapping clustering strategy, affinity propagation...

## Citations

... Therefore, the number of clusters during the execution time must be specified. Clustering algorithms are used in such areas as supply chain [22,23], Natural Language Processing [24,25], Internet of Things (IOT) [26], medicine [27,28], and technical sciences [29]. ...

Clustering is an ideal tool for working with big data and searching for structures in the data set. Clustering aims at maximizing the similarity between the data within a cluster
and minimizing the similarity between the data between different clusters. This study
presents a new and improved Particle Swarm Optimization (PSO) algorithm using
pattern reduction and reducing the clustering calculation time with Multistart Pattern
Reduction–Enhanced PSO (MPREPSO). This method adds two pattern reduction
operators and multistart operators into the PSO algorithms. The goal of the pattern
reduction operator is to reduce the computational time from the compression of static
patterns. The purpose of the multistart operator is to avoid falling into the local optimal
by enforcing diversity in the population. Two pattern reduction and multistart operators
are combined with the PSO algorithm to evaluate the performance of this method.

... Several efforts have been addressed toward parallel implementations of the K-means algorithm in several high-performance computing environments in the last years. Significant algorithms are described, for example, in [16] for distributed memory architectures, in [17][18][19] for multi-core CPUs and in [20] for GPUs based systems. Almost all these studies emphasize the role of a large amount of data as a critical feature to enable an implementation based on the data parallelism programming model. ...

The synergy between Artificial Intelligence and the Edge Computing paradigm promises to transfer decision-making processes to the periphery of sensor networks without the involvement of central data servers. For this reason, we recently witnessed an impetuous development of devices that integrate sensors and computing resources in a single board to process data directly on the collection place. Due to the particular context where they are used, the main feature of these boards is the reduced energy consumption, even if they do not exhibit absolute computing powers comparable to modern high-end CPUs. Among the most popular Artificial Intelligence techniques, clustering algorithms are practical tools for discovering correlations or affinities within data collected in large datasets, but a parallel implementation is an essential requirement because of their high computational cost. Therefore, in the present work, we investigate how to implement clustering algorithms on parallel and low-energy devices for edge computing environments. In particular, we present the experiments related to two devices with different features: the quad-core UDOO X86 Advanced+ board and the GPU-based NVIDIA Jetson Nano board, evaluating them from the performance and the energy consumption points of view. The experiments show that they realize a more favorable trade-off between these two requirements than other high-end computing devices.

... Earlier, the work published in Kraus and Kestler (2010) suggests a parallel implementations of the k-means and K-modes algorithms using a multi-core platform with transactional memory in order to process a large amount of data. The k-means and k-modes algorithms were implemented as follows: The first step consists of distributing the initial datasets to the cores. ...

Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.

... A wide application to such data was found by clustering algorithms, which shows the internal relationship between the data (Aggarwal & Reddy, 2014;Shirkhorshidi et al., 2014). Clustering algorithms are used in such areas as supply chain (Kuo et al., 2018;Yin et al., 2013;Maghsoodi et al., 2018;Kleyner et al., 2015), Natural Language Processing (Allen, Sui, & Parker, 2017), IoT (Internet of Things) (Tong & Fong, 2018), medicine (Othman et al., 2004;Kraus & Kestler, 2010). ...

... Many k-means modifications were proposed to address these shortcomings (Ghesmoune, Lebbah, & Azzag, 2015;Husch, Schyska, & von Bremen, 2018;Kantabutra & Couch, 2000;Kraus & Kestler, 2010), but the real Big data clustering question remains open. Currently, the parallelization technique is widely used (Zhao et al. (2009) ;Sardar & Ansari, 2018;Cuomo et al. 2019). ...

The application of clustering algorithms is expanding due to the rapid growth of data volumes. Nevertheless, existing algorithms are not always effective because of high computational complexity. A new parallel batch clustering algorithm based on the k-means algorithm is proposed. The proposed algorithm splits a dataset into equal partitions and reduces the exponential growth of computations. The goal is to preserve the characteristics of the dataset while increasing the clustering speed. The centers of the clusters are calculated for each partition, which are merged and also clustered later. The approach to determine the optimal batch size is also considered. The statistical significance of the proposed approach is provided. Six experimental datasets are used to evaluate the effectiveness of the proposed parallel batch clustering. The obtained results are compared with the k-means algorithm. The analysis shows the practical applicability of the proposed algorithm to Big Data.

... Finally, in the last years, further efforts have been addressed toward parallel implementations of the K -means algorithm in several high-performance computing environments. Significant algorithms are described, for example, in [11] for distributed memory architectures, in [22,24,33] for multicore CPUs and in [7,8] for GPUs based systems. Almost all these studies exploit the role that a large amount of data can play in an implementation based on the data parallelism programming model. ...

... McKmeans, described in [24]. It is based on the concept of transactional memory to guarantee thread safety indirectly. ...

The K-means algorithm is one of the most popular algorithms in Data Science, and it is aimed to discover similarities among the elements belonging to large datasets, partitioning them in K distinct groups called clusters. The main weakness of this technique is that, in real problems, it is often impossible to define the value of K as input data. Furthermore, the large amount of data used for useful simulations makes impracticable the execution of the algorithm on traditional architectures. In this paper, we address the previous two issues. On the one hand, we propose a method to dynamically define the value of K by optimizing a suitable quality index with special care to the computational cost. On the other hand, to improve the performance and the effectiveness of the algorithm, we propose a strategy for parallel implementation on modern multicore CPUs.

... In recent years, the k-means algorithm and its modifications have been the subject of research on the analysis of large volumes of data [16,[20][21][22][23]. ...

... A multi-core parallelization of the k-means/k-modes algorithm for biological data clustering that provides complex cluster number estimations for Big data on a single computer was proposed [21]. However, this approach requires additional effort and equipment (specialized hardware for fast communication between computers, multiple software installations in heterogeneous environments). ...

Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k-means algorithm. It significantly reduces the exponential growth of computations. The proposed algorithm splits a dataset into batches while preserving the characteristics of the initial dataset and increasing the clustering speed. The idea is to define cluster centroids, which are also clustered, for each batch. According to the obtained centroids, the data points belong to the cluster with the nearest centroid. Real large datasets are used to conduct the experiments to evaluate the effectiveness of the proposed approach. The proposed approach is compared with k-means and its modification. The experiments show that the proposed algorithm is a promising tool for clustering large datasets in comparison with the k-means algorithm.

... Strict protocols and specifications are needed to limit their influence a priori. Examples for existing data science methods to counteract noise effects are global normalization techniques [48], robustness procedures [8,42] or invariant models that are insensitive to data transformations [45]. Methods for aggregated data modalities Besides improving the analysis of single data modalities via additional samples and external domain knowledge, big data also brings up the challenge and promise of combining multiple data modalities (Fig. 1). ...

Recent snapshots of the European progress on big data in health care and precision medicine reveal diverse perceptions of experts and the public, leading to the impression that algorithmic issues have the largest share among the challenges all health systems are faced with. Yet, from a comparison of different countries it is evident that the adaption and integration of heterogeneous data sources have a major impact on the advancement of precision medicine. Legal regulations for implementation and operation of healthcare networking are actively discussed in the public and gradually implemented in several countries. Based on a unified documentation, they are a perfect precondition for integrating distributed healthcare data to a big data platform with a reliable fact representation. Now, basic and clinical scientists have to be motivated to share their work with these data platforms. In this work, we aim to provide an overview on the common issues in big healthcare data applications and address the challenges for the involved scientific, clinical and administrative partners. We propose a possible strategy for a comprehensive data integration by iterating data harmonization, semantic enrichment and data analysis processes.

... Many works with different approaches have been proposed to parallelize the clustering process using shared memory [6,23,24,38]. In [23], the authors proposed a solution based on messaging between the processes. ...

... In [24], the authors presented an approach based on the use of multi-cores processors and exploiting their cores to parallelize clustering algorithms. They parallelized Kmeans and K-nodes algorithms to cluster gene expressions. ...

Clustering data consists in partitioning it into clusters such that there is a strong similarity between data in the same cluster and a weak similarity between data in different clusters. With the significant increase in data volume, the clustering process becomes an expensive task in terms of computation. Therefore, several solutions have been proposed to overcome this issue using parallelism with the MapReduce paradigm. The proposed solutions in the literature aim to optimize the execution time while keeping the clustering quality close or identical to the sequential execution. One of the commonly used parallel clustering strategies when using the MapReduce framework consists in partitioning data and processing each partition separately. The results obtained from each partition are merged to obtain the final clusters configuration. Using a random data distribution strategy and an inappropriate merging technique will lead to an inaccurate final centroids and a rather average clustering quality. Hence, in this paper we propose a parallel scheme for partitional clustering algorithms based on MapReduce with a non-conventional data distribution and results merging strategies to improve the clustering quality. With this solution, in addition to optimizing the execution time, we exploit the parallel environment to enhance the clustering quality. The experimental results demonstrate the effectiveness and scalability of our solution in comparison with other recently proposed works. We also proposed an application of our approach to the community detection problem. The results demonstrate the ability of our approach to provide effective and relevant results.

... Reducing the complexity of data prior clustering is one of the widely discussed issues in big data clustering (Mohebi et al., 2015). Clustering algorithms have been implemented in cloud and graphical processing unit (GPU) platforms extensively over the years as discussed in Table 2. (2008) K-means in GPU platform Wu et al. (2009) Microarray and gene expression data was analysed using parallelised k-means and k-modes cluster algorithms Kraus and Kestler (2010) CloudVista: clustering in cloud computing environment Xu et al. (2012) Parallel affinity propagation algorithm for clustering large-scale microarray data Hierarchical clustering algorithm for biomedical research applications Tanaseichuk et al. (2015) Distributed bioinformatics platform to leverage local with remote clusters for genome analysis A workflow for transcriptome, scRNAseq data analysis via clusters Yu and Lin (2016) Association rule mining An algorithm to mine association rules from gene expression databases Creighton and Hanash (2003) Integrative analysis of gene expression dataset based on discovery of association rules Carmona-Saez et al. (2006) Prediction of protein function from protein interaction networks using association analysis Prediction of SNP-SNP interactions using conditional logistic regression Heidema et al. (2006) Identifying the relationship between mortality rate and iatrogenic illness in patients de Vries et al. (2010) Investigation of relationship between maternal risk factors and congenital tract infections using logistic regression Shnorhavorian et al. (2011) Identification of in-flight emergencies and type of on-board assistance provided using logistic regression Peterson et al. (2013) Support vector machine (SVM) ...

Advancement of unparalleled data in bioinformatics over the years is a major concern for storage and management. Such massive data must be handled efficiently to disseminate knowledge. Computational advancements in information technology present feasible analytical solutions to process such data. In this context, the paper is an attempt to highlight the influence of big data in bioinformatics. Some of the concepts emphasised are definition of big data; architectural platforms supporting data analytics; followed by the application of above-mentioned analytical techniques towards complex problems in bioinformatics. The challenges and future prospects of big data analytics in bioinformatics are briefly discussed. This paper provides a comprehensive summary of several data analytical techniques available for bioinformatics researchers and computer scientists.

... Reducing the complexity of data prior clustering is one of the widely discussed issues in big data clustering (Mohebi et al., 2015). Clustering algorithms have been implemented in cloud and graphical processing unit (GPU) platforms extensively over the years as discussed in Table 2. (2008) K-means in GPU platform Wu et al. (2009) Microarray and gene expression data was analysed using parallelised k-means and k-modes cluster algorithms Kraus and Kestler (2010) CloudVista: clustering in cloud computing environment Xu et al. (2012) Parallel affinity propagation algorithm for clustering large-scale microarray data Hierarchical clustering algorithm for biomedical research applications Tanaseichuk et al. (2015) Distributed bioinformatics platform to leverage local with remote clusters for genome analysis A workflow for transcriptome, scRNAseq data analysis via clusters Yu and Lin (2016) Association rule mining An algorithm to mine association rules from gene expression databases Creighton and Hanash (2003) Integrative analysis of gene expression dataset based on discovery of association rules Carmona-Saez et al. (2006) Prediction of protein function from protein interaction networks using association analysis Prediction of SNP-SNP interactions using conditional logistic regression Heidema et al. (2006) Identifying the relationship between mortality rate and iatrogenic illness in patients de Vries et al. (2010) Investigation of relationship between maternal risk factors and congenital tract infections using logistic regression Shnorhavorian et al. (2011) Identification of in-flight emergencies and type of on-board assistance provided using logistic regression Peterson et al. (2013) Support vector machine (SVM) ...

Advancement of unparalleled data in bioinformatics over the years is a major concern for storage and management. Such massive data must be handled efficiently to disseminate knowledge. Computational advancements in information technology present feasible analytical solutions to process such data. In this context, the paper is an attempt to highlight the influence of big data in bioinformatics. Some of the concepts emphasised are definition of big data; architectural platforms supporting data analytics; followed by the application of above-mentioned analytical techniques towards complex problems in bioinformatics. The challenges and future prospects of big data analytics in bioinformatics are briefly discussed. This paper provides a comprehensive summary of several data analytical techniques available for bioinformatics researchers and computer scientists.