About
37
Publications
7,616
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
517
Citations
Introduction
Current institution
Additional affiliations
June 2014 - present
Publications
Publications (37)
The computational complexity of many key bioinformatics problems has resulted in numerous alternative heuristic solutions, where no single approach consistently outperforms all others. This creates difficulties for users trying to identify the most suitable tool for their dataset and for developers managing and evaluating alternative methods. As da...
Motivation
The introduction of Deep Minds’ Alpha Fold 2 enabled the prediction of protein structures at an unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens...
Viromics produces millions of viral genomes and fragments annually, overwhelming traditional sequence comparison methods. We introduce Vclust, a novel approach that determines average nucleotide identity by Lempel-Ziv parsing and clusters viral genomes with thresholds endorsed by authoritative viral genomics and taxonomy consortia. Vclust demonstra...
Motivation
The introduction of Deep Minds' Alpha Fold 2 enabled prediction of protein structures at unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of tera...
Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel...
The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm...
Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established h...
PHIST (Phage-Host Interaction Search Tool) predicts prokaryotic hosts of viruses from their genomic sequences. It improves host prediction accuracy at species level over current alignment-based tools (on average by 3 percentage points) as well as alignment-free and CRISPR-based tools (by 14-20 percentage points). PHIST is also two orders of magnitu...
The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to...
Identification of genetic variants is of crucial importance in the forthcoming era of precision medicine. Since the majority of variant callers require mapping reads to a reference genome, the reliability of the latter is a key factor determining accuracy of the downstream analyses. We present Whisper 2, a short-read-mapping software providing supe...
Rule-based models are often used for data analysis as they combine interpretability with predictive power. We present RuleKit, a versatile tool for rule learning. Based on a sequential covering induction algorithm, it is suitable for classification, regression, and survival problems. The presence of a user-guided induction facilitates verifying hyp...
Whisper2 is a short-read-mapping software providing superior quality of indel variant calling. Its running times place it among the fastest existing tools.
Availability and Implementation: https://github.com/refresh-bio/whisper
Contact: sebastian.deorowicz@polsl.pl
Supplementary information: Supplementary data are available at publisher's Web site.
Rule-based models are often used for data analysis as they combine interpretability with predictive power. We present RuleKit, a versatile tool for rule learning. Based on a sequential covering induction algorithm, it is suitable for classification, regression, and survival problems. The presence of a user-guided induction facilitates verifying hyp...
This article presents GuideR, a user-guided rule induction algorithm, which overcomes the largest limitation of the existing methods—the lack of the possibility to introduce user's preferences or domain knowledge to the rule learning process. Automatic selection of attributes and attribute ranges often leads to the situation in which resulting rule...
Motivation:
Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time.
Results:
We present Whisper, an accurate and high-performant mapping tool, based on the idea of so...
Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40,715 pathogens in less than 7 minutes (on a modern workstation), 26 times faster than Mash, its main competitor...
This article presents GuideR, a user-guided rule induction algorithm, which overcomes the largest limitation of the existing methods-the lack of the possibility to introduce user's preferences or domain knowledge to the rule learning process. Automatic selection of attributes and attribute ranges often leads to the situation in which resulting rule...
Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40,715 pathogens in less than 4 minutes (on a modern workstation), 44 times faster than Mash, its main competito...
Motivation
Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.
Results
We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays f...
Background
Survival analysis is an important element of reasoning from data. Applied in a number of fields, it has become particularly useful in medicine to estimate the survival rate of patients on the basis of their condition, examination results, and undergoing treatment. The recent developments in the next generation sequencing open new opportu...
Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate...
Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. The abundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignment algorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment of...
In the face of increasing size of sequence databases caused by the
development of high throughput sequencing, multiple alignment algorithms face
one of the greatest challenges yet. In this paper we show that well-established
techniques, refinement and consistency, are ineffective when large protein
families are of interest. We present QuickProbs 2,...
Increasing size of sequence databases caused by the development of high throughput sequencing, poses multiple alignment algorithms to face one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are of interes...
Camera calibration is one of the basic problems concerning intelligent video analysis in networks of multiple cameras with changeable pan and tilt (PT). Traditional calibration methods give satisfactory results, but are human labour intensive. In this paper we introduce a method of camera calibration and navigation based on continuous tracking, whi...
Methods of tracking human motion in video sequences can be used to count people, identify pedestrian traffic patterns, analyze behavior statistics of shoppers, or as a preliminary step in the analysis and recognition of a person’s actions and behavior. A novel method for tracking multiple people in a video sequence is presented, based on clustering...
Expansion of capabilities of intelligent surveillance systems and research in human motion analysis requires massive amounts of video data for training of learning methods and classifiers and for testing the solutions under realistic conditions. While there are many publicly available video sequences which are meant for training and testing, the ex...
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs custo...
Determination of similarities between species is a crucial issue in life sciences. This task is usually done by comparing fragments of genomic or proteomic sequences of organisms subjected to analysis. The basic procedure which facilitates these comparisons is called multiple sequence alignment. There are a lot of algorithms aiming at this problem,...
Background
Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may l...
Splicing is one of the major contributors to observed spatiotemporal diversification of transcripts and proteins in metazoans.
There are numerous factors that affect the process, but splice sites themselves along with the adjacent splicing signals are
critical here. Unfortunately, there is still little known about splicing in plants and, consequent...
In the paper we present CHIRA, an algorithm performing decision rules aggregation. New elementary conditions, which are linear combinations of attributes may appear in rule premises during the aggregation, leading to so-called oblique rules. The algorithm merges rules iteratively, in pairs, according to a certain order specified in advance. It appl...
Multiple sequence alignment (MSA) is one of the most important problems in computational biology. As availability of genomic and proteomic data constantly increases, new tools for processing this data in reasonable time are needed. One method of addressing this issue is parallelization. Nowadays, graphical processing units offer much more computati...
Modern graphical processing units (GPUs) offer much more computational power than modern CPUs, so it is natural that GPUs
are often used for solving many computationally-intensive problems. One of the tasks of huge importance in bioinformatics
is sequence alignment. We investigate its variant introduced a few years ago in which some additional requ...
Decision trees and decision rules are usually applied for the classification problems in which legibility and possibility of interpretation of the obtained data model is important as well as good classification abilities. Beside trees, rules are the most frequently used knowledge representation applied by knowledge discovery algorithms. Rules gener...