# Daming ZhuShandong University | SDU · School of Computer Science and Technology

Daming Zhu

Doctor of Philosophy

## About

179

Publications

8,216

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

753

Citations

Citations since 2017

Introduction

Daming Zhu currently works at the School of Computer Science and Technology, Shandong University. Daming's research interests include algorithms and complexity, bioinformatics and computational biology. The current projects are in terms of algorithms in genome data analysis .

**Skills and Expertise**

Additional affiliations

January 2000 - present

**ShanDong University**

Position

- Professor

## Publications

Publications (179)

General-purpose protein structure embedding can be used for many important protein biology tasks, such as protein design, drug design and binding affinity prediction. Recent researches have shown that attention-based encoder layers are more suitable to learn high-level features. Based on this key observation, we propose a two-level general-purpose...

Sorting permutations by block moves is a fundamental combinatorial problem in genome rearrangements. The classic block move operation is called transposition, which switches two adjacent blocks, or equivalently, moves a block to some other position. But large blocks movement rarely occurs during real evolutionary events. A natural restriction of tr...

Motivation
Drawing peaks in a data window of an MS data set happens at all time in MS data visualization applications. This asks to retrieve from an MS data set some selected peaks in a data window whose image in a display window reflects the visual feature of all peaks in the data window. If an algorithm for this purpose is asked to output high qu...

Viewing peaks in LC–MS data sets always happens in liquid-chromatography mass spectrometry data analysis. It is challenging to develop an LC–MS data storage and retrieval tool that can get some peaks in a data window whose image presented with their intensities in a display window resembles as closely as possible that of all peaks in the data windo...

A permutation is happy, if it can be transformed into the identity permutation using as many short swaps as one third times the number of inversions in the permutation. The complexity of the decision version of sorting a permutation by short swaps, is still open. We present an O(n) time algorithm to decide whether it is true for a permutation to be...

The Maximum Vertex Coverage problem (abbreviated as MVC) is to maximum the number of edges covered by a set of vertices of size exactly K on a graph. This problem is the dual of the vertex cover problem and has attracted a lot of interests in the literature of approximation algorithm. So far, the best approximation factor for the MVC problem is 3/4...

Scaffold filling is a critical step in DNA assembly. Its basic task is to fill the missing genes (fragments) into an incomplete genome (scaffold) to make it similar to the reference genome. There have been a lot of work under distinct measurements in the literature of genome comparison. For genomes with gene duplications, common string partition re...

Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand genes, proteins, and cell functions. Transcriptome assembly remains a ch...

Genome structural variants (SVs) have great impacts on human phenotype and diversity, and have been linked to numerous diseases. Long-read sequencing technologies arise to make it possible to find SVs of as long as 10,000 nucleotides. Thus, long read-based SV detection has been drawing attention of many recent research projects, and many tools have...

In this paper, we propose an algorithm which approximates the Two-Sided Scaffold Filling problem to a better performance ratio of 1.4+ε. This is achieved through a deep investigation of the optimal solution structure of Two-Sided Scaffold Filling. We make use of a relevant graph aiming at a solution of a Two-Sided Scaffold Filling instance, and eva...

In this paper, we study the Maximum Internal Spanning Tree Problem (MIST). Given an undirected simple graph G, the task for the Maximum Internal Spanning Tree problem is to find a spanning tree of G with maximum number of internal vertices. We present an approximation algorithm with performance ratio 43, which improves upon the best known performan...

We propose a new problem whose input data are two linear genomes together with two indexed gene subsequences of them, which asks to find a longest common exemplar subsequence of the two given genomes with a subsequence identical to the given indexed gene subsequences. We present an algorithm for this problem such that the algorithm is allowed to ta...

Presents corrections to author information in the above named paper.

Genome structural variants have great impacts on human phenotype and diversity, and have been linked to numerous diseases. Long read sequencing technologies arise to make it possible to find structural variants of as long as ten thousand nucleotides. Thus, long read based structural variant detection has been drawing attention of many recent resear...

Translocation has been long learned as a basic operation to rearrange the structure of a genome. Translocation sorting asks to find a shortest sequence of translocations that transforms one genome into another, which has attracted attention of many scientists in algorithm design. Signed translocation sorting can be solved in polynomial time. Unsign...

Aiming at the problems of parameter optimization and insufficient utilization of split reads in the detection for copy number variation (CNV), a new definition of relative read depth (RRD) and a randomized sampling strategy (RGN) are proposed in this paper. Compared to the raw read depth, the RRD parameter has weak correlation with GC content, mapp...

Maximum stacking base pairs is a fundamental combinatorial problem from ribonucleic acid (RNA) secondary structure prediction under the energy model. The basic maximum stacking base pairs problem can be described as: given an RNA sequence, find a maximum number of base pairs such that each chosen base pair has at least one parallel and adjacent par...

Background:
Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge....

Although plenty of structural variant detecting approaches for human genomes can be looked up in the literatures, little has been acknowledged on the effectiveness of those structural variant softwares for plant genomes. It has been demonstrated frequent occurrences for those structural variant detecting softwares to find too many false structural...

Sorting permutations by block moves is a fundamental combinatorial problem in genome rearrangements. The classic block move operation is called transposition, which switches two consecutive blocks, or equivalently, moves a block to some other position. But large blocks movement rarely occurs during real evolutionary events. A natural restriction of...

Genome rearrangement problems have been extensively studied for more than two decades, intended to understand the species evolutionary relationships in terms of the long range genetic mutations at the genome level. While most earlier studies focus on the simplified genomes ignoring gene duplicates, thousands of whole genome sequencing projects reve...

Copy number variation (CNV) is a prevalent kind of genetic structural variation which leads to an abnormal number of copies of large genomic regions, such as gain or loss of DNA segments larger than 1kb. CNV exists not only in human genome but also in plant genome. Current researches have testified that CNV is associated with many complex diseases....

The prediction of RNA structure with pseudoknots is NP-hard problem. According to minimum free energy models and computational methods, we investigate the RNA pseudoknotted structures and their characteristics. The paper presents an efficient algorithm for predicting RNA structures with pseudoknots, and the algorithm runs in O(n ³ ) time and O(n ²...

Background:
Database search has been the main approach for proteoform identification by top-down tandem mass spectrometry. However, when the target proteoform that produced the spectrum contains post-translational modifications (PTMs) and/or mutations, it is quite time consuming to align a query spectrum against all protein sequences without any P...

The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I′, with respect to a complete reference genome G, such that the number of common/shared adjacencies between G and I′ is maximized. The problem is NP-complete, and admits a constant-factor approximation....

In this paper, we devote to find structural variants including deletions, insertions, and inversions which occur in Hedou12 genome in constrast to Williams82 genome. To find as many as possible potential structural variants, we try to develop new principles to detect discordant and split read map sets supporting structural variants. Aiming to enhan...

Indel is a molecular biology term for an insertion or deletion of bases in the genome. Lots of researches have demonstrated structural variations including indel are closely related to the disease and health of human. So the detection of indel is very important in the life science. Currently, there is a certain number of structural variation detect...

In this article, we develop a novel radical framework for de novo transcriptome assembly based on suffix trees, called DTAST. DTAST extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the length of the reads through a well-designed suffix tree structure. Besides, DTAST pro...

The prediction of RNA structure with pseudoknots is a nondeterministic polynomial-time hard (NP-hard) problem; according to minimum free energy models and computational methods, we investigate the RNA-pseudoknotted structure. Our paper presents an efficient algorithm for predicting RNA structure with pseudoknots, and the algorithm takes O((Formula...

High-throughput sequencing of mRNA has made the deep and efficient probing of transcriptome more affordable. However, the vast amounts of short RNA-seq reads make de novo transcriptome assembly an algorithmic challenge. In this work, we present IsoTree, a novel framework for transcripts reconstruction in the absence of reference genomes. Unlike mos...

Database search is the main approach for identifying proteoforms using top-down tandem mass spectra. However, it is extremely slow to align a query spectrum against all protein sequences in a large database when the target proteoform that produced the spectrum contains post-translational modifications and/or mutations. As a result, efficient and se...

Breakpoint graph has been widely used as a key data structure in algorithm design for genome rearrangements. The problem of breakpoint graph cycle decomposition, which asks for a largest collection of edge-disjoint cycles, is crucial in computing rearrangement distances between genomes. This problem is NP-hard, and can be approximated to 1.4193+ε(l...

High-throughput sequencing of mRNA has made the deep and efficient probing of transcriptomes more affordable. However, the vast amounts of short RNA-seq reads make de novo transcriptome assembly an algorithmic challenge. In this work, we present IsoTree, a novel framework for transcripts reconstruction in the absence of reference genomes. Unlike mo...

Scaffold Filling aims at getting what can be used as whole genomes from scaffolds by computation. Two-Sided Scaffold Filling is given by two scaffolds, asks respectively, to fill one scaffold with those genes in the other but the scaffold itself, so that the two new produced scaffolds have as many as possible common adjacencies. This problem has lo...

Translocation has long been learned as a basic operation to rearrange genomes. Signed translocation sorting can be solved in polynomial time. Unsigned translocation sorting turns to be NP-Hard and Max-SNP-Hard. The best known algorithm by now for unsigned translocation sorting can achieve a performance ratio 1.408. In this paper, we propose a new a...

A clause is not-all-equal satisfied if it has at least one literal assigned with true and one literal assigned with false. Max NAE-SAT is given by a boolean variable set U and a clause set C, asks to find an assignment of U, such that the number of not-all-equal satisfied clauses in C is maximized. Max NAE-SAT turns into Max NAE-k-SAT if each claus...

In order to run a dataflow with as low cost as possible, it is often faced with deciding which data-sets in a data-set sequence should be stored, with the rest regenerated. The Intermediate Data-set Storage problem arises from this situation. The current best algorithm for this problem takes time. In this paper, we present two improved algorithms f...

This book constitutes the proceedings of the 10th International Workshop on Frontiers in Algorithmics, FAW 2016, held in Qingdao, China, in June/July 2016.
The 25 full papers presented in this volume were carefully reviewed and selected from 54 submissions. They deal with algorithm, complexity, problem, reduction, NP-complete, graph, approximation,...

Background:
The protein-protein interaction plays a key role in the control of many biological functions, such as drug design and functional analysis. Determination of binding sites is widely applied in molecular biology research. Therefore, many efficient methods have been developed for identifying binding sites. In this paper, we calculate struc...

Scaffold filling is an interesting combinatorial optimization problem from genome sequencing. The one-sided scaffold filling problem can be stated as: given an incomplete scaffold with some genes missing and a reference scaffold, the purpose is to insert the missing genes back into the incomplete scaffold( called ”filling the scaffold”), such that...

This paper focuses on finding a spanning tree of a graph to maximize the number of its internal vertices. We propose a new upper bound for the number of internal vertices of a spanning tree, which shows that for any undirected simple graph, any spanning tree has less internal vertices than the edges a maximum path-cycle cover has. Thus starting wit...

Running a dataflow in a cloud environment usually generates many useful intermediate datasets. A strategy for running a dataflow is to decide which datasets should be stored, while the rest of them are regenerated. The intermediate dataset storage (IDS) problem asks to find a strategy for running a dataflow, such that the total cost is minimized. T...

A clause is not-all-equal satisfied if it has at least one literal assigned by \(T\) and one literal assigned by \(F\). Max NAE-SAT is given by a set \(U\) of boolean variables and a set \(C\) of clauses, and asks to find an assignment of \(U\), such that the not-all-equal satisfied clauses of \(C\) are maximized. Max NAE-SAT turns into Max NAE-\(k...

The exemplar breakpoint distance problem is motivated by finding conserved sets of genes between two genomes. It asks to find respective exemplars in two genomes to minimize the breakpoint distance between them. If one genome has no repeated gene (called trivial genome) and the other has genes repeating at most twice, it is referred to as the (1, 2...

Sorting genomes by translocations is a classic combinatorial problem in genome rearrangements. The translocation distance for signed genomes can be computed exactly in polynomial time, but for unsigned genomes the problem becomes NP-hard and the current best approximation ratio is . In this paper, we investigate the problem of sorting unsigned geno...

We consider the emerging problem of comparing the similarity between (unlabeled) pedigrees. More specifically, we focus on the simplest pedigrees, namely, the 2-generation pedigrees. We show that the isomorphism testing for two 2-generation pedigrees is GI-hard. If the 2-generation pedigrees are monogamous (i.e., each individual at level-1 can mate...

The exemplar breakpoint distance problem is motivated by finding conserved sets of genes between two genomes. It asks to find respective exemplars in two genomes to minimize the breakpoint distance between them. If one genome has no repeated gene (called trivial genome) and the other has genes repeating at most twice, it is referred to as the (1,2)...

Sorting permutations by short block moves is an interesting combinatorial problem derived from genome rearrangements. A short block move is an operation on a permutation that moves an element at most two positions away from its original position. The problem of sorting permutations by short block moves is to sort a permutation by the minimum number...

This paper focuses on finding a spanning tree of a graph to maximize its internal vertices in number. We propose a new upper bound for the number of internal vertices in a spanning tree, which shows that for any undirected simple graph, any spanning tree has less internal vertices than the edges a maximum path-cycle cover has. Thus starting with a...

The duplication-loss problem is to infer a species super tree from a collection of gene trees that are confounded by complex histories of gene duplication and loss events. The decision variant of this problem is NP-complete. The utility of this NP-hard problem for large-scale phylogenetic analyses has been largely limited by the high time complexit...

The exemplar breakpoint distance problem (EBD for short) is NP-hard, even when one of the genomes, called G-1, has no repetition, and the other genome, called G-2, has genes that appear at most twice in the genome ((1, 2)-Exemplar Breakpoint Distance Problem or EBD(1, 2) for short). Unless P=NP, there is no polynomial time algorithm for EBD(1, 2)....

Background
With the continuous discovery of novel RNA molecules with key cellular functions and of novel pathways and interaction networks, the need for structural information of RNA is still increasing. In order to predict structure of long RNA and understand its natural folding mechanism, exploring the characteristic of RNA structure is an import...

The scaffold filling problem aims to set up the whole genomes by filling those missing genes into the scaffolds to optimize a similarity measure of genomes. A typical and frequently used measure for the similarity of two genomes is the number of common adjacencies. One-sided scaffold filling is given by a scaffold and a whole genome, and asks to fi...

This paper focuses on finding a spanning tree of a graph to maximize the
number of its internal vertices. We present an approximation algorithm for this
problem which can achieve a performance ratio $\frac{4}{3}$ on undirected
simple graphs. This improves upon the best known approximation algorithm with
performance ratio $\frac{5}{3}$ before. Our a...

RNA secondary structures with pseudoknots are often predicted by minimizing free energy, which is NP-hard. Most RNAs fold during transcription from DNA into RNA through a hierarchical pathway wherein secondary structures form prior to tertiary structures. Real RNA secondary structures often have local instead of global optimization because of kinet...

PKNOTS data set includes 116 sequences, that is 25 tRNA sequences randomly selected from Sprinzl tRNA database, HIV-1-RT-ligand RNA pseukonts, and some viral RNAs. Column len and base pair indicate the length and the number of the native base pairs of sequence. For each algorithm, the sensitivity and specificity are shown. Average is is the usual a...

A biclustering problem consists of objects and an attribute vector for each object. Biclustering aims at finding a bicluster - a subset of objects that exhibit similar behavior across a subset of attributes, or vice versa. Biclustering in matrices with binary entries ("0"/"1") can be simplified into the problem of finding submatrices with entries o...

Most viruses have RNA genomes, their biological functions are expressed more by folded architecture than by sequence. Among the various RNA structures, pseudoknots are the most typical. In general, RNA secondary structures prediction doesn't contain pseudoknots because of its difficulty in modeling. Here we present an algorithm of dynamic matching...

Sorting genomes by translocations is a classic combinatorial problem in genome rearrangements. The translocation distance for signed genomes can be computed exactly in polynomial time, but for unsigned genomes the problem becomes NP-Hard and the current best approximation ratio is 1·5+ε. In this paper, we investigate the problem of sorting unsigned...

The problem of predicting ribonucleic acid (RNA) structure with pseudoknots makes it NP-hard. To find optimal RNA pseudoknotted structure, we investigate the RNA pseudoknotted structure based on computational methods and models with minimum free energy (MFE). The contribution of this paper is to obtain an efficient algorithm for predicting RNA pseu...

The duplication-loss problem is to infer a species supertree from a collection of gene trees that are confounded by complex histories of gene duplication and loss events. The decision variant of this problem is NP-complete. The utility of this NP-hard problem for large-scale phylogenetic analyses has been largely limited by the high time complexity...

The paper further investigates the computational problem and complexity of predicting Ribonucleic Acid structure. In order to find a way to optimize the Ribonucleic Acid pseudoknotted structure, we investigate the Ribonucleic Acid pseudoknotted structure based on thermal dynamic model, computational methods, minimum free energy are adopted to predi...

Computability logic (CoL) is a formal theory of interactive computation. It
understands computational problems as games played by two players: a machine
and its environment, uses logical formalism to describe valid principles of
computability and formulas to represent computational problems. Logic CL1 is a
deductive system for a fragment of CoL. Th...

The exemplar breakpoint distance problem is one of the most important problems in genome comparison and has been extensively studied in the literature. The exemplar breakpoint distance problem cannot be approximated within any factor even if each gene family occurs at most twice in a genome. This is due to the fact that its decision version, the ze...

Comparing with elliptic curve (EC) cryptosystem, hyperelliptic curve (HEC) cryptosystem offers high level of security with shorter key size. Scalar multiplication is the most important and key operation in cryptosystems built on HEC and EC. Montgomery Ladder algorithm is an efficient and important algorithm to implement scalar multiplications for d...

Scaffold filling is a new combinatorial optimization problem in genome sequencing. The one-sided scaffold filling problem can be described as given an incomplete genome $(I)$ and a complete (reference) genome $(G)$, fill the missing genes into $(I)$ such that the number of common (string) adjacencies between the resulting genome $(I^{\prime })$ and...

Elections are an important preference aggregation model in a variety of areas. Given a pool of n potential voters, the chair may strategically selecting k voters from the pool to feed to an election system, in order to control the final outcome of the election system. This type of control, called control by voter selection, is closely related to tw...

Translocation is a prevalent rearrangement event in the evolution of multi-chromosomal species which exchanges ends between two chromosomes. A translocation is reciprocal if none of the exchanged ends is empty; otherwise, non-reciprocal. The problem of sorting by translocations asks to find a shortest sequence of translocations transforming one gen...

Scaffold filling is a new combinatorial optimization problem in genome sequencing and can improve the accuracy of the sequencing results. The two-sided Scaffold Filling to Maximize the Number of String Adjacencies (SF-MNSA) problem can be described as: given two incomplete gene sequences A and B, respectively fill the missing genes into A and B suc...

Computational models and methods for predicting secondary structure of RNA sequence are in demand. Based on MFE principle and the relative stability of the n-stems in RNA molecules, Minimum Free Energy method is adopted widely to predict RNA secondary structure. An improved heuristic algorithm is presented to predict RNA pseudoknotted structure, an...

The ability to predict protein-protein binding sites has a wide range of applications, including signal transduction studies, de novo drug design, structure identification and comparison of functional sites. The interface in a complex involves two structurally matched protein subunits, and the binding sites can be predicted by identifying structura...

We introduce a new, substantially simplified version of the
toggling-branching recurrence operation of Computability Logic, prove its
equivalence to Japaridze's old, "canonical" version, and also prove that both
versions preserve the static property of their arguments.

Based on MFE principle and the relative stability of the n-stems in RNA molecules, Minimum free energy method is adopted widely to predict RNA secondary structure, an improved approximation algorithm is presented to predict RNA pseudoknotted structure, the algorithm can solve arbitrary nested or parallel pseudoknots the algorithm takes O(n3) time a...

De novo peptide sequencing is an important method for peptide sequencing and protein identification. It can be transformed into a special case of paths avoiding forbidden pairs problem (PAFP). General PAFP is NP-complete. The definition of PAFP is the following: Given a directed acyclic graph G = (V, E), two distinguished vertices source s, sink t...