# Zeev VolkovichORT Braude College · Department of Software Engineering

Zeev Volkovich

28.82

·

Professor

## About

155

Publications

6,942

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

535

Citations

Introduction

Research Experience

March 2005 - present

**University of Maryland, Baltimore County**

Position

- Affiliated professor

February 1991 - present

**ORT Braude College**

Position

- Professor (Full)

Description

- Head, M.Sc. in Software Engineering

## Publications

Publications (155)

A problem of linking vertices (objects) by a connecting tree is studied under
the condition that objects appear at different given times. In this case, the
target function depends not only on the total length of the connecting
tree but also on the times of constructing its fragments. This problem is
shown to be NP-complete even when the linking is...

Determination of metagenome composition is still one of the most interesting problems of bioinformatics. It involves a wide range of mathematical methods, from probabilistic models of combinatorics to cluster analysis and pattern recognition techniques. The successful advance of rapid sequencing methods and fast and precise metagenome analysis will...

We dedicate this work to the blessed memory of Vladimir Mikhailovich Zolotarev, to whom we owe our interest in heavy-tailed distributions. Abstract: A model of scientific citation distribution is given. We apply it to understand the role of the Hirsch index as an indicator of scientific publication importance in Mathematics and some related fields....

A new method for the recognition of meaningful changes in social state based on transformations of the linguistic content in Arabic newspapers is suggested. The detected alterations of the linguistic material in Arabic newspapers play an indicator role. The currently proposed approach acts in an “online” fashion and uses pre-trained vector represen...

A model of scientific citation distribution is given. We apply it to understand the role of the Hirsch index as an indicator of scientific publication importance in Mathematics and some related fields. The proposed model is based on a generalization of such well-known distributions as geometric and Sibuja laws included now in a family of distributi...

In 1938, Schoenberg [20] posed a few seemingly simple problems in the area of elementary probability theory. The main goal was to characterize all the pseudo-isotropic distributions; that is, probability distributions in which all the one-dimensional projections are the same up to a scale parameter c. However, except for rotationally invariant and...

Detection of computer-generated papers using
One-Class SVM and cluster approaches

The paper presents a novel methodology intended to distinguish between real and artificially generated manuscripts. The approach employs inherent differences between the human and artificially generated wring styles. Taking into account the nature of the generation process, we suggest that the human style is essentially more “diverse” and “rich” in...

This chapter utilizes both software and hardware hierarchical systems as logical structures. It expresses the properties to be tested in different extensions of first‐order logic. Coverage analysis is aimed to guarantee that the runs of the tests fully capture the functionality of the system. The chapter proposes a method to analyze quantitative co...

This paper suggests a new methodology for patterning writing style evolution using dynamic similarity. We divide a text into sequential, disjoint portions (chunks) of the same size and exploit the Mean Dependence measure, aspiring to model the writing process via association between the current text chunk and its predecessors. To expose the evoluti...

Testing of concurrent programmes is difficult since the scheduling nondeterminism requires one to test a huge number of different thread interleavings. Moreover, repeated test executions that are performed in the same environment will typically examine similar interleavings only. One possible way how to deal with this problem is to use the noise in...

A proposal for a new method of classification of objects of various nature, named “2”-soft classification, which allows for referring objects to one of two types with optimal entropy probability for available collection of learning data with consideration of additive errors therein. A decision rule of randomized parameters and probability density f...

This paper discusses a novel time series methodology for writing process modeling, taking into account the dependency between sequentially written text parts. A series of consecutive sub-documents of a given document are represented via histograms of the appropriately chosen terms. To characterize the document overall style and its fluctuations, a...

Metagenome, a mixture of different genomes (as a rule, bacterial), represents a pattern, and the analysis of its composition is, currently, one of the challenging problems of bioinformatics. In the present study, the possibility of evaluating metagenome composition by DNA-marker methods is investigated. These methods are based on using primers, sho...

The paper proposes a new approach for intrinsic plagiarism detection, based on a new unique method, which enables identifying style changes in a text using novel chronology-based similarity measures. A model for finding significant deviations in the style across a given document is constructed aiming to indicate text parts which are suspected to be...

In the paper there is given a connection between one special case of cluster analysis, deconvolution problem, and classical moment problem. Namely, the methods used there are applied to solve deconvolution problem for the case of one known distribution and another one concentrated in unknown finite number of points. These results can be applied to...

In this paper, a novel method for analyzing media in Arabic using new quantitative characteristics is proposed. A sequence of newspaper daily issues is represented as histograms of occurrences of informative terms. The histograms closeness is evaluated via a rank correlation coefficient by treating the terms as ordinal data consistent with their fr...

In this paper, we address the problem of literary writing style determination using a comparison of the randomness of two given texts. We attempt to comprehend if these texts are generated from distinct probability sources that can reveal a difference between the literary writing styles of the corresponding authors. We propose a new approach based...

We introduce the notion of strongly distributed multi-agent systems and present a uniform approach to incremental problem solving on them. The approach is based on the systematic use of two logical reduction techniques: Feferman-Vaught reductions and syntactically defined translation schemes. The multi-agent systems are presented as logical structu...

As a general rule, each user must provide the tool applied with particular values of its input parameters. An inexperienced user may hardly figure out their values and the tool developer must define the default values in order to help her/him. We present an approach to solve the problem with the help of multi-criteria optimization that is new in th...

As a rule, when a user applies some algorithm, she/he submits to it an input as well as provides values for its parameters. Any particular choice of the parameters affects the received final results and may vary significantly for different kinds of inputs. On the other hand, an inexperienced user may hardly figure out what we are talking about. Tha...

The necessity to operate with the huge number of anonymous documents abounding on the Internet is initiating the study of new methods for authorship recognition. The principal weakness of the methods used in this area is that they assess the similarity of text styles without any regard to their surroundings. This paper proposes a novel mathematical...

In this article we offer an algorithm recurrently divides a dataset by search of partitions via one dimensional
subspace discovered by means of optimizing of a projected pursuit function. Aiming to assess the model order a
resampling technique is employed. For each number of clusters, bounded by a predefined limit, samples from the
projected data a...

A complete weighted graph, \(G(X,\varGamma ,W)\), is considered. Let \(\tilde{X}\subset X\) be some subset of vertices and, by definition, a Steiner tree is any tree in the graph G such that the set of the tree vertices includes set \(\tilde{X}\). The Steiner tree problem consists of constructing the minimum-length Steiner tree in graph G, for a gi...

We introduce the notion of strongly distributed structures and present a uniform approach to incremental automated reasoning on them. The approach is based on a systematic use of two logical reduction techniques: Feferman-Vaught reductions and syntactically defined translation schemes. The distributed systems are presented as logical structures A’s...

The aim of the paper is writing style investigation. The method used is based on re-sampling approach. We present the text as a series of characters generated by distinct probability sources. A re-sampling procedure is applied in order to simulate samples from the texts. To check if samples are generated from the same population we use a KNN-based...

The article presents the theoretical foundations of the algorithm for
calculating the number of different genomes in the medium under study and of
two algorithms for determining the presence of a particular (known) genome in
this medium. The approach is based on the analysis of the compositional spectra
of subsequently sequenced samples of the medi...

In the fields of data mining and control, the huge amount of unstructured data and the presence of uncertainty in system descriptions have always been critical issues. The book Randomized Algorithms in Automatic Control and Data Mining introduces the readers to the fundamentals of randomized algorithm applications in data mining (especially cluster...

Realization of a control affects the information contributing to new changes in the object of information. Formed control u enters the system and affects the state x changing it in many cases.

For adaptive control the identification approach is often used. This approach constructs estimates for the possible values of the unknown parameters x* based on observation sequences, and these estimates are then used in a parameterized feedback loop that, if properly selected, normally assumed or established the quality of a closed-loop system tha...

A historical overview of the cluster validation methods is presented in Chapter 2. As was mentioned there, selecting the model to be used in cluster analysis, and in particular estimating the “true” number of clusters, is an ill-posed problem of essential relevance in cluster analysis. This chapter reviews some recently developed approaches to this...

Multidimensional stochastic optimization plays an important role in the analysis and control of many technical systems. Randomized algorithms of stochastic approximation with perturbed input have been suggested for solving the challenging multidimensional problems of optimization. These algorithms have simple forms and provide consistent estimates...

The problem of classification of input signals (stimuli, objects) is a typical problem in learning theory. The classification process yields signs that characterize a group of objects as a data set or class (clusters). On these grounds, each signal can be attributed to a particular class. Clustering results in the partition of signals into groups i...

This section is concerned with parameter estimation and the filtering of linear models. We focus on algorithms which facilitate avoiding standard requirements for observation noise.

The minimax theorem is one of the first positive theoretical answers for the question: Why should randomization be beneficial? The theorem was proved by John von Neumann in 1928 [343]. He considers two-person zero-sum games defined by a matrix A = {a
i, j
} with perfect information (i.e., in which each time players know all the moves that have take...

An Evolutionary Network (EN) in formatted protein sequence space is a very large graph representing information about sequence similarity of relatively short protein fragments. This graph can be used for detecting hidden relatedness between proteins, which is highly significant in protein annotation. Effective EN analysis requires an appropriate gr...

Stochastic modeling in image analysis aims to represent the images features in a small number of parameters so as to recognize the source producing the images. In this paper we address the image segmentation problem in the case of significantly differ segments' sizes. A probabilistic model dealing the distribution of gray level in the observed imag...

In this paper, we consider quantitative optimization problems on decomposable discrete systems. We restrict ourselves to labeled trees as the description of the systems and we use weighted automata on them as our computational model. We introduce a new kind of labeled decomposable trees, sum-like weighted labeled trees, and propose a method, which...

In this paper, we try to build a bridge between pure theoretical approach to computations on decomposable graphs and heuristics, used in practice for treatment of particular cases of them. In theory, Feferman and Vaught in 1959 proposed a method to reduce solution of First Order definable problems on Disjoint Union of structures to solutions of der...

Abstract Graph clustering becomes difficult as the graph size and complexity increase. In particular, in interaction graphs, the clusters are small and the data on the underlying interaction are not only complex, but also noisy due to the lack of information and experimental errors. The graphs representing such data consist of (possibly overlapping...

We have shown, in a previous paper, that tandem repeating sequences, especially triplet repeats, play a very important role in gene evolution. This result led to the formulation of the following hypothesis: most of the genomic sequences evolved through everlasting acts of tandem repeat expansions with subsequent accumulation of changes. In order to...

An appropriate distance is an essential ingredient in various real-world learning tasks. Distance metric learning proposes to study a metric, which is capable of reflecting the data configuration much better in comparison with the commonly used methods. We offer an algorithm for simultaneous learning the Mahalanobis like distance and K-means cluste...

Mobile phones are quickly becoming the primary source for social, behavioral, and environmental sensing and data collection. Today's smartphones are equipped with increasingly more sensors and accessible data types that enable the collection of literally dozens of signals related to the phone, its user, and its environment. A great deal of research...

In this paper we introduce a new heuristic approach for local clustering of the protein-protein interaction networks (PPIN), which can be applied to very large graphs. The method is based on idea of repeated bisections (rbr) proposed earlier for global clustering of PPIN. Each round of bisection is carried out by multilevel graph clusterization met...

Daily activity of the users of a mobile-phone network is represented as a sequence of input and output calls and of input and output text messages. Each such sequence corresponds to its spectrum, the distribution of short two-letter sequences of the same type. It is shown that the spectra of any user's sequences are stable, i.e. reproduced daily. B...

Cluster validation is the task of estimating the quality of a given partition of a data set into clusters of similar objects. Normally, a clustering algorithm requires a desired number of clusters as a parameter. We consider the cluster validation problem of determining the optimal “true” number of clusters. We adopt the stability testing approach,...

An open problem in spectral clustering concerning automatically finding the number of clusters is studied. We generalize the method for selecting the scale parameter offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn...

If we define a genetic code as a widespread DNA sequence pattern that carries a message with an impact on biology, then there are multiple genetic codes. Sequences involved in these codes overlap and, thus, both interact with and constrain each other, such as for the triplet code, the intron-splicing code, the code for amphipathic alpha helices, an...

In this paper, we propose a method to classify prokaryotic genomes using the agglomerative information bottleneck method for unsupervised clustering. Although the method we present here is closely related to a group of methods based on detecting the presence or absence of genes, our method is different because it uses gene lengths as well. We show...

One of the most difficult problems in cluster analysis is the identification of the number of groups in a given data set. In this paper we offer the randomized approach in the rate distortion framework. A randomized algorithm has been suggested to allocate this position. The scenario approach is used to significantly reduce the computational comple...

In this research, we consider a mixture of genome fragments of a certain bacteria set. The problem of mixture separation is studied under the assumption that all the genomes present in the mixture are completely sequenced or are close to those already sequenced. Such assumption is relevant, e.g., in regular observations of ecological or biomedical...

Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory
serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the
true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is a...

Superposition of signals in DNA molecule is a sufficiently general principle of information coding. The necessary requirement for such superposition is the degeneracy of the code, which allows placing different messages on the same DNA fragment. Code words that are equivalent in the informational sense (i.e., synonyms) form synonymous group and the...

In this paper, an efficient gene selection algorithm is proposed,
which employs a two-sample distribution-free test statistics for evaluating
the gene expression di¤erences in the compared datasets. The experimen-
tal results obtained for the Acute Lymphoblastic Leukemia (ALL) Dataset
con
rm the e¢ ciency of the algorithm.

One of the important problems arising in cluster analysis is the estimation of the appropriate number of clusters. In the case when the expected number of clusters is sufficiently large, the majority of the existing methods involve high complexity computations. This difficulty can be avoided by using a suitable confidence interval to estimate the n...

An open problem in spectral clustering concerning of finding automatically the number of clusters is studied. We generalize the method for the scale parameter selecting offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples dr...

This work addresses the cluster validation problem of determining the ‘right’ number of clusters. We consider a cluster stability property based on the k-nearest neighbour type coincidences model . Quality of a clustering is measured by the deviation from this model, where a small deviation indicates a good clustering. The true number of clusters c...

This communication reports on the nucleosome positioning patterns (bendability matrices) for the human genome, derived from over 8_million nucleosome DNA sequences obtained from apoptotically digested lymphocytes. This digestion procedure is used here for the first time for the purpose of extraction and sequencing of the nucleosome DNA fragments. T...

In cluster analysis, selecting the number of clusters is an “ill-posed” problem of crucial importance. In this paper we propose
a re-sampling method for assessing cluster stability. Our model suggests that samples’ occurrences in clusters can be considered
as realizations of the same random variable in the case of the “true” number of clusters. Thu...

One of the most difficult problems in cluster analysis is the identification of the number of groups in a given data set. In this paper we offer the approach in the framework of the common "elbow" methodology such that the true number of clusters is recognized as the slope discontinuity of the index function. A randomized algorithm has been suggest...

K-Nearest Neighbors is a widely used technique for classifying and clustering data. In the current article, we address the cluster stability problem based upon probabilistic characteristics of this approach. We estimate the stability of partitions obtained from clustering pairs of samples. Partitions are presumed to be consistent if their clusters...

HoPLLS (Hierarchy of protein loop-lock structures)
(http://leah.haifa.ac.il/~skogan/Apache/mydata1/main.html) is a web
server that identifies closed loops - a structural basis for protein
domain hierarchy. The server is based on the loop-and-lock theory for
structural organisation of natural proteins. We describe this web
server, the algorithms for...

Clustering is actively studied in such fields as statistics, pattern recognition, machine training, et al. A new randomized
algorithm is suggested and established for finding the number of clusters in the set of data, the efficiency of which is demonstrated
by examples of simulation modeling on synthetic data with thousands of clusters.

In:
International IFNA-ANS Journal "Problems of Nonlinear Analysis in Engineering Systems",
(ISSN 1727-687X) 17, n.1 (35) (2011)

In:
International IFNA-ANS Journal "Problems of Nonlinear Analysis in Engineering Systems" (ISSN 1727-687X) 144-151 (in Russian)

The article proposes a new standpoint to the cluster validation problem based on a fractal dimension cluster quality model. The suggested method uses the fractal property to describe cluster geometrical configuration. This notion is applied for further exploration of cluster validity, assuming that its low variability, calculated via different samp...

The 24th European Conference on Operational Research XIV) featured
a wide range of researches and applications in the area of stochastic
modeling. The following is a review of presentations in the stream on
“Stochastic Modeling and Simulation” organized by Erik Kropat
(Universität der Bundeswehr München, Germany), Zeev Volkovich
(ORT Braude College...

In the present paper, 188 prokaryote genomes are classified by separately calculating the compositional spectra for the coding and the non-coding parts of the genomes. For each subsequence, the compositional spectrum is transformed into the corresponding point in a vector space. This enables the categorization of genomes into meaningful groups by a...

Content:
Cluster Analysis is an essential tool for “unsupervised” learning suggesting categorizing data (objects, instances) into groups such that the likeness within a group is much higher then the one between the groups. This resemblance is often described by a distance
function. Clustering algorithms are intended to construct data partitions int...

Reference:
Z. Volkovich, G.-W. Weber, R. Avros and O. Yahalom, On an adjacency cluster merit approach, Int. J. Operational Research 13, 3 (2012) 239-255.

The exon-intron structures of fungi genes are quite different from each other, and the evolution of such struc-tures raises many questions. We tried to address some of these questions with an accent on methods of revealing evolu-tionary factors based on the analysis of gene exon-intron structures using statistical analysis. Taking whole genomes of...