Oliver Serang’s research while affiliated with University of Montana and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (42)


Median of heaps: linear-time selection by recursively constructing binary heaps
  • Preprint

April 2023

·

16 Reads

Oliver Serang

The first worst-case linear-time algorithm for selection was discovered in 1973; however, linear-time binary heap construction was first published in 1964. Here we describe another worst-case linear selection algorithm,which is simply implemented and uses binary heap construction as its principal engine. The algorithm is implemented in place, and shown to perform similarly to in-place median of medians.


Adversarial network training using higher-order moments in a modified Wasserstein distance
  • Preprint
  • File available

October 2022

·

15 Reads

Generative-adversarial networks (GANs) have been used to produce data closely resembling example data in a compressed, latent space that is close to sufficient for reconstruction in the original vector space. The Wasserstein metric has been used as an alternative to binary cross-entropy, producing more numerically stable GANs with greater mode covering behavior. Here, a generalization of the Wasserstein distance, using higher-order moments than the mean, is derived. Training a GAN with this higher-order Wasserstein metric is demonstrated to exhibit superior performance, even when adjusted for slightly higher computational cost. This is illustrated generating synthetic antibody sequences.

Download

Optimally selecting the top k values from X + Y with layer-ordered heaps

May 2021

·

28 Reads

·

1 Citation

Selection and sorting the Cartesian sum, X + Y, are classic and important problems. Here, a new algorithm is presented, which generates the top k values of the form X i + Y j . The algorithm relies on layer-ordered heaps, partial orderings of exponentially sized layers. The algorithm relies only on median-of-medians and is simple to implement. Furthermore, it uses data structures contiguous in memory, cache efficient, and fast in practice. The presented algorithm is demonstrated to be theoretically optimal.


Figure 1 (A): Nine layer products of A + B. (B): The layer product tuples in the order they would pop from the heap, the number of values in their Cartesian product, and s, the cumulative size of the layer products whose max tuples have been popped. The two axes are the input arrays after being LOHified. The values of all 18 possible layer product tuples are shown (nine min tuples in blue and nine max tuples in green). If k = 10, then the tuples will be popped in the order shown in (B). After (20,(3,1),true) is popped, s (the total number of items in the Cartesian product of all max tuples) exceeds k. Note that the values in the layers of A and B are not necessarily in sorted order. Full-size DOI: 10.7717/peerjcs.483/fig-1
Figure 2 The process of adding a layer to a pairwise selection node's LOH, G, when both children are pairwise selection nodes. Each node has a LOH it generates for its parent to access as well as a (not realized) matrix formed by the Cartesian product of its children's LOHs. Blue layer products currently have their min tuple in the heap. Green layer products have had at least their min tuple popped from the heap (and thus have inserted other tuples into the heap). (A) The triplet before adding a new layer to G. (B) The parent generating the next layer in its LOH. The parent pops (1,3) and must now insert (1,3) and (1,4); however, the left child has not yet generated the fourth layer in its LOH, C, so the parent can not insert (1,4). (C) The left child generating the fourth layer in its LOH, C. The left child pops (2,2) then inserts (2,2) and (2,3). The left child continues and pops (2,2) and (2,3) and performs the appropriate insertions: (2,3) and (2,4). Finally, the left child pops (2,3) at which point it has enough values to select the next layer in C. Now that C has its fourth layer, the parent is able insert (1,4) and continue. The parent then pops (1,3) and selects its third layer. The parent did not need F to generate a new layer and so the right child remains the same as before. (D) The triplet after the parent and left child perform the necessary operations to generate the next layer in G. Full-size DOI: 10.7717/peerjcs.483/fig-2
Selection on X 1 + X 2 + ⋯ + X m via Cartesian product trees

April 2021

·

59 Reads

Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on A + B, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of A + B selections was proposed to perform selection on X1 + X2 + ⋯ + X m in o(n⋅m + k⋅m), where X i have length n. Here, that o(n⋅m + k⋅m) algorithm is combined with a novel, optimal LOH-based algorithm for selection on A + B (without a soft heap). Performance of algorithms for selection on X1 + X2 + ⋯ + X m are compared empirically, demonstrating the benefit of the algorithm proposed here.


Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps

February 2021

·

12 Reads

Journal of Proteome Research

Kyle Lucke

·

Jake Pennington

·

Patrick Kreitzberg

·

[...]

·

Oliver Serang

Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω(n log(n)) time to sort n values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the p value when investigating the family-wise error rate and q values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in O(n), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., p values or q values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.


Selection on X1+X1+XmX_1 + X_1 + \cdots X_m via Cartesian product tree

August 2020

·

6 Reads

Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on X+Y, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of X+Y selections was proposed to perform k-selection on X1+X2++XmX_1+X_2+\cdots+X_m in o(nm+km)o(n\cdot m + k\cdot m), where XiX_i have length n. Here, that o(nm+km)o(n\cdot m + k\cdot m) algorithm is combined with a novel, optimal LOH-based algorithm for selection on X+Y (without a soft heap). Performance of algorithms for selection on X1+X2++XmX_1+X_2+\cdots+X_m are compared empirically, demonstrating the benefit of the algorithm proposed here.


Optimal construction of a layer-ordered heap

July 2020

·

12 Reads

The layer-ordered heap (LOH) is a simple, recently proposed data structure used in optimal selection on X+Y, thealgorithm with the best known runtime for selection on X1+X2++XmX_1+X_2+\cdots+X_m, and the fastest method in practice for computing the most abundant isotope peaks in a chemical compound. Here, we introduce a few algorithms for constructing LOHs, analyze their complexity, and demonstrate that one algorithm is optimal for building a LOH of any rank α\alpha. These results are shown to correspond with empirical experiments of runtimes when applying the LOH construction algorithms to a common task in machine learning.


Fast Exact Computation of the k Most Abundant Isotope Peaks with Layer-Ordered Heaps

July 2020

·

13 Reads

·

6 Citations

Analytical Chemistry

Computation of the isotopic distribution of compounds is crucial to applications of mass spectrometry, particularly as machine precision continues to improve. In the last decade, several tools have been created for doing so. In this paper we present a novel algorithm for calculating the most abundant k isotopologue peaks of a compound. The algorithm uses Serang's optimal method of selection on Cartesian products. The method is significantly faster than the state-of-the-art on large compounds (\emph{e.g.}, Titin protein) and on compounds whose elements have many isotopes (\emph{e.g.}, Palladium alloys).


Fast exact computation of the k most abundant isotope peaks with layer-ordered heaps

April 2020

·

7 Reads

The theoretical computation of isotopic distribution of compounds is crucial in many important applications of mass spectrometry, especially as machine precision grows. A considerable amount of good tools have been created in the last decade for doing so. In this paper we present a novel algorithm for calculating the top k peaks of a given compound. The algorithm takes advantage of layer-ordered heaps used in an optimal method of selection on X+Y and is able to efficiently calculate the top k peaks on very large molecules. Among its peers, this algorithm shows a significant speedup on molecules whose elements have many isotopes. The algorithm obtains a speedup of more than 31x when compared to \textsc{IsoSpec} on \ch{Au2Ca10Ga10Pd76} when computing 47409787 peaks, which covers 0.999 of the total abundance.


Optimal selection on X+Y simplified with layer-ordered heaps

January 2020

·

3 Reads

Selection on the Cartesian sum, A+B, is a classic and important problem. Frederickson's 1993 algorithm produced the first algorithm that made possible an optimal runtime. Kaplan \emph{et al.}'s recent 2018 paper descibed an alternative optimal algorithm by using Chazelle's soft heaps. These extant optimal algorithms are very complex; this complexity can lead to difficulty implementing them and to poor performance in practice. Here, a new optimal algorithm is presented, which uses layer-ordered heaps. This new algorithm is both simple to implement and practically efficient.


Citations (25)


... IsoSpec is another popular simulator with bindings to Python (IsoSpecPy), R (IsoSpecR), and C. There are also more recent offerings like ElemCor, 28 which specialized in the simulation of isotopically enriched metabolites; NEUTRONSTAR, which is optimized for the efficient calculation of large molecules; 29 and DEUTERIUM, which extends Rockwood's FFT algorithm into a 2D FFT algorithm for enhanced detail. 30 Here we present a straightforward approach for fitting simulated isotope patterns to experimental data using a modification to the MATLAB isotope simulator, isotopicdist. ...

Reference:

Automated Assignment of 15 N And 13 C Enrichment Levels in Doubly-Labeled Proteins
Fast Exact Computation of the k Most Abundant Isotope Peaks with Layer-Ordered Heaps
  • Citing Article
  • July 2020

Analytical Chemistry

... Datasets from the ABRF RG studies, including those from sPRG studies resulting in commercially available standards, have seen frequent reuse by researchers developing new algorithms and software. [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74] NIST offers numerous other materials that include human, animal, and plant tissues, human biofluids, etc., but are not specifically fit for purpose for proteomics. These are available directly from NIST at https://shop.nist.gov. ...

EPIFANY - A method for efficient high-confidence protein inference
  • Citing Article
  • January 2020

Journal of Proteome Research

... This makes it particularly suited to act as a pre-selection funnel for selecting promising candidates for further exploration using computationally more expensive approaches such as in silico fragmentation trees which also are able to provide comparable improvements detecting structurally highly related compound [28]. In addition, one could think of adding relevant mass differences as input to train the model, for example following the approach of Kreitzberg et al. [29]. In future work we are also keen to explore how Spec2Vec can be combined with the concept of hypothetical neutral losses as proposed by [30]. ...

The alphabet projection of spectra

Journal of Proteome Research

... The library includes modular, from-scratch implementations of several tools used in inference: these include real and complex FFT (using a template-recursive approach), p-convolution (using a lazy approach that may terminate early without computing the full family of convolutions in L p spaces), PMFs, and message passing methods for graphical models. The template-recursive TRIOT tensor library is used for manipulating distributions of arbitrary dimension (and dimension unknown at compile time) [23]. ...

TRIOT: Faster tensor manipulation in C++11
  • Citing Article
  • April 2017

The Art Science and Engineering of Programming

... Although the Cooley-Tukey approach is not quite as good for large numbers of dimensions (because each dimension must be zero padded, and reaching the next power of two in each dimension may result in a ≈ 2 d slowdown where d is the number of dimensions [28]), this implementation is lightweight, produced completely in house, and is fast in practice for small numbers of dimensions. ...

An exact, cache-localized algorithm for the sub-quadratic convolution of hypercubes
  • Citing Article
  • July 2016

... A reported minimum probability was chosen to achieve a 1% FDR at both peptide and protein levels. Peptide abundance was calculated using the FeatureFinderMultiplex tool from OpenMS (v.2.3) [43][44][45]. Peptide abundance features were mapped (IDMapper) to the identified peptides (iProphet) followed by IDConflictResolver and MultiplexResolver in OpenMS (v.2.3). Peptides and proteins with their corresponding abundances were assembled in R (v3.6.1, ...

Yeast membrane proteomics using leucine metabolic labelling: Bioinformatic data processing and exemplary application to the ER-intramembrane protease Ypf1
  • Citing Article
  • July 2016

Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics

... Statistical analyses in proteomics can be complex and has been covered extensively (Handler and Haynes, 2020;Jung, 2016;Serang and Käll, 2015). However, there is little consensus for which statistical methods and tests are best suited to proteomics studies, resulting in a variety of strategies employed. ...

Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less
  • Citing Article
  • August 2015

Journal of Proteome Research