
Jehoshua Bruck- PhD
- California Institute of Technology
Jehoshua Bruck
- PhD
- California Institute of Technology
About
456
Publications
45,340
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,939
Citations
Current institution
Additional affiliations
January 1994 - present
June 1986 - June 1994
Education
September 1985 - March 1989
Publications
Publications (456)
Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable o...
DNA-based storage is an emerging technology that provides high information density and longevity. Noise and errors are present in almost every stage of the process: writing, storing, and reading. Thus, efficient error correction is crucial to guarantee high reliability and low cost of storing data in DNA. Due to technological constraints and biolog...
Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable o...
The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors...
One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) an...
One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) an...
Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a we...
In this paper, we study a model that mimics the programming operation of memory cells. This model was first introduced by Lastras-Montano
et al.
for continuous-alphabet channels, and later by Bunte and Lapidoth for discrete memoryless channels (DMC). Under this paradigm we assume that cells are programmed sequentially and individually. The progra...
In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of in...
Current approach for the detection of cancer is based on identifying genetic mutations typical to tumor cells. This approach is effective only when cancer has already emerged, however, it might be in a stage too advanced for effective treatment. Cancer is caused by the continuous accumulation of mutations; is it possible to measure the time-depende...
Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices repres...
Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices repres...
Consider a set of classes and an uncertain input. Suppose, we do not have access to data and only have knowledge of perfect experts between a few classes in the set. What constitutes a consistent set of opinions? How can we use this to predict the opinions of experts on missing sub-domains? In this paper, we define a framework to analyze this probl...
We study a new representation of neural networks based on DOMINATION functions. Specifically, we show that a threshold function can be computed by its variables connected via an unweighted bipartite graph to a universal gate computing a DOMINATION function. The DOMINATION function consists of fixed weights that are ascending powers of 2. We derive...
The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string x ∈ {0,1}^n with high probability, where the samples are independently obtained by passing x through a random deletion channel with deletion probability q. The problem is receiving significant attention recently due to its applications in DNA se...
The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $p$. The problem is receiving significant attention recently...
Consider a set of classes and an uncertain input. Suppose, we do not have access to data and only have knowledge of perfect experts between a few classes in the set. What constitutes a consistent set of opinions? How can we use this to predict the opinions of experts on missing sub-domains? In this paper, we define a framework to analyze this probl...
We study a new representation of neural networks based on DOMINATION functions. Specifically, we show that a threshold function can be computed by its variables connected via an unweighted bipartite graph to a universal gate computing a DOMINATION function. The DOMINATION function consists of fixed weights that are ascending powers of 2. We derive...
The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string x ∈ {0, 1}^n with high probability, where the samples are independently obtained by passing x through a random deletion channel with deletion probability p. The problem is receiving significant attention recently due to its applications in DNA s...
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datas...
Levenshtein introduced the problem of constructing k-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is O(k log N) for constant k, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy k-deletion correctin...
Levenshtein introduced the problem of constructing k-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is O(k log N) for constant k, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy k-deletion correctin...
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datas...
Evolution is a process of change where mutations in the viral RNA are selected based on their fitness for replication and survival. Given that current phylogenetic analysis of SARS-CoV-2 identifies new viral clades after they exhibit evolutionary selections, one wonders whether we can identify the viral selection and predict the emergence of new vi...
Evolution is a process of change where mutations in the viral RNA are selected based on their fitness for replication and survival. Given that current phylogenetic analysis of SARS-CoV-2 identifies new viral clades after they exhibit evolutionary selections, one wonders whether we can identify the viral selection and predict the emergence of new vi...
The channel model of encoding data as a set of unordered strings is receiving great attention as it captures the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we solve this open problem and present an order-wise optimal construction of codes t...
We introduce a general technique that we call syndrome compression, for designing low-redundancy error correcting codes. The technique allows us to boost the redundancy efficiency of hash/labeling-based codes by further compressing the labeling. We apply syndrome compression to different types of adversarial deletion channels and present code const...
The problem of constructing optimal multiple deletion correcting codes has long been open until recent break-through for binary cases. Yet comparatively less progress was made in the non-binary counterpart, with the only rate one non-binary deletion codes being Tenengolts’ construction that corrects single deletion. In this paper, we present severa...
Systematic deletion correcting codes play an important role in applications of document exchange. Yet despite a series of recent advances made in deletion correcting codes, most of them are non-systematic. To the best of the authors' knowledge, the only known deterministic systematic t-deletion correcting code constructions with rate approaching 1...
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset’s quality by a quantity we call the expected diameter, which measures the ex...
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
The genome is traditionally viewed as a time-independent source of information; a paradigm that drives researchers to seek correlations between the presence of certain genes and a patient's risk of disease. This analysis neglects genomic temporal changes, which we believe to be a crucial signal for predicting an individual's susceptibility to cance...
Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment...
Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment...
Deep neural networks (DNNs) typically have many weights. When errors appear in their weights, which are usually stored in non-volatile memories, their performance can degrade significantly. We review two recently presented approaches that improve the robustness of DNNs in complementary ways. In the first approach, we use error-correcting codes as e...
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the ex...
Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek et al., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of mag...
Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes r...
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
Levenshtein introduced the problem of constructing k-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is O(k log N), and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy k-deletion correcting codes remaine...
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
Levenshtein introduced the problem of constructing k-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is O(k log N), and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy k-deletion correcting codes remaine...
One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. The current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous)...
Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their...
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide a code construction for a single substitution that is...
Background
Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have...
The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working wit...
The genome is traditionally viewed as a time-independent source of information; a paradigm that drives researchers to seek correlations between the presence of certain genes and a patient's risk of disease. This analysis neglects genomic temporal changes , which we believe to be a crucial signal for predicting an individual's susceptibility to canc...
The genome is traditionally viewed as a time-independent source of information; a paradigm that drives researchers to seek correlations between the presence of certain genes and a patient's risk of disease. This analysis neglects genomic temporal changes, which we believe to be a crucial signal for predicting an individual's susceptibility to cance...
The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working wit...
In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to det...
Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their...
In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to det...
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
We (people) are memory machines. Our decision processes, emotions, and interactions with the world around us are based on and driven by associations to our memories. This natural association paradigm will become critical in future memory systems, namely, the key question will not be “How do I store more information?” but rather, “Do I have the rele...
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be a...
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be a...
We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact cap...
Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek et al., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of mag...
Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of m...
When sensitive data is stored in the cloud, the only way to ensure its secrecy is by encrypting it before it is uploaded. The emerging multi-cloud model, in which data is stored redundantly in two or more independent clouds, provides an opportunity to protect sensitive data with secret-sharing schemes. Both data-protection approaches are considered...
Erwin Chargaff in 1950 made an experimental observation that the count of A is equal to the count of T and the count of C is equal to the count of G in DNA. This observation played a crucial role in the discovery of the double stranded helix structure by Watson and Crick. However, this symmetry was also observed in single stranded DNA. This phenome...
Encryption is a useful tool to protect data confidentiality. Yet it is still challenging to hide the very presence of encrypted, secret data from a powerful adversary. This paper presents a new technique to hide data in flash by manipulating the voltage level of pseudo-randomlyselected flash cells to encode two bits (rather than one) in the cell. I...
A natural feature of molecular systems is their inherent stochastic behavior. A fundamental challenge related to the programming of molecular information processing systems is to develop a circuit architecture that controls the stochastic states of individual molecular events. Here we present a systematic implementation of probabilistic switching c...
Erwin Chargaff in 1950 made an experimental observation that the count of A is equal to the count of T and the count of C is equal to the count of G in DNA. This observation played a crucial rule in the discovery of the double stranded helix structure by Watson and Crick. However, this symmetry was also observed in single stranded DNA. This phenome...
This work studies the Stopping-Set Elimination Problem, namely, given a stopping set, how to remove the fewest erasures so that the remaining erasures can be decoded by belief propagation in k iterations (including k =∞). The NP-hardness of the problem is proven. An approximation algorithm is presented for k = 1. And efficient exact algorithms are...
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCGC, TGTG is a tandem repeat, that may be generated from AGTCTGC by a tandem duplication of length 2. In this work,...
This paper studies the problem of repairing secret sharing schemes, i.e., schemes that encode a message into $n$ shares, assigned to $n$ nodes, so that any $n-r$ nodes can decode the message but any colluding $z$ nodes cannot infer any information about the message. In the event of node failures so that shares held by the failed nodes are lost, the...
We study secure RAID, i.e., low-complexity schemes to store information in a distributed manner that is resilient to node failures and resistant to node eavesdropping. We describe a technique to shorten the secure EVENODD scheme in [6], which can optimally tolerate 2 node failures and 2 eavesdropping nodes. The shortening technique allows us to obt...
This paper studies the communication efficiency of threshold secret sharing schemes. We construct a family of Shamir's schemes with asymptotically optimal decoding bandwidth for arbitrary parameters. We also construct a family of secret sharing schemes with both optimal decoding and optimal repair bandwidth for arbitrary parameters. The constructio...
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
Network switches and routers scale in rate by distributing the packet read/write operations across multiple memory banks. Rate scaling is achieved so long as sufficiently many packets can be written and read in parallel. However, due to the non-determinism of the read process, parallel pending read requests may contend on memory banks, and thus sig...
Maximum distance separable (MDS) array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with r redundancy nodes can correct any r node erasures by accessing (reading) all the remaining information in the surviving nodes. However, in practice, e erasures are a more likely f...
For the storage of big data, there are significant challenges with its long-term reliability. This paper studies how to use the natural redundancy in data for error correction, and how to combine it with error-correcting codes to effectively improve data reliability. It explores several aspects of natural redundancy, including the discovery of natu...
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
We study secure RAID, i.e., low-complexity schemes to store information in a distributed manner that is resilient to node failures and resistant to node eavesdropping. We describe a technique to shorten the secure EVENODD scheme in [6], which can optimally tolerate 2 node failures and 2 eavesdropping nodes. The shortening technique allows us to obt...
This paper studies the communication efficiency of threshold secret sharing schemes. We construct a family of Shamir’s schemes with asymptotically optimal decoding bandwidth for arbitrary parameters. We also construct a family of secret sharing schemes with both optimal decoding bandwidth and optimal repair bandwidth for arbitrary parameters. The c...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and t...
A secret sharing scheme is a method to store information securely and reliably. Particularly, in a threshold secret sharing scheme, a secret is encoded into n shares, such that any set of at least t_1 shares suffice to decode the secret, and any set of at most t_2 < t_1 shares reveal no information about the secret. Assuming that each party holds a...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form x = abc → y = abbc, where x and y are sequences and a, b, and c are their substrings, needed to generate a binary sequence of length n starting from a square-free sequ...
In distributed storage, a file is stored in a set of nodes and protected by erasure-correcting codes. Regenerating code is a type of code with two properties: first, it can reconstruct the entire file in the presence of any r node erasures for some specified integer r ; second, it can efficiently repair an erased node from any subset of remaining n...
We propose efficient coding schemes for two communication settings: 1) asymmetric channels and 2) channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In a...
We study the tandem duplication distance between binary sequences and their roots. This distance is motivated by genomic tandem duplication mutations and counts the smallest number of tandem duplication events that are required to take one sequence to another. We consider both exact and approximate tandem duplications, the latter leading to a combi...
We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for...
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
We propose secure RAID, i.e., low-complexity schemes to store information in a distributed manner that is resilient to node failures and resistant to node eavesdropping. We generalize the concept of systematic encoding to secure RAID and show that systematic schemes have significant advantages in the efficiencies of encoding, decoding and random ac...
The ability to store data in the DNA of a living
organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
Multi-permutations and in particular permutations appear in various applications in an information theory. New applications, such as rank modulation for flash memories, have suggested the need to consider error-correcting codes for multi-permutations. In this paper, we study systematic error-correcting codes for multi-permutations in general and fo...