Conference Paper

Towards learning-based image compression for storage on DNA support

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The exponentially increasing demand for data storage has been facing more and more challenges during the past years. The energy costs that it represents are also increasing, and the availability of the storage hardware is not able to follow the storage demand's trend. The short lifespan of conventional storage media-10 to 20 years-forces the duplication of the hardware and worsens the situation. The majority of this storage demand concerns "cold" data, data very rarely accessed but that has to be kept for long periods of time. The coding abilities of synthetic DNA, and its long durability (several hundred years), make it a serious candidate as an alternative storage media for "cold" data. In this paper, we propose a variable-length coding algorithm adapted to DNA data storage with improved performance. The proposed algorithm is based on a modified Shannon-Fano code that respects some biochemichal constraints imposed by the synthesis chemistry. We have inserted this code in a JPEG compression algorithm adapted to DNA image storage and we highlighted an improvement of the compression ratio ranging from 0.5 up to 2 bits per nucleotide compared to the state-of-the-art solution, without affecting the reconstruction quality.
Article
Full-text available
Synthetic DNA is a growing alternative to electronic-based technologies in fields such as data storage, product tagging, or signal processing. Its value lies in its characteristic attributes, namely Watson-Crick base pairing, array synthesis, sequencing, toehold displacement and polymerase chain reaction (PCR) capabilities. In this review, we provide an overview of the most prevalent applications of synthetic DNA that could shape the future of information technology. We emphasize the reasons why the biomolecule can be a valuable alternative for conventional electronic-based media, and give insights on where the DNA-analog technology stands with respect to its electronic counterparts.
Article
Full-text available
Background DNA is a promising storage medium for high-density long-term digital data storage. Since DNA synthesis and sequencing are still relatively expensive tasks, the coding methods used to store digital data in DNA should correct errors and avoid unstable or error-prone DNA sequences. Near-optimal rateless erasure codes, also called fountain codes, are particularly interesting codes to realize high-capacity and low-error DNA storage systems, as shown by Erlich and Zielinski in their approach based on the Luby transform (LT) code. Since LT is the most basic fountain code, there is a large untapped potential for improvement in using near-optimal erasure codes for DNA storage. Results We present NOREC4DNA , a software framework to use, test, compare, and improve near-optimal rateless erasure codes (NORECs) for DNA storage systems. These codes can effectively be used to store digital information in DNA and cope with the restrictions of the DNA medium. Additionally, they can adapt to possible variable lengths of DNA strands and have nearly zero overhead. We describe the design and implementation of NOREC4DNA . Furthermore, we present experimental results demonstrating that NOREC4DNA can flexibly be used to evaluate the use of NORECs in DNA storage systems. In particular, we show that NORECs that apparently have not yet been used for DNA storage, such as Raptor and Online codes, can achieve significant improvements over LT codes that were used in previous work. NOREC4DNA is available on https://github.com/umr-ds/NOREC4DNA . Conclusion NOREC4DNA is a flexible and extensible software framework for using, evaluating, and comparing NORECs for DNA storage systems.
Conference Paper
Full-text available
We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate-distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.
Article
Full-text available
LT-codes are a new class of codes introduced by Luby for the purpose of scalable and fault-tolerant distribution of data over computer networks. In this paper, we introduce Raptor codes, an extension of LT-codes with linear time encoding and decoding. We will exhibit a class of universal Raptor codes: for a given integer k and any real epsiv>0, Raptor codes in this class produce a potentially infinite stream of symbols such that any subset of symbols of size k(1+epsiv) is sufficient to recover the original k symbols with high probability. Each output symbol is generated using O(log(1/epsiv)) operations, and the original symbols are recovered from the collected ones with O(klog(1/epsiv)) operations. We will also introduce novel techniques for the analysis of the error probability of the decoder for finite length Raptor codes. Moreover, we will introduce and analyze systematic versions of Raptor codes, i.e., versions in which the first output elements of the coding system coincide with the original k elements
Article
Full-text available
Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
Article
Full-text available
Digital information is accumulating at an astounding rate, straining our ability to store and archive it. DNA is among the most dense and stable information media known. The development of new technologies in both DNA synthesis and sequencing make DNA an increasingly feasible digital storage medium. We developed a strategy to encode arbitrary digital information in DNA, wrote a 5.27-megabit book using DNA microchips, and read the book by using next-generation DNA sequencing.
Article
Full-text available
We introduce online codes – a class of near-optimal codes for a very general loss channel which we call the free channel. Online codes are linear encoding/decoding time codes, based on sparse bipartite graphs, similar to Tornado codes, with a couple of novel properties: local encodability and rateless-ness. Local encodability is the property that each block of the encoding of a message can be computed independently from the others in constant time. This also implies that each encoding block is only dependent on a constant-sized part of the message and a few preprocessed bits. Rateless-ness is the property that each message has an encoding of practically infinite size. We argue that rateless codes are more appropriate than fixed-rate codes for most situations where erasure codes were considered a solution. Furthermore, rateless codes meet new areas of application, where they are not replaceable by fixed-rate codes. One such area is information dispersal over peer-to-peer networks.
Conference Paper
Full-text available
LT codes are the first realization of a class of erasure codes called universal erasure codes. LT codes are universal in the sense that they are simultaneously near optimal for every erasure channel and they are very efficient as the data length grows. The key to the design and analysis of the LT codes is the introduction and analysis of the LT process. A full analysis of the behavior of the LT process using first principles of probability theory precisely captures the behavior of the data recovery process.
Article
Full-text available
Many state-of-the-art perceptual image quality assessment (IQA) algorithms share a common two-stage structure: local quality/distortion measurement followed by pooling. While significant progress has been made in measuring local image quality/distortion, the pooling stage is often done in ad-hoc ways, lacking theoretical principles and reliable computational models. This paper aims to test the hypothesis that when viewing natural images, the optimal perceptual weights for pooling should be proportional to local information content, which can be estimated in units of bit using advanced statistical models of natural images. Our extensive studies based upon six publicly-available subject-rated image databases concluded with three useful findings. First, information content weighting leads to consistent improvement in the performance of IQA algorithms. Second, surprisingly, with information content weighting, even the widely criticized peak signal-to-noise-ratio can be converted to a competitive perceptual quality measure when compared with state-of-the-art algorithms. Third, the best overall performance is achieved by combining information content weighting with multiscale structural similarity measures.
Conference Paper
Full-text available
The structural similarity image quality paradigm is based on the assumption that the human visual system is highly adapted for extracting structural information from the scene, and therefore a measure of structural similarity can provide a good approximation to perceived image quality. This paper proposes a multiscale structural similarity method, which supplies more flexibility than previous single-scale methods in incorporating the variations of viewing conditions. We develop an image synthesis method to calibrate the parameters that define the relative importance of different scales. Experimental comparisons demonstrate the effectiveness of the proposed method.
Conference Paper
Over the past years, the ever-growing trend on data storage demand, more specifically for "cold" data (i.e. rarely accessed), has motivated research for alternative systems of data storage. Because of its biochemical characteristics, synthetic DNA molecules are now considered as serious candidates for this new kind of storage. This paper introduces a novel arithmetic coder for DNA data storage, and presents some results on a lossy JPEG 2000 based image compression method adapted for DNA data storage that uses this novel coder. The DNA coding algorithms presented here have been designed to efficiently compress images, encode them into a quaternary code, and finally store them into synthetic DNA molecules. This work also aims at making the compression models better fit the problematic that we encounter when storing data into DNA, namely the fact that the DNA writing, storing and reading methods are error prone processes. The main take away of this work is our arithmetic coder and it's integration into a performant image codec.
Article
Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review, we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology. Throughout evolution, DNA has been the primary medium of biological information storage. In this article, Ceze, Nivala and Strauss discuss how DNA can be adopted as a storage medium for custom data, as a potential future complement to current data storage media such as computer hard disks, optical disks and tape. They discuss strategies for coding, decoding and error correction and give examples of implementation both in vitro and in vivo.
Article
We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.
Article
A reliable and efficient DNA storage architecture DNA has the potential to provide large-capacity information storage. However, current methods have only been able to use a fraction of the theoretical maximum. Erlich and Zielinski present a method, DNA Fountain, which approaches the theoretical maximum for information stored per nucleotide. They demonstrated efficient encoding of information—including a full computer operating system—into DNA that could be retrieved at scale after multiple rounds of polymerase chain reaction. Science , this issue p. 950
Article
We present an image quality metric based on the transformations associated with the early visual system: local luminance subtraction and local gain control. Images are decomposed using a Laplacian pyramid, which subtracts a local estimate of the mean luminance at multiple scales. Each pyramid coefficient is then divided by a local estimate of amplitude (weighted sum of absolute values of neighbors), where the weights are optimized for prediction of amplitude using (undistorted) images from a separate database. We define the quality of a distorted image, relative to its undistorted original, as the root mean squared error in this “normalized Laplacian” domain. We show that both luminance subtraction and amplitude division stages lead to significant reductions in redundancy relative to the original image pixels. We also show that the resulting quality metric provides a better account of human perceptual judgements than either MS-SSIM or a recently-published gain-control metric based on oriented filters.
Article
We report on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today's digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, specifically tailored forward error correction schemes have to be applied. For the purpose of realizing a DNA data storage, we have developed an efficient and robust forwarderror-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School [1]. Our forward error correction scheme is able to cope with all error types of today's DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22 MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved further. This proves the feasibility to use synthetic DNA as longterm digital data storage media.
Article
Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging. Here we show that digital information can be stored on DNA and recovered without errors for considerably longer time frames. To allow for the perfect recovery of the information, we encapsulate the DNA in an inorganic matrix, and employ error-correcting codes to correct storage-related errors. Specifically, we translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silica. Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions. The original information could be recovered error free, even after treating the DNA in silica at 70 °C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.
Article
Image quality assessment (IQA) aims to use computational models to measure the image quality consistently with subjective evaluations. The well-known structural similarity index brings IQA from pixel- to structure-based stage. In this paper, a novel feature similarity (FSIM) index for full reference IQA is proposed based on the fact that human visual system (HVS) understands an image mainly according to its low-level features. Specifically, the phase congruency (PC), which is a dimensionless measure of the significance of a local structure, is used as the primary feature in FSIM. Considering that PC is contrast invariant while the contrast information does affect HVS' perception of image quality, the image gradient magnitude (GM) is employed as the secondary feature in FSIM. PC and GM play complementary roles in characterizing the image local quality. After obtaining the local quality map, we use PC again as a weighting function to derive a single quality score. Extensive experiments performed on six benchmark IQA databases demonstrate that FSIM can achieve much higher consistency with the subjective evaluations than state-of-the-art IQA metrics.
Article
Measurement of visual quality is of fundamental importance to numerous image and video processing applications. The goal of quality assessment (QA) research is to design algorithms that can automatically assess the quality of images or videos in a perceptually consistent manner. Image QA algorithms generally interpret image quality as fidelity or similarity with a "reference" or "perfect" image in some perceptual space. Such "full-reference" QA methods attempt to achieve consistency in quality prediction by modeling salient physiological and psychovisual features of the human visual system (HVS), or by signal fidelity measures. In this paper, we approach the image QA problem as an information fidelity problem. Specifically, we propose to quantify the loss of image information to the distortion process and explore the relationship between image information and visual quality. QA systems are invariably involved with judging the visual quality of "natural" images and videos that are meant for "human consumption." Researchers have developed sophisticated models to capture the statistics of such natural signals. Using these models, we previously presented an information fidelity criterion for image QA that related image quality with the amount of information shared between a reference and a distorted image. In this paper, we propose an image information measure that quantifies the information that is present in the reference image and how much of this reference information can be extracted from the distorted image. Combining these two quantities, we propose a visual information fidelity measure for image QA. We validate the performance of our algorithm with an extensive subjective study involving 779 images and show that our method outperforms recent state-of-the-art image QA algorithms by a sizeable margin in our simulations. The code and the data from the subjective study are available at the LIVE website.
DNA-based Media Storage: State-of-the-Art, Challenges, Use Cases and Requirements version 8.0
  • Antonini
Analysis of the influence of errors in dna-based image coding
  • Ramos
Image Storage on Synthetic DNA Using Autoencoders
  • Pic
Toward a practical perceptual video quality metric
  • Li
Architecting datacenters for sustainability: greener data storage using synthetic dna
  • Nguyen
Joint autoregressive and hierarchical priors for learned image compression
  • Minnen
High-fidelity generative image compression
  • Mentzer
Compressai: a pytorch library and evaluation platform for end-to-end compression research
  • Bégaint