Article

A Method for the Construction of Minimum-Redundancy Codes

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The classical problem of determining the value of a discrete r.v. X by asking general binary questions is well studied in information theory and source coding, going back to Shannon [3] and Huffman [4]. As is well known, this problem leads to the notion of Shannon entropy H(X) as the essential fundamental limit for the minimal number of questions required on average to describe a single copy of X, and as the exact number of questions per instance (with high probability) required to describe i.i.d. ...
... To that end, we only need to take into account cross-pair edges, since edges within each pair are in the cut both before and after the transformation. Consider two generic pairs (1,2) and (3,4) that are in a C-pair and C-pair position respectively (they need not be adjacent). Let us write W CC and W CC to denote the total weight of edges between the distinct pairs before and after we transform the pair (3, 4) into a C-pair, respectively. ...
... Edges between all 4 members of the C,C-pairs are positive. Since the edges (1, 3) and(2,4) are overlapping, then by additivity there is no change in the weight: ...
Preprint
Alice holds an random variable X, and Bob is trying to guess its value by asking questions of the form "is X=x?". Alice answers truthfully and the game terminates once Bob guesses correctly. Before the game begins, Bob is allowed to reach out to an oracle, Carole, and ask her any yes/no question, i.e., a question of the form "is XAX\in A?". Carole is known to lie with a given probability p. What should Bob ask Carole if he would like to minimize his expected guessing time? When Carole is always truthful (p=0), it is not difficult to check that Bob should order the symbol probabilities in descending order, and ask Carole whether the index of X w.r.t this order is even or odd. We show that this strategy is almost optimal for any lying probability p, up to a small additive constant upper bounded by a 1/4. We discuss a connection to the cutoff rate of the BSC with feedback.
... In the classical setting to search for the optimal code, one has to find for the set of integers {ℓ i } that minimizes the averaged length subjected to the Kraft-McMillan inequality. It is well known that Huffman code provides the optimal solution [21]. Let us see that the quantum optimal code or the quantum version of Huffman code is obtained for an encoding scheme U with basis given by the eigenstates of ρ and the classical code c given by the Huffman code for the symbols {1, . . . ...
... where {c opt (i)} is the classical optimal code given by the Huffman code [21] of the symbols {1, . . . , d} with corresponding probabilities {ρ 1 , . . . ...
... To take integer values, one can again consider the excess integer part of these values, ℓ i = ⌈− log k ρ ti ⌉, and construct a corresponding code using the Kraft tree, that is the Shannon code corresponding to the escort probabilities {ρ ti }. However, independently of the explicit expression of the generalized optimal code (20), it is possible to upper and lower bound the optimal t-exponential average quantum codeword length (21) in terms of the quantum Rényi entropy of the source. ...
Preprint
Based on the problem of quantum data compression in a lossless way, we present here an operational interpretation for the family of quantum R\'enyi entropies. In order to do this, we appeal to a very general quantum encoding scheme that satisfies a quantum version of the Kraft-McMillan inequality. Then, in the standard situation, where one is intended to minimize the usual average length of the quantum codewords, we recover the known results, namely that the von Neumann entropy of the source bounds the average length of the optimal codes. Otherwise, we show that by invoking an exponential average length, related to an exponential penalization over large codewords, the quantum R\'enyi entropies arise as the natural quantities relating the optimal encoding schemes with the source description, playing an analogous role to that of von Neumann entropy.
... Lemma 2 (Huffman [3] p. 1099). For any source, if a prefix code is optimal, then it is complete and monotone. ...
... Huffman codes [3] were invented in 1952 and are used today in many practical data compression applications, such as for text, audio, image, and video coding, and are known to be optimal [1]. ...
... Figure 1 depicts the main result relating to Theorem 1, along with some known prior art. Huffman [3] Gallager [13] Manickman [12] ...
Article
Full-text available
A property of prefix codes called strong monotonicity is introduced, and it is proven that for a given source, a prefix code is optimal if and only if it is complete and strongly monotone.
... If the channel varies slightly, most changes are close to zero. Then, they use exponential distribution mapping, which is a sort of Huffman coding [5] and encodes small values with short codewords, thus improving compression. ...
... These differential values ∆ are compressed using a variablelength prefix code. Although Huffman code is optimal [5], it is complex, while the exponential distribution encoder [2] has lower complexity and achieves similar gains. This encoder assumes high probabilities for small ∆ and low probabilities for large ∆. ...
Article
Full-text available
The multiple-input multiple-output (MIMO) technology improves Wi-Fi throughput by increasing the number of antennas. However, with more antennas and developing coordinated MIMO operations, the amount of channel state information (CSI) increases. As the CSI-induced overhead limits the performance of Wi-Fi networks, the paper proposes a new low-complexity dual differential CSI compression method that significantly reduces this overhead. Experimental and simulation results in diverse scenarios show that the proposed algorithm overcomes state-of-the-art algorithms of similar complexity, reducing the feedback size by up to 40%.
... Previous schemes for recompressing the VQ index table perform poorly on texture images. To achieve a high compression rate for the index table, we propose a scheme that combines principal component analysis (PCA) [18] and Huffman coding [19]. The real challenge in compressing texture images is finding the right balance between compression efficiency and preserving quality. ...
... A Huffman coding is a common encoding algorithm used for data compression. The algorithm was developed by David A. Huffman [19] back in 1952 to represent data by variable-length codes. The key concept of the method is to represent highly used symbols with fewer bits. ...
Article
Full-text available
With the development of the information age, all walks of life are inseparable from the internet. Every day, huge amounts of data are transmitted and stored on the internet. Therefore, to improve transmission efficiency and reduce storage occupancy, compression technology is becoming increasingly important. Based on different application scenarios, it is divided into lossless data compression and lossy data compression, which allows a certain degree of compression. Vector quantization (VQ) is a widely used lossy compression technology. Building upon VQ compression technology, we propose a lossless compression scheme for the VQ index table. In other words, our work aims to recompress VQ compression technology and restore it to the VQ compression carrier without loss. It is worth noting that our method specifically targets texture images. By leveraging the spatial symmetry inherent in these images, our approach generates high-frequency symbols through difference calculations, which facilitates the use of adaptive Huffman coding for efficient compression. Experimental results show that our scheme has better compression performance than other schemes.
... Huffman coding (HUFF) matches each element depending on its probability (frequency of occurrence) with a fixed prefix code [12] [13] [14] [15]. This coding is most often performed in the following sequence: the probabilities (frequencies) of individual elements are calculated; elements in descending order of probabilities are arranged; the two elements with the lowest probabilities (most often the last ones in the created list) are iteratively combined in order to obtain one element, the code 0 to the first one, and 1 to the second one are added; the probabilities of the selected elements to calculate the probability of the formed element and insert this element into the sorted list of probabilities are summed; HUFF codes are formed, the formed codes in reverse order -from the top to each element are written. ...
... To speed up the development of graphic formats, data compression formats are created, improved and used, which combine proven context-dependent and context-independent algorithms. One of the successful examples of such a combination is the DEFLATE dictionary compression format [21], which for processing of incoming flow uses context-dependent LZ77 algorithm [7], a the results of his work is compressed by Huffman's dynamic codes [12]. This format is used in many popular archivers (for example, GZIP [22]) and graphic formats (for example, PNG [23]) and does not need the acquisition licenses for use in applied software. ...
... Another way of compressing a WT is to use a prefix-free variable-length encoding for the symbols. For example, Huffman [50] code can be used to build a Huffman-Shaped WT [51], where the tree is not balanced anymore. The size reduces to n(H 0 (S) + 1) + o(n(H 0 (S) + 1)) + O(σ log n), 1 and average time becomes O(H 0 (S)) for rank, access, and select (worst-case time is still O(log σ) [52]). ...
Preprint
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users' trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments. CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu-Tucker-shaped Wavelet-tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50-70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of 1-1000 microseconds) of CTR.
... For example, in the case N = 2 with q 1 > 1 2 and q 2 > 1−q 1 q 1 both procedures D ′ and S are preferred to individual testing and optimal. The optimality follows from the fact that both procedures are equivalent to the optimal prefix Huffman code (Huffman, 1952) with the expected length L(N), N = 2. For the N ≥ 3, the optimum group testing strategy does not attain L(N). ...
Preprint
Group testing is a useful method that has broad applications in medicine, engineering, and even in airport security control. Consider a finite population of N items, where item i has a probability pip_i to be defective. The goal is to identify all items by means of group testing. This is the generalized group testing problem. The optimum procedure, with respect to the expected total number of tests, is unknown even in case when all pip_i are equal. \cite{H1975} proved that an ordered partition (with respect to pip_i) is the optimal for the Dorfman procedure (procedure D), and obtained an optimum solution (i.e., found an optimal partition) by dynamic programming. In this paper, we investigate the Sterrett procedure (procedure S). We provide close form expression for the expected total number of tests, which allows us to find the optimum arrangement of the items in the particular group. We also show that an ordered partition is not optimal for the procedure S or even for a slightly modified Dorfman procedure (procedure DD^{\prime}). This discovery implies that finding an optimal procedure S appears to be a hard computational problem. However, by using an optimal ordered partition for all procedures, we show that procedure DD^{\prime} is uniformly better than procedure D, and based on numerical comparisons, procedure S is uniformly and significantly better than procedures D and DD^{\prime}.
... The special case of uniform multicast costs (with nonuniform member weights) bears a strong resemblance to the Huffman encoding problem [11]. Indeed, it can be easily seen that an optimal binary hierarchy in this special case is given by the Huffman code. ...
Preprint
Many data dissemination and publish-subscribe systems that guarantee the privacy and authenticity of the participants rely on symmetric key cryptography. An important problem in such a system is to maintain the shared group key as the group membership changes. We consider the problem of determining a key hierarchy that minimizes the average communication cost of an update, given update frequencies of the group members and an edge-weighted undirected graph that captures routing costs. We first present a polynomial-time approximation scheme for minimizing the average number of multicast messages needed for an update. We next show that when routing costs are considered, the problem is NP-hard even when the underlying routing network is a tree network or even when every group member has the same update frequency. Our main result is a polynomial time constant-factor approximation algorithm for the general case where the routing network is an arbitrary weighted graph and group members have nonuniform update frequencies.
... We say that an autocorrelation is perfect if n−1 k=0 c k c k+m mod n = 0 for every 1 ≤ m ≤ n−1. Sequences with low autocorrelation have a fundamental importance in radar signals theory [7], data transmission and data compression [13]. It is thus interesting to search for new finite sequences having perfect autocorrelation, in a similar way as Huffman generalized Barker sequences [14]. ...
Preprint
We study the existence and construction of circulant matrices C of order n2n\geq2 with diagonal entries d0d\geq0, off-diagonal entries ±1\pm1 and mutually orthogonal rows. These matrices generalize circulant conference (d=0) and circulant Hadamard (d=1) matrices. We demonstrate that matrices C exist for every order n and for d chosen such that n=2d+2, and we find all solutions C with this property. Furthermore, we prove that if C is symmetric, or n1n-1 is prime, or d is not an odd integer, then necessarily n=2d+2. Finally, we conjecture that the relation n=2d+2 holds for every matrix C, which generalizes the circulant Hadamard conjecture. We support the proposed conjecture by computing all the existing solutions up to n=50.
... Spångberg et al. proposes Tng-Mf1 16 , a class of algorithms that use quantization, delta-coding within and between frames, a custom 0 th order variable length integer compression, and optionally a combination of Burrow-Wheeler transformation 4 , Lempel-Ziv coding 18 , and Huffman coding. 8 Marais et al. 10 use quantization, an arithmetic encoder, and interframe prediction with polynomials of order zero or one. Additionally they use a priori knowledge about the spatial structure of water to exploit redundancy in adjacent water molecules position and orientation. ...
Preprint
Molecular dynamics simulations yield large amounts of trajectory data. For their durable storage and accessibility an efficient compression algorithm is paramount. State of the art domain-specific algorithms combine quantization, Huffman encoding and occasionally domain knowledge. We propose the high resolution trajectory compression scheme (HRTC) that relies on piecewise linear functions to approximate quantized trajectories. By splitting the error budget between quantization and approximation, our approach beats the current state of the art by several orders of magnitude given the same error tolerance. It allows storing samples at far less than one bit per sample. It is simple and fast enough to be integrated into the inner simulation loop, store every time step, and become the primary representation of trajectory data.
... If the demon wants to operate with maximum efficiency, it must use an optimal coding procedure, i. e., Huffman coding [5]. In this context, the question arises as to how the record length l i for the i-th state can be interpreted. ...
Preprint
If p is the probability of a letter of a memoryless source, the length l of the corresponding binary Huffman codeword can be very different from the value -log p. We show that, nevertheless, for a typical letter, l is approximately equal to -log p. More precisely, the probability that l differs from -log p by more than m decreases exponentially with m.
... It is well known by Kraft and McMillan theorems [1][2] that the codeword lengths of any uniquely decodable FV code must satisfy Kraft's inequality, and such codeword lengths can be realized by an instantaneous FV code. Hence, the Huffman code [3], which can attain the best compression ratio in the class of instantaneous FV codes, is also the optimal code in the class of uniquely decodable FV codes. However, it was implicitly assumed in [2] that a single code tree is used for a uniquely decodable FV code. ...
Preprint
Binary AIFV codes are lossless codes that generalize the class of instantaneous FV codes. The code uses two code trees and assigns source symbols to incomplete internal nodes as well as to leaves. AIFV codes are empirically shown to attain better compression ratio than Huffman codes. Nevertheless, an upper bound on the redundancy of optimal binary AIFV codes is only known to be 1, which is the same as the bound of Huffman codes. In this paper, the upper bound is improved to 1/2, which is shown to coincide with the worst-case redundancy of the codes. Along with this, the worst-case redundancy is derived in terms of pmaxp_{\max}\geq1/2, where pmaxp_{\max} is the probability of the most likely source symbol. Additionally, we propose an extension of binary AIFV codes, which use m code trees and allow at most m-bit decoding delay. We show that the worst-case redundancy of the extended binary AIFV codes is 1/m for $m \leq 4.
... End-Tagged Dense Code (ETDC). It is a semi-static statistical byte-oriented encoder/decoder [22,23], that achieves very good compression and decompression times while keeping similar compression ratios to those obtained by Plain Huffman [24] (the byte-oriented version of Huffman [25] that obtains optimum byte-oriented prefix codes). ...
Preprint
We introduce a dynamic data structure for the compact representation of binary relations RA×B\mathcal{R} \subseteq A \times B. The data structure is a dynamic variant of the k2^2-tree, a static compact representation that takes advantage of clustering in the binary relation to achieve compression. Our structure can efficiently check whether two objects (a,b)A×B(a,b) \in A \times B are related, and list the objects of B related to some aAa \in A and vice versa. Additionally, our structure allows inserting and deleting pairs (a,b) in the relation, as well as modifying the base sets A and B. We test our dynamic data structure in different contexts, including the representation of Web graphs and RDF databases. Our experiments show that our dynamic data structure achieves good compression ratios and fast query times, close to those of a static representation, while also providing efficient support for updates in the represented binary relation.
... Data compression started with the statistical methods of Shannon [2] and Huffman [3]. Shannon introduced the concept of information entropy H(X) for a random variable X, which is computed by (1). ...
Article
Full-text available
After a boom that coincided with the advent of the internet, digital cameras, digital video and audio storage and playback devices, the research on data compression has rested on its laurels for a quarter of a century. Domain-dependent lossy algorithms of the time, such as JPEG, AVC, MP3 and others, achieved remarkable compression ratios and encoding and decoding speeds with acceptable data quality, which has kept them in common use to this day. However, recent computing paradigms such as cloud computing, edge computing, the Internet of Things (IoT), and digital preservation have gradually posed new challenges, and, as a consequence, development trends in data compression are focusing on concepts that were not previously in the spotlight. In this article, we try to critically evaluate the most prominent of these trends and to explore their parallels, complementarities, and differences. Digital data restoration mimics the human ability to omit memorising information that is satisfactorily retrievable from the context. Feature-based data compression introduces a two-level data representation with higher-level semantic features and with residuals that correct the feature-restored (predicted) data. The integration of the advantages of individual domain-specific data compression methods into a general approach is also challenging. To the best of our knowledge, a method that addresses all these trends does not exist yet. Our methodology, COMPROMISE, has been developed exactly to make as many solutions to these challenges as possible inter-operable. It incorporates features and digital restoration. Furthermore, it is largely domain-independent (general), asymmetric, and universal. The latter refers to the ability to compress data in a common framework in a lossy, lossless, and near-lossless mode. COMPROMISE may also be considered an umbrella that links many existing domain-dependent and independent methods, supports hybrid lossless–lossy techniques, and encourages the development of new data compression algorithms.
... | E m , in the best possible order. Since the cost of computing A + B is proportional to |A| + |B|, if it were the case that |A + B| = |A| + |B|, the best possible order would be given by building the Huffman tree [47] of the matrices A i = M(E i ) using |A i | as their weight (see Sect. 3 for the definition of M(E i )). Since, instead, it holds that max(|A|, |B|) ≤ |A + B| ≤ |A| + |B|, we opt for a heuristic that simulates Huffman's algorithm on the actual size of the matrices as they are produced. ...
Article
Full-text available
Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL and GQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an application, use them to handle RPQs. Our baseline representation uses the same space and time as the previously most compact index for RPQs, outperforming it on the hardest types of queries—those where both RPQ endpoints are unspecified. Our more succinct structure, based on k2k^2-trees, is 4 times smaller than any existing representation that handles RPQs. While slower, it still solves complex RPQs in a few seconds and slightly outperforms the smallest previous structure on the hardest RPQs. Our new sparse-matrix-based solutions dominate a good portion of the space/time tradeoff map, being outperformed only by representations that use much more space. They also implement an algebra of Boolean matrices that is of independent interest beyond solving RPQs.
... For data compression techniques, works mainly adopt two types of compressors: lossless compressors and error-bounded lossy compressors. The lossless methods [22], [23] maintain complete original information, so the compressed data can be decoded without any loss. Relatively, the lossless methods possess low compression ratio. ...
Preprint
With the development of quantum computing, quantum processor demonstrates the potential supremacy in specific applications, such as Grovers database search and popular quantum neural networks (QNNs). For better calibrating the quantum algorithms and machines, quantum circuit simulation on classical computers becomes crucial. However, as the number of quantum bits (qubits) increases, the memory requirement grows exponentially. In order to reduce memory usage and accelerate simulation, we propose a multi-level optimization, namely Mera, by exploring memory and computation redundancy. First, for a large number of sparse quantum gates, we propose two compressed structures for low-level full-state simulation. The corresponding gate operations are designed for practical implementations, which are relieved from the longtime compression and decompression. Second, for the dense Hadamard gate, which is definitely used to construct the superposition, we design a customized structure for significant memory saving as a regularity-oriented simulation. Meanwhile, an ondemand amplitude updating process is optimized for execution acceleration. Experiments show that our compressed structures increase the number of qubits from 17 to 35, and achieve up to 6.9 times acceleration for QNN.
... 4.2. This ordering facilitates entropy encoding by placing low frequency coefficients which are more likely to be nonzero and high frequency coefficients that are mostly zero in order.The coding methods are based on the standards[42] [43] and Huffman codes (given as Appendix)[44].In the intermediate entropy encoding stage, each AC coefficient is represented in combination with the 'runlength' of zero valued AC coefficients which precedes it in the zigzag sequence. Each such runlength/non-zero combination is represented by a pair ofFigure 4.2: DC coefficients and zig-zag pattern of AC coefficients Symbol 1 represents RUNLENGTH and SIZE. ...
... The main techniques employed in lossless compression are based on repetition removal (stemming from the seminal works of Lempel and Ziv (Ziv & Lempel, 1977;1978) and dubbed LZ compression) and entropy encoding (e.g. (Huffman, 1952;Rissanen, 1976)). LZ compressors find multiplebyte repetitions (typically of at least 4 bytes) and replace these with shorter back-pointers, hence saving space. ...
Preprint
Full-text available
With the growth of model sizes and the scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast model compression literature deleting parts of the model weights for faster inference, we investigate a more traditional type of compression - one that represents the model in a compact form and is coupled with a decompression algorithm that returns it to its original form and size - namely lossless compression. We present ZipNN a lossless compression tailored to neural networks. Somewhat surprisingly, we show that specific lossless compression can gain significant network and storage reduction on popular models, often saving 33% and at times reducing over 50% of the model size. We investigate the source of model compressibility and introduce specialized compression variants tailored for models that further increase the effectiveness of compression. On popular models (e.g. Llama 3) ZipNN shows space savings that are over 17% better than vanilla compression while also improving compression and decompression speeds by 62%. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like Hugging Face.
... [Huf52]). Let k be any random variable. There exist an encoding function C( · ) (called the Huffman code) such that E[|C(k)|] ≤ H(k) + 1. ...
Preprint
Full-text available
We exhibit a total search problem whose communication complexity in the quantum SMP (simultaneous message passing) model is exponentially smaller than in the classical two-way randomized model. Moreover, the quantum protocol is computationally efficient and its solutions are classically verifiable, that is, the problem lies in the communication analogue of the class TFNP. Our problem is a bipartite version of a query complexity problem recently introduced by Yamakawa and Zhandry (JACM 2024). We prove the classical lower bound using the structure-vs-randomness paradigm for analyzing communication protocols.
... It is important to note that the zip format, widely used for file compression and archival, employs the DEFLATE algorithm [5] as its primary compression method. This algorithm combines the LZ77 algorithm [6] and Huffman coding [7] to achieve its compression. While the zip format also supports other compression methods, the DEFLATE algorithm is also the algorithm that was used by Cox to create its quine. ...
Article
Full-text available
This paper explores the concept of zip quines, which are zip files that contain themselves upon extraction, extending the idea of computational self-reference. While only two individuals, Russ Cox and Erling Ellingsen, have created such entities, this study focuses on Cox’s method to develop a generator for these files. Overcoming the initial limitations, the generator allows for the inclusion of additional files within the zip quine. Additionally, this research explores the concept of looped zip files, wherein a zip archive contains another archive. This archive then contains the initial zip file. By offering practical methodologies and insights, this study advances the understanding and application of quines in computer science.
... The standard for portable network graphics (PNGs) [18] is based on DEFLATE compression [19], Lempel-Ziv-Welch compression and Huffman coding, where the dictionary is built up dynamically [20]. Alternatively, static Huffman coding can also be used, but this one requires a prior analysis of encoding data probability distribution. ...
Article
Full-text available
Digital image compression is applied to reduce camera bandwidth and storage requirements, but real-time lossless compression on a high-speed high-resolution camera is a challenging task. The article presents hardware implementation of a Bayer colour filter array lossless image compression algorithm on an FPGA-based camera. The compression algorithm reduces colour and spatial redundancy and employs Golomb–Rice entropy coding. A rule limiting the maximum code length is introduced for the edge cases. The proposed algorithm is based on integer operators for efficient hardware implementation. The algorithm is first verified as a C++ model and later implemented on AMD-Xilinx Zynq UltraScale+ device using VHDL. An effective tree-like pipeline structure is proposed to concatenate codes of compressed pixel data to generate a bitstream representing data of 16 parallel pixels. The proposed parallel compression achieves up to 56% reduction in image size for high-resolution images. Pipelined implementation without any state machine ensures operating frequencies up to 320 MHz. Parallelised operation on 16 pixels effectively increases data throughput to 40 Gbit/s while keeping the total memory requirements low due to real-time processing.
... Finally, a dictionary containing the symbols representing specific sub-strings is added to the compressed file for use in decompression. If a string contains many different patterns that require symbols of varying size for representation, the Huffman algorithm will make sure that the most frequent patterns, or sub-strings, are assigned progressively shorter representations [25]. The size of such representations therefore correlates negatively with occurrence frequency. ...
Article
Full-text available
Background To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models. Results We find that genome size ( p < 0.001 ) and OUV ( p < 0.001 ) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant ( p < 0.001 ) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ. Conclusion As lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.
... Rank queries can be answered in O (log ) time by traversing the tree to the leaf level. Using Huffman codes [41] to encode symbols in a sequence creates Huffman-shaped wavelet trees [13] that are highly compressible and provide simple random access in time proportional to the length of the code word. In practice, Huffman-shaped wavelet trees can be constructed efficiently [23,24] with various techniques to improve query performance [19,40]. ...
Preprint
Full-text available
Processing large-scale graphs, containing billions of entities, is critical across fields like bioinformatics, high-performance computing, navigation and route planning, among others. Efficient graph partitioning, which divides a graph into sub-graphs while minimizing inter-block edges, is essential to graph processing, as it optimizes parallel computing and enhances data locality. Traditional in-memory partitioners, such as METIS and KaHIP, offer high-quality partitions but are often infeasible for enormous graphs due to their substantial memory overhead. Streaming partitioners reduce memory usage to O(n), where 'n' is the number of nodes of the graph, by loading nodes sequentially and assigning them to blocks on-the-fly. This paper introduces StreamCPI, a novel framework that further reduces the memory overhead of streaming partitioners through run-length compression of block assignments. Notably, StreamCPI enables the partitioning of trillion-edge graphs on edge devices. Additionally, within this framework, we propose a modification to the LA-vector bit vector for append support, which can be used for online run-length compression in other streaming applications. Empirical results show that StreamCPI reduces memory usage while maintaining or improving partition quality. For instance, using StreamCPI, the Fennel partitioner effectively partitions a graph with 17 billion nodes and 1.03 trillion edges on a Raspberry Pi, achieving significantly better solution quality than Hashing, the only other feasible algorithm on edge devices. StreamCPI thus advances graph processing by enabling high-quality partitioning on low-cost machines.
Preprint
The hidden metric space behind complex network topologies is a fervid topic in current network science and the hyperbolic space is one of the most studied, because it seems associated to the structural organization of many real complex systems. The Popularity-Similarity-Optimization (PSO) model simulates how random geometric graphs grow in the hyperbolic space, reproducing strong clustering and scale-free degree distribution, however it misses to reproduce an important feature of real complex networks, which is the community organization. The Geometrical-Preferential-Attachment (GPA) model was recently developed to confer to the PSO also a community structure, which is obtained by forcing different angular regions of the hyperbolic disk to have variable level of attractiveness. However, the number and size of the communities cannot be explicitly controlled in the GPA, which is a clear limitation for real applications. Here, we introduce the nonuniform PSO (nPSO) model that, differently from GPA, forces heterogeneous angular node attractiveness by sampling the angular coordinates from a tailored nonuniform probability distribution, for instance a mixture of Gaussians. The nPSO differs from GPA in other three aspects: it allows to explicitly fix the number and size of communities; it allows to tune their mixing property through the network temperature; it is efficient to generate networks with high clustering. After several tests we propose the nPSO as a valid and efficient model to generate networks with communities in the hyperbolic space, which can be adopted as a realistic benchmark for different tasks such as community detection and link prediction.
Article
Nowadays, brain-computer interfaces (BCIs) are being extensively explored by researchers to recover the abilities of motor-impaired individuals and improve their communication. Text entry is one of the most important regular tasks in communication, and BCI has high application potential to develop a speller. Although BCI has been a growing research topic for the last decade, more progress is yet to be made for BCI-based spellers. Two challenges are there in BCI speller development: designing an effective graphical user interface with an optimal arrangement of symbols enabling a minimum number of control commands and reducing users’ efforts in error correction following an efficient error-correction policy. With this scope in mind, this work proposes a novel BCI speller paradigm with an efficient symbol arrangement to improve the text entry rate. Additionally, it provides a user-friendly error correction approach through six text-cursor navigation keys to enhance the accuracy of text entry. The proposed speller includes 36 target symbols and is operated with two control commands obtained from electroencephalogram motor-imagery signals. The experimental results revealed that the proposed speller outperforms the existing motor imagery-based BCI spellers when tested on ten motor-impaired users. This speller achieved a mean performance of 5.20 characters per minute without any typing error and 6.04 characters per minute with a 2.9 percent mean error. The mean correction efficiency was 0.69; that is, users corrected 69 percent of incorrect inputs with a single correction key pressed.
Article
Full-text available
Data compression algorithms tend to reduce information entropy, which is crucial, especially in the case of images, as they are data intensive. In this regard, lossless image data compression is especially challenging. Many popular lossless compression methods incorporate predictions and various types of pixel transformations, in order to reduce the information entropy of an image. In this paper, a block optimisation programming framework Φ is introduced to support various experiments on raster images, divided into blocks of pixels. Eleven methods were implemented within Φ , including prediction methods, string transformation methods, and inverse distance weighting, as a representative of interpolation methods. Thirty-two different greyscale raster images with varying resolutions and contents were used in the experiments. It was shown that Φ reduces information entropy better than the popular JPEG LS and CALIC predictors. The additional information associated with each block in Φ is then evaluated. It was confirmed that, despite this additional cost, the estimated size in bytes is smaller in comparison to the sizes achieved by the JPEG LS and CALIC predictors.
Article
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Moreover, we design a Compact Appearance Model that mixes diffuse and view-dependent Gaussians to further minimize the model size while maintaining the rendering quality. We also develop a rasterization pipeline of Gaussian primitives based on the hardware-accelerated technique to improve rendering speed. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling hours of volumetric video data while maintaining state-of-the-art rendering quality.
Article
A quantitative measure of “information” is developed which is based on physical as contrasted with psychological considerations. How the rate of transmission of this information over a system is limited by the distortion resulting from storage of energy is discussed from the transient viewpoint. The relation between the transient and steady state viewpoints is reviewed. It is shown that when the storage of energy is used to restrict the steady state transmission to a limited range of frequencies the amount of information that can be transmitted is proportional to the product of the width of the frequency-range by the time it is available. Several illustrations of the application of this principle to practical systems are included. In the case of picture transmission and television the spacial variation of intensity is analyzed by a steady state method analogous to that commonly used for variations with time.
Article
An abstract is not available.