To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?

Computing Research Repository - CORR 01/2007;
Source: arXiv


Application designers often face the question of whether to store large
objects in a filesystem or in a database. Often this decision is made for
application design simplicity. Sometimes, performance measurements are also
used. This paper looks at the question of fragmentation - one of the
operational issues that can affect the performance and/or manageability of the
system as deployed long term. As expected from the common wisdom, objects
smaller than 256KB are best stored in a database while objects larger than 1M
are best stored in the filesystem. Between 256KB and 1MB, the read:write ratio
and rate of object overwrite or replacement are important factors. We used the
notion of "storage age" or number of object overwrites as way of normalizing
wall clock time. Storage age allows our results or similar such results to be
applied across a number of read:write ratios and object replacement rates.

Full-text preview

Available from:
  • Source
    • "Data access pattern is mainly determined by the requirement of the specific application. An optimal data layout should be able to maximize the continuous I/O, and minimize the accesses of unnecessary (unrelated) data [15] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the explosive growth in the size of scientific data-sets, data-intensive computing and analysing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimise the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large-scale data processing. However, the data access patterns which are generally applied to scientific data-set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in the current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires reorganising the data before it is being analysed or processed. Our experiments are launched on a real-world halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.
    Full-text · Article · Oct 2013 · International Journal of Parallel Emergent and Distributed Systems
  • Source
    • "As bandwidth fragmentation in EONs is similar to file system fragmentation in computer storage, we borrow the corresponding definition of the fragmentation ratio [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In elastic optical networks (EONs), bandwidth fragmentation refers to the existence of non-aligned, isolated and small-sized blocks of contiguous subcarrier slots in the optical spectrum. As they are neither contiguous in the spectrum domain nor aligned along the routing paths, the network operator will have difficulty to use these slots for future connections. In this work, we analyze the effect of bandwidth fragmentation on the blocking probability in EONs. Our theoretical analysis indicates that two factors related to bandwidth fragmentation have effects on the blocking probability: 1) the extent that the available slot-blocks (i.e., blocks of contiguous slots) on different links are aligned on spectrum locations, and 2) the sizes of the available slot-blocks in links' spectra for future requests. When an EON's spectrum becomes more fragmented, the first factor actually reduces the blocking probability, while the second one increases the blocking probability. Their mixed effect determines the overall trend of how the blocking probability will change with bandwidth fragmentation. Our theoretical model can forecast this trend and reveal the relation among the blocking probability, bandwidth fragmentation, request bandwidth distribution, and spectrum utilization. We have also conducted numerical simulations to verify the theoretical analysis, and the simulation results exhibit similar trends as predicted by the theoretical model.
    Full-text · Article · Jul 2013 · IEEE Transactions on Communications
  • Source
    • "On the one hand, storing an image as a BLOB in a database has certain advantages over storing it as a file, such as lower read throughput for objects <1 MB in size for short intervals and atomicity for overwriting. On the other hand, storing an image as a file provides other advantages such as high throughput for files >1 MB and low fragmentation in the long term (16). We have decided to use files for storing images in the BioDIG system to minimize the load on the database. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas. Database URL: BioDIG website: BioDIG source code repository: The MyDIG database:
    Full-text · Article · Jan 2013 · Database The Journal of Biological Databases and Curation
Show more