To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?

Computing Research Repository - CORR 01/2007;
Source: arXiv

ABSTRACT Application designers often face the question of whether to store large
objects in a filesystem or in a database. Often this decision is made for
application design simplicity. Sometimes, performance measurements are also
used. This paper looks at the question of fragmentation - one of the
operational issues that can affect the performance and/or manageability of the
system as deployed long term. As expected from the common wisdom, objects
smaller than 256KB are best stored in a database while objects larger than 1M
are best stored in the filesystem. Between 256KB and 1MB, the read:write ratio
and rate of object overwrite or replacement are important factors. We used the
notion of "storage age" or number of object overwrites as way of normalizing
wall clock time. Storage age allows our results or similar such results to be
applied across a number of read:write ratios and object replacement rates.

1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Organizing the peer review process for a scientific conference can be a cumbersome task. Electronic conference management systems support chairs and reviewers in managing the huge amount of submissions. These system implement the complete work-flow of a scientific conference. We present a new approach to such systems. By providing an open API framework instead of a closed system it enables external programs to harvest and to utilize open information sources available on the internet today.
    17th International Conference on Electronic Publishing; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe our database centered workflow for acquisition, enrichment, long-term storage and webbased analysis of multi- and hyperspectral image data and measurement metadata. We propose a standardized way of storing large amounts of measurement data using a hybrid approach consisting of a relational database and direct le access through a common data access layer. Data import and export is performed by either using proprietary file formats like ENVI or by using an universal XML-based data format. To allow preview and analysis of image data, a web-based application has been developed which supersedes the need for client-side installation of tools like MATLAB to perform spectral or spatial analysis of the data while still allowing third-party applications to retrieve data from the database to perform in-depth analysis like automated filter design. Successful implementation of the workflow is demonstrated by the example of rapid application development for mineral sorting.
    OCM 2013 - Optical Characterization of Materials - conference proceedings; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the explosive growth in the size of scientific data-sets, data-intensive computing and analysing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimise the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large-scale data processing. However, the data access patterns which are generally applied to scientific data-set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in the current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires reorganising the data before it is being analysed or processed. Our experiments are launched on a real-world halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.
    International Journal of Parallel Emergent and Distributed Systems 10/2013; 28(5):407-433.


Available from