To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?

Computing Research Repository - CORR 01/2007;
Source: arXiv

ABSTRACT Application designers often face the question of whether to store large
objects in a filesystem or in a database. Often this decision is made for
application design simplicity. Sometimes, performance measurements are also
used. This paper looks at the question of fragmentation - one of the
operational issues that can affect the performance and/or manageability of the
system as deployed long term. As expected from the common wisdom, objects
smaller than 256KB are best stored in a database while objects larger than 1M
are best stored in the filesystem. Between 256KB and 1MB, the read:write ratio
and rate of object overwrite or replacement are important factors. We used the
notion of "storage age" or number of object overwrites as way of normalizing
wall clock time. Storage age allows our results or similar such results to be
applied across a number of read:write ratios and object replacement rates.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the explosive growth in the size of scientific data-sets, data-intensive computing and analysing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimise the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large-scale data processing. However, the data access patterns which are generally applied to scientific data-set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in the current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires reorganising the data before it is being analysed or processed. Our experiments are launched on a real-world halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.
    International Journal of Parallel Emergent and Distributed Systems 10/2013; 28(5):407-433. DOI:10.1080/17445760.2012.720982
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper compares the performance of an SQL solution that implements a relational data model with a document store named MongoDB. We report on the performance of a single node configuration of each data store and assume the database is small enough to fit in main memory. We analyze utilization of the CPU cores and the network bandwidth to compare the two data stores. Our key findings are as follows. First, for those social networking actions that read and write a small amount of data, the join operator of the SQL solution is not slower than the JSON representation of MongoDB. Second, with a mix of actions, the SQL solution provides either the same performance as MongoDB or outperforms it by 20%. Third, a middle-tier cache enhances the performance of both data stores as query result look up is significantly faster than query processing with either system.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Angesichts der zunehmenden Datenflut in der Genomforschung wird ein effizientes Forschungsdatenmanagement, verbunden mit einer sicheren und nachhaltigen Archivierung, auch in diesem Wissenschaftsbereich immer wichtiger. Der letzte von 3 Artikeln der Reihe ,,Forschungsdatenmanagement von Genomdaten“ beschreibt allgemein den Lebenszyklus von Forschungsdaten – ausgehend von deren Planung, über die Auswahl und Übernahme der Daten für die Speicherung bis hin zu notwendigen Erhaltungsmaßnahmen und dem Zugriff durch Datennutzer. Archive spielen in fast allen Phasen dieses Zyklus eine Rolle und bilden daher eine wichtige Komponente der Verarbeitung von Genomdaten. Beispielhaft werden 3 öffentliche europäische Archive für Genomdaten vorgestellt: die Datenbank des European Molecular Biology Laboratory (EMBL), das Sequence Read Archive und das Trace Archive. Da jede dieser Einrichtungen jedoch auf eine bestimmte Art von Daten spezialisiert ist, bleibt ein Bedarf an zusätzlichen Langzeitarchiven, die flexibel mit verschiedenen Datentypen umgehen bzw. auf zusätzliche Datentypen erweitert werden können. Für solche Archive wird ein generisches Konzept beschrieben und mit Empfehlungen für dessen praktische Umsetzung verbunden.
    Medizinische Genetik 09/2013; 25(3). DOI:10.1007/s11825-013-0403-y · 0.09 Impact Factor


Available from