The LSST Data Mining Research Agenda

11/2008; DOI: 10.1063/1.3059074
Source: arXiv

ABSTRACT We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more. Comment: 5 pages, Presented at the "Classification and Discovery in Large Astronomical Surveys" meeting, Ringberg Castle, 14-17 October, 2008

  • [Show abstract] [Hide abstract]
    ABSTRACT: Traditional relational databases have not always been well matched to the needs of data-intensive sciences, and to the needs of high energy physics data stores in particular. To address this mismatch, members of the database community and people involved with large scientific data stores in a variety of disciplines have inaugurated an open-source project, SciDB, that aims to develop and deliver database technologies suited to the needs of data-intensive sciences. This paper describes early experience using the first release of SciDB with an initial subset of high energy physics data structures and query patterns. It examines the early capabilities of SciDB, and describes requirements that further development must address if emerging database technologies such as SciDB are to accommodate the data structures, query patterns, computations, and use cases of high energy physics.
    Journal of Physics Conference Series 12/2011; 331(4):042016.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Traditional relational databases have not always been well matched to the needs of data-intensive sciences, but efforts are underway within the database community to attempt to address many of the requirements of large-scale scientific data management. One such effort is the open-source project SciDB. Since its earliest incarnations, SciDB has been designed for scalability in parallel and distributed environments, with a particular emphasis upon native support for array constructs and operations. Such scalability is of course a requirement of any strategy for large-scale scientific data handling, and array constructs are certainly useful in many contexts, but these features alone do not suffice to qualify a database product as an appropriate technology for hosting particle physics or cosmology data. In what constitutes its 1.0 release in June 2011, SciDB has extended its feature set to address additional requirements of scientific data, with support for user-defined types and functions, for data versioning, and more. This paper describes an evaluation of the capabilities of SciDB for two very different kinds of physics data: event-level metadata records from proton collisions at the Large Hadron Collider (LHC), and the output of cosmological simulations run on very-large-scale supercomputers. This evaluation exercises the spectrum of SciDB capabilities in a suite of tests that aim to be representative and realistic, including, for example, definition of four-vector data types and natural operations thereon, and computational queries that match the natural use cases for these data.
    Journal of Physics Conference Series 01/2012; 368(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Time-domain astronomy is becoming a fundamental aspect of the next generation of astronomical instruments. The timing properties will revolutionise the studies of all kinds of astronomical objects. Consequetially, the huge complex data volumes and high cadences of these facilities will force us to overhaul and extend current software solutions. LOFAR, laying the groundwork for this, will produce a continuously updated spectral light-curve catalogue of all detected sources, with real-time capabilities to cope with the growth of 50 - 100TB/yr, making it the largest dynamic astronomical catalogue. Automated pipelines use the column-store MonetDB as their key component. We exploit SciLens, a 300+ node, 4-tier locally distributed cluster focussed on massive I/O. Introduction of the new array-based query language, SciQL, simplifies data exploration and mining. I will demonstrate how MonetDB/SQL & SciQL on its SciLens platform manages the millions of lightcurves for LOFAR. Initial benchmark results confirm the linear scale-up performance over tens of TBs using tens of nodes.
    Proc SPIE 09/2012;

Full-text (2 Sources)

Available from
May 20, 2014