The LSST Data Mining Research Agenda
ABSTRACT We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more. Comment: 5 pages, Presented at the "Classification and Discovery in Large Astronomical Surveys" meeting, Ringberg Castle, 14-17 October, 2008
- [Show abstract] [Hide abstract]
ABSTRACT: Traditional relational databases have not always been well matched to the needs of data-intensive sciences, and to the needs of high energy physics data stores in particular. To address this mismatch, members of the database community and people involved with large scientific data stores in a variety of disciplines have inaugurated an open-source project, SciDB, that aims to develop and deliver database technologies suited to the needs of data-intensive sciences. This paper describes early experience using the first release of SciDB with an initial subset of high energy physics data structures and query patterns. It examines the early capabilities of SciDB, and describes requirements that further development must address if emerging database technologies such as SciDB are to accommodate the data structures, query patterns, computations, and use cases of high energy physics.Journal of Physics Conference Series 12/2011; 331(4):042016.
- [Show abstract] [Hide abstract]
ABSTRACT: Traditional relational databases have not always been well matched to the needs of data-intensive sciences, but efforts are underway within the database community to attempt to address many of the requirements of large-scale scientific data management. One such effort is the open-source project SciDB. Since its earliest incarnations, SciDB has been designed for scalability in parallel and distributed environments, with a particular emphasis upon native support for array constructs and operations. Such scalability is of course a requirement of any strategy for large-scale scientific data handling, and array constructs are certainly useful in many contexts, but these features alone do not suffice to qualify a database product as an appropriate technology for hosting particle physics or cosmology data. In what constitutes its 1.0 release in June 2011, SciDB has extended its feature set to address additional requirements of scientific data, with support for user-defined types and functions, for data versioning, and more. This paper describes an evaluation of the capabilities of SciDB for two very different kinds of physics data: event-level metadata records from proton collisions at the Large Hadron Collider (LHC), and the output of cosmological simulations run on very-large-scale supercomputers. This evaluation exercises the spectrum of SciDB capabilities in a suite of tests that aim to be representative and realistic, including, for example, definition of four-vector data types and natural operations thereon, and computational queries that match the natural use cases for these data.Journal of Physics Conference Series 01/2012; 368(1).
- [Show abstract] [Hide abstract]
ABSTRACT: Time-domain astronomy is becoming a fundamental aspect of the next generation of astronomical instruments. The timing properties will revolutionise the studies of all kinds of astronomical objects. Consequetially, the huge complex data volumes and high cadences of these facilities will force us to overhaul and extend current software solutions. LOFAR, laying the groundwork for this, will produce a continuously updated spectral light-curve catalogue of all detected sources, with real-time capabilities to cope with the growth of 50 - 100TB/yr, making it the largest dynamic astronomical catalogue. Automated pipelines use the column-store MonetDB as their key component. We exploit SciLens, a 300+ node, 4-tier locally distributed cluster focussed on massive I/O. Introduction of the new array-based query language, SciQL, simplifies data exploration and mining. I will demonstrate how MonetDB/SQL & SciQL on its SciLens platform manages the millions of lightcurves for LOFAR. Initial benchmark results confirm the linear scale-up performance over tens of TBs using tens of nodes.Proc SPIE 09/2012;
arXiv:0811.0167v1 [astro-ph] 2 Nov 2008
The LSST Data Mining Research Agenda
K. Borne∗, J. Becla†, I. Davidson∗∗, A. Szalay‡and J. A. Tyson§
∗Computational and Data Sciences, George Mason University, MS 6A2, Fairfax, VA 22030
†Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94309
∗∗Department of Computer Science, University of California, Davis, CA 95616
‡Department of Physics & Astronomy, Johns Hopkins University, Baltimore, MD 21228
§Physics Department, University of California, Davis, CA 95616
Abstract. We describe features of the LSST science database that are amenable to scientific data
mining, object classification, outlier identification, anomaly detection, image quality assurance, and
of existing machine learning and data mining algorithms; developmentof grid-enabledparallel data
mining algorithms; designing a robust system for brokering classifications from the LSST event
pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods
for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical
databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.
Keywords: catalogs - surveys - methods: data analysis - astronomical data bases: miscellaneous
PACS: 95.80.+p, 95.75.Pq
DATA-INTENSIVE ASTRONOMY AND THE LSST SKY SURVEY
The development of models to describe and understand scientific phenomena has his-
torically proceeded at a pace driven by new data. The more we know, the more we are
driven to tweak or to revolutionize our models, thereby advancing our scientific under-
standing. This data-driven modeling and discovery linkage has entered a new paradigm
. The acquisition of scientific data in all disciplines is now accelerating and causing
a nearly insurmountable data avalanche . In astronomy in particular, rapid advances
in three technology areas (telescopes, detectors, and computation) have continued un-
abated  – all of these advances lead to more and more data . With this accelerated
advance in data generation capabilities, we will require novel, increasingly automated,
and increasingly more effective scientific knowledge discovery systems .
Astronomers have been doing data mining for centuries: “the data are mine, and you
can’t have them!”. Seriously, astronomers are trained as data miners, because we are
trained to: (a) characterize the known (i.e., unsupervised learning, clustering); (b) assign
the new (i.e., supervised learning, classification); and (c) discover the unknown (i.e.,
semi-supervised learning, outlier detection) [6, 7, 8]. These skills are more critical than
ever since astronomy is now a data-intensive science, and it will become even more
data-intensive in the coming decade [4, 9, 10]. New surveys may produce hundreds of
terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and
in the object catalogs (databases). Discovering the ensuing hidden wealth of new scien-
tific knowledge will require more sophisticated algorithms and networks that discover,
integrate, and learn from distributed petascale databases more effectively , .
Theproblemtherefore isthis:astronomyresearchers willsoon(ifnot already)losethe
ability to assimilate or to keep up with any of these things: the data flood, the scientific
discoveries buried within, the development of new models of those phenomena, and the
resulting new data-driven follow-up observing strategies that are imposed on telescope
facilities to collect new data needed to validate and augment new discoveries.
One of the most impressive astronomical sky surveys being planned for the next
decade is the Large Synoptic Survey Telescope project (LSST at www.lsst.org) .
The three fundamental distinguishing astronomical attributes of the LSST project are:
1. Repeated temporal measurements of all observable objects in the sky, correspond-
ing to thousands of observations per each object over a 10-year period, expected
to generate 10,000-100,000 alerts each night to the astronomical research commu-
nity that something has changed at that location on the sky: either the brightness or
position of an object, or the serendipitous appearance of some totally new object;
2. Wide-angle imaging that will repeatedly cover most of the night sky within 3 to 4
nights (= tens of billions of objects); and
3. Deep co-added images of each observable patch of sky (summed over 10 years:
2015-2025), reaching far fainter objects and to greater distance over more area of
sky than other sky surveys .
Compared to other astronomical sky surveys, the LSST survey will deliver time
domain coverage for orders of magnitude more objects. It is envisioned that this project
will produce ∼30 TB of data per each night of observation for 10 years. The final
image archive will be greater than 60 PB (and possibly much more), and the final LSST
astronomical object catalog (object-attribute database) is expected to be ∼10-20 PB
(or more). Additional information about the LSST survey and scientific program are
described by Ivezic et al.  and provided elsewhere in these proceedings .
Since it is anticipated that LSST will generate many thousands(probably tens of thou-
sands) of new astronomical event alerts per night of observation, there is a critical need
for innovative follow-up procedures. These procedures necessarily must include mod-
eling of the events – to determine their classification, time-criticality, astronomical rel-
evance, rarity, and the scientifically most productive set of follow-up measurements.
Rapid time-critical follow-up observations, with a wide range of time scales from sec-
onds to days, are essential for proper identification, classification, characterization, anal-
ysis, interpretation, and understanding of nearly every astrophysical phenomenon (e.g.,
supernovae,novae,accreting black holes, microquasars,gamma-ray bursts, gravitational
microlensing events, extrasolar planetary transits across distant stars, new comets, in-
coming asteroids, trans-Neptunian objects, dwarf planets, optical transients, variable
stars of all classes, and anything that goes “bump in the night”) [17, 18].
Petascale Mining of Large Astronomical Sky Surveys
LSST and similar large sky surveys have enormous potential to enable countless
astronomical discoveries. Such discoveries will span the full spectrum of statistics:
from rare one-in-a-billion (or one-in-a-trillion) type objects, to a complete statistical
and astrophysical specification of a class of objects (based upon millions of instances of
the class). One of the key scientific requirements of these projects therefore is to learn
rapidly from what they see. This means: (a) to identify the serendipitous as well as the
known; (b) to identify rare events that our models say should be there; (c) to identify
new classes of objects that fall outside the bounds of model expectations; (d) to find new
attributes of known classes; (e) to provide statistically robust tests of existing models;
and (f) to generate the vital inputs for new models. All of this requires integrating and
mining all known data: to train classification models and to apply classification models.
LSST alone is likely to throw such data mining and knowledge discovery efforts into
the petascale realm. For example: astronomers currently discover a few hundred new su-
pernovae per year. Since the beginning of human history, perhaps ∼10,000 supernovae
enable fundamental (Dark Energy) science, it is imperative for astronomers to respond
quickly to each new event with rapid follow-up observations in many measurement
modes (light curves; spectroscopy; and images of the host galaxy and its environment).
Historically, with <10 new supernovae being discovered each week, such follow-up has
been feasible. But now, LSST promises to produce a list of 1000 new supernovae each
night for 10 years , which represent a small fraction of the total (10-100 thousand)
alerts expected each night! Astronomers are faced with the enormous challenge of effi-
ciently mining, correctly classifying, and intelligently prioritizing a staggering number
of new events for follow-up observation each night for a decade.
The major features and contents of the LSST scientific database include: >100
database tables; image metadata (675M rows); source catalog (260B rows); object cat-
alog (22B rows, with 200+ attributes); moving object catalog; variable object catalog;
alerts catalog; calibration metadata; configuration metadata; processing metadata; and
years), comprising 60-100 PB of pixel data. This enormous LSST data archive and ob-
ject database enables a diverse multidisciplinary research program: astronomy & astro-
physics; machine learning (data mining); exploratory data analysis; XLDB (extremely
large databases); scientific visualization; computational science & distributed comput-
ing; and inquiry-based science education (using data in the classroom).
Many possible scientific data mining use cases are anticipated with the LSST
• Provide rapid probabilistic classifications for all 10,000 LSST events each night;
• Find new “fundamental planes” of correlated astrophysical parameters (e.g., the
fundamental plane of Elliptical galaxies) ;
• Find new correlations, associations, relationships of all kinds from 100+ attributes
in the LSST science database, integrated with distributed VO-accessible data;
• Compute multi-point multi-dimensional correlation functions over the full panoply
of astrophysical parameter spaces;
• Discover zones of avoidance in interesting parameter spaces (e.g., period gaps);
• Discover new properties of known classes;
• Discover new and improved rules for classifying known classes of objects (e.g.,
photometric redshifts) ;
• Discover new and exotic classes of astronomical objects;
• Identify novel, unexpected temporal behavior in all classes of objects ;
• Hypothesis testing – verify existing (or generate new) astronomical hypotheses
with strong statistical confidence, using millions of training samples;
• Serendipity – discover rare one-in-a-billion objects through novelty detection;
• Image processing – identify non-astronomical features, classify them, and separate
them from the astronomical catalog inputs [21, 22]; and
• Quality assurance – identify system glitches, instrument anomalies, and pipeline
errors through near-real-time deviation detection.
Some of the data mining research challenge areas posed by the arrival of petascale
scientific databases include:
• indexing and associative memory techniques (trees, graphs, networks) for multi-
attribute (highly-dimensional) astronomical databases (beyond RA-Dec indexing);
• scalability of statistical, computational, machine learning, and data mining algo-
rithms to multi-petabyte scales;
• algorithms for optimization of simultaneous multi-point fitting across massive
multi-dimensional data cubes;
• multi-resolution methods and structures for exploration of petascale databases;
• petascale analytics for visual exploratory data analysis of massive databases; and
• rapid query, search, and retrieval algorithms for petabyte databases.
Additional and more in-depth discussion of the petascale data challenges posed by
the LSST sky survey are available (at www.lsst.org/Project/docs/data-challenge.pdf and
A Classification Broker for Astronomy
We envision an astroinformatics (data-intensive astronomy) research paradigm (for
dataintegrationand mining)to address thepetascale needs oflarge astronomicalsurveys
10,000 times, while the time-criticality requirement (for event/object classification and
characterization) drastically drops from months (or weeks) down to minutes (or tens of
seconds). In addition to the follow-up classification problem (described earlier), we will
want to find every possible new scientific discovery (pattern, correlation, relationship,
outlier, new class, etc.) buried within these new enormous databases. This might lead to
a petascale data mining compute engine that runs in parallel alongside the data archive
– to test every possible N-point correlation, multi-parameter association, and classifica-
tion rule. In addition to such a “batch discovery machine”, a rapid-response data mining
engine (i.e., classification broker) is needed in order to produce and distribute scientif-
ically robust near-real-time classifications of astronomical sources, events, objects, or
event host objects (e.g., we need the redshift of the host galaxy in order to interpret
and classify a supernova accurately) [23, 24, 25]. These classifications are derived from
integrating and mining data, information, and knowledge from multiple distributed VO-
accessible data repositories, robotic telescopes, and astronomical alert networks world-
wide. Incoming event alert data will be subjected to a suite of machine learning (ML)
algorithmsfor event classification, outlierdetection, object characterization, and novelty
discovery [18, 23, 24, 25, 26, 27]. Probabilistic ML models will produce rank-ordered
lists, to guide follow-up observations on the 10-100K alertable astronomical events that
willbe identified each nightby theLSST sky surveyalone. The classification brokerwill
thereby enable rapid follow-up science for the most important and exciting astronomical
discoveries of the coming decade, on a wide range of time scales from seconds to days,
corresponding to a plethora of exotic astrophysical phenomena.
We thank our LSST (www.lsst.org) collaborators for their valuable contributions.
1. Mahootian, F., & Eastman, T. 2008, “Complementary Frameworks of Scientific Inquiry: Hypothetico-
Deductive, Hypothetico-Inductive,and Observational-Inductive,”World Futures journal, in press.
2. Bell, G., Gray, J., & Szalay, A. 2005, arxiv.org/abs/cs/0701165.
3. Gray, J., & Szalay, A. 2004, Microsoft technical report MSR-TR-2004-110.
4. Becla, J., et al. 2006, arxiv.org/abs/cs/0604112.
5. Borne, K. D. 2006, “Data-Driven Discovery through e-Science Technologies,” in the proceedings of
the 2nd IEEE Conference on Space Mission Challenges for Information Technology.
6. Borne, K. D. 2001a, astro-ph/0008307.
7. Borne, K. D. 2001b, astro-ph/0010583.
8. Borne, K. D. 2009, “Scientific Data Mining in Astronomy,” in Next Generation Data Mining, CRC
Press, in press.
9. Brunner, R. J., et al. 2001, astro-ph/0106481.
10. Szalay, A. S., Gray, J., & VandenBerg, J. 2002, arxiv.org/abs/cs/0208013.
11. Gray, J., et al. 2002, arxiv.org/abs/cs/0202014.
12. Longo, G., et al. 2001, “Advanced Data Mining Tools for Exploring Large Astronomical Databases,”
Proceedings of the SPIE, volume 4477, Astronomical Data Analysis, p. 61.
13. Tyson, J. A. 2004, “The Large Synoptic Survey Telescope: Science & Design,” downloaded from
14. Strauss, M. 2004, “Towards a Design Reference Mission for the LSST,” downloaded from
15. Ivezic, Z., et al. 2008a, astro-ph/0805.2366.
16. Ivezic, Z., et al. 2008b, “Parametrization and Classification of 20 Billion LSST Objects,” in the
proceedings of Classification and Discovery in Large Astronomical Surveys, New York: AIP.
17. Paczynski, B. 2000, Publications of the ASP, 112, p. 1281.
18. Becker, A., Astronomische Nachrichten, 329, p. 280.
19. Djorgovski, S., & Davis, M. 1987, Astrophysical Journal, 313, p. 59.
20. Way, M. J., & Srivastava, A. N. 2006, Astrophysical Journal, 647, p. 102.
21. Salzberg, S., et al. 1995, Publications of the ASP, 107, p. 279.
22. Waniak, W. 2006, Experimental Astronomy, vol. 21, issue 3, p. 151.
23. Borne, K. D. 2008, Astronomische Nachrichten, 329, p. 255.
24. Bloom, J. S., et al. 2008, Astronomische Nachrichten, 329, p. 284.
25. Mahabal, A., et al. 2008, Astronomische Nachrichten, 329, p. 288.
26. Ball, N. M., et al. 2006, Astrophysical Journal, 650, p. 497.
27. Ball, N. M., et al. 2007, Astrophysical Journal, 663, p. 774.