The Palomar‐Quest digital synoptic sky survey
ABSTRACT We describe briefly the Palomar-Quest (PQ) digital synoptic sky survey, including its parameters, data processing, status, and plans. Exploration of the time domain is now the central scientific and technological focus of the survey. To this end, we have developed a real-time pipeline for detection of transient sources.We describe some of the early results, and lessons learned which may be useful for other, similar projects, and time-domain astronomy in general. Finally, we discuss some issues and challenges posed by the real-time analysis and scientific exploitation of massive data streams from modern synoptic sky surveys. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)
- SourceAvailable from: ArXiv[Show abstract] [Hide abstract]
ABSTRACT: We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box. Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the textInternational Journal of Modern Physics D 06/2009; · 1.03 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Astronomy has been at the forefront of the development of the techniques and methodologies of data intensive science for over a decade with large sky surveys and distributed efforts such as the Virtual Observatory. However, it faces a new data deluge with the next generation of synoptic sky surveys which are opening up the time domain for discovery and exploration. This brings both new scientific opportunities and fresh challenges, in terms of data rates from robotic telescopes and exponential complexity in linked data, but also for data mining algorithms used in classification and decision making. In this paper, we describe how an informatics-based approach-part of the so-called "fourth paradigm" of scientific discovery-is emerging to deal with these. We review our experiences with the Palomar-Quest and Catalina Real-Time Transient Sky Surveys; in particular, addressing the issue of the heterogeneity of data associated with transient astronomical events (and other sensor networks) and how to manage and analyze it.Distributed and Parallel Databases 08/2012; 30(5-6). · 0.81 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: New modes of discovery are enabled by the growth of data and computational resources (i.e., cyberinfrastructure) in the sciences. This cyberinfrastructure includes structured databases, virtual observatories (distributed data, as described in Section 20.2.1 of this chapter), high-performance computing (petascale machines), distributed computing (e.g., the Grid, the Cloud, and peer-to-peer networks), intelligent search and discovery tools, and innovative visualization environments. Data streams from experiments, sensors, and simulations are increasingly complex and growing in volume. This is true in most sciences, including astronomy, climate simulations, Earth observing systems, remote sensing data collections, and sensor networks. At the same time, we see an emerging confluence of new technologies and approaches to science, most clearly visible in the growing synergism of the four modes of scientific discovery: sensors-modeling-computing-data (Eastman et al. 2005). This has been driven by numerous developments, including the information explosion, development of large-array sensors, acceleration in high-performance computing (HPC) power, advances in algorithms, and efficient modeling techniques. Among these, the most extreme is the growth in new data. Specifically, the acquisition of data in all scientific disciplines is rapidly accelerating and causing a data glut (Bell et al. 2007). It has been estimated that data volumes double every year—for example, the NCSA (National Center for Supercomputing Applications) reported that their users cumulatively generated one petabyte of data over the first 19 years of NCSA operation, but they then generated their next one petabyte in the next year alone, and the data production has been growing by almost 100% each year after that (Butler 2008). The NCSA example is just one of many demonstrations of the exponential (annual data-doubling) growth in scientific data collections. In general, this putative data-doubling is an inevitable result of several compounding factors: the proliferation of data-generating devices, sensors, projects, and enterprises; the 18-month doubling of the digital capacity of these microprocessor-based sensors and devices (commonly referred to as "Moore’s law"); the move to digital for nearly all forms of information; the increase in human-generated data (both unstructured information on the web and structured data from experiments, models, and simulation); and the ever-expanding capability of higher density media to hold greater volumes of data (i.e., data production expands to fill the available storage space). These factors are consequently producing an exponential data growth rate, which will soon (if not already) become an insurmountable technical challenge even with the great advances in computation and algorithms. This technical challenge is compounded by the ever-increasing geographic dispersion of important data sources—the data collections are not stored uniformly at a single location, or with a single data model, or in uniform formats and modalities (e.g., images, databases, structured and unstructured files, and XML data sets)—the data are in fact large, distributed, heterogeneous, and complex. The greatest scientific research challenge with these massive distributed data collections is consequently extracting all of the rich information and knowledge content contained therein, thus requiring new approaches to scientific research. This emerging data-intensive and data-oriented approach to scientific research is sometimes called discovery informatics or X-informatics (where X can be any science, such as bio, geo, astro, chem, eco, or anything; Agresti 2003; Gray 2003; Borne 2010). This data-oriented approach to science is now recognized by some (e.g., Mahootian and Eastman 2009; Hey et al. 2009) as the fourth paradigm of research, following (historically) experiment/observation, modeling/analysis, and computational science.Advances in Machine Learning and Data Mining for Astronomy. 03/2012;
arXiv:0801.3005v1 [astro-ph] 21 Jan 2008
Astron. Nachr. / AN 329, No.3, (2008)/ Ref. Proc. ”Hotwiring the Transient Universe”, eds. A.Allan, R.Seaman,J.Bloom
The Palomar-Quest Digital Synoptic Sky Survey
S.G. Djorgovski1,,⋆C. Baltay2, A.A. Mahabal1, A.J. Drake3, R. Williams3, D. Rabinowitz2, M.J.
Graham3, C. Donalek1, E. Glikman1, A. Bauer2, R. Scalzo2, N. Ellman2, and J. Jerke2
1Astronomy, MS 105-24, Caltech, Pasadena, CA 91125, USA
2Physics Dept., Yale University, New Haven, CT 06520, USA
3Center for Advanced Computing Research, MS 158-79, Caltech, Pasadena, CA 91125, USA
Received 01 Sep 2007, accepted 25 Dec 2007
Sky Surveys – Transients – Software Systems
We describe briefly the Palomar-Quest (PQ) digital synoptic sky survey, including its parameters, data processing, status,
and plans. Exploration of the time domain is now the central scientific and technological focus of the survey. To this end,
wehavedeveloped areal-timepipeline for detection of transient sources. Wedescribe some of the earlyresults, andlessons
learned which may be useful for other, similar projects, and time-domain astronomy in general. Finally, we discuss some
issues and challenges posed by the real-time analysis and scientific exploitation of massive data streams from modern
synoptic sky surveys.
1 A Brief Description of the PO Survey
The Palomar-Quest (PQ) digital synoptic sky survey is a
collaborative project between groups at Yale University and
Caltech (Co-PIs: C. Baltay and S.G. Djorgovski), with an
extendednetworkof collaborationswith other groupsworld
- wide, including Indiana U. (M. Gebhard et al.), NCSA
(R. Brunner et al.), LBNL Nearby SN Factory (NSNF; S.
Perlmutter et al.), INAOE (Puebla, Mexico; L. Carrasco, O.
Lopez-Cruz et al.), EPFL (Switzerland; G. Meylan et al.),
and Caltech/JPL (M. Brown et al.). The data are obtained
at the Palomar Observatorys Samuel Oschin telescope (the
48-inch Schmidt) using the QUEST-2 112-CCD, 161 Mpix
camera (Baltay et al. 2007). Approx. 45% of the telescope
time is used for the PQ survey.The survey started in the late
summer of 2003, and will finish in the late 2008.
In the first phase of the survey, data were obtained in
the drift scan mode in 4.6◦wide strips of a constant Dec, in
the range −25◦< δ < +25◦, excluding the Galactic plane.
The total area coverage is ∼ 15,000 deg2, with multiple
passes, ranging from a few to about 25, and typically 5 –
10 times, with time baselines ranging from hours to years.
There are some thin-strip gaps in the coverage, due to a
combination of inter-CCD gaps, bad CCDs, and a subop-
timal dithering strategy. Typical area coverage rate is up to
∼ 500 deg2/night in 4 filters. The raw data rate is on aver-
age ∼ 70GB per clear night. To date, about 25 TB of usable
data have been collected in the drift scan mode.
Data were obtained with two filter sets, Johnson UBRI
and Gunn/SDSS rizz, recently changed to griz. Effective
exposures are ∼ 150 sec / cos δ per pass. Typical estimated
limiting magnitudes are rlim≈ 21.5, ilim≈ 20.5, zlim≈
19.5, Rlim ≈ 22, and Ilim ≈ 21 mag, depending on the
⋆Corresponding author: Djorgovski, e-mail: email@example.com
seeing, lunarphase,etc. Coaddingof ∼ 8 passes reaches the
are done independently at Yale and Caltech, mainly using
the overlap region with SDSS.
In the second phase of the survey, which started in the
spring of 2007, data are obtained in the traditional point-
and-track mode, in a single, wide-band red filter (RG610),
with ∼ 10% of the time in the drift scan mode. The cover-
age and the cadence are optimized for the nearby supernova
search, in collaboration with the LBNL NSNF group, and a
search for dwarf planets, in collaboration with M. Brown.
Data are processed with several different pipelines, op-
timized for different scientific goals. This includes the Yale
pipeline (Andrews et al. 2007), which does the PSF fitting
and was designed for a search for gravitationally lensed
quasars; the Caltech data cleaning pipeline, used to remove
numerous instrumental artifacts present in the data; the Cal-
sient events, as described below; the LBNL NSNF pipeline,
based on image subtraction and designed for detection of
nearby SNe; and a pipeline for an optimal coadding of im-
ages and detection of sources in them, now developed at
Caltech. Images and resulting catalogs are stored in multi-
ple locations, using a variety of databases.
PQ is the first major digital sky survey fully designed
and implemented in the Virtual Observatory (VO) era, and
it uses VO standards and protocols throughout. Public data
releases will be also done through VO-type interfaces. The
first publicdatareleaseis imminent,pendingthecompletion
of various data quality control and assessment tests.
The survey is feeding multiple scientific goals and pro-
jects. The initial motivation was a search for > 105QSOs,
using colors and variability, in order to discover > 100
strong gravitational lenses, and use them to constrain cos-
mology and/or history of mass assembly. Another project
2S.G. Djorgovski et al.: Palomar-Quest Survey
was a search forhigh-z QSOs, to be used as probesofreion-
ization and early structure formation. Both of them are now
finally starting to yield results; the progress was slow due to
numerous problems with the data, all of which have been
solved, and will be documented in detail elsewhere. Our
principalscientific focusnowis explorationoftime domain,
as described below.
Our mainpublicoutreacheffortto date has been the cre-
ation of theGriffith Observatory’s“Big Picture”, andthe as-
sociated website, http://bigpicture.caltech.edu. This exhibit
will be seen by millions of visitors, serving multiple educa-
tional roles in the years to come.
2 PQ Exploration of the Time Domain:
Some Preliminary Results
With a data set covering nearly 40% of the entire sky, with
multiple passes reaching∼ 21mag each,andtime baselines
ranging from minutes (between different CCDs) to hours
(repeated scans in the same night), days (within the same
lunation), months, and years, and (using the cross-matches
to DPOSS and SDSS catalogs) up to decades, PQ is in a
unique position to explore time-variable sky in a systematic
fashion. For some early reports, see Graham et al. (2005),
Mahabal et al. (2004, 2005), or Djorgovski et al. (2006).
One major effort is a search for nearby (z ∼ 0.1) SNe
Ia, to be used as the low-z calibration of the Hubble dia-
gram. This project is led by the Yale group in collaboration
with the LBNL NSNF. To date, this effort has found a total
of about 500 SNe, about a half of which were spectroscop-
ically confirmed, and among them about 70 Type Ias with
10 or more spectra taken; as well as a plethora of other SNe
(including some peculiar ones) and transients. All are pub-
lished in IAU Circulars, CBETs, and ATel’s. The work uses
image subtraction technique, in order to remove the well-
detected light host galaxies. The Caltech real-time pipeline
is now also starting to detect SNe, using a search for tran-
sients in the catalog domain.
We are now using the archives of our data to study sys-
tematically the variability of QSOs, and especially Blazars.
Some examples are shown in Fig. 1. The main goal is to
devise an algorithm based on colors and variability alone
to define a purely optically selected sample of Blazars, and
thus check on the selection effects in the traditional radio
and x-ray approaches. These sources may be the main con-
tributors to the extragalactic γ-ray background, a subject
of considerable interest with the upcoming launch of the
GLAST mission. They are also implicated as sources of
ultra-high energy cosmic rays (UHECR). These cosmic ac-
celerators can reach energies several orders of magnitude
higher than any predictable terrestrial accelerators. Their
census and detailed studies are thus of a considerable and
Our exploration of the archival PQ data has yielded a
large number of transients, operationally defined as PSF-
data. The top row shows them in a relatively high state, the
bottom row in a relatively low state.
Examples of 3 known Blazars, as seen in the PQ
like sources detected in only one epoch, with no detectable
apparent motion between different CCDs in a single pass.
Subsequent studies have revealed counterparts for some of
them in deeper, coadded images. We believe that many of
them are probablyasteroids caught near the stationary point
(see below). However, this has underscored the need to de-
tect and follow transients in a real or near-real time, in order
to determine their physical nature.
We have thus developed a real-time pipeline, which is
now operational. The pipeline does the standard removal of
instrumental signatures, pushes the data through the Cal-
ments astrometry, compares the new catalogs to those from
the previous passes, finds newly detected sources, imple-
ments a number of software filters to eliminate the residual
instrumental artifacts, known asteroids or variables, moving
objects (uncatalogued asteroids), produces cutout images
and webpages for the candidate transients, and publishes
them using the VOEvent protocols and on VOEN website,
We typically do a ∼ 4-hour long scan, then re-scan the
same area again, with the real-time pipeline running. In a
typical half-night scan, we may detect a couple of million
sources, and about a thousand potential transients. Removal
of residual instrumental artifacts leaves a few hundred gen-
uine detections, nearly all of which are asteroids; of them,
typically only a half are among the previously catalogued;
the rest are largelyremovedafterthe second scan.The num-
procedures were improved. Over the past year or so, nearly
4800 events have been submitted, with an average rate of ∼
200 per night. About 85% of these were immediately clas-
sified as asteroids, and the majority of the remaining ones
are as well. Finally, there are only a few (< 10/night) ap-
follow-up observations to date show that they are a mixture
of SNe, AGN, probableflaring M dwarfs, and the rest are of
as yet unknown nature. Some are re-discovered on different
c ? 2008 WILEY-VCH Verlag GmbH&Co.KGaA, Weinheim
Astron. Nachr. / AN (2008)3
time pipeline, PQT 070519:143304+150707; see Drake et
al., ATel 1083. The top row are detection images in the g
and r bands; the bottom row are the comparison baseline
images. The source faded slowly, but got redder rapidly; it
may be a rare type of a SN.
An example of a transient detected with our real-
3 Some Lessons Learned
Combining the current PQ experiences with the older work
with DPOSS (see, e.g., Mahabal et al. 2005), we estimate
that ina single-passsnapshotsurveythereare∼ 10−2astro-
physical transients/deg2down to ∼ 20 mag at high Galactic
latitudes. Many of them are known, highly variable types of
objects, where the “low state” is below the detection of the
baseline data, with variable stars of different kinds domi-
nating on the short time scales (∼ minutes to months), and
AGN (mainly Blazars and OVVs) dominating on the longer
time scales (years and longer). Some are a variety of stel-
lar explosions. Some may be as-yet unknown types of ob-
jects and phenomena, but real-time spectroscopic and other
follow-up is necessary in order to discover them.
We find that a principal contaminant for optical surveys
are the slow-moving asteroids; there are ∼ 1 − 3 of them
per deg2down to ∼ 21 mag, depending very much on the
Ecliptic latitude; i.e., > 100asteroids for each astrophysical
transient. A joint analysis for movingand variableobjects is
necessary,andanytypeofa synopticsky surveydatastream
can feed both scientific domains simultaneously. Improv-
ing the existing catalogs of asteroids is an urgent task. At
least two epochs are neededin orderto eliminate previously
unknown asteroids in any synoptic survey, and their base-
line will define the effective time resolution of any transient
search (we also note that at least 3 properly spaced epochs
are needed to compute even a rough orbit).
The quality of the baseline or fiducial sky against which
current observations are compared is a key issue. It must
be deep, clean, complete, and wavelength-matched. Gen-
erating a standard, dynamically evolving, annotated, multi-
wavelength baseline sky may be a good community (VO)
project; we are developing a prototype from PQ and other
publicly available panoramic imaging data sets.
Achieving a high completeness (a few real transients
missed) and a low contamination (a few false alarms) is a
huge challenge. Interesting sources are discovered as out-
liers in some parameter space; problems with the data also
generate outliers in some parameter space. In a large data
set, most unlikely things will happen, and most of them are
bad. Robust and reliable data cleaning is a key requirement.
This is hard to do in a cutting-edge software system.
Data systems (pipelines, archives, and analysis) and op-
erational procedures for synoptic sky surveys are subject to
a substantial tension between static and dynamic compo-
nents, including both real-time and subsequent (non-time-
critical) analysis and distribution, data ingestion, database
updating and recomputing, etc. This has implications both
for survey strategies and system architecture design.
Another key challenge is an automated classification of
events for prioritized follow-up, as discussed by Mahabal
et al. and Bloom et al. elsewhere in this volume. This will
certainly requireuse of machinelearningtools, as described
by Vestrand et al. in this volume.
All of these challenges will grow much sharper, as the
data volume and data flux increases dramatically in upcom-
ing synoptic sky surveys. We are now dealing with data
streams of the order of 0.1TB/night, and ∼ 10 transients/nt.
On a time scale of ∼ 1 − 5 years, this will increase to ∼ 1
TB/night and ∼ 104transients/night (e.g., PanSTARRS),
and on a time scale of ∼ 5 − 10 years, this will increase to
∼ 20 TB/night and ∼ 105− 106transients/night (LSST).
Development and testing of software, methodologies, and
operational and follow-up procedures is an urgent task, in
which surveys such as PQ can play an important role.
Acknowledgements. Wethank many collaborators who have made
essential contributions to the survey, and the staff of Palomar Ob-
servatory for their tireless efforts during the survey operations.
This work was supported in part by the NSF grants AST-0407448,
AST-0326524, and CNS-0540369, by the Ajax Foundation, and
other private donors. SGD acknowledges astimulating atmosphere
of the Aspen Center for Physics. Finally, we thank the workshop
organizers for an excellent and productive meeting.
Andrews, P., et al.: 2007, PASP, in press (astro-ph/0703446)
Baltay, C., et al.: 2007, PASP, in press (astro-ph/0702590)
Djorgovski, S.G., et al.: 2006, in Proc. ICPR2006, eds. Y.Y. Tang
et al., IEEE Press, p. 856 (astro-ph/0608638)
Graham, M., et al. (the PQ Survey Team): 2004: in Proc. ADASS
XIII, eds. F. Ochsenbein et al., ASPCS 314, 14
Mahabal, A., et al.: 2004, in press (astro-ph/0408035)
Mahabal, A., et al. (the PQ Survey Team): 2005, in Proc. ADASS
XIV, eds. P. Shopbell et al., ASPCS 347, 604
c ? 2008 WILEY-VCH Verlag GmbH&Co.KGaA, Weinheim